In [9]:
import os
os.listdir()


['.config', 'sample_data']

In [11]:
from google.colab import files
uploaded = files.upload()

import pandas as pd

# Automatically read uploaded file
df = pd.read_csv(next(iter(uploaded)))

df.head()


Saving Titanic-Dataset.csv to Titanic-Dataset.csv


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
# Check missing values
df.isnull().sum()


Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [13]:
# Fill missing Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing Embarked with most frequent value
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop Cabin column (too many missing values)
df.drop(columns=['Cabin'], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


In [14]:
# Convert Sex column
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Convert Embarked column into dummy variables
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)


In [15]:
X = df.drop(columns=['Survived', 'Name', 'Ticket', 'PassengerId'])
y = df['Survived']


# Task
Split the preprocessed data into training and testing sets (80% train, 20% test), train a Logistic Regression model on the training data, evaluate the model's accuracy on the test set, and summarize the model's performance.

## Data Preprocessing

### Subtask:
Refine data preprocessing by filling missing 'Age' values with the median, 'Embarked' values with the mode, and dropping the 'Cabin' column, using the updated syntax to avoid warnings.


**Reasoning**:
The subtask requires refining data preprocessing by filling missing 'Age' values with the median, 'Embarked' values with the mode, and dropping the 'Cabin' column, specifically using updated syntax to avoid `FutureWarning` related to `inplace=True`. The previous cell `CsHB0cp1bBmg` used `inplace=True` which generated warnings. This code block implements the suggested updated syntax.



In [16]:
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df = df.drop(columns=['Cabin'])


KeyError: 'Embarked'

**Reasoning**:
The `KeyError: 'Embarked'` occurred because the 'Embarked' column was already converted into dummy variables and dropped from the DataFrame `df` in a prior step (`cwoWtC2NbGLk`), before cell `7c363789` attempted to fill missing values in it. To fix this and ensure all preprocessing steps (filling missing 'Age' and 'Embarked', dropping 'Cabin', converting 'Sex' and dummifying 'Embarked') are applied in the correct order and using the updated syntax to avoid warnings, the DataFrame `df` needs to be re-initialized from its original uploaded state and all transformations reapplied sequentially.



In [17]:
df = pd.read_csv(next(iter(uploaded)))

# Fill missing Age with median using updated syntax
df['Age'] = df['Age'].fillna(df['Age'].median())

# Fill missing Embarked with most frequent value using updated syntax
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Drop Cabin column
df = df.drop(columns=['Cabin'])

# Convert Sex column
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Convert Embarked column into dummy variables
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Separate features (X) and target (y)
X = df.drop(columns=['Survived', 'Name', 'Ticket', 'PassengerId'])
y = df['Survived']

**Reasoning**:
The subtask requires splitting the preprocessed data into training and testing sets (80% train, 20% test), training a Logistic Regression model on the training data, evaluating the model's accuracy on the test set, and summarizing the model's performance. This step will import necessary libraries, perform the data split, train the model, make predictions, and print the accuracy score.



In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

Model Accuracy: 0.8101


## Final Task

### Subtask:
Summarize the model's performance and the overall results of the analysis.


## Summary:

### Q&A
The model's performance, as evaluated by its accuracy on the test set, is approximately 81.01%.

### Data Analysis Key Findings
*   Missing 'Age' values were imputed with the median, and missing 'Embarked' values were filled with the mode.
*   The 'Cabin' column was dropped due to a high number of missing values.
*   Categorical features 'Sex' and 'Embarked' were converted into numerical representations using mapping and one-hot encoding, respectively.
*   The data was split into training and testing sets, with 80% used for training and 20% for testing.
*   A Logistic Regression model was trained on the preprocessed training data.
*   The trained Logistic Regression model achieved an accuracy of approximately 81.01% (0.8101) on the test set.

### Insights or Next Steps
*   The model demonstrates reasonable predictive power for survival, with an accuracy of over 81%.
*   Consider exploring other machine learning algorithms (e.g., Random Forest, Gradient Boosting) or implementing feature engineering techniques to potentially improve model performance.
