# Task 3: Introduction to Machine Learning

## Section 1: Setup & Dataset

### **Task 1**: Load the Dataset

*Instruction*: Load the preprocessed Titanic dataset (from the previous module or load again if needed). Separate it into features (`X`) and target (`y`, where target = `Survived`).

In [4]:
import pandas as pd

df = pd.read_csv('titanic.csv')
import pandas as pd

df = pd.read_csv('titanic.csv')
X = df.drop('Survived', axis=1)
y = df['Survived']
print(X.head())
print(y.head())

   Pclass                                               Name     Sex   Age  \
0       3                             Mr. Owen Harris Braund    male  22.0   
1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...  female  38.0   
2       3                              Miss. Laina Heikkinen  female  26.0   
3       1        Mrs. Jacques Heath (Lily May Peel) Futrelle  female  35.0   
4       3                            Mr. William Henry Allen    male  35.0   

   Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0                        1                        0   7.2500  
1                        1                        0  71.2833  
2                        0                        0   7.9250  
3                        1                        0  53.1000  
4                        0                        0   8.0500  
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64


## Section 2: Splitting the Data

### **Task 2**: Train/Test Split

*Instruction*:

Split the dataset into training and testing sets (80/20 split).


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.head())
print(X_test.head())
print(y_train.head())
print(y_test.head())


     Pclass                              Name     Sex   Age  \
730       2       Mr. Moses Aaron Troupiansky    male  23.0   
390       3       Mr. Johan Birger Gustafsson    male  28.0   
118       3  Miss. Ellis Anna Maria Andersson  female   2.0   
440       2          Ms. Encarnacion Reynaldo  female  28.0   
309       1         Miss. Emily Borie Ryerson  female  18.0   

     Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
730                        0                        0   13.000  
390                        2                        0    7.925  
118                        4                        2   31.275  
440                        0                        0   13.000  
309                        2                        2  262.375  
     Pclass                                   Name   Sex   Age  \
296       1                   Mr. Adolphe Saalfeld  male  47.0   
682       2  Mr. Joseph Philippe Lemercier Laroche  male  25.0   
535       3                Mr. Sa

## Section 3: Train Your First Model

### **Task 3**: Logistic Regression

*Instruction*: Train a Logistic Regression model on the Titanic dataset. Display accuracy on both train and test sets.



In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, StandardScaler  # Import StandardScaler

df = pd.read_csv('titanic.csv')

X = df.drop('Survived', axis=1)
y = df['Survived']

categorical_cols = ['Name', 'Sex', 'Embarked', 'Cabin']
label_encoders = {}
for col in categorical_cols:
    if col in X.columns:
        label_encoders[col] = LabelEncoder()
        X[col] = label_encoders[col].fit_transform(X[col].astype(str))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred_test)

print(f"Accuracy on Train Set: {accuracy_train}")
print(f"Accuracy on Test Set: {accuracy_test}")

Accuracy on Train Set: 0.8152327221438646
Accuracy on Test Set: 0.7471910112359551


## Section 4: Model Evaluation

### **Task 4**: Confusion Matrix & Classification Report

*Instruction*: Evaluate the model using confusion matrix and classification report.

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler

df = pd.read_csv('titanic.csv')

X = df.drop('Survived', axis=1)
y = df['Survived']

categorical_cols = ['Name', 'Sex', 'Embarked', 'Cabin']
label_encoders = {}
for col in categorical_cols:
    if col in X.columns:
        label_encoders[col] = LabelEncoder()
        X[col] = label_encoders[col].fit_transform(X[col].astype(str))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred_test = model.predict(X_test)

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_test))

Confusion Matrix:
[[97 14]
 [31 36]]

Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.87      0.81       111
           1       0.72      0.54      0.62        67

    accuracy                           0.75       178
   macro avg       0.74      0.71      0.71       178
weighted avg       0.74      0.75      0.74       178



## Section 5: Try Another Model

### **Task 5**:  Random Forest Classifier

*Instruction*: Train a `RandomForestClassifier` and compare its performance with Logistic Regression.


In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)


categorical_cols = ['Name', 'Sex', 'Embarked', 'Cabin']

label_encoders = {}
for col in categorical_cols:
    if col in X_train.columns:
        label_encoders[col] = LabelEncoder()
        label_encoders[col].fit(pd.concat([X_train[col], X_test[col]]).astype(str))
        X_train[col] = label_encoders[col].transform(X_train[col].astype(str))
        X_test[col] = label_encoders[col].transform(X_test[col].astype(str))

logreg_model = LogisticRegression(max_iter=1000)
logreg_model.fit(X_train, y_train)
y_pred_logreg = logreg_model.predict(X_test)
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)

rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print("Logistic Regression:")
print(f"Accuracy: {accuracy_logreg}")
print(classification_report(y_test, y_pred_logreg))

print("\nRandom Forest Classifier:")
print(f"Accuracy: {accuracy_rf}")
print(classification_report(y_test, y_pred_rf))

Logistic Regression:
Accuracy: 0.7584269662921348
              precision    recall  f1-score   support

           0       0.77      0.87      0.82       111
           1       0.73      0.57      0.64        67

    accuracy                           0.76       178
   macro avg       0.75      0.72      0.73       178
weighted avg       0.76      0.76      0.75       178


Random Forest Classifier:
Accuracy: 0.797752808988764
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       111
           1       0.76      0.67      0.71        67

    accuracy                           0.80       178
   macro avg       0.79      0.77      0.78       178
weighted avg       0.80      0.80      0.79       178



## Section 6: Model Tuning

### **Task 6**: Hyperparameter Tuning (GridSearch)

*Instruction*:Use `GridSearchCV` to tune `n_estimators` and `max_depth` of the Random Forest model.

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

categorical_cols = ['Name', 'Sex', 'Embarked', 'Cabin']

label_encoders = {}
for col in categorical_cols:
    if col in X_train.columns:
        label_encoders[col] = LabelEncoder()
        label_encoders[col].fit(pd.concat([X_train[col], X_test[col]]).astype(str))
        X_train[col] = label_encoders[col].transform(X_train[col].astype(str))
        X_test[col] = label_encoders[col].transform(X_test[col].astype(str))

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15]
}

rf_model = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")

best_rf_model = grid_search.best_estimator_

y_pred_rf = best_rf_model.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy of the best Random Forest model: {accuracy_rf}")
print(classification_report(y_test, y_pred_rf))

Best Hyperparameters: {'max_depth': 15, 'n_estimators': 100}
Accuracy of the best Random Forest model: 0.797752808988764
              precision    recall  f1-score   support

           0       0.81      0.88      0.84       111
           1       0.77      0.66      0.71        67

    accuracy                           0.80       178
   macro avg       0.79      0.77      0.78       178
weighted avg       0.80      0.80      0.79       178

