# Task 3: Introduction to Machine Learning

## Section 1: Setup & Dataset

### **Task 1**: Load the Dataset

*Instruction*: Load the preprocessed Titanic dataset (from the previous module or load again if needed). Separate it into features (`X`) and target (`y`, where target = `Survived`).

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

df = pd.read_csv('titanic_preprocessed.csv')

columns_to_drop = ['Name', 'Ticket', 'Cabin']
df = df.drop(columns=[col for col in columns_to_drop if col in df.columns])

df = pd.get_dummies(df, drop_first=True)

X = df.drop('Survived', axis=1)
y = df['Survived']

## Section 2: Splitting the Data

### **Task 2**: Train/Test Split

*Instruction*:

Split the dataset into training and testing sets (80/20 split).


In [13]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Section 3: Train Your First Model

### **Task 3**: Logistic Regression

*Instruction*: Train a Logistic Regression model on the Titanic dataset. Display accuracy on both train and test sets.



In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

y_pred_train = logreg.predict(X_train)
y_pred_test = logreg.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train, y_pred_train))
print("Test Accuracy:", accuracy_score(y_test, y_pred_test))


Train Accuracy: 0.8177966101694916
Test Accuracy: 0.7752808988764045


## Section 4: Model Evaluation

### **Task 4**: Confusion Matrix & Classification Report

*Instruction*: Evaluate the model using confusion matrix and classification report.

In [14]:
from sklearn.metrics import confusion_matrix, classification_report

print("Confusion Matrix (Logistic Regression):")
print(confusion_matrix(y_test, y_pred_test))

print("\nClassification Report (Logistic Regression):")
print(classification_report(y_test, y_pred_test))



Confusion Matrix (Logistic Regression):
[[81 15]
 [25 57]]

Classification Report (Logistic Regression):
              precision    recall  f1-score   support

           0       0.76      0.84      0.80        96
           1       0.79      0.70      0.74        82

    accuracy                           0.78       178
   macro avg       0.78      0.77      0.77       178
weighted avg       0.78      0.78      0.77       178



## Section 5: Try Another Model

### **Task 5**:  Random Forest Classifier

*Instruction*: Train a `RandomForestClassifier` and compare its performance with Logistic Regression.


In [16]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

y_pred_rf_train = rf.predict(X_train)
y_pred_rf_test = rf.predict(X_test)

print("Train Accuracy (Random Forest):", accuracy_score(y_train, y_pred_rf_train))
print("Test Accuracy (Random Forest):", accuracy_score(y_test, y_pred_rf_test))

print("\nConfusion Matrix (Random Forest):")
print(confusion_matrix(y_test, y_pred_rf_test))

print("\nClassification Report (Random Forest):")
print(classification_report(y_test, y_pred_rf_test))



Train Accuracy (Random Forest): 0.9844632768361582
Test Accuracy (Random Forest): 0.8202247191011236

Confusion Matrix (Random Forest):
[[81 15]
 [17 65]]

Classification Report (Random Forest):
              precision    recall  f1-score   support

           0       0.83      0.84      0.84        96
           1       0.81      0.79      0.80        82

    accuracy                           0.82       178
   macro avg       0.82      0.82      0.82       178
weighted avg       0.82      0.82      0.82       178



## Section 6: Model Tuning

### **Task 6**: Hyperparameter Tuning (GridSearch)

*Instruction*:Use `GridSearchCV` to tune `n_estimators` and `max_depth` of the Random Forest model.

In [17]:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [4, 6, 8, 10, None]
}

rf = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)

best_rf = grid_search.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("\nTest Accuracy (Tuned Random Forest):", accuracy_score(y_test, y_pred_best_rf))
print("\nConfusion Matrix (Tuned RF):")
print(confusion_matrix(y_test, y_pred_best_rf))
print("\nClassification Report (Tuned RF):")
print(classification_report(y_test, y_pred_best_rf))


Fitting 5 folds for each of 15 candidates, totalling 75 fits
Best Parameters: {'max_depth': 6, 'n_estimators': 50}

Test Accuracy (Tuned Random Forest): 0.8146067415730337

Confusion Matrix (Tuned RF):
[[83 13]
 [20 62]]

Classification Report (Tuned RF):
              precision    recall  f1-score   support

           0       0.81      0.86      0.83        96
           1       0.83      0.76      0.79        82

    accuracy                           0.81       178
   macro avg       0.82      0.81      0.81       178
weighted avg       0.82      0.81      0.81       178

