# Task 3: Introduction to Machine Learning

## Section 1: Setup & Dataset

### **Task 1**: Load the Dataset

*Instruction*: Load the preprocessed Titanic dataset (from the previous module or load again if needed). Separate it into features (`X`) and target (`y`, where target = `Survived`).

In [16]:
import pandas as pd

df = pd.read_csv('titanic.csv')
X = df.drop('Survived', axis=1)
y = df['Survived']
print()




## Section 2: Splitting the Data

### **Task 2**: Train/Test Split

*Instruction*:

Split the dataset into training and testing sets (80/20 split).


In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Section 3: Train Your First Model

### **Task 3**: Logistic Regression

*Instruction*: Train a Logistic Regression model on the Titanic dataset. Display accuracy on both train and test sets.



In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

categorical_cols = ['Name', 'Sex', 'Embarked', 'Cabin']

label_encoders = {}
for col in categorical_cols:
    if col in X_train.columns:
        label_encoders[col] = LabelEncoder()
        label_encoders[col].fit(pd.concat([X_train[col], X_test[col]]).astype(str))
        X_train[col] = label_encoders[col].transform(X_train[col].astype(str))
        X_test[col] = label_encoders[col].transform(X_test[col].astype(str))


model = LogisticRegression()
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred_test)

print(f"Accuracy on Train Set: {accuracy_train}")
print(f"Accuracy on Test Set: {accuracy_test}")

Accuracy on Train Set: 0.8110014104372355
Accuracy on Test Set: 0.7471910112359551


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Section 4: Model Evaluation

### **Task 4**: Confusion Matrix & Classification Report

*Instruction*: Evaluate the model using confusion matrix and classification report.

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

categorical_cols = ['Name', 'Sex', 'Embarked', 'Cabin']

label_encoders = {}
for col in categorical_cols:
    if col in X_train.columns:
        label_encoders[col] = LabelEncoder()
        label_encoders[col].fit(pd.concat([X_train[col], X_test[col]]).astype(str))
        X_train[col] = label_encoders[col].transform(X_train[col].astype(str))
        X_test[col] = label_encoders[col].transform(X_test[col].astype(str))


model = LogisticRegression()
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_test = accuracy_score(y_test, y_pred_test)

print(f"Accuracy on Train Set: {accuracy_train}")
print(f"Accuracy on Test Set: {accuracy_test}")

# Confusion Matrix and Classification Report
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_test))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_test))

Accuracy on Train Set: 0.8039492242595204
Accuracy on Test Set: 0.7359550561797753

Confusion Matrix:
[[96 15]
 [32 35]]

Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.86      0.80       111
           1       0.70      0.52      0.60        67

    accuracy                           0.74       178
   macro avg       0.72      0.69      0.70       178
weighted avg       0.73      0.74      0.73       178



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Section 5: Try Another Model

### **Task 5**:  Random Forest Classifier

*Instruction*: Train a `RandomForestClassifier` and compare its performance with Logistic Regression.


In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder

categorical_cols = ['Name', 'Sex', 'Embarked', 'Cabin']

label_encoders = {}
for col in categorical_cols:
    if col in X_train.columns:
        label_encoders[col] = LabelEncoder()
        label_encoders[col].fit(pd.concat([X_train[col], X_test[col]]).astype(str))
        X_train[col] = label_encoders[col].transform(X_train[col].astype(str))
        X_test[col] = label_encoders[col].transform(X_test[col].astype(str))

# Logistic Regression
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)
y_pred_logreg = logreg_model.predict(X_test)
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)

# Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)  # Added random_state for reproducibility
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

# Compare Performance
print("Logistic Regression:")
print(f"Accuracy: {accuracy_logreg}")
print(classification_report(y_test, y_pred_logreg))

print("\nRandom Forest Classifier:")
print(f"Accuracy: {accuracy_rf}")
print(classification_report(y_test, y_pred_rf))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression:
Accuracy: 0.7303370786516854
              precision    recall  f1-score   support

           0       0.75      0.86      0.80       111
           1       0.69      0.52      0.59        67

    accuracy                           0.73       178
   macro avg       0.72      0.69      0.70       178
weighted avg       0.72      0.73      0.72       178


Random Forest Classifier:
Accuracy: 0.797752808988764
              precision    recall  f1-score   support

           0       0.82      0.87      0.84       111
           1       0.76      0.67      0.71        67

    accuracy                           0.80       178
   macro avg       0.79      0.77      0.78       178
weighted avg       0.80      0.80      0.79       178



## Section 6: Model Tuning

### **Task 6**: Hyperparameter Tuning (GridSearch)

*Instruction*:Use `GridSearchCV` to tune `n_estimators` and `max_depth` of the Random Forest model.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

categorical_cols = ['Name', 'Sex', 'Embarked', 'Cabin']

label_encoders = {}
for col in categorical_cols:
    if col in X_train.columns:
        label_encoders[col] = LabelEncoder()
        label_encoders[col].fit(pd.concat([X_train[col], X_test[col]]).astype(str))
        X_train[col] = label_encoders[col].transform(X_train[col].astype(str))
        X_test[col] = label_encoders[col].transform(X_test[col].astype(str))

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],  # Try different numbers of trees
    'max_depth': [5, 10, 15]  # Try different maximum depths of trees
}

# Create a RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)

# Create GridSearchCV object
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")

# Get the best model
best_rf_model = grid_search.best_estimator_

# Make predictions using the best model
y_pred_rf = best_rf_model.predict(X_test)

# Evaluate the best model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy of the best Random Forest model: {accuracy_rf}")
print(classification_report(y_test, y_pred_rf))


Best Hyperparameters: {'max_depth': 10, 'n_estimators': 300}
Accuracy of the best Random Forest model: 0.7921348314606742
              precision    recall  f1-score   support

           0       0.80      0.88      0.84       111
           1       0.77      0.64      0.70        67

    accuracy                           0.79       178
   macro avg       0.79      0.76      0.77       178
weighted avg       0.79      0.79      0.79       178

