# Task 3: Introduction to Machine Learning

## Section 1: Setup & Dataset

### **Task 1**: Load the Dataset

*Instruction*: Load the preprocessed Titanic dataset (from the previous module or load again if needed). Separate it into features (`X`) and target (`y`, where target = `Survived`).

In [7]:
import pandas as pd

# Verify the file exists in the correct location. If necessary, correct the path
# or place 'titanic_preprocessed.csv' in the same directory as this script.

# Option 1: Using a relative path (assuming the file is in the 'data' subfolder)
try:
    df = pd.read_csv('data/titanic_preprocessed.csv')
except FileNotFoundError:
    print("File not found in 'data' subfolder. Check the file path or place the file in the correct location.")
    # If the file is not found, you may need to download it or specify the correct path
    # For example, if the file is in the current directory:
    df = pd.read_csv('titanic.csv') # Changed this line to assign df if file not found in 'data'

X = df.drop('Survived', axis=1)
y = df['Survived']

File not found in 'data' subfolder. Check the file path or place the file in the correct location.


## Section 2: Splitting the Data

### **Task 2**: Train/Test Split

*Instruction*:

Split the dataset into training and testing sets (80/20 split).


In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Section 3: Train Your First Model

### **Task 3**: Logistic Regression

*Instruction*: Train a Logistic Regression model on the Titanic dataset. Display accuracy on both train and test sets.



In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Assuming df is your DataFrame
# ... (Your code to load the DataFrame)

# Identify columns with string values (object dtype)
categorical_cols = X_train.select_dtypes(include=['object']).columns

# One-hot encode categorical features
X_train = pd.get_dummies(X_train, columns=categorical_cols)
X_test = pd.get_dummies(X_test, columns=categorical_cols)

# Align training and testing data after one-hot encoding
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)


# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions and calculate accuracy
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 0.843441466854725
Test Accuracy: 0.7528089887640449


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Section 4: Model Evaluation

### **Task 4**: Confusion Matrix & Classification Report

*Instruction*: Evaluate the model using confusion matrix and classification report.

In [11]:
from sklearn.metrics import confusion_matrix, classification_report

# Generate predictions on the test set
y_pred = model.predict(X_test)

# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Print the confusion matrix
print("Confusion Matrix:")
print(cm)

# Generate the classification report
cr = classification_report(y_test, y_pred)

# Print the classification report
print("\nClassification Report:")
print(cr)

Confusion Matrix:
[[95 16]
 [28 39]]

Classification Report:
              precision    recall  f1-score   support

           0       0.77      0.86      0.81       111
           1       0.71      0.58      0.64        67

    accuracy                           0.75       178
   macro avg       0.74      0.72      0.73       178
weighted avg       0.75      0.75      0.75       178



## Section 5: Try Another Model

### **Task 5**:  Random Forest Classifier

*Instruction*: Train a `RandomForestClassifier` and compare its performance with Logistic Regression.


In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Create a Random Forest Classifier model
rf_model = RandomForestClassifier(random_state=42)  # Setting random_state for reproducibility

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions on the training and test sets
y_train_pred_rf = rf_model.predict(X_train)
y_test_pred_rf = rf_model.predict(X_test)

# Calculate accuracy scores
train_accuracy_rf = accuracy_score(y_train, y_train_pred_rf)
test_accuracy_rf = accuracy_score(y_test, y_test_pred_rf)

# Display the accuracy scores and compare with Logistic Regression
print("Random Forest Classifier:")
print(f"Training Accuracy: {train_accuracy_rf}") # Added closing curly brace
print(f"Test Accuracy: {test_accuracy_rf}") # Added closing curly brace and the variable to print


Random Forest Classifier:
Training Accuracy: 1.0
Test Accuracy: 0.8202247191011236


## Section 6: Model Tuning

### **Task 6**: Hyperparameter Tuning (GridSearch)

*Instruction*:Use `GridSearchCV` to tune `n_estimators` and `max_depth` of the Random Forest model.

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 150, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the trees
}

# Create a Random Forest Classifier
rf_model = RandomForestClassifier(random_state=42)

# Create GridSearchCV object
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid,
                          scoring='accuracy', cv=5, n_jobs=-1, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Print the best parameters and score
print(f"Best Parameters: {best_params}")
print(f"Best Accuracy Score: {best_score}")

# Evaluate the model with the best parameters on the test set
best_rf_model = grid_search.best_estimator_
test_accuracy = best_rf_model.score(X_test, y_test)
print(f"Test Accuracy with Best Parameters: {test_accuracy}")


Fitting 5 folds for each of 16 candidates, totalling 80 fits
Best Parameters: {'max_depth': 20, 'n_estimators': 50}
Best Accuracy Score: 0.8307761462391369
Test Accuracy with Best Parameters: 0.7752808988764045
