## Importing Required Libraries

In this cell, we import all the necessary Python libraries for data handling, visualization, model building, and evaluation

In [18]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split,cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## Loading the Training Dataset

Here, we load the processed training dataset from a CSV file using `pandas.read_csv()`.
The dataset (`train_features.csv`) contains engineered features that will be used to train our machine learning model.

In [19]:
train_data = pd.read_csv('../data/processed/train_features.csv')

### Convert to integer codes

In [20]:
train_data['Embarked_int'] = train_data['Embarked'].astype('category').cat.codes
train_data['Title_int'] = train_data['Title'].astype('category').cat.codes
train_data['CabinDeck_int'] = train_data['CabinDeck'].astype('category').cat.codes

KeyError: 'Embarked'

## Defining Features (X) and Target (y)

This step separates the independent variables (features) from the dependent variable (label) for model training.

In [None]:

X = train_data[["Sex","Pclass","Fare","SibSp","Parch",
                "Age","Embarked_int","Title_int","CabinDeck_int",
                ]].copy()
y = train_data['Survived']


## Previewing the Feature Matrix

We use `X.head()` to display the first few rows of the feature matrix.

In [None]:
X.head()


## Splitting Data into Training and Validation Sets

This allows us to train the model on one portion of the data and evaluate its performance on unseen data.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

## Initializing the Random Forest Classifier

These parameters define how the ensemble model will be constructed.

In [None]:
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)

## Training the Random Forest Model


The model learns patterns in the data that help it predict survival outcomes.

In [None]:
rf_model.fit(X_train, y_train)

## Generating Predictions on Validation Data

After training, we use the model to predict survival outcomes for the validation dataset (`X_val`).
The predictions are stored in `y_pred`, which will be used to evaluate model performance.

In [None]:
y_pred = rf_model.predict(X_val)

## Evaluating Model Performance
These metrics give a comprehensive understanding of how well the model performs.

In [None]:
print("Accuracy:", accuracy_score(y_val, y_pred))
print("\nClassification Report:\n", classification_report(y_val, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_val, y_pred))

## Cross-Validation for More Robust Evaluation

Cross-validation provides a more reliable estimate of model performance by reducing dependence on a single train-test split.

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(rf_model, X_train, y_train, cv=cv)

print("\nCV Accuracy Scores:", cv_scores)
print("Mean CV Accuracy:", cv_scores.mean())
