<a href="https://colab.research.google.com/github/reesha-rsh/MLb4/blob/main/Homework/HW8.%20Bagging_Random%20Forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Use any binary classification dataset
2. Define validation strategy and use it for all next steps without changes
3. Train decision tree model and estimate performance on validation

*   Validation approach: We have a small amount of data so I will use a **K Fold** method
*   Metric: I plan to optimize **fbeta score** with beta 0.5 to give more weight for precision, thus minimizing false positives - reducing the prediction that a passenger survived when he actually did not.



In [11]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np


from google.colab import drive
drive.mount('/content/drive')

train_full = pd.read_csv("/content/drive/MyDrive/MLb4/EDA Titanic/train.csv")
test_full = pd.read_csv("/content/drive/MyDrive/MLb4/EDA Titanic/test.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
def generate(df,age_median,fare_median):
  useless_features = ['Name','Ticket','Cabin']
  data_cleaned = df
  data_cleaned = data_cleaned.drop(columns = useless_features)

  # generate binary values using get_dummies
  data_cleaned = pd.get_dummies(data_cleaned, columns=['Sex'],prefix=["Sex"])
  data_cleaned = pd.get_dummies(data_cleaned, columns=['Embarked'],prefix=["Embarked"])

  # Check for NaN values in the DataFrame
  nan_mask = data_cleaned.isnull()
  # Count the number of NaN values in each column
  nan_count_per_column = data_cleaned.isnull().sum()

  data_cleaned['Age'] = data_cleaned['Age'].fillna(age_median)
  data_cleaned['Fare'] = data_cleaned['Fare'].fillna(fare_median)

  return data_cleaned


In [13]:
features_columns = ['Pclass',	'Age',	'SibSp',	'Parch',	'Fare',	'Sex_female',	'Sex_male',	'Embarked_C',	'Embarked_Q',	'Embarked_S']

In [14]:
# get medians that will fill NaNs in generate func
age_median = train_full['Age'].median()
fare_median = train_full['Fare'].median()

In [15]:
train = generate(train_full,age_median=age_median,fare_median=fare_median)
train

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,22.0,1,0,7.2500,0,1,0,0,1
1,2,1,1,38.0,1,0,71.2833,1,0,1,0,0
2,3,1,3,26.0,0,0,7.9250,1,0,0,0,1
3,4,1,1,35.0,1,0,53.1000,1,0,0,0,1
4,5,0,3,35.0,0,0,8.0500,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,27.0,0,0,13.0000,0,1,0,0,1
887,888,1,1,19.0,0,0,30.0000,1,0,0,0,1
888,889,0,3,28.0,1,2,23.4500,1,0,0,0,1
889,890,1,1,26.0,0,0,30.0000,0,1,1,0,0


In [16]:
titanic_features = train[features_columns]
titanic_label = train['Survived']

In [17]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import make_scorer, fbeta_score, classification_report, accuracy_score
from sklearn import metrics


In [18]:
random_state = 42

In [19]:
# Initialize the StratifiedKFold object
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

In [20]:
# Define the custom scoring function with the desired beta value
beta = 0.5
custom_scorer = make_scorer(fbeta_score, beta=beta)

In [21]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer, fbeta_score



# Define the DecisionTreeClassifier model
classifier = DecisionTreeClassifier(random_state=random_state)

# Define the hyperparameter grid to search over
param_grid = {
    'criterion': ['gini'],
    'splitter': ['random'],
    'max_depth': [None, 3, 5, 7, 9, 11],
    'min_samples_leaf': [1,  3,  5,  7,  9,  11],
    'max_features': [ 1,  3,  5,  7,  9,  11],
    'class_weight': ['balanced']
}


# Initialize the GridSearchCV object with the DecisionTreeClassifier, hyperparameter grid, and custom scorer
grid_search = GridSearchCV(classifier, param_grid, scoring=custom_scorer, cv=skf)

# Perform grid search to find the best hyperparameters
grid_search.fit(titanic_features, titanic_label)

# Get the best hyperparameters and the corresponding model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_


# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Optionally, you can evaluate the model on the full data using the F-beta score with beta=0.1
y_pred = best_model.predict(titanic_features)

fbeta = fbeta_score(titanic_label, y_pred, beta=beta)
accuracy = accuracy_score(titanic_label, y_pred)

print("F-beta Score (beta={}): {:.4f}".format(beta, fbeta))
print("Accuracy: {:.4f}".format(accuracy))
print(metrics.classification_report(titanic_label, y_pred))
print(metrics.confusion_matrix(titanic_label, y_pred))


Best Hyperparameters: {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': 5, 'max_features': 3, 'min_samples_leaf': 3, 'splitter': 'random'}
F-beta Score (beta=0.5): 0.7950
Accuracy: 0.8249
              precision    recall  f1-score   support

           0       0.82      0.91      0.87       549
           1       0.83      0.69      0.75       342

    accuracy                           0.82       891
   macro avg       0.83      0.80      0.81       891
weighted avg       0.83      0.82      0.82       891

[[500  49]
 [107 235]]



4. Train bagging model with decision tree as a base model and estimate performance on validation
5. Write your own bagging implementation:
  <br>5.1. Define init for our CustomBaggingClassifier
  <br>5.2. Write fit as described in lecture: divide train data on n parts (`n_estimators` in CustomBaggingClassifier), train `base_estimator` on each part and save these models inside class
  <br>5.3. For predictions we should use all saved models and combine their predictions (as voting)
6. Compare performance of sklearn bagging model with your own implementation

In [22]:
class CustomBaggingClassifier:
    def __init__(self, base_estimator, n_estimators, max_samples, max_features):
        ...

    def fit(X, y):
        ...

    def predict(X):
        ...