<a href="https://colab.research.google.com/github/sahar-mariam/level2-report/blob/main/EnsembleTechniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Ensemble techniques applied to the Titanic dataset for predicting survival or no survival.
- using two popular ensemble methods: Random Forest and Gradient Boosting (using XGBoost)
- importing scikit-learn and XGBoost libraries.

In [None]:
# Import necessary libraries with warnings filter
import warnings
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Ignore warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(url)

Preprocessing:
- Removing unnecessary columns (Name, Ticket, Cabin, PassengerId).
- Performing one-hot encoding for categorical variables.
- Dropping rows with missing values.

Splitting the Data:
- Separating the data into features (X) and the target variable (y).
- Splitting the data into training and testing sets.

In [None]:
# Preprocessing
titanic_data = titanic_data.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)  # Drop unnecessary columns
titanic_data = pd.get_dummies(titanic_data, drop_first=True)  # One-hot encoding for categorical variables
titanic_data = titanic_data.dropna()  # Drop rows with missing values

# Split the data into features (X) and target variable (y)
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Random Forest Model:
- Hyperparameter tuning using GridSearchCV.
- Training the Random Forest model with the best parameters.
- Cross-validation scores and model evaluation.


In [None]:
# Ensemble 1: Random Forest

# Hyperparameter tuning using GridSearchCV
rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

rf_grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=rf_param_grid, cv=5)
rf_grid_search.fit(X_train, y_train)

# Best parameters from the grid search
best_rf_params = rf_grid_search.best_params_

# Train the Random Forest model with the best parameters
rf_model = RandomForestClassifier(random_state=42, **best_rf_params)
rf_model.fit(X_train, y_train)

# Cross-validation scores
rf_cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5)
print(f"Random Forest Cross-Validation Scores: {rf_cv_scores}")
print(f"Mean Cross-Validation Score: {rf_cv_scores.mean()}")

# Predictions on the test set
rf_predictions = rf_model.predict(X_test)

# Model evaluation
print("Random Forest Model Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, rf_predictions)}")
print("Confusion Matrix:\n", confusion_matrix(y_test, rf_predictions))
print("Classification Report:\n", classification_report(y_test, rf_predictions))

Random Forest Cross-Validation Scores: [0.8173913  0.84210526 0.85087719 0.89473684 0.8245614 ]
Mean Cross-Validation Score: 0.8459344012204424
Random Forest Model Evaluation:
Accuracy: 0.7972027972027972
Confusion Matrix:
 [[75 12]
 [17 39]]
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.86      0.84        87
           1       0.76      0.70      0.73        56

    accuracy                           0.80       143
   macro avg       0.79      0.78      0.78       143
weighted avg       0.80      0.80      0.80       143



XGBoost Model:
- Hyperparameter tuning using GridSearchCV.
- Training the XGBoost model with the best parameters.
- Cross-validation scores and model evaluation.

In [None]:
# Ensemble 2: XGBoost

# Hyperparameter tuning using GridSearchCV
xgb_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

xgb_grid_search = GridSearchCV(XGBClassifier(random_state=42), param_grid=xgb_param_grid, cv=5)
xgb_grid_search.fit(X_train, y_train)

# Best parameters from the grid search
best_xgb_params = xgb_grid_search.best_params_

# Train the XGBoost model with the best parameters
xgb_model = XGBClassifier(random_state=42, **best_xgb_params)
xgb_model.fit(X_train, y_train)

# Cross-validation scores
xgb_cv_scores = cross_val_score(xgb_model, X_train, y_train, cv=5)
print(f"\nXGBoost Cross-Validation Scores: {xgb_cv_scores}")
print(f"Mean Cross-Validation Score: {xgb_cv_scores.mean()}")

# Predictions on the test set
xgb_predictions = xgb_model.predict(X_test)

# Model evaluation
print("\nXGBoost Model Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, xgb_predictions)}")
print("Confusion Matrix:\n", confusion_matrix(y_test, xgb_predictions))
print("Classification Report:\n", classification_report(y_test, xgb_predictions))



XGBoost Cross-Validation Scores: [0.8173913  0.85964912 0.85964912 0.86842105 0.84210526]
Mean Cross-Validation Score: 0.8494431731502671

XGBoost Model Evaluation:
Accuracy: 0.8181818181818182
Confusion Matrix:
 [[75 12]
 [14 42]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.86      0.85        87
           1       0.78      0.75      0.76        56

    accuracy                           0.82       143
   macro avg       0.81      0.81      0.81       143
weighted avg       0.82      0.82      0.82       143



Ensemble Model:
- Combining predictions using a simple voting ensemble.
- Model evaluation for the ensemble.



In [None]:
# Combine predictions using a simple voting ensemble
ensemble_predictions = (rf_predictions + xgb_predictions) // 2

# Ensemble Model Evaluation
print("\nEnsemble Model Evaluation:")
print(f"Accuracy: {accuracy_score(y_test, ensemble_predictions)}")
print("Confusion Matrix:\n", confusion_matrix(y_test, ensemble_predictions))
print("Classification Report:\n", classification_report(y_test, ensemble_predictions))


Ensemble Model Evaluation:
Accuracy: 0.8111888111888111
Confusion Matrix:
 [[77 10]
 [17 39]]
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.89      0.85        87
           1       0.80      0.70      0.74        56

    accuracy                           0.81       143
   macro avg       0.81      0.79      0.80       143
weighted avg       0.81      0.81      0.81       143

