# Problem Statement

### To create a model using machine learning algorithms that predicts the characteristics of passengers survived in the Titanic shipwreck.

In [1]:
# Importing all the required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model, preprocessing, model_selection
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, KFold
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn import tree
# !pip install ydata-profiling
from ydata_profiling import ProfileReport


ModuleNotFoundError: No module named 'ydata_profiling'

### Part 1: Loading train and test datasets

In [None]:
# Reading both train and test datasets
train = pd.read_csv("train.csv")  # Read train dataset from CSV file into DataFrame 'train'
test = pd.read_csv("test.csv")    # Read test dataset from CSV file into DataFrame 'test'

# Extracting the target variable 'Survived' into y_train
y_train = train['Survived'].ravel()  # Extract 'Survived' column from train DataFrame into a flattened array 'y_train'



The dataset includes the following columns:

1. **PassengerId**: Unique identifier for each passenger.
2. **Survived**: Survival status (0 = Not Survived, 1 = Survived).
3. **Pclass**: Passenger class (1 = First, 2 = Second, 3 = Third).
4. **Sex**: Gender of the passenger.
5. **Age**: Age of the passenger.
6. **SibSp**: Number of siblings/spouses aboard.
7. **Parch**: Number of parents/children aboard.
8. **Fare**: Fare paid by the passenger.
9. **Cabin**: Cabin number of the passenger.
10. **Embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).


In [None]:
#Obtaining all the columns present in train dataset and test dataset
train.columns

In [None]:
train.shape

In [None]:
train.head()

In [None]:
# Count the occurrences of each value in the 'Embarked' column
train['Embarked'].value_counts()


In [None]:
train.info()

In [None]:
# Check for missing values in each column of the train DataFrame
train.isnull().sum()

# Here we can see that the columns Age, Embarked, and Cabin have null values. We will handle these missing values shortly.


In [None]:
train.describe()

In [None]:
# Concatenate the train and test DataFrames along the rows (axis=0)
combined_df = pd.concat([train, test], axis=0)

# Reset the index of the combined DataFrame
combined_df.reset_index(drop=True, inplace=True)

# Display summary statistics of the combined DataFrame
combined_df.describe()


- The dataset contains information on 1309 passengers.

- The survival rate among the passengers in the dataset is approximately 38.4%.

- The majority of passengers were in second or third class, as indicated by the mean Pclass value of around 2.29.

- The average age of passengers in the dataset is around 29.88 years old.

- Most passengers traveled without siblings or spouses, with a mean SibSp value of approximately 0.50.

- Similarly, the majority of passengers traveled without parents or children, with a mean Parch value of about 0.39.

- The average fare paid by passengers is approximately $33.30.

- The age range of passengers is from as young as 0.17 years old to as old as 80 years old.

- The majority of fares paid were relatively low, as indicated by the 25th percentile value of $ 7.90  and the median (50th percentile ) value of $14.45.

- There were passengers who traveled with up to 8 siblings or spouses, and up to 9 parents or children.

- The maximum fare paid by a passenger was $512.33, significantly higher than the average fare.


### Part 2: Exploratory Data Analysis

### Plotting the relationship between the target class and various predictor variables


In [None]:
# Create a figure for the plot with a specific size
fig = plt.figure(figsize=(32, 12))

# Subplot for bar plot of passengers survived vs deceased
plt.subplot2grid((3, 4), (0, 0))
train['Survived'].value_counts(normalize=True).plot(kind="bar")
plt.title('Survived')

# Subplot for scatter plot of passengers Age vs Survival
plt.subplot2grid((3, 4), (0, 1))
plt.scatter(train['Survived'], train['Age'], alpha=0.5)
plt.title('Age Vs Survival')

# Subplot for bar plot of passenger's ticket class
plt.subplot2grid((3, 4), (0, 2))
train['Pclass'].value_counts().plot(kind="bar")
plt.title('Class')

# Subplot for density plot to find the distribution of age in each class
plt.subplot2grid((3, 4), (1, 0))
for x in [1, 2, 3]:
    train['Age'][train['Pclass'] == x].plot(kind='kde')
plt.legend(("1st", "2nd", "3rd"))

# Subplot for bar plot of Female survivors
plt.subplot2grid((3, 4), (1, 1))
train['Survived'][train['Sex'] == "female"].value_counts().plot(kind="bar")
plt.title('Female Survivors')

# Subplot for bar plot of gender of the passengers survived
plt.subplot2grid((3, 4), (1, 2))
train['Sex'][train['Survived'] == 1].value_counts().plot(kind="bar", color=['r'])
plt.title('Sex of Survivors')

# Subplot for plot showing the distribution of passengers survived in each ticket class
plt.subplot2grid((3, 4), (2, 0))
for x in [1, 2, 3]:
    train['Survived'][train['Pclass'] == x].plot(kind='kde')
plt.legend(("1st", "2nd", "3rd"))

# Show the plot
plt.show()


In [None]:
df = pd.DataFrame(train)

# Generate a profiling report using pandas_profiling.ProfileReport
profile = ProfileReport(df, title="Profiling Report")

In [None]:
# Render the profiling report as a notebook iframe
profile.to_notebook_iframe()


In [None]:
# Save the profiling report to an HTML file
profile.to_file("output_file.html")


### Part 3: Data Cleansing

Data cleansing involves mainly the following:
1. Checking for null values and imputation of null values using mean, median, or mode.
2. Label Encoding as well as mapping of characteristic variables to numeric variables.
3. Grouping into different buckets


In [None]:
# Function to replace null values with the median of the corresponding column
def fill_null_with_median(df, col):
    """Replace null values with the median of the corresponding column."""
    df[col] = df[col].fillna(df[col].dropna().median())

# Function to clean the data by filling null values and imputing values
def clean_data(data):
    """
    Clean the data by filling null values and imputing values.

    Args:
        data (DataFrame): The input DataFrame to be cleaned.

    Returns:
        DataFrame: The cleaned DataFrame.
    """
    fill_null_with_median(data, 'Fare')
    fill_null_with_median(data, 'Age')
    data["Embarked"] = data["Embarked"].fillna("S")
    return data

# Function to add a new column 'Deck' based on the 'Cabin' column
def add_deck_column(data):
    """
    Create a new column 'Deck' based on the 'Cabin' column.

    Args:
        data (DataFrame): The input DataFrame.

    Returns:
        DataFrame: The DataFrame with the new 'Deck' column.
    """
    data['Cabin'].fillna(value='', inplace=True)
    data['Deck'] = data['Cabin'].apply(lambda cabin: cabin[0] if cabin != '' else 'None')
    return data

# Function to perform label encoding on categorical features
def label_encode(data, columns):
    """
    Perform label encoding on categorical features.

    Args:
        data (DataFrame): The input DataFrame.
        columns (list): List of columns to encode.
    """
    label_encoder = preprocessing.LabelEncoder()
    for column in columns:
        data[column] = label_encoder.fit_transform(data[column])

# Function to extract titles from names and map them to numerical values
def set_title(data):
    """
    Extract titles from names and map them to numerical values.

    Args:
        data (DataFrame): The input DataFrame.
    """
    data['Title'] = data['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
    data['Title'] = data['Title'].replace(main_title_map)
    data['Title'] = data['Title'].map(replacement_title_map)
    data['Title'].fillna(0, inplace=True)
    data['Title'] = data['Title'].astype(int)
    data['Lastname'] = data['Name'].apply(lambda x: x.split(',')[0])

# Function to assign passengers into age groups
def get_age_group(data):
    """
    Assign passengers into age groups.

    Args:
        data (DataFrame): The input DataFrame.
    """
    data['Age'] = pd.cut(data['Age'], bins=[0, 12, 18, 23, 28, 34, 41, 67, np.inf], labels=range(8))

# Function to assign passengers into fare groups
def get_fare_group(data):
    """
    Assign passengers into fare groups.

    Args:
        data (DataFrame): The input DataFrame.
    """
    data['Fare'] = pd.cut(data['Fare'], bins=[-np.inf, 7.91, 14.454, 31, 99, 250, np.inf], labels=range(6))

# Function to calculate 'classAge' by multiplying 'Pclass' and 'Age'
def get_class_age(data):
    """
    Calculate 'classAge' by multiplying 'Pclass' and 'Age'.

    Args:
        data (DataFrame): The input DataFrame.
    """
    data['classAge'] = data['Pclass'] * data['Age']

# Function to check if the passenger is traveling alone
def is_alone(data):
    """
    Check if the passenger is traveling alone.

    Args:
        data (DataFrame): The input DataFrame.
    """
    data['isAlone'] = np.where((data['SibSp'] > 0) | (data['Parch'] > 0), 0, 1)

# Function to perform label encoding on categorical feature without any order
def label_encode(data, columns):
    """Perform label encoding on categorical features."""
    label_encoder = preprocessing.LabelEncoder()
    for column in columns:
        data[column]= label_encoder.fit_transform(data[column])
    data[column].unique()

# Dictionary to map replacement titles
main_title_map = {'Lady': 'Rare', 'Mme': 'Mrs', 'Dona': 'Rare', 'the Countess': 'Rare',
         'Ms': 'Mrs', 'Mlle': 'Mrs',
         'Sir': 'Rare', 'Major': 'Officer', 'Capt': 'Officer', 'Jonkheer': 'rare', 'Don': 'rare', 'Col': 'Officer', 'Rev': 'Officer', 'Dr': 'rare'}
replacement_title_map = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5,"Officer":6}

# List to store known families
known_family = []

# Splitting the name to get the title and last name
def get_title(full_name):
    return full_name.split(',')[1].split('.')[0].strip()

def get_lastname(full_name):
    return full_name.split(',')[0]

# Function to obtain the survival rate of passengers per family
def get_known_family(last_name):
    if last_name in known_family:
        print(last_name)
        return True
    else:
        return False

# Function to print the name of the family where more than one passenger is present
def get_known_family_survival_rate(Lastname):
    if get_known_family(Lastname):
        print(train.Lastname[train.Survived==1].value_counts())
        return train.Lastname[train.Survived==1].value_counts()/train.Lastname.value_counts()
    else:
        return 0

# Function to find the survival rate of passengers traveling in each deck of the ship
def get_deck_survival_rate(deck):
    return train.Deck[train.Survived==1].value_counts()/train.Deck.value_counts()

# Function to apply label encoding on categorical feature without any order
def label_encode(data, columns):
    """Perform label encoding on categorical features."""
    label_encoder = preprocessing.LabelEncoder()
    for column in columns:
        data[column]= label_encoder.fit_transform(data[column])
    data[column].unique()

# Function to assign the passenger fare details into 6 buckets
def get_fare_group(data):
    """Assign passengers into fare groups."""
    data.loc[ data['Fare'] <= 7.91, 'Fare'] = 0
    data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
    data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare']   = 2
    data.loc[(data['Fare'] > 31) & (data['Fare'] <= 99), 'Fare']   = 3
    data.loc[(data['Fare'] > 99) & (data['Fare'] <= 250), 'Fare']   = 4
    data.loc[ data['Fare'] > 250, 'Fare'] = 5
    data['Fare'] = data['Fare'].astype(int)

# Function to assign the passengers based on age into 8 buckets
def get_age_group(data):
    """Assign passengers into age groups."""
    data.loc[ data['Age'] <= 11, 'Age'] = 0
    data.loc[(data['Age'] > 11) & (data['Age'] <= 18), 'Age'] = 1
    data.loc[(data['Age'] > 18) & (data['Age'] <= 22), 'Age'] = 2
    data.loc[(data['Age'] > 22) & (data['Age'] <= 27), 'Age'] = 3
    data.loc[(data['Age'] > 27) & (data['Age'] <= 33), 'Age'] = 4
    data.loc[(data['Age'] > 33) & (data['Age'] <= 40), 'Age'] = 5
    data.loc[(data['Age'] > 40) & (data['Age'] <= 66), 'Age'] = 6
    data.loc[ data['Age'] > 66, 'Age'] = 7

# Function to calculate 'classAge' by multiplying 'Pclass' and 'Age'
def get_class_age(data):
    """Calculate 'classAge' by multiplying 'Pclass' and 'Age'."""
    data['classAge'] = data['Pclass'] * data['Age']

# Function to check if the passenger is traveling alone
def is_alone(data):
    """Check if the passenger is traveling alone."""
    data['isAlone'] = np.where((data['SibSp'] > 0) | (data['Parch'] > 0), 0, 1)

# Function to get a boolean value based on a threshold
def get_boolean(pred):
    """Get a boolean value based on a threshold."""
    return 1 if pred > 0.3 else 0

# Function to write predictions to a CSV file
def write_prediction(prediction, name):
    """
    Write predictions to a CSV file.

    Args:
        prediction (array): Array containing the predictions.
        name (str): Name of the output file.
    """
    PassengerId = np.array(test["PassengerId"]).astype(int)
    solution = pd.DataFrame(prediction, PassengerId, columns=["Survived"])
    solution.to_csv(name, index_label=["PassengerId"])


## Preprocessing and Data Cleaning Steps:
### Step 1: Handling Null Values
- Replace null values in the "Fare" and "Age" columns with their respective medians.
- Fill null values in the "Embarked" column with 'S'.

### Step 2: Creating Additional Features
- Create a new column "Deck" based on the first letter of the "Cabin" column to represent the deck each passenger was on.
- Extract titles from the "Name" column and map them to predefined categories, creating a new feature "Title".
- Extract family names from the "Name" column to identify passengers belonging to the same family.
- Group passengers into age categories based on predefined ranges.
- Group fares into predefined ranges and assign them to passengers.
- Create a binary feature "isChild" to indicate whether a passenger is a child (age < 18).
- Create a new feature "classAge" by multiplying the "Pclass" and "Age" columns.

### Step 3: Handling Categorical Data
- Label encode categorical columns like "Embarked" and "Title" using sklearn's LabelEncoder.

### Step 4: Generating VIF (Variance Inflation Factor)
- Check for collinearity among predictor variables using the Variance Inflation Factor.

### Step 5: Outputting Predictions
- Provide a function "write_prediction" to output predictions for whether a passenger survived or not.

### Step 6: Miscellaneous
- Provide functions to check the survival rate of passengers per family and per deck.
- Create an additional feature "isAlone" to indicate whether a passenger is traveling alone (without siblings, spouses, parents, or children).


In [None]:
#Adding the deck column in train dataset
clean_data(train)
clean_data(test)
train = addDeckColumn(train)
columns=['Deck','Sex','Embarked']
label_encode(train,columns)
train.head()

In [None]:
#Mapping passenger class 1 and 2 to 0 and passenger_class 3 to 1
class3_map = {1: 0, 2: 0, 3: 1}
train['Pclass_3'] = train['Pclass'].map(class3_map)
test['Pclass_3'] = test['Pclass'].map(class3_map)

In [None]:
train.info()

In [None]:
#Applying all the above functions to both train and test datasets.
set_title(train)
set_title(test)
isAlone(train)
isAlone(test)
isChild(train)
isChild(test)
get_age_group(train)
get_age_group(test)
get_fare_group(train)
get_fare_group(test)
get_class_age(train)
get_class_age(test)

In [None]:
train.head(10)

In [None]:
#Similarly all the above steps are done for test dataset. Also the target variable is removed from train dataset.
test = addDeckColumn(test)
columns=['Deck','Sex','Embarked','Lastname','isAlone']
label_encode(train,columns)
label_encode(test,columns)
train = train.drop(['Survived'], axis=1)

In [None]:
train.head()

In [None]:
# Save preprocessed train dataset to CSV file
train.to_csv('preprocessed_train.csv', index=False)

# Save preprocessed test dataset to CSV file
test.to_csv('preprocessed_test.csv', index=False)


### Part 4: Feature selection- Using Ensemble method to obtain the best features for survival prediction.

In [None]:
# Read preprocessed train dataset from CSV file
train = pd.read_csv('preprocessed_train.csv')

# Read preprocessed test dataset from CSV file
test = pd.read_csv('preprocessed_test.csv')

In [None]:
def get_train_test_data(columns):
    """
    Extracts the specified columns from the train and test datasets and converts them into arrays.

    Parameters:
    columns (list): List of column names to extract from the datasets.

    Returns:
    tuple: A tuple containing the arrays of training data and testing data respectively.
    """

    train_features = train[columns]  # Extract specified columns from the train dataset
    test_features = test[columns]    # Extract specified columns from the test dataset

    x_train = train_features.values  # Creates an array of the train data
    x_test = test_features.values    # Creates an array of the test data

    ntrain = x_train.shape[0]        # Number of rows in the training data array
    ntest = x_test.shape[0]          # Number of rows in the testing data array

    return x_train, x_test


Using KFold cross validation technique to split the data into NFolds and for each fold the data is divided into training and test and the model is built using k-1 folds and evaluated on the test fold.

In [None]:
# Some useful parameters which will come in handy later on
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # for reproducibility
NFOLDS = 5 # set folds for out-of-fold prediction
kf = KFold(n_splits = NFOLDS, random_state = None)
# Class to extend the Sklearn classifier for easier handling and customization
class SklearnHelper(object):
    """
    A helper class to extend the functionality of Sklearn classifiers.

    Parameters:
    clf (object): A Sklearn classifier object.
    seed (int): Random seed for reproducibility (by default we have set it to 0).
    params (dict): Additional parameters to be passed to the classifier (default is None).

    Attributes:
    clf (object): The Sklearn classifier object initialized with the specified parameters.

    Methods:
    train(x_train, y_train): Trains the classifier on the provided training data.
    predict(x): Predicts the target variable for the provided data.
    fit(x, y): Fits the classifier to the provided data.
    feature_importances(x, y): Calculates and returns the feature importances of the fitted model.
    """

    def __init__(self, clf, seed=0, params=None):
        """
        Initialize the SklearnHelper object with the specified classifier and parameters.

        Args:
        clf (object): A Sklearn classifier object.
        seed (int): Random seed for reproducibility (default is 0).
        params (dict): Additional parameters to be passed to the classifier (default is None).
        """
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        """
        Train the classifier on the provided training data.

        Args:
        x_train (array-like): Training data features.
        y_train (array-like): Target variable for training.

        Returns:
        None
        """
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        """
        Predict the target variable for the provided data.

        Args:
        x (array-like): Data for prediction.

        Returns:
        array-like: Predicted target variable.
        """
        return self.clf.predict(x)

    def fit(self, x, y):
        """
        Fit the classifier to the provided data.

        Args:
        x (array-like): Training data features.
        y (array-like): Target variable for training.

        Returns:
        object: Fitted classifier.
        """
        return self.clf.fit(x, y)

    def feature_importances(self, x, y):
        """
        Calculate and return the feature importances of the fitted model.

        Args:
        x (array-like): Training data features.
        y (array-like): Target variable for training.

        Returns:
        array-like: Feature importances.
        """
        return self.clf.fit(x, y).feature_importances_


In [None]:
def get_oof(clf, model_x_train, y_train, model_x_test):
    """
    Generate out-of-fold predictions using a classifier.

    Args:
    clf (object): Classifier object with methods train and predict.
    model_x_train (array-like): Features of the training dataset.
    y_train (array-like): Target variable of the training dataset.
    model_x_test (array-like): Features of the testing dataset.

    Returns:
    tuple: A tuple containing out-of-fold predictions for the training set and predictions for the testing set.
    """
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf.split(train)):
        x_tr = model_x_train[train_index]
        y_tr = y_train[train_index]
        x_te = model_x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(model_x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)


In [None]:
# Put in our parameters for said classifiers
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 100,
    'warm_start': True,
    'max_depth': 6,
    'min_samples_leaf': 1,
    'max_features' : 'sqrt',
    'min_samples_split': 16,
    'criterion': 'gini',
    'verbose': 0
}


# Extra Trees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators':1000,
    'max_features':4,
    'max_depth': 50,
    'min_samples_leaf': 5,
    'criterion': 'entropy',
    'verbose': 0
}

# AdaBoost parameters
ada_params = {
    'n_estimators': 500,
    'learning_rate' : 0.75
}

# Gradient Boosting parameters
gb_params = {
    'n_estimators': 500,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

# Support Vector Classifier parameters
svc_params = {
    'kernel' : 'linear',
    'C' : 0.025
    }


### list of model used :
1. Random Forest Classifier
2. Extra Trees Classifier
3. AdaBoost Classifier
4. Gradient Boosting Classifier
5. Support Vector Classifier

In [None]:
# Create instances of classifiers with specified parameters
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)


Best Features: Using feature_importance in all 4 algorithmns(RandomForest,Decision Trees, AdaBoost and Gradient Boosting) we can obtain the best features for prediction

In [None]:
# Define columns to be used for training
columns_to_train = ['Pclass_3', 'Sex', 'Title', 'Lastname', 'isAlone', 'Age', 'Fare', 'classAge', 'Deck', 'isChild']

# Get training and test datasets based on the specified columns
x_train, x_test = get_train_test_data(columns_to_train)


In [None]:
# Extract feature importances from each model
rf_features = rf.feature_importances(x_train, y_train)
et_features = et.feature_importances(x_train, y_train)
ada_features = ada.feature_importances(x_train, y_train)
gb_features = gb.feature_importances(x_train, y_train)


In [None]:
# Define the columns used as features
cols = columns_to_train
print(cols)

# Create a DataFrame to store feature importances
feature_dataframe = pd.DataFrame({
    'features': cols,
    'Random Forest feature importances': rf_features,
    'Extra Trees feature importances': et_features,
    'AdaBoost feature importances': ada_features,
    'Gradient Boost feature importances': gb_features
})

# Print the DataFrame
feature_dataframe.head()

Base Feature Mean : Calculating Mean of the new features obtained and representing the average using Barplot.

In [None]:
feature_dataframe['mean'] = feature_dataframe.mean(axis= 1)

In [None]:
feature_dataframe

###### Observations on feature importances from various classifiers:
- 'Sex' and 'Title' are consistently identified as highly important features across Random Forest, Extra Trees, and AdaBoost classifiers.
- 'Lastname' receives the highest importance from AdaBoost and Gradient Boosting classifiers, suggesting family groups might have affected survival.
- 'Age' is moderately important across most classifiers, with Gradient Boosting assigning relatively higher importance.
- 'Fare' shows moderate importance, indicating ticket fare might have influenced survival.
- 'Deck' receives relatively higher importance from Gradient Boosting, suggesting cabin location as a factor.
- 'IsAlone' and 'IsChild' have low importance across all classifiers, implying they might not have significantly influenced survival.
##### These observations highlight the varying perspectives of different classifiers on feature importance and suggest potential factors influencing survival aboard the Titanic.


In [None]:
#Plotting the mean of the features
import seaborn as sns

sns.set_context('paper')
sns.barplot(x = 'features', y = 'mean',  data = feature_dataframe.sort_values(by=['mean']),palette = 'Blues', edgecolor = 'w')
plt.show()

In [None]:
# Generate out-of-fold predictions

et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test)     # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf, x_train, y_train, x_test)     # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test)  # AdaBoost
gb_oof_train, gb_oof_test = get_oof(gb, x_train, y_train, x_test)     # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc, x_train, y_train, x_test)  # Support Vector Classifier


In [None]:
# Assemble base predictions for training data
base_predictions_train = pd.DataFrame({
    'RandomForest': rf_oof_train.ravel(),       # Predictions using Random Forest classifier
    'ExtraTrees': et_oof_train.ravel(),         # Predictions using Extra Trees classifier
    'AdaBoost': ada_oof_train.ravel(),          # Predictions using AdaBoost classifier
    'GradientBoost': gb_oof_train.ravel(),      # Predictions using Gradient Boosting classifier
    'SVC': svc_oof_train.ravel()                # Predictions using Support Vector Classifier
})

# Display the initial rows of the assembled base predictions
base_predictions_train.head()


Concatenation of all new features into a dataframe and creation of X_train and X_test.

In [None]:
# Concatenate OOF predictions for training dataset
x_train = np.concatenate((et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)

# Concatenate OOF predictions for test dataset
x_test = np.concatenate((et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

### Part 5: Model Creation and prediction

Finding the best Model by using different Algorithms:

  1. Using XGBoost Algorithm

In [None]:
from xgboost import XGBClassifier

# Defining and training the XGBoost classifier
gbm = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3, gamma=0.0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.15, max_delta_step=0, max_depth=5,
              min_child_weight=7, monotone_constraints='()',
              n_estimators=600, n_jobs=1, nthread=1, num_parallel_tree=1,
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              silent=True, subsample=1.0, tree_method='exact',
              validate_parameters=1, verbosity=None).fit(x_train, y_train)

# Making predictions using the trained XGBoost classifier
predictions = gbm.predict(x_test)


In [None]:
# Write the predictions to a result.csv file
write_prediction(predictions,'results.csv')

2. Using RandomForest Algorithm

In [None]:
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, x_train, y_train, cv=10, scoring = "accuracy")
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

In [None]:
#Performing hyperparameter tuning and finding the best parameters for RandomForest Algorithm
param_grid = { "criterion" : ["gini", "entropy"], "min_samples_leaf" : [1, 5, 10, 25, 50, 70], "min_samples_split" : [2, 4, 10, 12, 16, 18, 25, 35], "n_estimators": [100, 400, 700, 1000, 1500]}
from sklearn.model_selection import GridSearchCV, cross_val_score
rf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1, n_jobs=-1)
clf = GridSearchCV(estimator=rf, param_grid=param_grid, n_jobs=-1)
clf.fit(x_train, y_train)
clf.best_params_

In [None]:
# One can also use these parameters too
# {'criterion': 'gini',
#  'min_samples_leaf': 1,
#  'min_samples_split': 16,
#  'n_estimators': 1000}

random_forest = RandomForestClassifier(criterion = "gini",
                                       min_samples_leaf = 1,
                                       min_samples_split = 12,
                                       n_estimators=1000,
                                       max_features='auto',
                                       oob_score=True,
                                       random_state=1,
                                       n_jobs=-1)

random_forest.fit(x_train, y_train)
Y_prediction = random_forest.predict(x_test)
random_forest.score(x_train, y_train)
print("oob score of the model is:", round(random_forest.oob_score_, 4)*100, "%")

In [None]:
 # Write predictions to 'results.csv'
 write_prediction(Y_prediction,'results.csv')

3. Using Decision Trees Classifier

In [None]:
# Instantiating and fitting an Extra Trees Classifier
extc = ExtraTreesClassifier(n_estimators=1000, max_features=4, criterion='entropy', min_samples_split=5,
                            max_depth=50, min_samples_leaf=5)

extc.fit(x_train, y_train)


In [None]:
y_pred = extc.predict(x_test)

In [None]:
# Write predictions to 'results.csv'
write_prediction(y_pred,'results.csv')

## Inferences

This model mainly predicts the characteristics of passengers survived the titanic shipwreck
The main challenge here is to select the best features of the passengers survived
By building the model in this way the most relevant features are selected and based on this, we can easily find out if a passenger has survived or not in the test dataset.

Finding the best Model by using different Algorithms:
    1. Using XGBoost Algorithm