# Titanic - Machine Learning from Disaster
## The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

The competition is simple: we want you to use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die.

* Kaggle challenge link: https://www.kaggle.com/competitions/titanic/overview

### Dataset Description
Overview

The data has been split into two groups:

   * training set (train.csv)
   * test set (test.csv)

> The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

> The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, train_test_split
from sklearn.metrics import f1_score,mean_absolute_error,accuracy_score
from xgboost import XGBClassifier

In [2]:
# Importing data 
df = pd.read_csv("titanic-data/train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [30]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [3]:
df_tmp = df
df_tmp

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [4]:
for label,content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

Name
Sex
Ticket


In [5]:
for label,content in df_tmp.items():
    if pd.api.types.is_object_dtype(content):
        print(label)
        df_tmp[label]=content.astype("category").cat.as_ordered()

Name
Sex
Ticket
Cabin
Embarked


In [6]:
for lable,content in df_tmp.items():
    if pd.api.types.is_object_dtype(content):
        df_tmp[label]=content.astype("category").cat.as_ordered()

In [7]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int64   
 1   Survived     891 non-null    int64   
 2   Pclass       891 non-null    int64   
 3   Name         891 non-null    category
 4   Sex          891 non-null    category
 5   Age          714 non-null    float64 
 6   SibSp        891 non-null    int64   
 7   Parch        891 non-null    int64   
 8   Ticket       891 non-null    category
 9   Fare         891 non-null    float64 
 10  Cabin        204 non-null    category
 11  Embarked     889 non-null    category
dtypes: category(5), float64(2), int64(5)
memory usage: 122.0 KB


In [8]:
df_tmp.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [9]:
for label,content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if(pd.isnull(content).sum()):
            df_tmp[label+"_is_missing"]=pd.isnull(content)
            df_tmp[label]=content.fillna(content.median())

In [10]:
# Turn categorical variables into naumbers and fill missing
for label,content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        df_tmp[label+"_is_missing"] = pd.isnull(content)
        df_tmp[label]= pd.Categorical(content).codes+1

In [11]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PassengerId          891 non-null    int64  
 1   Survived             891 non-null    int64  
 2   Pclass               891 non-null    int64  
 3   Name                 891 non-null    int16  
 4   Sex                  891 non-null    int8   
 5   Age                  891 non-null    float64
 6   SibSp                891 non-null    int64  
 7   Parch                891 non-null    int64  
 8   Ticket               891 non-null    int16  
 9   Fare                 891 non-null    float64
 10  Cabin                891 non-null    int16  
 11  Embarked             891 non-null    int8   
 12  Age_is_missing       891 non-null    bool   
 13  Name_is_missing      891 non-null    bool   
 14  Sex_is_missing       891 non-null    bool   
 15  Ticket_is_missing    891 non-null    boo

In [12]:
df_tmp.isna().sum()

PassengerId            0
Survived               0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                  0
Embarked               0
Age_is_missing         0
Name_is_missing        0
Sex_is_missing         0
Ticket_is_missing      0
Cabin_is_missing       0
Embarked_is_missing    0
dtype: int64

In [13]:
x_train,x_test,y_train,y_test = train_test_split(df_tmp.drop("Survived",axis=1),df_tmp.Survived,test_size=0.2,random_state=42)
x_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_is_missing,Name_is_missing,Sex_is_missing,Ticket_is_missing,Cabin_is_missing,Embarked_is_missing
331,332,1,638,2,45.5,0,0,21,28.5,57,3,False,False,False,False,False,False
733,734,2,85,2,23.0,0,0,229,13.0,0,3,False,False,False,False,True,False
382,383,3,812,2,32.0,0,0,666,7.925,0,3,False,False,False,False,True,False
704,705,3,327,2,26.0,1,0,399,7.8542,0,3,False,False,False,False,True,False
813,814,3,24,1,6.0,4,2,334,31.275,0,3,False,False,False,False,True,False


In [66]:
y_train

331    0
733    0
382    0
704    0
813    0
      ..
106    1
270    0
860    0
435    1
102    0
Name: Survived, Length: 712, dtype: int64

In [68]:
%%time
## Modelling using XGBoost
model = XGBClassifier(n_estimators = 100,
                     max_depth = 3,
                     learning_rate = 0.2,
                     objective='multi:softmax',
                     num_class=3,
                    n_jobs=-1)
model.fit(x_train,y_train)

CPU times: user 707 ms, sys: 47.5 ms, total: 755 ms
Wall time: 114 ms


In [69]:
model.score(x_test,y_test)

0.8268156424581006

### Hyperparameter tuning

In [70]:
rf_grid={"n_estimators":np.arange(10,1000,50),
        "max_depth":[None,3,5,10],
        
         "min_child_weight": np.arange(1,20,2)
        }
np.random.seed(42)
rs_rf_rg=RandomizedSearchCV( XGBClassifier(tree_method='auto',n_jobs=-1,enable_categorical = True),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=100,
                           verbose=True)
rs_rf_rg.fit(x_train,y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [71]:
rs_rf_rg.score(x_test,y_test)

0.8435754189944135

In [72]:
mod = RandomForestClassifier()
mod.fit(x_train,y_train)
mod.score(x_test,y_test)

0.8379888268156425

In [14]:
# 2. Hyperparameter grid 
rf_grid={"n_estimators":np.arange(10,1000,50),
        "max_depth":[None,3,5,10],
        "min_samples_split": np.arange(2,20,2),
         "min_samples_leaf": np.arange(1,20,2)
        }


In [77]:
# tune random forest classifier
np.random.seed(42)
rf_mod=RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=100,
                          n_jobs=-1,
                           verbose=True)
rf_mod.fit(x_train,y_train)


Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [78]:
mod.score(x_test,y_test)

0.8379888268156425

In [79]:
gs_rf = GridSearchCV(
    RandomForestClassifier(),
    param_grid=rf_grid,
    cv=5,
    verbose=True,
    n_jobs=-1  # ⬅️ This uses all cores
)
gs_rf.fit(x_train,y_train)
gs_rf.score(x_test,y_test)

Fitting 5 folds for each of 7200 candidates, totalling 36000 fits


KeyboardInterrupt: 

In [82]:
df_tmp.to_csv("preprocessed_df.csv",index=False)
import joblib
joblib.dump(rs_rf_rg,"XGBC-model.pkl")

['XGBC-model.pkl']

In [83]:
joblib.dump(mod,"Random-forest.pkl")

['Random-forest.pkl']

In [15]:
import joblib
mod_xgb = joblib.load("XGBC-model.pkl")

In [16]:
test_df = pd.read_csv("titanic-data/test.csv")
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [17]:
df_tmp.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_is_missing,Name_is_missing,Sex_is_missing,Ticket_is_missing,Cabin_is_missing,Embarked_is_missing
0,1,0,3,109,2,22.0,1,0,524,7.25,0,3,False,False,False,False,True,False
1,2,1,1,191,1,38.0,1,0,597,71.2833,82,1,False,False,False,False,False,False
2,3,1,3,354,1,26.0,0,0,670,7.925,0,3,False,False,False,False,True,False
3,4,1,1,273,1,35.0,1,0,50,53.1,56,3,False,False,False,False,False,False
4,5,0,3,16,2,35.0,0,0,473,8.05,0,3,False,False,False,False,True,False


In [18]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


In [19]:
test_df.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [20]:
def preprocessing(df_tmp):
    for label,content in df_tmp.items():
        if pd.api.types.is_object_dtype(content):
            print(label)
            df_tmp[label]=content.astype("category").cat.as_ordered()
            
    for label,content in df_tmp.items():
        if pd.api.types.is_numeric_dtype(content):
            if(pd.isnull(content).sum()):
                df_tmp[label+"_is_missing"]=pd.isnull(content)
                df_tmp[label]=content.fillna(content.median())
                
    # Turn categorical variables into naumbers and fill missing
    for label,content in df_tmp.items():
        if not pd.api.types.is_numeric_dtype(content):
            df_tmp[label+"_is_missing"] = pd.isnull(content)
            df_tmp[label]= pd.Categorical(content).codes+1

    return df_tmp

In [21]:
test_df=preprocessing(test_df)
test_df.head()

Name
Sex
Ticket
Cabin
Embarked


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_is_missing,Fare_is_missing,Name_is_missing,Sex_is_missing,Ticket_is_missing,Cabin_is_missing,Embarked_is_missing
0,892,3,207,2,34.5,0,0,153,7.8292,0,2,False,False,False,False,False,True,False
1,893,3,404,1,47.0,1,0,222,7.0,0,3,False,False,False,False,False,True,False
2,894,2,270,2,62.0,0,0,74,9.6875,0,2,False,False,False,False,False,True,False
3,895,3,409,2,27.0,0,0,148,8.6625,0,3,False,False,False,False,False,True,False
4,896,3,179,1,22.0,1,1,139,12.2875,0,3,False,False,False,False,False,True,False


In [22]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PassengerId          891 non-null    int64  
 1   Survived             891 non-null    int64  
 2   Pclass               891 non-null    int64  
 3   Name                 891 non-null    int16  
 4   Sex                  891 non-null    int8   
 5   Age                  891 non-null    float64
 6   SibSp                891 non-null    int64  
 7   Parch                891 non-null    int64  
 8   Ticket               891 non-null    int16  
 9   Fare                 891 non-null    float64
 10  Cabin                891 non-null    int16  
 11  Embarked             891 non-null    int8   
 12  Age_is_missing       891 non-null    bool   
 13  Name_is_missing      891 non-null    bool   
 14  Sex_is_missing       891 non-null    bool   
 15  Ticket_is_missing    891 non-null    boo

In [23]:
df_tmp["Fare_is_missing"] = False

In [24]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PassengerId          891 non-null    int64  
 1   Survived             891 non-null    int64  
 2   Pclass               891 non-null    int64  
 3   Name                 891 non-null    int16  
 4   Sex                  891 non-null    int8   
 5   Age                  891 non-null    float64
 6   SibSp                891 non-null    int64  
 7   Parch                891 non-null    int64  
 8   Ticket               891 non-null    int16  
 9   Fare                 891 non-null    float64
 10  Cabin                891 non-null    int16  
 11  Embarked             891 non-null    int8   
 12  Age_is_missing       891 non-null    bool   
 13  Name_is_missing      891 non-null    bool   
 14  Sex_is_missing       891 non-null    bool   
 15  Ticket_is_missing    891 non-null    boo

In [25]:
test_df.isna().sum()

PassengerId            0
Pclass                 0
Name                   0
Sex                    0
Age                    0
SibSp                  0
Parch                  0
Ticket                 0
Fare                   0
Cabin                  0
Embarked               0
Age_is_missing         0
Fare_is_missing        0
Name_is_missing        0
Sex_is_missing         0
Ticket_is_missing      0
Cabin_is_missing       0
Embarked_is_missing    0
dtype: int64

In [30]:
ab = test_df["Fare_is_missing"]
test_df = test_df.drop("Fare_is_missing",axis=1)
test_df["Fare_is_missing"] = ab

In [31]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_is_missing,Name_is_missing,Sex_is_missing,Ticket_is_missing,Cabin_is_missing,Embarked_is_missing,Fare_is_missing
0,892,3,207,2,34.5,0,0,153,7.8292,0,2,False,False,False,False,True,False,False
1,893,3,404,1,47.0,1,0,222,7.0,0,3,False,False,False,False,True,False,False
2,894,2,270,2,62.0,0,0,74,9.6875,0,2,False,False,False,False,True,False,False
3,895,3,409,2,27.0,0,0,148,8.6625,0,3,False,False,False,False,True,False,False
4,896,3,179,1,22.0,1,1,139,12.2875,0,3,False,False,False,False,True,False,False


In [32]:
df_tmp.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_is_missing,Name_is_missing,Sex_is_missing,Ticket_is_missing,Cabin_is_missing,Embarked_is_missing,Fare_is_missing
0,1,0,3,109,2,22.0,1,0,524,7.25,0,3,False,False,False,False,True,False,False
1,2,1,1,191,1,38.0,1,0,597,71.2833,82,1,False,False,False,False,False,False,False
2,3,1,3,354,1,26.0,0,0,670,7.925,0,3,False,False,False,False,True,False,False
3,4,1,1,273,1,35.0,1,0,50,53.1,56,3,False,False,False,False,False,False,False
4,5,0,3,16,2,35.0,0,0,473,8.05,0,3,False,False,False,False,True,False,False


In [33]:
x_train,x_test,y_train,y_test = train_test_split(df_tmp.drop("Survived",axis=1),df_tmp.Survived,test_size=0.2,random_state=42)
x_train.head()
rf_grid={"n_estimators":np.arange(10,1000,50),
        "max_depth":[None,3,5,10],
        
         "min_child_weight": np.arange(1,20,2)
        }
np.random.seed(42)
up_mod=RandomizedSearchCV( XGBClassifier(tree_method='auto',n_jobs=-1,enable_categorical = True),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=100,
                           verbose=True)
up_mod.fit(x_train,y_train)
up_mod.score(x_test,y_test)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


0.8435754189944135

In [35]:
predictions = up_mod.predict(test_df)
predictions

array([0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,

In [36]:
len(predictions)

418

In [39]:
pass_id = test_df["PassengerId"]

### Preparing prediction csv file 

In [41]:
pred = pd.DataFrame(predictions,index=pass_id,columns=["Survived"])
pred

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,1
896,0
...,...
1305,0
1306,1
1307,0
1308,0


In [42]:
pred.to_csv("test_results.csv")

In [45]:
def random_forest_optimization():
    rf_grid={"n_estimators":np.arange(10,1000,50),
            "max_depth":[None,3,5,10],
            "min_samples_split": np.arange(2,20,2),
             "min_samples_leaf": np.arange(1,20,2)
            }
    gs_rf = GridSearchCV(
        RandomForestClassifier(),
        param_grid=rf_grid,
        cv=5,
        verbose=True,
        n_jobs=-1  # ⬅️ This uses all cores
        )
    gs_rf.fit(x_train,y_train)
    return gs_rf


In [46]:
rf_model = random_forest_optimization()

Fitting 5 folds for each of 7200 candidates, totalling 36000 fits


In [47]:
rf_model.score(x_test,y_test)

0.8379888268156425

In [48]:
preds = rf_model.predict(test_df)
preds

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [49]:
pred = pd.DataFrame(predictions,index=pass_id,columns=["Survived"])
pred

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,1
896,0
...,...
1305,0
1306,1
1307,0
1308,0


In [50]:
pred.to_csv("test_results_using_random_forest.csv")

In [51]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PassengerId          418 non-null    int64  
 1   Pclass               418 non-null    int64  
 2   Name                 418 non-null    int16  
 3   Sex                  418 non-null    int8   
 4   Age                  418 non-null    float64
 5   SibSp                418 non-null    int64  
 6   Parch                418 non-null    int64  
 7   Ticket               418 non-null    int16  
 8   Fare                 418 non-null    float64
 9   Cabin                418 non-null    int8   
 10  Embarked             418 non-null    int8   
 11  Age_is_missing       418 non-null    bool   
 12  Name_is_missing      418 non-null    bool   
 13  Sex_is_missing       418 non-null    bool   
 14  Ticket_is_missing    418 non-null    bool   
 15  Cabin_is_missing     418 non-null    boo

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PassengerId          891 non-null    int64  
 1   Survived             891 non-null    int64  
 2   Pclass               891 non-null    int64  
 3   Name                 891 non-null    int16  
 4   Sex                  891 non-null    int8   
 5   Age                  891 non-null    float64
 6   SibSp                891 non-null    int64  
 7   Parch                891 non-null    int64  
 8   Ticket               891 non-null    int16  
 9   Fare                 891 non-null    float64
 10  Cabin                891 non-null    int16  
 11  Embarked             891 non-null    int8   
 12  Age_is_missing       891 non-null    bool   
 13  Name_is_missing      891 non-null    bool   
 14  Sex_is_missing       891 non-null    bool   
 15  Ticket_is_missing    891 non-null    boo

In [53]:
df_tmp.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_is_missing,Name_is_missing,Sex_is_missing,Ticket_is_missing,Cabin_is_missing,Embarked_is_missing,Fare_is_missing
0,1,0,3,109,2,22.0,1,0,524,7.25,0,3,False,False,False,False,True,False,False
1,2,1,1,191,1,38.0,1,0,597,71.2833,82,1,False,False,False,False,False,False,False
2,3,1,3,354,1,26.0,0,0,670,7.925,0,3,False,False,False,False,True,False,False
3,4,1,1,273,1,35.0,1,0,50,53.1,56,3,False,False,False,False,False,False,False
4,5,0,3,16,2,35.0,0,0,473,8.05,0,3,False,False,False,False,True,False,False


In [54]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load and prepare data
df = df.drop(['PassengerId', 'Ticket', 'Name', 'Cabin'], axis=1)

# Target and features
X = df.drop('Survived', axis=1)
y = df['Survived']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest model
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions and accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.8101


In [56]:
model.score(X_test,y_test)

0.8100558659217877

In [57]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)  # 5-fold cross-validation
print("CV Accuracy:", scores.mean())


CV Accuracy: 0.8002761910740066


In [58]:
t_df = test_df.drop(['PassengerId', 'Ticket', 'Name', 'Cabin'], axis=1)
pred_t = model.predict(t_df)

In [59]:
pred_t = pd.DataFrame(pred_t,index=pass_id,columns=["Survived"])
pred_t


Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,0
895,1
896,0
...,...
1305,0
1306,1
1307,0
1308,0


In [60]:
pred.to_csv("test_results_dropping_columns.csv")