# Kaggle - Titanic Machine Learning
This solution is built based on the [Alexis Cook's Titanic Tutorial notebook](https://www.kaggle.com/code/alexisbcook/titanic-tutorial).

## Problem:

This problem aims to predict if the given passanger survived in Titanic incident or not. <br>Here they have provided data set of passangers who were boarded on the ship on that day.

## Dataset descrption:

The competition provides two datasets:
1. Training data
2. Testing data

Here we have to use training dataset to train our machine learning model and we have to predict if each passanger in the test dataset survived or not.

Training dataset consists of following features: 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp','Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'<br>
Here Survived column will be our target feature and remaining features will be training features

Testing dataset consists of PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch','Ticket', 'Fare', 'Cabin', 'Embarked'<br>
Here we have to predict the Survived column


### 1. Import and loading data
We will import pandas library which is used for data processing
and loaded data using panda's read_csv() function

In [None]:
import pandas as pd

In [64]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [5]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### 2. EDA
Here we have calculated the percentage of women and men survived.

Among the survived passangers around 74% were the women and 18% were men. So we can conclude that "sex" features is one of the essential feature for prediction that is women are more likely to survive than men.

In [65]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


In [66]:
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924


### 3. Loading model and selecting features
Here we have selected features that are more likely to be related to Survival prediction. We have selected "Pclass", "Sex", "SibSp", "Parch" features. <br>
Here "Sex" feature is catogorical which needs to be converted to numerical which is done by get_dummies function.

### 4. Fiting model, predicting on test data and saving output
We will use Random Forest Classifier model to predict the survial. Here I have initiated Random Forest Classifier model with some random parameter values. We will fit our training features and training target feature to model and we will use this model to predict the target feature for test dataset. 

Finally we will load the predicted data into a csv file and submitted to the challenge.

In [None]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('term_project_submission.csv', index=False)
print("Your submission was successfully saved!")

# Contribution

In [None]:
train_data.describe()

Here we will try to include "Age" column in training features.<br>But first we have to fill the null values for "Age" column in training and testing dataset.

In [4]:
train_mean_age = train_data['Age'].mean()
train_data['Age'].fillna(train_mean_age,inplace=True)

In [None]:
train_mean_age

In [None]:
train_data.describe()

In [None]:
test_data.describe()

In [5]:
test_data['Age'].fillna(train_mean_age,inplace=True)

In [None]:
test_data

In [None]:
new_features=["Pclass", "Sex", "SibSp", "Parch","Age"]
X_new = pd.get_dummies(train_data[new_features])
X_test_new = pd.get_dummies(test_data[new_features])
# train_data["Age"].unique
X_new

We will try to use the same model with same parameters.

In [None]:
from sklearn.ensemble import RandomForestClassifier

y_new = train_data["Survived"]
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_new, y_new)
predictions = model.predict(X_test_new)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('term_project_submission2.csv', index=False)
print("Your submission was successfully saved!")

We got a score of 

In [None]:
train_data.head()

# Contribution 2

### 1. Dividing train data in tarin and validation set and using mean_squared_error
Here we have divided the training data further into training(70%) and validation data(30%).<br> We have loaded MSE module to measure the performance of our model on validation dataset.

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [13]:
train_data, validation_data = train_test_split(train_data,test_size=0.3,random_state=42,stratify=train_data['Survived'])
print(train_data.head())
print(validation_data.head())

     PassengerId  Survived  Pclass                            Name     Sex  \
748          749         0       1       Marvin, Mr. Daniel Warner    male   
45            46         0       3        Rogers, Mr. William John    male   
28            29         1       3   O'Dwyer, Miss. Ellen "Nellie"  female   
633          634         0       1   Parr, Mr. William Henry Marsh    male   
403          404         0       3  Hakkarainen, Mr. Pekka Pietari    male   

      Age  SibSp  Parch            Ticket     Fare Cabin Embarked  
748  19.0      1      0            113773  53.1000   D30        S  
45    NaN      0      0   S.C./A.4. 23567   8.0500   NaN        S  
28    NaN      0      0            330959   7.8792   NaN        Q  
633   NaN      0      0            112052   0.0000   NaN        S  
403  28.0      1      0  STON/O2. 3101279  15.8500   NaN        S  
     PassengerId  Survived  Pclass                     Name   Sex   Age  \
625          626         0       1    Sutton, Mr

### 2. Filling null values for train, validation, test data
Here we will use "Age" feature also for training the model, but when we describe the dataset we see that there are some missing values for "Age" feature in training, validation and test datasets as well.<br> So we will fill these null values with median value of training dataset. <br>Here we have used median age value for validation and test dataset as well, this is to prevent "data leakage" problem.

In [44]:
train_data.head()
train_median_age = train_data['Age'].median()
train_data['Age'].fillna(train_median_age,inplace=True)
validation_data['Age'].fillna(train_median_age,inplace=True)
test_data['Age'].fillna(train_median_age,inplace=True)

train_median_fare = train_data['Fare'].median()
test_data['Fare'].fillna(train_median_fare,inplace=True)

In [24]:
validation_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,268.0,268.0,268.0,268.0,268.0,268.0,268.0
mean,453.563433,0.384328,2.283582,28.896455,0.675373,0.391791,35.296438
std,251.327054,0.487346,0.848987,12.90435,1.361258,0.718566,53.886969
min,2.0,0.0,1.0,0.75,0.0,0.0,0.0
25%,251.5,0.0,1.0,23.0,0.0,0.0,8.05
50%,450.5,0.0,3.0,29.0,0.0,0.0,15.875
75%,667.75,1.0,3.0,33.0,1.0,1.0,37.50315
max,887.0,1.0,3.0,70.0,8.0,4.0,512.3292


### 3. Added new feature
Here we have added "Age" feature to list of training features.<br> To convert the categorical values to numerical we have used get_dummies feature. 
### 4. Predicting with random parameters and calculating MSE
We have initiated a model with previous random parameters and fitted our training data and predicted the result for validation dataset.<br> We have used MSE for testing our model's accuracy for validation data and we got 0.1865671641791045 score.

In [45]:
features_to_use = ['Pclass','Sex','SibSp','Parch','Age']
X = pd.get_dummies(train_data[features_to_use])
X_validation = pd.get_dummies(validation_data[features_to_use])
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100,max_depth=4,random_state=1)
model.fit(X,train_data['Survived'])

RandomForestClassifier(max_depth=4, random_state=1)

In [46]:
validation_prediction=model.predict(X_validation)
print(mean_squared_error(validation_data['Survived'],validation_prediction))

0.1865671641791045


In [None]:
X_test = pd.get_dummies(test_data[features_to_use])
test_predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': test_predictions})
output.to_csv('term_project_submission3.csv', index=False)
print("Your submission was successfully saved!")

# Contribution 3- GridSearch

### 5. Using GridSearch for hyper parameter tuning
We can try to improve the model performance by trying to brute force different parameters. 

For this we have used GridSearch with 'n_estimators':[10,20,50,100,120,150,200,300,350,400,500], 'max_depth':[2,4,6,8,10,12,14] parameters. And we will fit this GridSearch to training dataset. 

In [54]:
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':[10,20,50,100,120,150,200,300,350,400,500],'max_depth':[2,4,6,8,10,12,14],'random_state':[1]}
clf = GridSearchCV(model,parameters,verbose=3)
# clf.fit(X,X_validation)

In [55]:
clf.fit(X,train_data['Survived'])

Fitting 5 folds for each of 77 candidates, totalling 385 fits
[CV 1/5] END max_depth=2, n_estimators=10, random_state=1;, score=0.808 total time=   0.0s
[CV 2/5] END max_depth=2, n_estimators=10, random_state=1;, score=0.840 total time=   0.0s
[CV 3/5] END max_depth=2, n_estimators=10, random_state=1;, score=0.848 total time=   0.0s
[CV 4/5] END max_depth=2, n_estimators=10, random_state=1;, score=0.782 total time=   0.0s
[CV 5/5] END max_depth=2, n_estimators=10, random_state=1;, score=0.766 total time=   0.0s
[CV 1/5] END max_depth=2, n_estimators=20, random_state=1;, score=0.808 total time=   0.0s
[CV 2/5] END max_depth=2, n_estimators=20, random_state=1;, score=0.840 total time=   0.0s
[CV 3/5] END max_depth=2, n_estimators=20, random_state=1;, score=0.840 total time=   0.0s
[CV 4/5] END max_depth=2, n_estimators=20, random_state=1;, score=0.782 total time=   0.0s
[CV 5/5] END max_depth=2, n_estimators=20, random_state=1;, score=0.766 total time=   0.0s
[CV 1/5] END max_depth=2, n_

GridSearchCV(estimator=RandomForestClassifier(max_depth=4, random_state=1),
             param_grid={'max_depth': [2, 4, 6, 8, 10, 12, 14],
                         'n_estimators': [10, 20, 50, 100, 120, 150, 200, 300,
                                          350, 400, 500],
                         'random_state': [1]},
             verbose=3)

This gives us best parameters as {'max_depth': 6, 'n_estimators': 150, 'random_state': 1}.

In [57]:
clf.best_params_

{'max_depth': 6, 'n_estimators': 150, 'random_state': 1}

We will used these parameters and save the result to output file.<br> But this actually reduced the perfomance of our model.

In [62]:
model = RandomForestClassifier(n_estimators=150,max_depth=6,random_state=1)
model.fit(X,train_data['Survived'])
validation_prediction=model.predict(X_validation)
print(mean_squared_error(validation_data['Survived'],validation_prediction))


0.208955223880597


In [None]:
X_test = pd.get_dummies(test_data[features_to_use])
test_predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': test_predictions})
output.to_csv('term_project_submission4.csv', index=False)
print("Your submission was successfully saved!")

### Future work
* We can try different machine leaning model
* We can try to further improve the parameters 
* We caan try to use multiple models and compare their results.