In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#hide some warnings :p
import sys
import warnings

if not sys.warnoptions:
    warnings.simplefilter("ignore")

# Loading Data And Exploring Data

We have two files basically.
1. Data of people who boarded titanic and outcome of wether they survived or not. (train data)
2. Data of people who boarded titanic and task of this competition is to predict their outcome i.e wether they survived or not.

In [None]:
kaggle_path = '../input/titanic/'
train = pd.read_csv(kaggle_path + 'train.csv')
test = pd.read_csv(kaggle_path + 'test.csv')
submission = pd.read_csv(kaggle_path + 'gender_submission.csv')

Note that we should always check train and test data sets simultaneously to gain more insights and avoid inconsistency issues later on.

In [None]:
print(train.shape)
train.head()

In [None]:
print(test.shape)
test.head()

In [None]:
train.isna().sum()

In [None]:
test.isna().sum()

<big> The Missing Data: </big>

1. It seems like the 'Cabin' column has immense number of empty values (but dont worry we will just drop it at the end :D )
2. The 'Age' column is another issue to deal with as it has huge amount of N/As as well
3. There's is one row empty 'Fare' value in test data. 

Let's see what we can do about this.

*(Observe that there is no empty 'Fare' value in train dataset. If we had not explored test set and had made model without checking test set, it would have thrown error when evaluating the test set.)*

# Feature Engineering

(first of all it is not that great feature analysis but still it will help create a good consisten model!!)

# *Exploring names:*

In [None]:
train['Name'][0:10]

If you take a good look at names, every name tells us 3 things:
1. The family name
2. The title of name such as Mr. Mrs. Miss. etc.
3. The actual name of person


In [None]:
titles = ['Mr.', 'Mrs.', 'Miss.', 'Ms.', 'Master.', 'Dr.']   #creating list of relevant titles

In [None]:
print('Names with missing ages insight:\n')
total = 0
for t in titles:
    count1 = train['Name'][train['Age'].isna()][train['Name'].str.find(t)!=-1].count()
    count2 = test['Name'][test['Age'].isna()][test['Name'].str.find(t)!=-1].count()
    print('Names having', t, 'in them are:', count1+count2)
    total += count1

The name of N/A(empty) values in our data has these titles:
1. Mr.  
2. Mrs.
3. Miss.
4. Master. 
5. Dr.

Master is fancy way of calling little boys who are not worthy of being called a Mister yet :) 

(This can be verified by this code)

In [None]:
total == train['Age'].isna().sum()

Now let's try to compute the mean age of Names with titles who do not have missing ages.

This fancy code computes average age of various titles from both train and test datasets.

In [None]:
age_dict = {}
for t in titles:
    # Mean of a title in train set
    trm = train['Age'][train['Name'].str.find(t)!=-1][train['Age'].notna()].mean()
    #mean of title in test set
    tsm = test['Age'][test['Name'].str.find(t)!=-1][test['Age'].notna()].mean()
    if np.isnan(trm):
        trm = 0
    if np.isnan(tsm):
        tsm = 0
    avg = round( (trm + tsm) / 2)
    print('average age of Names having',t ,'in them is: ', avg)
    age_dict[t] = avg

Now we should be having a cool nice dictionary of titles and their average ages.

In [None]:
age_dict

Let's fill the 'Age' of missing people with missing 'Age'. This fancy function will do that job.

In [None]:
def fill_na_names(df):
    missing_ages = df['Name'][df['Age'].isna()]
    index = df['Name'][df['Age'].isna()].index
    for name,i in zip(missing_ages, index):
        for ttl in age_dict:
            if name.find(ttl) != -1:
                df.loc[i, 'Age'] = age_dict[ttl]
                
            
fill_na_names(train)
fill_na_names(test)

Let's see if the 'Age' was filled or not.

In [None]:
train['Age'].isna().sum(), test['Age'].isna().sum()

Sweet!

# *Filling Fare*

There was an empty 'Fare' value in test. I could have put average of fare or mode of fare but I decided to be smart about it! (I guess)

In [None]:
m = test[test['Fare'].isna()]
i = test[test['Fare'].isna()].index
m

In [None]:
# let's see if he shares a ticket with someone else
test[test.Ticket=='3701']

No one else with that ticket.

Let's just fill the Fare with mean based on his Pclass, Age, Embarked port, SibSp and Parch

In [None]:
m1 = test['Fare'][test['Pclass']==3].mean()
m2 = test['Fare'][test['Age']> 50].mean()
m3 = test['Fare'][test['Embarked']=='S'].mean()
m4 = test['Fare'][test['Sex']=='male'].mean()
m5 = test['Fare'][test['SibSp']==0].mean()
m6 = test['Fare'][test['Parch']==0].mean()

avg = round( (m1+ m2+ m3 + m4 + m5 + m6) / 6 )

test.loc[i,'Fare'] = avg
print('Filled ', round(avg))

# *Creating An Additional Family Feature*

In [None]:
train.head(2)

Observe that every Name tells us their family names.

A thought:
- Rich families would have had more influence and thus they had higher chances of survival
- Similarly poor families would have had less chances of survival comparatively

So it can be helpful to add a feature of Family that tell which Family does one belong to.

*This fancy function pulls Family name from names of people and makes a new Family column*

In [None]:
def add_fam_col(df):
    for n,ind in zip(df['Name'].values, df['Name'].index):
        df.loc[ind, 'Family'] = n.split(',')[0]
        
add_fam_col(train)
add_fam_col(test)

In [None]:
train.head(2)

A cool family feature has been added!

# Merging train and test datasets

It is integral to merge both train and test datasets before feeding input to the model

In [None]:
df = pd.concat([train, test], join='outer')
df.shape

Let's drop useless features

In [None]:
df.columns

In [None]:

useless_feats = ['PassengerId', 'Survived', 'Name', 'SibSp', 
                 'Parch', 'Ticket', 'Cabin', 'Embarked']
df.drop(useless_feats, axis=1,inplace=True)
df.head(2)

Let's convert non-numeric features to categorical values.

In [None]:
df = pd.get_dummies(df, drop_first=True)
df.head(1)

Time to get prepare data for our model.

In [None]:
X = df[:train.shape[0]]
tst = df[train.shape[0]:]
y = train['Survived']
print(X.shape, tst.shape, y.shape)
X.head(2)

And time to scale features to get their values as less as possible. This will speed up and improve the modeling process!

In [None]:
from sklearn.preprocessing import scale
X = scale(X)
tst = scale(tst)

# **Creating Classification Models**

**Finally!!**

*We will try different models and see which one works best.*

I will use GridSearchCV which is a sweet feature you get from Sklearn. It basically allows you to run a model with different hyperparameters in a fast efficient way.

# 1. Random Forest

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

parameter_grid = {
                 'max_depth' : [None, 10, 12, 20],
                 'n_estimators': [None, 50, 10],
                 'max_features': [None, 'sqrt'],
                 'min_samples_split': [None, 2, 3, 10],
                 'min_samples_leaf': [None, 1, 3, 10],
                 'bootstrap': [True, False],
                 }

forest = RandomForestClassifier(n_jobs=2)

M1 = GridSearchCV(forest,
                           scoring='accuracy',
                           param_grid=parameter_grid,
                           cv=3,
                           n_jobs=-1,
                           verbose=1)

M1.fit(X, y)

parameters = M1.best_params_
print('Best score: ' , M1.best_score_ * 100)
print('Best estimator: ' , M1.best_estimator_)


# 2. Support Vector Classifier (SVM)

In [None]:
from sklearn.svm import SVC


parameter_grid = {
                 'kernel': [None, 'rbf', 'sigmoid', 'linear'],
                    'C'  : [0,0.25,0.5,1,2,3,4],
                 'gamma' : [None, 'auto', 0.01, 0.03, 0.1, 0.3, 1, 3, 5, 20, 50],
                  'class_weight' : [None, 'balanced']
    
                 }

M2 = GridSearchCV(SVC(),
                      scoring = 'accuracy',
                      cv=3,
                      param_grid= parameter_grid,
                      n_jobs=-1,
                      verbose=1
                     ).fit(X,y)

M2.fit(X,y)

print('Best score: ', M2.best_score_ * 100)
print('Best estimator: ', M2.best_estimator_)


# 3. K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

grid = { 'n_neighbors': [1, 5, 15, 25, 30, 40],
        'weights': ['distance'],
        'leaf_size': [None, 1, 3, 10, 25, 40],
        'p':[1,2]
       }


M3 = GridSearchCV(KNeighborsClassifier(),
                      scoring='accuracy',
                      cv=3,
                      param_grid=grid,
                      n_jobs=-1,
                      verbose=1
)

M3.fit(X, y)

print('Best score: ', M3.best_score_ * 100)
print('Best estimator: ', M3.best_estimator_)


# 4. XGB Classifier

In [None]:
from xgboost import XGBClassifier

parameter_grid = {
                 'max_depth' : [None, 5, 7, 10 ],
                 'max_delta_step': [None, 1, 2],
                 'n_estimators': [None,10, 20, 30, 40],
                 'colsample_bylevel': [None,0.2, 0.5, 0.8],
                 'colsample_bytree': [None,0.2, 0.6],
                 'subsample': [None,0.01,0.1, 0.3, 0.4,1],
                 }

M4 = GridSearchCV(XGBClassifier(),
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=3,
                               n_jobs=-1,
                               verbose=1)

M4.fit(X, y)

print('Best score: ', M4.best_score_ * 100)
print('Best estimator: ', M4.best_estimator_)


# 5. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

parameter_grid = {
    'C' : [0.01, 0.1, 0.5, 1, 2],
    'penalty' : ['l1', 'l2', 'elasticnet'],
    'class_weight' : [None, 'balanced'],
    'solver' : ['newton-cg', 'lbfgs', 'sag', 'saga'],
    'max_iter' : [100, 500, 1000]

}

M5 = GridSearchCV(LogisticRegression(),
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=3,
                               n_jobs=-1,
                               verbose=1)

M5.fit(X, y)

print('Best score: ', M5.best_score_ * 100)
print('Best estimator: ', M5.best_estimator_)

# 6. Boosting Classifier

To score higher in kaggle competitions it usually is the case that final model is derived out of multiple previous models.
I will try to make a boosting model by using Logistic Regression and SVM

In [None]:
from sklearn.ensemble import BaggingClassifier

parameter_grid = {
    'base_estimator': [LogisticRegression(), SVC()],
    'n_estimators' : [10, 20, 30, 40],
}

M6 = GridSearchCV(BaggingClassifier(),
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=3,
                               n_jobs=-1,
                               verbose=1)

M6.fit(X, y)

print('Best score: ', M6.best_score_ * 100)
print('Best estimator: ', M6.best_estimator_)


# Predicting Test Data

Thing is that I kinda discovered the original outcome of test dataset :p <br>
So I will just compare my output with that instead of submitting the file everytime. <br> <br>
*I hope that is legal because the results have been out there for a long time and they would have been removed, I meand Kaggle Competition holders must have saw that. Besides it is not a serious competition so I guess it is fine at the end. :)*

(If you choose to upload that document straight up, woe to you! I mean what's the purpose of this competition then.)

In [None]:
temp = pd.read_csv('../input/titanic-leaked/titanic.csv')
original = temp['Survived']

In [None]:
from sklearn.metrics import accuracy_score

preds_M1 = M1.predict(tst)
preds_M2 = M2.predict(tst)
preds_M3 = M3.predict(tst)
preds_M4 = M4.predict(tst)
preds_M5 = M5.predict(tst)
preds_M6 = M6.predict(tst)


score_M1 = accuracy_score(original, preds_M1)
score_M2 = accuracy_score(original, preds_M2)
score_M3 = accuracy_score(original, preds_M3)
score_M4 = accuracy_score(original, preds_M4)
score_M5 = accuracy_score(original, preds_M5)
score_M6 = accuracy_score(original, preds_M6)

print('score with model 1: ', score_M1*100)
print('score with model 2: ', score_M2*100)
print('score with model 3: ', score_M3*100)
print('score with model 4: ', score_M4*100)
print('score with model 5: ', score_M5*100)
print('score with model 6: ', score_M6*100)

Bruh this is seriously unexpected. <br>
The boosted model(BaggingClassifier) got 82% acuracy on test dataset!<br>
Beginner's luck I must say!

In [None]:
subm_dir = kaggle_path + 'gender_submission.csv'
submit_file = pd.read_csv(subm_dir)
submit_file['Survived'] = preds_M6
submit_file.to_csv('gender_submission.csv', index=False)

Hopefully rest of the score would be attained by using Cabin and Embarked features accurately.
But I got lower score with them included so I dropped them and got better results.

Wasn't so much of a special kernel but considering i did most of the stuff in it on my own and managed to end up in top 17% is a very good thing for me and I am very happy with it. Been only a month since I got into Machine Learing.

Also please do give advices to improve this model especially about the Cabin feature <br>
If you think I did any screw up, let me know about that too hehe.

Cheers! Have a nice day!