This notebook is to help students beginning with EDA and ML. 
The basic steps are:
1. Import the necessary stuff.
2. Load the data.
3. Examine null percentages in train and test data attributewise
4. Examine class distribution, in this case distribution of Survived
5. From the movie, we can guess that more females should have been survived than males. Examine that.
6. Pclass (proxy for social status) might have played some role in not survival as priority among same gender type may be based on that. So, Examine Pclass among the females who did not survive.
7. Let us try to visualize a few relations using seaborn
8. Create a new feature - Salutation, merge similar Salutations, i.e. replace similar salutations with one and find meadian age Salutationwise. Impute Age Salutationwise (e.g. handle missing value in age Salutationwise).
9. Construct a new attribute, let us name it 'GenderPlus' with possible values female, boy and male. This is because we suspect that boys might had better chance of survival than adult males.
10. Visualize the survuval percentage of boys vs adult males to test the above point.
11. Construct a new feature Family_Size as the sum of SibSp (Sibling and Spouce), Parch (Parents and Children) + 1.
12. Construct FamilySurvivalRate. The hypothesis is that if somebody from a family of a passenger under focus is survived then the chances of that passenger survived is also more.
13. LabelEncode GenderPlus and Sex
14. There is a missing value in Fare, impute it and create Fare_Bins and Fare_Code
15. Do anomaly detection using pycaret.anomaly
16. Use only Gender, GenderPlus, FamilySize, FamilySurvivalRate and Pclass attribute in the final training and testing set (We tried others but this set of attributes worked well).
17. Use XGBClassifier with hyperparameter tuning.

That is it. 

The notebook is not original and it is inspired from many excellent notebooks available in the public domain on Kaggle. Some of the excellent notebooks from which this notebook has taken clues include:

https://www.kaggle.com/pliptor/how-am-i-doing-with-my-score

https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic

https://www.kaggle.com/cdeotte/titantic-mega-model-0-84210

https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83

https://www.kaggle.com/tomigelo/titanic-with-family-survival-tpot-0-81818

https://www.kaggle.com/cdeotte/titanic-wcg-xgboost-0-84688

I hope the notebook will be helpful. Do Upvote if you find it useful.

In [None]:
#Importing 
#Importing a few extra classifiers and stats which I may not use. Those can be removed.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

#setting sns
sns.set(style='white', context='notebook', palette='deep')

from scipy.stats import mode, norm, skew, kurtosis
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

#handle warnings
import warnings
warnings.filterwarnings('ignore')

#ensuring inline display of plots
get_ipython().run_line_magic('matplotlib', 'inline')

In [None]:
#Checking for GPU
!nvidia-smi

In [None]:
#read data files
trdf=pd.read_csv("/kaggle/input/titanic/train.csv", header='infer')
tsdf=pd.read_csv("/kaggle/input/titanic/test.csv", header='infer')
submission=pd.read_csv('/kaggle/input/titanic/gender_submission.csv',header='infer')

#Take a glimpse at the data
print(trdf.head())
#tsdf.head()

#trdf.describe()
#tsdf.describe()

#observe data types and missing values
trdf.info()
tsdf.info()

In [None]:
#Examine Null Percentages in train attributewise
per_null_series=(trdf.isnull().sum()/len(trdf))*100
print(type(per_null_series))
print(per_null_series)
sorted_per_null_series=per_null_series.sort_values(ascending=False)
print(sorted_per_null_series)
temp=pd.DataFrame({"Missing Ratio in Train":sorted_per_null_series})
print(temp.head())

#Examine null percentages in test attributewise
per_null_series_test=(tsdf.isnull().sum()/len(tsdf))*100
sorted_per_null_series_test=per_null_series_test.sort_values(ascending=False)
temp_ts=pd.DataFrame({"Percentage Missing Value in Test":sorted_per_null_series_test})
print(temp_ts)

#In output, observe that missing percentages are almost same in train and test, so that is good

In [None]:
sns.histplot(x="Survived", stat="percent", data=trdf) #This works

per_class0=round(((trdf.loc[:,"Survived"]==0).sum()/len(trdf))*100,2)
per_class1=round(((trdf.loc[:,"Survived"]==1).sum()/len(trdf))*100,2)

for i, fr in [(0,per_class0), (0.92,per_class1)]:
    plt.text(i, fr+0.1, str(fr))

In [None]:
#We have seen the movie, we suspect more females should have been survived. 
#Let us examine percentage of male and female survived
per_males= ( (trdf.loc[:,"Sex"]=="male") & (trdf.loc[:,"Survived"]==1) ).sum()/(trdf.loc[:,"Sex"]=="male").sum()
per_females=( (trdf.loc[:,"Sex"]=="female") & (trdf.loc[:,"Survived"]==1) ).sum()/(trdf.loc[:,"Sex"]=="female").sum()

plt.bar(x=["male", "female"], height=[per_males, per_females])
plt.ylabel("Survived(%)")


for x,per in [(0,per_males), (0.92, per_females)]:
    plt.text(x,per+0.005, str(round(per,2)))

#Output shows that only 19% males survived while this is almost 74% in females

In [None]:
#Pclass might have played some role in not survival as priority among same gender type may be based on that
#So let us examine Pclass among the females who did not survive.

#First overall Pclass distribution among females
ax=sns.countplot(x="Pclass", data=trdf.loc[ (trdf.loc[:,"Sex"]=='female'), :])
ax.set_ylabel('Pclass counts among females')

#Now Pclass among females who died
plt.figure()

ax=sns.countplot(x="Pclass", data=trdf.loc[ (trdf.loc[:,"Sex"]=='female') & (trdf.loc[:,"Survived"]==0), :])
ax.set_ylabel('Pclass counts among females who died')

counts=trdf.loc[ trdf.loc[:,"Sex"]=='female' ,["Sex", "Pclass"]].groupby("Pclass").count()

#Following is the other way 
counts=trdf.loc[ trdf.loc[:,"Sex"]=='female' ,["Sex", "Pclass"]].groupby("Pclass")["Sex"].count()
#other way ends

total_n_females_Pclasswise=counts.tolist()


not_survived_n_females_Pclasswise = trdf.loc[ (trdf.loc[:,"Sex"]=='female') & (trdf.loc[:, "Survived"]==0) ,["Sex", "Pclass"]].groupby("Pclass")["Sex"].count().tolist()

plt.figure()
height1=[round(i/j,2) for i, j in zip(not_survived_n_females_Pclasswise, total_n_females_Pclasswise)]

plt.bar(x=[1,2,3], height=height1)
plt.xlabel("Pclass")
plt.ylabel("Pclasswise Not Survived Female Percentage")

for i, h in zip([1,2,3],height1):
    plt.text(i,h+0.005,str(h))

#The output clearly shows that the Pclass did matter

In [None]:
#Let us try some visualization

#count plot Pclasswise survival
ax=sns.countplot(x="Pclass", hue="Survived", data=trdf) #cannot take both x and y
ax.set_ylabel("Survival Count")

#count plots for Pclasswise and Genderwise Survival
plt.figure()
sns.catplot(x="Pclass", hue="Survived", col="Sex", data=trdf, kind="count")

#Let us see barplot of Pclasswise and Genderwise Survival
plt.figure()
sns.catplot(x="Pclass", y="Survived", hue="Sex", data=trdf, kind="bar")
#Another way for the same thing
plt.figure()
sns.barplot(x="Pclass", y="Survived", hue="Sex", data=trdf)

#Let us try to see relation between Survival and Age
plt.figure()
sns.regplot(x="Age", y="Survived", data=trdf)

In [None]:
#Impute missing age Salutationwise

#first let us create an attribute Salutation

#For training data
trdf["Salutation"]=trdf.loc[:, "Name"].str.extract(pat='([a-zA-Z]+)\.', expand=False)
print(trdf["Salutation"].unique())

#For testing data
tsdf["Salutation"]=tsdf.loc[:, "Name"].str.extract(pat='([a-zA-Z]+)\.', expand=False)

#Let us use below mapping to combine some of the common salutations
mapping={'Mlle': 'Miss', 'Major': 'Mr', 'Col': 'Mr', 'Sir': 'Mr', 'Don': 'Mr', 'Mme': 'Miss',
          'Jonkheer': 'Mr', 'Lady': 'Mrs', 'Capt': 'Mr', 'Countess': 'Mrs', 'Ms': 'Miss', 'Dona': 'Mrs'}

trdf.replace({"Salutation":mapping}, inplace=True)
tsdf.replace({"Salutation":mapping}, inplace=True)
print(trdf["Salutation"].unique())

#get the meadian age Salutationwise
temp=trdf.loc[:,["Age", "Salutation"]].groupby("Salutation")["Age"].median()

#do the imputation now
for i in range(len(temp.index)):
    trdf.loc[ trdf.loc[:,"Salutation"]==temp.index[i] , "Age"]=temp[i]
    tsdf.loc[ tsdf.loc[:,"Salutation"]==temp.index[i] , "Age"]=temp[i]

In [None]:
#Construct a new attribute let us name 'GenderPlus' with possible values female, boy and male
#We suspect that boys will have better chance of survival than adult males
trdf["GenderPlus"]=trdf["Sex"]
tsdf["GenderPlus"]=tsdf["Sex"]

trdf.loc[ trdf.loc[:,"Salutation"]=="Master", "GenderPlus"]="boy"
tsdf.loc[ tsdf.loc[:,"Salutation"]=="Master", "GenderPlus"]="boy"

In [None]:
#Visualize survuval percentage of boys vs adult males
n_males_survived=trdf.loc[ (trdf.loc[:,"GenderPlus"]=='male') & (trdf.loc[:,"Survived"]==1) ,"GenderPlus"].count()
n_males=trdf.loc[ trdf.loc[:,"GenderPlus"]=='male' ,"GenderPlus"].count()
surv_perc_of_males= round(n_males_survived/n_males,2)
print(surv_perc_of_males)

n_boys_survived=trdf.loc[ (trdf.loc[:,"GenderPlus"]=='boy') & (trdf.loc[:,"Survived"]==1) ,"GenderPlus"].count()
n_boys=trdf.loc[ trdf.loc[:,"GenderPlus"]=='boy' ,"GenderPlus"].count()
surv_perc_of_boys= round(n_boys_survived/n_boys,2)
print(surv_perc_of_boys)

#From output it is clear that boys have better chance of being survived than adult males and therefore GenderPlus can be useful

In [None]:
#Construct Family Size, we may use it.
trdf["Family_Size"]=trdf["SibSp"]+trdf["Parch"]+1
tsdf["Family_Size"]=tsdf["SibSp"]+tsdf["Parch"]+1

In [None]:
#Construct FamilySurvivalRate
#The hypothesis is that if somebody from a family of a passenger under focus is survived then the chances of that passenger
#survived is also more

#First combine train and test
data=pd.concat([trdf,tsdf], axis=0, ignore_index=True)

#Add surname as it can help identify family

data["Surname"]=data.loc[:,"Name"].str.split(pat=",").str[0]
#print(type(data.loc[:,"Name"].str.split(pat=",")))
#print(data.loc[:,"Name"].str.split(pat=",").str[0])

#Add Family_Survival_Rate

data["FamilySurvivalRate"]=0.5

for grpid, grpdf in data.groupby("Surname"):
    if len(grpdf) > 1:
        for ind, row in grpdf.iterrows():
            smax=grpdf.drop(ind).loc[:, "Survived"].max()
            smin=grpdf.drop(ind).loc[:, "Survived"].min()
            pid=row["PassengerId"]
            
            if smax == 1:
                data.loc[ data.loc[:,"PassengerId"]==pid, "FamilySurvivalRate"]=1
            elif smin==0:
                data.loc[ data.loc[:,"PassengerId"]==pid, "FamilySurvivalRate"]=0
            
#Same ticket is also an indication of the same family
for grpid1, grpdf1 in data.groupby("Ticket"):
    if len(grpdf1) > 1:
        for ind, row in grpdf1.iterrows():
            smax=grpdf1.drop(ind).loc[:, "Survived"].max()
            smin=grpdf1.drop(ind).loc[:, "Survived"].min()
            pid=row["PassengerId"]
            
            if smax == 1:
                data.loc[ data.loc[:,"PassengerId"]==pid, "FamilySurvivalRate"]=1
            elif smin==0:
                data.loc[ data.loc[:,"PassengerId"]==pid, "FamilySurvivalRate"]=0         

#drop Salutation and Surname as they are no longer required
#data.drop(columns=["Salutation", "Surname"], inplace=True)          

#split back to trdf and tsdf

trdf=data.loc[0:890,:].copy()
tsdf=data.loc[891:,:].copy()

#label encoding of GenderPlus and Sex
lec=LabelEncoder()
lec.fit(trdf.loc[:,"GenderPlus"])
trdf.loc[:,"GenderPlus"]=lec.transform(trdf.loc[:,"GenderPlus"])
tsdf.loc[:,"GenderPlus"]=lec.transform(tsdf.loc[:,"GenderPlus"])

lec=LabelEncoder()
lec.fit(trdf.loc[:,"Sex"])
trdf.loc[:,"Sex"]=lec.transform(trdf.loc[:,"Sex"])
tsdf.loc[:,"Sex"]=lec.transform(tsdf.loc[:,"Sex"])

In [None]:
#There is missing value in Fare in test data. Let us impute it first.
tsdf["Fare"].fillna(trdf["Fare"].mean(), inplace=True)

#Create Fare_Code column but before that create Fare_Bins
trdf["Fare_Bins"], bins=pd.qcut(trdf["Fare"], q=5, retbins=True)

lec=LabelEncoder()
lec.fit(trdf["Fare_Bins"])
trdf["Fare_Code"]=lec.transform(trdf["Fare_Bins"])

tsdf["Fare_Bins"]=pd.cut(tsdf["Fare"], bins=bins, include_lowest=True)

tsdf["Fare_Code"]=lec.transform(tsdf["Fare_Bins"])


In [None]:
!pip install pycaret

In [None]:
trdf.loc[:, ["GenderPlus", "Pclass", "FamilySurvivalRate", "Family_Size", "Sex"]].info()

In [None]:
#Try pycaret.anomaly for anomaly detection

#import first
from pycaret.anomaly import *

#setup the experiment
exp_name=setup(data=trdf.loc[:, ["GenderPlus", "Pclass", "FamilySurvivalRate", "Family_Size", "Sex"]], silent=True, verbose=False)

#Choose the mode for anomlay detection
pca=create_model('pca')

#apply the model to generate Anomaly_Score and Anomaly
pca_df=assign_model(pca, transformation=True)

#check the info of the new dataframe
pca_df.info()

#display anomaly index and convert it to list
anomaly_index=pca_df.loc[ pca_df["Anomaly"]==1].index.tolist()

#drop examples with anomaly and then check
pca_df.drop(anomaly_index, inplace=True)
pca_df.info()

In [None]:
#Train model with hyperparameter tuning

#Use only attributes which matter
#We tried to include Fare_Code but it reduced the accuracy marginally, so we won't use it
X_train=trdf.loc[:, ["GenderPlus", "Pclass", "FamilySurvivalRate", "Family_Size", "Sex"]]
y_train=trdf.loc[:, "Survived"]

#removing examples which have anomlay, however, even without removing them we got the same performance
X_train.drop(anomaly_index, inplace=True)
y_train.drop(anomaly_index, inplace=True)

X_test=tsdf.loc[:, ["GenderPlus", "Pclass", "FamilySurvivalRate", "Family_Size", "Sex"]]

#Hyper-parameter Grid for XGBoost
param_grid={
    'max_depth': range (2, 8, 1),
    'n_estimators': range(10, 100, 10),
    'learning_rate': [0.1, 0.01, 0.05, 0.001]
}

#Learn the classifier
clf=XGBClassifier(eval_metric='logloss')
clf_gs=GridSearchCV(clf, param_grid,cv=5)
clf_gs.fit(X_train,y_train)

print(clf_gs.best_params_)

#make prediction
predictions=clf_gs.predict(X_test)

'''
clf=RandomForestClassifier(n_estimators=110, max_depth= 8, max_features='auto',
                                   random_state=0, oob_score=False, min_samples_split = 2,
                                   criterion= 'gini', min_samples_leaf=2, bootstrap=False)
clf.fit(X_train,y_train)
predictions=clf.predict(X_test)'''

#write prediction in submission.csv
output=pd.read_csv('/kaggle/input/titanic/gender_submission.csv',header='infer')
output['Survived']=predictions.astype('int')
output.to_csv('submission.csv', index=False)