![Caption for the picture.](https://images.app.goo.gl/rcco9AayW5Xh6HWh9)

**Table Of Content**
    1. Import Data and Data Structure
         1.1 Import Data
         1.2 Overview of Data Structure
    2. Data visualization and Missing Values
        2.1 Overview of Relation
        2.2 Visualization of Features
        2.3 Missing Values
    3. Feature Engineering and Statistical Analysis
        3.1 Features Generation
        3.2 Statistical Analysis and Feature Selection
        
  
    4. Modelling


 # 1. Import and Read Data

## 1.1 Import Data

In [None]:
# read the data
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits import mplot3d
from matplotlib import cm
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error



file_path1 = "../input/titanic/train.csv"
file_path2= "../input/titanic/test.csv"
data = pd.read_csv(file_path1)
test_data=pd.read_csv(file_path2)
data.describe()

## 1.2 Overview of Data structure 
> Find out the data structure and type

In [None]:
# show the first five line of data
data.head()

In [None]:
# show the data struture and type
data.info()

In [None]:
test_data.info()

There are totally 11 variables and we can divide them into 3 types: categorical variable, numeric variables and text variable:
### 1.2.1 Categorical Variable

* **Pclass **:
>         1=1st  
>         2=2st             
>         3=3st 


In [None]:
data.Pclass.value_counts()
#s1=data.groupby('Pclass').apply(lambda df: df.loc[df.Survived==0].Survived.value_counts())
#s2=data.groupby('Pclass').Survived.count()
s1=data.groupby('Pclass').apply(lambda df: df.Survived.value_counts()/len(df)) 
s2=data.groupby('Pclass').apply(lambda df: df.Survived.value_counts()) 
pd.concat([s2,s1],axis=1,keys=['Survived/Death count','Survived/Death Rate'])

The above table shows the Survival rate for each Pclass which is an decreasing trend: 0.629 > 0.472 > 0.242 as 'Pclass' value increasing. Also Pclass=3 has the highest death population. Pclass will be an important feature for Survived prediction.
* **Sex**:
>         Male 
>         Female  

Most of the passengers are male and female passenger was less than 50%

In [None]:
data.Sex.value_counts()

* **Cabin**: 

The cabin number is a character followed by a number and there are 147 different cabin number in train dataset and 76 different numbe in test dataset. Now we group these Cabin number by their first character.




In [None]:
data.Cabin.value_counts()
test_data.Cabin.value_counts()

In [None]:
data.Cabin.str[0].value_counts()
test_data.Cabin.str[0].value_counts()

Now we can get all the types of Cabin are starting with these character:
>         A 
>         B  
>         C 
>         D  
>         E 
>         F  
>         G 
>         T  

* Embarked (Port of Embarkation):
>         S 
>         C  
>         Q 

------------


In [None]:
s3=pd.Series(copy1.Embarked.value_counts())

pd.concat([s3,s3/889],axis=1,keys=['count(Embarked)','portion'])

### Numeric Variables
* PassengerId (Uniquely define a passenger)

In [None]:
data.PassengerId.value_counts()

There's 891 person in train dataset and each person has a unique PassengerId. However, PassengerId is a numeric value and there's a pretty small correlation (-0.05) between 'Survived' and 'PassengerId'. We may not consider this feature in the prediction modelling.


* Age
* SibSp (# of siblings / spouses aboard the Titanic)
* Parch (# of parents / children aboard the Titanic)
* Fare (the ticket price)

-------
### Text Variable
* Name
* Ticket
----------

-------
# 2. Data Visualization and Missing Values
## 2.1 Categorical Encoding
Apply the categorical encoding method to the categorical variable and visualize their relation


In [None]:
copy1=data.copy()
copy2=test_data.copy()


### Sex
* Convert Sex into categorical variable by applying Label-Encoding

In [None]:
label_encoder=LabelEncoder()
copy1['new_Sex']=label_encoder.fit_transform(copy1['Sex'])
copy2['new_Sex']=label_encoder.transform(copy2['Sex'])

### Cabin
* Oragnize the Cabin data by extracting their first character (A, B, C, D, E, F, G, T)
* Treat them as categorical variable

In [None]:
copy1['new_Cabin']=data['Cabin'].str[0]
s1=pd.Series(copy1.new_Cabin.value_counts())
s2=s1/91
pd.concat([s1,s2],axis=1,keys=['count(new_cabin)','portion'])
# Copy the test dataset and add a new column

In [None]:

copy2['new_Cabin']=test_data['Cabin'].str[0]
s3=pd.Series(copy2.new_Cabin.value_counts())
pd.concat([s3,s3/76],axis=1,keys=['count(new_Cabin)','portion'])

* new_Cabin='C' has the highest portion in both dataset
* Treat them as categorical variable by applying Label_Encoding
* However, 'new_Cabin' contains missing value (null)
* Regard all the missing value as new_Cabin='Z' for now and leave this problem to next part

In [None]:
copy1['new_Cabin']=copy1['new_Cabin'].fillna("Z")
copy2['new_Cabin']=copy2.new_Cabin.fillna("Z")

label_encoder=LabelEncoder()
copy1['new_Cabin']=label_encoder.fit_transform(copy1['new_Cabin'])
copy2['new_Cabin']=label_encoder.transform(copy2['new_Cabin'])


### Embarked
* Embarked also contains missing value
* Regard all the missing value as Embarked='N'
* Transfer Embarked into integer by applying label encoder

In [None]:
copy1['new_Embarked'] = copy1['Embarked'].fillna("N")

label_encoder=LabelEncoder()
copy1['new_Embarked']=label_encoder.fit_transform(copy1['new_Embarked'])



----
## 2.2 Overview of Relation
Visualize the relation between feature by correlation Heatmap and pairs plot

* Apply the pairs plot

In [None]:
sns.pairplot(data,hue='Pclass', diag_kws={'bw':0.1}, palette="husl")

* Apply the correlation Heatmap

In [None]:
# the correlation matrix
features=['PassengerId','Survived','Pclass','new_Sex','Age','SibSp','Parch','Fare','new_Cabin','new_Embarked']
corr=copy1[features].corr() 
# mask the upper triangle
sns.set(style="white")
plt.figure(figsize=(11,7))
mask=np.triu(np.ones_like(corr,dtype=np.bool))
# colour
cmap=sns.diverging_palette(240,10,n=9)
# annot to display the value
sns.heatmap(corr,annot=True,mask=mask,cmap='RdYlBu',linewidths=0.6)

* Some of the features are highly correlated with Survived
* Apply the pairs plot on those features

------
## 2.2 Visualization of Features
Some of the features are highly correlated with survival rate or with each other. Data visualization can help us find out the pattern behind them.

### Gender, Age and Survived
* Firstly, Consider the Age structure of each Gender 
*  We can discover that the Age structure of male and female are similar

In [None]:
s1=copy1.loc[copy1.Sex=='female'].Age.describe()
s2=copy1.loc[copy1.Sex=='male'].Age.describe()
pd.concat([s1,s2],axis=1,keys=['Age|Sex=female','Age|Sex=male'])

* The correlation Heatmap shows that Gender is highly correlated to 'Survived'
* Calculate the survival rate for men and women

In [None]:
s1=copy1.groupby('Sex').apply(lambda df: df.Survived.value_counts())
s2=copy1.groupby('Sex').apply(lambda df: df.Survived.value_counts()/len(df))
s3=pd.concat([s1,s2],axis=1,keys=['count(Survival/Death)','Survival/Death rate'])
s3

In [None]:
#fig, ax = plt.subplots(figsize=(12,5),ncols=2)
#d1=copy1.loc[copy1.Sex=='male']
#d2=copy1.loc[copy1.Sex=='female'].Survived.value_counts()

#d1=copy1.loc[copy1.Sex=='female']
#f=['Survived']
#X=d1[f].Survived.value_counts()

sns.set(style="whitegrid")
ax1=sns.barplot(y=s3.index, x =s3['count(Survival/Death)'],linewidth=2.5,facecolor=(1,1,1,0),errcolor="1", edgecolor=".1")
ylabels = ['(female, survived)','(female, Dead)', '(male, Dead)', '(male, survived)']
ax1.set_yticklabels(ylabels)
i=0
list1=s3['Survival/Death rate']
for p in ax1.patches:
    label = list1[i]*100
    i=i+1
    plt.text(-36+p.get_width(), p.get_y()+0.55*p.get_height(),
             str('{:1.2f}'.format(label))+'%',
             ha='center', va='center')


* Obviously, survival rate of female is much higher than male
* Consider Sex vs. Age vs. Survival

In [None]:
sns.set(style="darkgrid")
f, (ax1,ax2) = plt.subplots(figsize=(18,7),ncols=2)
s1= copy1.loc[copy1.Sex=='female']
ax1 = sns.distplot(a=s1.Age,bins=34, kde=False, 
                  hist_kws={"rwidth":1,'edgecolor':'black', 'alpha':1.0},color='azure',label="Age",ax=ax1)
s2= s1.loc[(s1.Survived==1)]
ax1 = sns.distplot(a=s2.Age,bins=34, kde=False, 
                  hist_kws={"rwidth":1,'edgecolor':'black', 'alpha':1.0},color='cyan',ax=ax1)
#ax1.set_title("Histogram of Survival, female")
ax1.legend(['total count', 'Survived count'])
ax1.set_title("Histogram of Survival, female")


s3= copy1.loc[copy1.Sex=='male']
ax2 = sns.distplot(a=s3.Age,bins=34, kde=False, 
                  hist_kws={"rwidth":1,'edgecolor':'black', 'alpha':1.0},color='lavender',label="Age",ax=ax2)
s4= s1.loc[(s1.Survived==1)]
ax2 = sns.distplot(a=s4.Age,bins=34, kde=False, 
                  hist_kws={"rwidth":1,'edgecolor':'black', 'alpha':1.0},color='orchid',ax=ax2)
#ax1.set_title("Histogram of Survival, female")
ax2.legend(['total count', 'Survived count'])
ax2.set_title("Histogram of Survival, male")

* The portion of blue area is much larger than purple
* At any range of age, the survival rate of female is much higher than male

### Parch and SibSp vs. Age
* SibSp is highly correlated with SibSp
* There's negative correlation between Parch and SibSp vs. Age
* Group the dataset by pairs of (x=Parch,y=SibSP) value and compare each group's average age
* Ignore the 'Age' missing value in this part

In [None]:
copy3=copy1.loc[copy1.Age.notnull()]

In [None]:
X=[0.0,1.0,2.0,3.0,4.0,5.0,6.0]
Y=[0,1,2,3,4,5]
def f(x,y):
    a=copy3.loc[(copy3.Parch==x)&(copy3.SibSp==y)].Age.mean()
    return a
X,Y=np.meshgrid(X,Y)
#Z=f(X,Y)
Z=np.zeros((6, 7))
for i in range(6):
    for j in range(7):
        Z[i][j]=f(X[i][j],Y[i][j])
a= copy3.loc[(copy3.SibSp>=4)].Age.mean()
for i in range(2):
    for j in range(7):
        Z[i+4][j]=a

Z[0][6]=Z[0][5]
Z[2][4]=Z[2][5]=Z[2][6]=Z[2][3]
Z[3][3]=Z[3][4]=Z[3][5]=Z[3][6]=Z[3][2]


* 3D surface plot and contour plot to visualize the relation among SibSp, Parch and Age
* Z represents the average Age for certain pairs of (SibSp,Parch) value

In [None]:
#Z = np.cos(X ** 2 + Y ** 2)
fig= plt.figure(figsize=(10,6))
#ax = plt.axes(projection='3d')
ax = fig.add_subplot(111, projection='3d')
surf=ax.plot_surface(X, Y, Z,cmap='viridis', edgecolor='none')
fig.colorbar(surf, ax=ax, shrink=0.5, aspect=5)
ax.set_title('Average Age for SibSp and Parch')
ax.set_xlim(0, 6);
ax.set_ylim(5, 0);
ax.set_xlabel('Parch')
ax.set_ylabel('SibSp')
ax.set_zlabel('Age.mean()');
#plt.show()
plt.show()

In [None]:

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot()

cset = ax.contourf(X, Y, Z)
plt.colorbar(cset)

ax.set_title('contour plot');
plt.show()

* Larger SibSp and Parch value implies lower average Age
* Young people are more likely to be companied by their family member


### Age, Family Size, Fare and Survive

* Age and Family size vs Survive
* Divide all the people into several Age group: 0~18, 18~35, 35~55, 55~80

In [None]:
copy3=copy1[['Age','Fare','Survived','SibSp','Parch']]
copy3['family_size']=copy3['SibSp']+copy3['Parch']
def f (p):
    if p<= 18: return '0~18'
    if p>18 and p<35: return '18~35'
    if p>=35 and p<55: return '35~55'
    if p>=55: return '55~80'
copy3['age_group']=copy3.Age.map(lambda p: f(p))

In [None]:

sns.set(style="darkgrid")
#sns.set(style="darkgrid")
sns.catplot(x='family_size',col='age_group', col_wrap=2,data=copy3,hue="Survived", kind="count",height=8, aspect=.8,palette=['aqua','pink'])
#sns.catplot(x='Age',y='SibSp',data=copy3,height=8, aspect=.7)

* Visualize Age, Fare,Family size and Survived

In [None]:
fig = plt.figure()
fig, ax = plt.subplots(figsize=(11,6))
ax = fig.add_subplot(111, projection='3d')
X=copy1['SibSp']+copy1['Parch']
Y=copy1.Age
Z=copy1.Fare
g=ax.scatter(X, Y, Z,  c= copy1.Survived,marker='o',cmap='cool')
plt.colorbar(g)
ax.set_zlim(0, 250);
ax.set_xlabel('Family size')
ax.set_ylabel('Age')
ax.set_zlabel('Fare');
plt.show()

* Fare of Purple dots are higher than blue dots on average
* Higher Fare price tends to stay alive

### Pclass, Fare vs. Survived
* First, consider Pclass and Fare which are negatively correlated
* Fare distribution given a certain Pclass

In [None]:
s1=copy1.loc[copy1.Pclass==1].Fare.describe()
s2=copy1.loc[copy1.Pclass==2].Fare.describe()
s3=copy1.loc[copy1.Pclass==3].Fare.describe()
pd.concat([s1,s2,s3],axis=1,keys=['Fare|Pclass=1','Fare|Pclass=2','Fare|Pclass=3'])

In [None]:
sns.set(style="darkgrid")
fig, ax1 = plt.subplots(figsize=(16,5))
ax=sns.kdeplot(data=copy1.loc[copy1.Pclass==1]['Fare'],shade=True,color='red')
ax.set_title("Fare distribution|Pclass")
ax=sns.kdeplot(data=copy1.loc[copy1.Pclass==2]['Fare'],shade=True,color='blue')
ax=sns.kdeplot(data=copy1.loc[copy1.Pclass==3]['Fare'],shade=True,color='purple')
ax.legend(["Fare|Pclass=1","Fare|Pclass=2","Fare|Pclass=3"])


* Acccoring to the above distribution plot and table, we can roughly say that Fare|Pclass=1> Fare|Pclass=2 > Fare|Pclass=3 on average.
* The Fare|Pclass=1 also has a larger standard error than any other Pclass groups
* It implies that as Pclass get upper, the ticket price will be more expensive on average

In [None]:
fig, ax1 = plt.subplots(figsize=(12,15))
sns.swarmplot(x=copy1.Survived,y=copy1.Fare,hue=copy1.Pclass, palette='cool')

In [None]:
sns.catplot(x="Survived",y="Fare",hue="Pclass",data=copy1,kind='violin',height=8, aspect=2, palette=['violet','turquoise','tomato'])

### Sex, Pclass, Fare and Survived

In [None]:
sns.set(style="darkgrid")
sns.catplot(x="Survived",y="Fare",col="Pclass",data=copy1 ,hue="Sex",height=8, aspect=.7, palette=['violet','turquoise'])

* Higher Fare price and upper class tends to live 

### Pclass,Age,Sex vs Survived

In [None]:
g = sns.FacetGrid(copy1,height=5, col="Pclass", row="Sex", margin_titles=True, hue = "Survived" )
g = g.map(sns.distplot, "Age",kde=False,bins=15,hist_kws={"rwidth":1,'edgecolor':'black', 'alpha':1.0}).add_legend();

### Embarked, Fare and Pclass
* Embarked and Pclass

In [None]:
s0=copy1.Embarked.value_counts()
s1=copy1.loc[copy1.Pclass==1].Embarked.value_counts()
s2=copy1.loc[copy1.Pclass==2].Embarked.value_counts()
s3=copy1.loc[copy1.Pclass==3].Embarked.value_counts()
pd.concat([s0,s1/s0*100,s2/s0*100,s3/s0*100],axis=1,keys=['Total','Embarked|Pclass=1 (%)','Embarked|Pclass=2 (%)','Embarked|Pclass=3 (%)'])

In [None]:
plt.figure(figsize=(12,12))
sns.boxplot(x=copy1.Embarked,y=copy1.Fare,hue=copy1.Pclass,palette='cool')

### Embarked, Pclass and Fare vs. Survived
* Embarked and Survived

In [None]:
plt.figure(figsize=(12,12))
sns.swarmplot(x=copy1.Embarked,hue=copy1.Survived,y=copy1.Fare,palette='spring')

* Embarked S and C have higher survival rate than Embarked Q


In [None]:
sns.set(style="darkgrid")
sns.catplot(x="Survived",col="Pclass",data=copy1 ,hue="Embarked",kind='count',height=8, aspect=.7, palette=['gold','orangered','brown'])

------
## 2.3  Missing Values
There are totally 891 entries in the train dataset and 418 entries in the test dataset. However, the variable: Age, Cabin and Embarked in train dataset contain null values and the variables:  Age, Fare,Cabin in test dataset contain null values. Find the best method to deal with missing value problem.

### Embarked
* According to the correlation heatmap, we know that Embarked is highly correlated to Sex, Fare and Pclass
* We gonna find out the missing values based on these 2 features

In [None]:
copy1.loc[copy1.Embarked.isnull()]

* There's only 2 missing values in Embarked
* They share the same Sex,Pclass and Fare

In [None]:
copy1.loc[(copy1.Pclass==1) & (copy1.Sex=='female')].Embarked.value_counts()

* People from Pclass=1 and Sex= female are more likely to from Embarked=S or C
* Let's see the Fare description for (Pclass=1 & Sex='female')

In [None]:
s1=copy1.loc[(copy1.Pclass==1) & (copy1.Sex=='female')]
s2=s1.loc[s1.Embarked=='S'].Fare.describe()
s3=s1.loc[s1.Embarked=='C'].Fare.describe()
pd.concat([s2,s3],axis=1,keys=['Fare|Embarked=S', 'Fare|Embarked=C'])

* As the mean of Fare are 99 and 115 for Embarked='S' and Embarked='C' respectively
* 99 is much closer to 80 than 115
* Assign Embarked='S' to these 2 missing values

* Refill the missing values for new_Embarked in both training and testing dataset

In [None]:
copy1['Embarked']=copy1.Embarked.fillna('S')

In [None]:
label_encoder=LabelEncoder()
copy1['new_Embarked']=label_encoder.fit_transform(copy1['Embarked'])
copy2['new_Embarked']=label_encoder.transform(copy2['Embarked'])


### Cabin
* Cabin is highly correlated to Pclass, Fare, Age and Sex


In [None]:
# make a temporary column to store the first character
print('Null Pertcentage of Cabin in tranining dataset : ',str(len(copy1.loc[copy1.Cabin.isnull()])/len(copy1.Cabin)*100),'%')
print('Null Pertcentage of Cabin in testing dataset : ',str(len(copy2.loc[copy1.Cabin.isnull()])/len(copy2.Cabin)*100),'%')

* 77% of the Cabin data are null for both dataset
* Cabin column should be abandoned

In [None]:
copy1=copy1.drop(['Cabin','new_Cabin'],axis=1)
copy2=copy2.drop(['Cabin','new_Cabin'],axis=1)

### Age
* Age is highly correlated to Pclass, SibSp and Parch
* We can apply a prediction for the missing values based on these features
* Gonna choose LogisticRegression in this part

In [None]:


# training dataset
feature=['Pclass','SibSp','Parch']
copy3=copy1.loc[copy1.Age.notnull()]
x_train=copy3[feature]
#convert y_train to integer
y_train=copy3.Age.astype(int)

# prediction
copy4=copy1.loc[copy1.Age.isnull()]
x_test=copy4[feature]

log=LogisticRegression()
log.fit(x_train,y_train)
y_pred = log.predict(x_test)


In [None]:
# assign the new values to Age column
copy1.loc[copy1.Age.isnull(), "Age"] = y_pred

In [None]:
# Testing datset
x_test=copy2.loc[copy2.Age.isnull()][feature]
y_pred=log.predict(x_test)
copy2.loc[copy2.Age.isnull(), "Age"] = y_pred

### Fare
* There's a missing entry in Fare column of testing dataset
* Fare is highly correlated to Pclass, new_Embarked,new_Sex, SibSP and Parch
* Apply the random forest regressor to make a prediction

In [None]:
copy2.loc[copy2.Fare.isnull()]

In [None]:
# training dataset
feature=['Pclass','SibSp','Parch','new_Sex','new_Embarked']
copy3=copy2.loc[copy2.Fare.notnull()]
x_train=copy3[feature]
#convert y_train to integer
y_train=copy3.Fare

# prediction
copy4=copy2.loc[copy2.Fare.isnull()]
x_test=copy4[feature]

forest_model=RandomForestRegressor(random_state=1)
forest_model.fit(x_train,y_train)
y_pred = forest_model.predict(x_test)



In [None]:
copy2.loc[copy2.Fare.isnull(), "Fare"] = y_pred

-----
# 3. Feature Engineering and Statistical Analysis
In this part, we will create some features by applying feature engineering. Then select the useful features among them by statistical analysis to prepare for model prediction.

## 3.1 Features Generation
### Name
* The name of passengers also contain title
* Extract the title from name and treat it as categorical variable

In [None]:
#[i.split('.')[1] for i in data.Name]
#for i in range(len(data.Name)):
    #title = data.Name[i].split('.')[0]
    #title = title.split(',')[1]
copy1['Title']=[n.split('.')[0] for n in copy1.Name]
copy1['Title'] = [t.split(',')[1] for t in copy1.Title]

copy2['Title']=[n.split('.')[0] for n in copy2.Name]
copy2['Title'] = [t.split(',')[1] for t in copy2.Title]



* Demonstrate all sorts of title:

In [None]:
pd.concat([copy1,copy2]).Title.value_counts()
#copy1.Title.value_counts()

* Convert 'Title' column into category variables
* There are 18 different kinds of titles in dataset and train dataset only contains 17 of them
* Fit the Label-Encoding on concatation of train dataset and test datset

In [None]:
label_encoder=LabelEncoder()
label_encoder.fit_transform(pd.concat([copy1,copy2])['Title'])
copy1['new_Title']=label_encoder.transform(copy1['Title'])
copy2['new_Title']=label_encoder.transform(copy2['Title'])

### Ticket
* Create a feature Ticket's string length

In [None]:
copy1['Ticket_length']=[len(i) for i in copy1.Ticket]
copy2['Ticket_length']=[len(i) for i in copy2.Ticket]

### SibSp & Parch
* SibSp and Parch are both illustrating number of family members
* Set a new feature Famsize = SibSp + Parch

In [None]:
copy1['Famsize']=copy1['SibSp']+copy1['Parch']
copy2['Famsize']=copy2['SibSp']+copy2['Parch']

## 3.2 Statistical Analysis and Feature Selection
* Draw the Correlation Heatmap based on the features in hand

In [None]:
# the correlation matrix
corr=copy1.corr() 
# mask the upper triangle
sns.set(style="white")
plt.figure(figsize=(11,7))
mask=np.triu(np.ones_like(corr,dtype=np.bool))
# colour
cmap=sns.diverging_palette(240,10,n=9)
# annot to display the value
sns.heatmap(corr,annot=True,mask=mask,cmap='jet',linewidths=0.6)

* Some of the features have almost no correlation with Survived (eg. PassengerId)
* The correlation magnitude of [PassengerId, Age, SibSp, Parch, Ticket_length, Famsize] and Survived are less then 0.1
* Apply the Pearson Residual Test to have a better insight
-----
#### Pearson's Residual Test
* For a given factor, The null hypothesis of a feature is that 'The prediction of Survived will not consider this factor'.
* Use the scipy.stats library to calculate the p-value and compare it with alpha=0.05. If the p-value between the factor and the response is larger than alpha, then this factor does not have a significant level of 95%. The null hypothesis will not be rejected. However, if the p-value is less than alpha, the factor will be rejected.

In [None]:
# scipy.stats to find the p-value and of each factor
# compare p-value with alpha=0.05 to find out the significance level
# select the columns for test
import scipy.stats as stats
from scipy.stats import chi2_contingency

features= ['PassengerId', 'Pclass','Age','SibSp','Parch','Fare','Embarked','new_Sex','new_Embarked','new_Title','Ticket_length','Famsize']
# drop the missing value row to have a accurate estimation?
for feature in features:
    table = pd.crosstab(copy1[feature], copy1['Survived'], margins=False)
    stat, p, dof, expected = stats.chi2_contingency(table)
    print("The p-value of", feature,"is: ",p)
    
    

* The p-value of PassengerId is larger than 0.05.
* PassengerId does not have a siginificant level of 95% and can be dropped
* Also the object types feature [Name,Sex,Ticket, Embarked,Title] should be dropped
* Famsize is the sum of SibSp and Parch. They have high relevancy and we only need one of [Famsize, SibSp&Parch].
* Obviously, Famsize has a much higher siginificant level than SibSp and Parch, so SibSp and Parch should be abandoned

In [None]:
copy1=copy1.drop(['PassengerId','Name','Sex','Embarked','Ticket','Title','SibSp','Parch'],axis=1)

-----
# 4.Modelling
Apply several prediction model and find the best of them

## 4.1 Building Models
* Divide the trainging dataset into 2 groups 
* One group for training and Another group for model testing say valid dataset

In [None]:

y=copy1['Survived']
copy3=copy1.drop(['Survived'],axis=1)
X_train,X_valid,y_train,y_valid = train_test_split(copy3,y,train_size=0.8,test_size=0.2,random_state=0)

### Logistic Regression

In [None]:
l1=LogisticRegression()
l1.fit(X_train,y_train)
y_pred=log.predict(X_valid)
mean_absolute_error(y_valid, y_pred)

### Random Forest Regressor

In [None]:
forest_model=RandomForestRegressor(random_state=1)
forest_model.fit(X_train,y_train)
y_pred=forest_model.predict(X_valid)
mean_absolute_error(y_valid, y_pred)

### Roc Curve
* Let the sample size be: a+b+c+d
* Let Y be the real value and $\hat{Y}$ is the prediction value

In [None]:
s1=pd.Series(['a','c'], index=['Y=1','Y=0'])
s2=pd.Series(['b','d'], index=['Y=1','Y=0'])
pd.concat([s1,s2],axis=1,keys=['$\hat{Y}$=1', '$\hat{Y}$=0'])

Then we can calculate sensitivity and Specificity:
* **Sensitivity**: $P(\hat{Y}=1|Y=1) = \frac{a}{a+b}$
* **Specificity**: $P(\hat{Y}=0|Y=0) = \frac{d}{c+d}$
* The ROC curve is plotting **sensitivity** in y-axis and **1-specificity** in x-axis. 
* Concordance index c is the area under ROC curve. The bigger the c, the better the model.

AIC , BIC and ROC curve

In [None]:
#train_data.dropna(axis=0, subset['Survived'],inplace=True)
from sklearn.model_selection import train_test_split
data.dropna(axis=0, subset=['Survived'], inplace=True)
y=data['Survived']
data.drop('Survived',axis=1,inplace=True)

# Break off into validation and training dataset
X_train,X_valid,y_train,y_valid = train_test_split(data,y,train_size=0.8,test_size=0.2,random_state=0)

In [None]:
# A function to calculate the mean absolute error
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

In [None]:
# Method1: Drop the missing value column
cols_with_missing = [col for col in X_train.columns
                    if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing,axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing,axis=1)

print("MAE for dropping the missing valus column: ", score_dataset(reduced_X_train,reduced_X_valid, y_train, y_valid))