The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

#### In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

What Data Will I Use in This Competition?
In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled `train.csv` and the other is titled `test.csv`.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The `test.csv` dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

In [527]:
import pandas as pd
train_data = pd.read_csv('titanic/train.csv')
test_data = pd.read_csv('titanic/test.csv')

### Data description:

#### Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file.
#### Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
#### Name - Name
#### Age - Age
#### Sibsp - Number of Siblings/Spouses Aboard
#### Parch - Number of Parents/Children Aboard
#### Ticket - Ticket Number
#### Fare - Passenger Fare
#### Cabin - Cabin
#### Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


In [528]:
def data_information(data):
    print("Data Types: ")
    print(data.dtypes)
    print("Row and Columns: ")
    print(data.shape)
    print("Columns Name: ")
    print(data.columns)
    print("Null Values: ")
    print(data.apply(lambda x: sum(x.isnull())/ len(data)))

data_information(train_data)

Data Types: 
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
Row and Columns: 
(891, 12)
Columns Name: 
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
Null Values: 
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64


In [529]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [530]:
train_data.isnull().values.any()

True

In [531]:
train_data.isnull().sum().sum()

866

In [532]:
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [533]:
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [534]:
train_data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [535]:
train_data.dropna(how='any', inplace = True)
train_data.reset_index(drop=True, inplace=True)

In [536]:
test_data.dropna(how='any',inplace = True)
test_data.reset_index(drop=True, inplace=True)

In [537]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
1,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
2,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
3,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
4,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
178,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
179,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
180,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
181,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [538]:
train_data['Cabin'].describe()

count             183
unique            133
top       C23 C25 C27
freq                4
Name: Cabin, dtype: object

In [539]:
import re

deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}

data = [train_data, test_data]

for item in data:
    item['Cabin'] = item['Cabin'].fillna("U0")
    item['Deck'] = item['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
    item['Deck'] = item['Deck'].map(deck)
    item['Deck'] = item['Deck'].fillna(0)
    item['Deck'] = item['Deck'].astype(int)
     

train_data = train_data.drop(['Cabin'], axis=1)
test_data = test_data.drop(['Cabin'], axis=1)

In [540]:
train_data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked        object
Deck             int64
dtype: object

In [541]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Deck
0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,3
1,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S,3
2,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S,5
3,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,S,7
4,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,S,3
...,...,...,...,...,...,...,...,...,...,...,...,...
178,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,S,4
179,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,S,2
180,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C,3
181,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S,2


In [542]:
data = [train_data, test_data]
titles = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for item in data:
    
    item['P_Title'] = item.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    
    item['P_Title'] = item['P_Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                            'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    item['P_Title'] = item['P_Title'].replace('Mlle', 'Miss')
    item['P_Title'] = item['P_Title'].replace('Ms', 'Miss')
    item['P_Title'] = item['P_Title'].replace('Mme', 'Mrs')
    
    item['P_Title'] = item['P_Title'].map(titles)
    

train_data = train_data.drop(['Name'], axis=1)
test_data = test_data.drop(['Name'], axis=1)

In [543]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Deck,P_Title
0,2,1,1,female,38.0,1,0,PC 17599,71.2833,C,3,3
1,4,1,1,female,35.0,1,0,113803,53.1000,S,3,3
2,7,0,1,male,54.0,0,0,17463,51.8625,S,5,1
3,11,1,3,female,4.0,1,1,PP 9549,16.7000,S,7,2
4,12,1,1,female,58.0,0,0,113783,26.5500,S,3,2
...,...,...,...,...,...,...,...,...,...,...,...,...
178,872,1,1,female,47.0,1,1,11751,52.5542,S,4,3
179,873,0,1,male,33.0,0,0,695,5.0000,S,2,1
180,880,1,1,female,56.0,0,1,11767,83.1583,C,3,3
181,888,1,1,female,19.0,0,0,112053,30.0000,S,2,2


In [544]:

p_embarked = pd.get_dummies(train_data.Embarked)
train_data = pd.concat([train_data, p_embarked], axis=1)

In [545]:
p_sex = pd.get_dummies(train_data.Sex)
train_data = pd.concat([train_data, p_sex], axis=1)



In [546]:
p_class_cat = pd.get_dummies(train_data.Pclass)
train_data=pd.concat([train_data, p_class_cat], axis=1)
train_data.rename(columns={1:'Frist_Class',2:'Second_Class2',3:'Second_Class3'}, inplace=True)



In [547]:
p_embarked = pd.get_dummies(test_data.Embarked)
test_data = pd.concat([test_data, p_embarked], axis=1)
p_sex = pd.get_dummies(test_data.Sex)
test_data = pd.concat([test_data, p_sex], axis=1)
p_class_cat = pd.get_dummies(test_data.Pclass)
test_data=pd.concat([test_data, p_class_cat], axis=1)
test_data.rename(columns={1:'Frist_Class',2:'Second_Class2',3:'Third_Class3'}, inplace=True)


In [548]:
train_data

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Deck,P_Title,C,Q,S,female,male,Frist_Class,Second_Class2,Second_Class3
0,2,1,1,female,38.0,1,0,PC 17599,71.2833,C,3,3,1,0,0,1,0,1,0,0
1,4,1,1,female,35.0,1,0,113803,53.1000,S,3,3,0,0,1,1,0,1,0,0
2,7,0,1,male,54.0,0,0,17463,51.8625,S,5,1,0,0,1,0,1,1,0,0
3,11,1,3,female,4.0,1,1,PP 9549,16.7000,S,7,2,0,0,1,1,0,0,0,1
4,12,1,1,female,58.0,0,0,113783,26.5500,S,3,2,0,0,1,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178,872,1,1,female,47.0,1,1,11751,52.5542,S,4,3,0,0,1,1,0,1,0,0
179,873,0,1,male,33.0,0,0,695,5.0000,S,2,1,0,0,1,0,1,1,0,0
180,880,1,1,female,56.0,0,1,11767,83.1583,C,3,3,1,0,0,1,0,1,0,0
181,888,1,1,female,19.0,0,0,112053,30.0000,S,2,2,0,0,1,1,0,1,0,0


In [549]:
train_data = train_data.drop(['Sex'], axis=1)
test_data = test_data.drop(['Sex'], axis=1)
train_data = train_data.drop(['Pclass'], axis=1)
test_data = test_data.drop(['Pclass'], axis=1)
train_data = train_data.drop(['Embarked'], axis=1)
test_data = test_data.drop(['Embarked'], axis=1)

In [550]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Ticket,Fare,Deck,P_Title,C,Q,S,female,male,Frist_Class,Second_Class2,Second_Class3
0,2,1,38.0,1,0,PC 17599,71.2833,3,3,1,0,0,1,0,1,0,0
1,4,1,35.0,1,0,113803,53.1,3,3,0,0,1,1,0,1,0,0
2,7,0,54.0,0,0,17463,51.8625,5,1,0,0,1,0,1,1,0,0
3,11,1,4.0,1,1,PP 9549,16.7,7,2,0,0,1,1,0,0,0,1
4,12,1,58.0,0,0,113783,26.55,3,2,0,0,1,1,0,1,0,0


In [551]:
data = [train_data, test_data]
for item in data:
    item['Family_Member'] = item['SibSp'] + item['Parch']
    item.loc[item['Family_Member'] > 0, 'With_family'] = 0
    item.loc[item['Family_Member'] == 0, 'With_family'] = 1
    item['With_family'] = item['With_family'].astype(int)
    
    
train_data

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Ticket,Fare,Deck,P_Title,C,Q,S,female,male,Frist_Class,Second_Class2,Second_Class3,Family_Member,With_family
0,2,1,38.0,1,0,PC 17599,71.2833,3,3,1,0,0,1,0,1,0,0,1,0
1,4,1,35.0,1,0,113803,53.1000,3,3,0,0,1,1,0,1,0,0,1,0
2,7,0,54.0,0,0,17463,51.8625,5,1,0,0,1,0,1,1,0,0,0,1
3,11,1,4.0,1,1,PP 9549,16.7000,7,2,0,0,1,1,0,0,0,1,2,0
4,12,1,58.0,0,0,113783,26.5500,3,2,0,0,1,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178,872,1,47.0,1,1,11751,52.5542,4,3,0,0,1,1,0,1,0,0,2,0
179,873,0,33.0,0,0,695,5.0000,2,1,0,0,1,0,1,1,0,0,0,1
180,880,1,56.0,0,1,11767,83.1583,3,3,1,0,0,1,0,1,0,0,1,0
181,888,1,19.0,0,0,112053,30.0000,2,2,0,0,1,1,0,1,0,0,0,1


In [552]:
train_data['Age'].value_counts()

36.00    11
24.00     9
35.00     6
19.00     6
31.00     5
         ..
62.00     1
64.00     1
14.00     1
63.00     1
0.92      1
Name: Age, Length: 63, dtype: int64

In [553]:
data = [train_data, test_data]

for item in data:
    item['Age'] = item['Age'].astype(int)
    item.loc[item['Age'] <= 11, 'Age'] = 0
    item.loc[(item['Age'] > 11) & (item['Age']<=18),'Age']= 1
    item.loc[(item['Age']>18) & (item['Age']<=22),'Age']= 2
    item.loc[(item['Age']>22) & (item['Age']<=27),'Age']= 3
    item.loc[(item['Age']>27) & (item['Age']<=33),'Age']= 4
    item.loc[(item['Age']>33) & (item['Age']<=40),'Age']= 5
    item.loc[(item['Age']>40) & (item['Age']<=66),'Age']= 6
    item.loc[item['Age'] > 66, 'Age'] = 6


In [554]:
train_data

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Ticket,Fare,Deck,P_Title,C,Q,S,female,male,Frist_Class,Second_Class2,Second_Class3,Family_Member,With_family
0,2,1,5,1,0,PC 17599,71.2833,3,3,1,0,0,1,0,1,0,0,1,0
1,4,1,5,1,0,113803,53.1000,3,3,0,0,1,1,0,1,0,0,1,0
2,7,0,6,0,0,17463,51.8625,5,1,0,0,1,0,1,1,0,0,0,1
3,11,1,0,1,1,PP 9549,16.7000,7,2,0,0,1,1,0,0,0,1,2,0
4,12,1,6,0,0,113783,26.5500,3,2,0,0,1,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178,872,1,6,1,1,11751,52.5542,4,3,0,0,1,1,0,1,0,0,2,0
179,873,0,4,0,0,695,5.0000,2,1,0,0,1,0,1,1,0,0,0,1
180,880,1,6,0,1,11767,83.1583,3,3,1,0,0,1,0,1,0,0,1,0
181,888,1,2,0,0,112053,30.0000,2,2,0,0,1,1,0,1,0,0,0,1


In [555]:
train_data = train_data.drop(['SibSp'], axis=1)
test_data = test_data.drop(['SibSp'], axis=1)
train_data = train_data.drop(['Parch'], axis=1)
test_data = test_data.drop(['Parch'], axis=1)

In [556]:
train_data 

Unnamed: 0,PassengerId,Survived,Age,Ticket,Fare,Deck,P_Title,C,Q,S,female,male,Frist_Class,Second_Class2,Second_Class3,Family_Member,With_family
0,2,1,5,PC 17599,71.2833,3,3,1,0,0,1,0,1,0,0,1,0
1,4,1,5,113803,53.1000,3,3,0,0,1,1,0,1,0,0,1,0
2,7,0,6,17463,51.8625,5,1,0,0,1,0,1,1,0,0,0,1
3,11,1,0,PP 9549,16.7000,7,2,0,0,1,1,0,0,0,1,2,0
4,12,1,6,113783,26.5500,3,2,0,0,1,1,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178,872,1,6,11751,52.5542,4,3,0,0,1,1,0,1,0,0,2,0
179,873,0,4,695,5.0000,2,1,0,0,1,0,1,1,0,0,0,1
180,880,1,6,11767,83.1583,3,3,1,0,0,1,0,1,0,0,1,0
181,888,1,2,112053,30.0000,2,2,0,0,1,1,0,1,0,0,0,1


In [557]:
train_data['Fare'].describe()

count    183.000000
mean      78.682469
std       76.347843
min        0.000000
25%       29.700000
50%       57.000000
75%       90.000000
max      512.329200
Name: Fare, dtype: float64

In [558]:
pd.qcut(train_data.Fare, q=4) 

0        (57.0, 90.0]
1        (29.7, 57.0]
2        (29.7, 57.0]
3      (-0.001, 29.7]
4      (-0.001, 29.7]
            ...      
178      (29.7, 57.0]
179    (-0.001, 29.7]
180      (57.0, 90.0]
181      (29.7, 57.0]
182      (29.7, 57.0]
Name: Fare, Length: 183, dtype: category
Categories (4, interval[float64]): [(-0.001, 29.7] < (29.7, 57.0] < (57.0, 90.0] < (90.0, 512.329]]

In [559]:
data = [train_data, test_data]



for item in data:
    item.loc[item['Fare'] <= 3.0, 'Fare'] = 0
    item.loc[(item['Fare'] > 3.0) & (item['Fare'] <= 5.0), 'Fare'] = 1
    item.loc[(item['Fare'] > 5.0) & (item['Fare'] <= 6.0), 'Fare']   = 2
    item.loc[(item['Fare'] > 6.0) & (item['Fare'] <= 151.55), 'Fare']   = 3
    item.loc[item['Fare'] > 151.55, 'Fare'] = 4
    item['Fare'] = item['Fare'].astype(int)

In [560]:
test_data
    


Unnamed: 0,PassengerId,Age,Ticket,Fare,Deck,P_Title,C,Q,S,female,male,Frist_Class,Second_Class2,Third_Class3,Family_Member,With_family
0,904,3,21228,3,2,3,0,0,1,1,0,1,0,0,1,0
1,906,6,W.E.P. 5734,3,5,3,0,0,1,1,0,1,0,0,1,0
2,916,6,PC 17608,4,2,3,1,0,0,1,0,1,0,0,4,0
3,918,2,113509,3,2,2,1,0,0,1,0,1,0,0,1,0
4,920,6,113054,3,1,1,0,0,1,0,1,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,1296,6,17765,3,4,1,1,0,0,0,1,1,0,0,1,0
83,1297,2,SC/PARIS 2166,3,4,1,1,0,0,0,1,0,1,0,0,1
84,1299,6,113503,4,3,1,1,0,0,0,1,1,0,0,2,0
85,1303,5,19928,3,3,3,0,1,0,1,0,1,0,0,1,0


In [561]:
PassengerId = test_data['PassengerId']


In [562]:
train_data = train_data.drop(['Ticket'], axis=1)
test_data = test_data.drop(['Ticket'], axis=1)

train_data = train_data.drop(['PassengerId'], axis=1)
test_data = test_data.drop(['PassengerId'], axis=1)

In [563]:
train_data.dtypes

Survived         int64
Age              int64
Fare             int64
Deck             int64
P_Title          int64
C                uint8
Q                uint8
S                uint8
female           uint8
male             uint8
Frist_Class      uint8
Second_Class2    uint8
Second_Class3    uint8
Family_Member    int64
With_family      int64
dtype: object

In [564]:
test_data.dtypes

Age              int64
Fare             int64
Deck             int64
P_Title          int64
C                uint8
Q                uint8
S                uint8
female           uint8
male             uint8
Frist_Class      uint8
Second_Class2    uint8
Third_Class3     uint8
Family_Member    int64
With_family      int64
dtype: object

In [565]:
# from sklearn.preprocessing import StandardScaler



# scaler = StandardScaler()
# train_s_data = pd.DataFrame(data = train_data)
# train_s_data = scaler.fit_transform(train_s_data)

# train_s_data = s_scaler.fit_transform(train_data

In [566]:
# from sklearn.preprocessing import StandardScaler



# scaler = StandardScaler()
# test_s_data = pd.DataFrame(data = test_data)
# test_s_data = scaler.fit_transform(test_s_data)

In [567]:
X_train = train_data.drop('Survived', axis=1)
Y_train = train_data['Survived']
X_test = test_data.copy()

In [568]:
print(type(X_train))
print(type(Y_train))
print(type(X_test))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [569]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)

(183, 14)
(183,)
(87, 14)


## K-nearest neighbors(KNN) Classification
### scikit-learn 4-step modeling pattern

In [570]:
# step1: Import the class 

from sklearn.neighbors import KNeighborsClassifier

In [571]:
# step 2: Instantiate the estimator
# Estimator is scikit-learn term for model
# Instanitiate means make an instance of

clf = KNeighborsClassifier(n_neighbors=5)

In [572]:
print(clf)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')


In [573]:
# step 3: fit the model with data
# model is learning the relatinship between X and y
# occurs in-place

clf.fit(X_train,Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [574]:
clf.predict(X_test)

array([1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1])

## Logistic regression

In [575]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model(useing default parameters)

logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,Y_train)

# predict the response value for the observation in x
logreg.predict(X_test)

# compute clasification accuracy for the logistic regression model

acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

79.78

# KNN(K=5)

In [576]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train,Y_train)
knn_prediction = clf.predict(X_test)

acc_knn = round(clf.score(X_train, Y_train) * 100, 2)

print("Knn Prediction of Test Data Where N_neighbors = 5\n" , knn_prediction)
print("Accurecy Score" , acc_knn)


Knn Prediction of Test Data Where N_neighbors = 5
 [1 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0
 1 1 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1
 0 1 0 1 1 1 1 1 0 1 0 1 1]
Accurecy Score 81.97


# KNN(K=1)

In [577]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train,Y_train)

y_pred = clf.predict(X_test)
acc_knn = round(clf.score(X_train, Y_train) * 100, 2)
print("Knn Prediction of Test Data Where N_neighbors = 1\n" , y_pred)
print("Accurecy Score" , acc_knn)


Knn Prediction of Test Data Where N_neighbors = 1
 [1 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0 1 1 0 1 1 0 0 1
 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1
 0 1 0 1 1 1 1 1 1 1 0 1 1]
Accurecy Score 94.54


In [578]:
test_data['PassengerId']= pd.DataFrame(PassengerId)
test_data = test_data[['PassengerId', 'Age','Fare','Deck','P_Title','C','Q','S','female','male','Frist_Class',
                      'Second_Class2','Third_Class3','Family_Member','With_family']]


test_data.insert((test_data.shape[1]),'Survived',y_pred)
test_data.to_csv('KNeighborsClassifier.csv')


In [588]:
pre_data = pd.read_csv('KNeighborsClassifier.csv',index_col=[0])

In [589]:
pre_data

Unnamed: 0,PassengerId,Age,Fare,Deck,P_Title,C,Q,S,female,male,Frist_Class,Second_Class2,Third_Class3,Family_Member,With_family,Survived
0,904,3,3,2,3,0,0,1,1,0,1,0,0,1,0,1
1,906,6,3,5,3,0,0,1,1,0,1,0,0,1,0,1
2,916,6,4,2,3,1,0,0,1,0,1,0,0,4,0,1
3,918,2,3,2,2,1,0,0,1,0,1,0,0,1,0,1
4,920,6,3,1,1,0,0,1,0,1,1,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,1296,6,3,4,1,1,0,0,0,1,1,0,0,1,0,1
83,1297,2,3,4,1,1,0,0,0,1,0,1,0,0,1,1
84,1299,6,4,3,1,1,0,0,0,1,1,0,0,2,0,0
85,1303,5,3,3,3,0,1,0,1,0,1,0,0,1,0,1
