[Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview)

#### Competition Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

#### Practice Skills
Binary classification
Python and R basics

#### Goal
It is your job to predict if a passenger survived the sinking of the Titanic or not. 
For each in the test set, you must predict a 0 or 1 value for the variable.

#### Metric
Your score is the percentage of passengers you correctly predict. This is known simply as "accuracy”.

#### Submission File Format
You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

#### Data Dictionary

|Variable|Definition|Key|
|---|:---:|:---:| 
| survival | Survival | 0 = No, 1 = Yes |
|pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd|
|sex|Sex||
|Age|Age in years||
|sibsp|# of siblings / spouses aboard the Titanic||
|parch|# of parents / children aboard the Titanic||
|ticket|Ticket number||
|fare|Passenger fare||
|cabin|Cabin number|
|embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|

#### Variable Notes
**pclass:** 
A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

**age:**
Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp:**
The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

**parch:** 
The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [1]:
! kaggle competitions download -c titanic

train.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
gender_submission.csv: Skipping, found more recently modified local copy (use --force to force download)


In [2]:
import numpy as np
import pandas as pd
train = pd.read_csv('train.csv')
print(train.shape)
train.head()

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test = pd.read_csv('test.csv')
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
# selecting columns
# df = df[["Column Name","Column Name2"]]
# df.filter(regex='[A-CEG-I]') 
# df.iloc[:, 1:]

import numpy as np
train['MaleSex'] = (train.Sex == 'male') + 0

subset = train[['Survived', 'Pclass', 'MaleSex', 'Age', 'Fare']].copy()
print(len(subset))
subset = subset.dropna()
print(len(subset))
subset.head()

891
714


Unnamed: 0,Survived,Pclass,MaleSex,Age,Fare
0,0,3,1,22.0,7.25
1,1,1,0,38.0,71.2833
2,1,3,0,26.0,7.925
3,1,1,0,35.0,53.1
4,0,3,1,35.0,8.05


In [5]:
train_X = subset.iloc[:, 1:]
print(train_X.head())

train_y = subset.iloc[:, 0]
#train_y = subset.Survived
print(train_y.head())

sum(train_y==1) / len(train_y)

   Pclass  MaleSex   Age     Fare
0       3        1  22.0   7.2500
1       1        0  38.0  71.2833
2       3        0  26.0   7.9250
3       1        0  35.0  53.1000
4       3        1  35.0   8.0500
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64


0.4061624649859944

In [6]:
from sklearn.linear_model import LogisticRegression
%time logistic1 = LogisticRegression(multi_class='ovr', solver='lbfgs').fit(train_X, train_y)

CPU times: user 36 ms, sys: 0 ns, total: 36 ms
Wall time: 19.3 ms


In [7]:
logistic1.score(train_X, train_y)

0.7941176470588235

In [8]:
np.exp(logistic1.coef_)

array([[0.29496168, 0.09007459, 0.96483365, 1.00081636]])

In [9]:
# now, let's predict survival in the test dataset
# create MaleSex column
test['MaleSex'] = (test.Sex == 'male') + 0

# Choose the same columns as in the training data set
cols = ['PassengerId', 'Pclass', 'MaleSex', 'Age', 'Fare']
subset = test[cols]

# rows with one or more NaNs
missing = subset.isnull().T.any().T
subset_missing = subset[missing]

# since more people did not survive overoall, will imput the survival to 0 in these people with missing data
subset_missing['Survived'] = 0

subset_missing.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,PassengerId,Pclass,MaleSex,Age,Fare,Survived
10,902,3,1,,7.8958,0
22,914,1,0,,31.6833,0
29,921,3,1,,21.6792,0
33,925,3,0,,23.45,0
36,928,3,0,,8.05,0


In [10]:
# back to the test dataset
# drop any rows with missing values
print(len(subset))
subset = subset.dropna()
print(len(subset))

subset.head()

418
331


Unnamed: 0,PassengerId,Pclass,MaleSex,Age,Fare
0,892,3,1,34.5,7.8292
1,893,3,0,47.0,7.0
2,894,2,1,62.0,9.6875
3,895,3,1,27.0,8.6625
4,896,3,0,22.0,12.2875


In [11]:
# remove passenger ID
test_X = subset.iloc[:, 1:]
# predict survival in the test subset 
test_y = logistic1.predict(test_X)

# create a submission dataframe, and csv file
submission = pd.DataFrame({'PassengerId': subset.PassengerId, 'Survived': test_y})

# append subjects with missing data whose survival was imputed as o, see above
imputed = subset_missing[['PassengerId', 'Survived']]
imputed.head()

submission = submission.append(imputed)
submission.head()

print(submission.shape)

(418, 2)


In [12]:
# create csv file
submission.to_csv('submission_lr.csv', index=False)

In [13]:
! cat submission_lr.csv | head

PassengerId,Survived
892,0
893,0
894,0
895,0
896,1
897,0
898,1
899,0
900,1


In [15]:
# ! kaggle competitions submit -c titanic -f submission_lr.csv -m "logistic regression with 4 features"

100%|██████████████████████████████████████| 2.77k/2.77k [00:00<00:00, 4.24kB/s]
Successfully submitted to Titanic: Machine Learning from Disaster

In [1]:
a=1

In [2]:
from notebook.auth import passwd
passwd('love2019')

'sha1:213f0eb5e5d4:51927779d57bc33c046b3982725836c9973f1860'