# Machine Learning
## Predict Titanic Survival
The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.
In this project you will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.
The data we will be using for training our model is provided by Kaggle. Feel free to make the model better on your own and submit it to the Kaggle Titanic competition!

In [36]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [37]:
# Load the passenger data
df = pd.read_csv("passengers.csv")
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [38]:
#convert sex column to numerical: female = 1 / male = 0
df["Sex"] = df['Sex'].map({'female':1,'male':0})

In [39]:
#check if we have missing values
df.isna().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [40]:
#filling missing values in Age with the average
df['Age'].fillna(value=df["Age"].mean(),inplace=True)

In [41]:
#another feature can be used is Pclass. We will classify as 1 for first class and 0 for the others
df["FirstClass"] = df['Pclass'].apply(lambda x: 1 if x == 1 else 0)

In [42]:
#We will classify as 1 for second class and 0 for the others
df["SecondClass"] = df['Pclass'].apply(lambda x: 1 if x == 2 else 0)

### Select and Split the Data

In [43]:
#select data
features = df[["Sex", "Age", "FirstClass", "SecondClass"]]
survival = df["Survived"]

In [44]:
#split data
x_train, x_test, y_train, y_test = train_test_split(features, survival)

### Normalize the Data

In [45]:
#create StandardScaler Object
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  after removing the cwd from sys.path.


### Create and Evaluate the Model

In [46]:
#create Logistic Regression
model = LogisticRegression()
model.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [47]:
# Score the model on the train data
model.score(x_train, y_train)

0.8038922155688623

In [48]:
# Score the model on the test data
model.score(x_test, y_test)

0.7713004484304933

Which feature is most important in predicting survival on the sinking of the Titanic?

In [49]:
#print coefficient of the model
print(model.coef_)

[[ 1.29062142 -0.43538297  0.97674925  0.42633307]]


Sex and First Class were the most important predicting them.

### Predict with the Model

In [50]:
# Sample passenger features
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
You = np.array([0.0,37.0, 0.0,1.0])

In [52]:
# Combine arrays
sample_passengers = np.array([Jack, Rose, You])

In [53]:
#scale our data
sample_passengers = scaler.transform(sample_passengers)

In [54]:
#make predictions
model.predict(sample_passengers)

array([0, 1, 0], dtype=int64)

Only survival would have been Rose

In [55]:
#Check the probabilities
model.predict_proba(sample_passengers)

array([[0.88466204, 0.11533796],
       [0.04804346, 0.95195654],
       [0.82422774, 0.17577226]])

Jack has 88% probability to die and only 11% to survive;
Rose has only 4% probability to die and 95% to survive;
and You has 82% probability to die and 17% to survive.