# A first introduction to predictive modeling (aka machine learning)

## Let's try to predict (with some hindsight) who will surve the Titanic disaster<br>We need pandas to do the data wrangling and sci-kit learn to do the modeling and predictions

In [69]:
import numpy as np
import pandas as pd

import plotly.express as px

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [67]:
df = pd.read_csv('https://github.com/wortell-smart-learning/python-data-fundamentals/raw/main/data/titanic.csv')

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


In [11]:
df.head(2)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.283,C,First,woman,False,C,Cherbourg,yes,False


### Let's pretend we don't know anything. A random model would be predict a 50/50 chance to survive or not. This is the dumbest model we can come up with. Let's create some random predictions:

In [77]:
random_predictions = np.random.randint(0, 2, 891)

### What is our accuracy score when I take this model of random predictions?

In [79]:
accuracy_score(df['survived'], random_predictions)

0.4792368125701459

## But we can do better by looking at what the percentage of survivors is? 

In [31]:
px.bar(
    df['survived'].value_counts(dropna=False, normalize=True) * 100.,
    title='Survival rate on the Titanic',
    range_y=[0, 100]
)

## So 60% did not survive and only 40% did survive. So if we would predict noone to survive. We would have 60% correct. That's already a better model!

In [95]:
accuracy_score(df['survived'].values, np.array([0] * 891))

0.6161616161616161

## But we can do better of course if look at the data and see what else predicts survival or not<br>Let's see what the effect of passenger class is

In [100]:
group_pclass = df.groupby(['pclass', 'survived'], as_index=False)['alive'].count()
group_pclass['perc_of_group'] = group_pclass['alive'] / group_pclass.groupby('pclass')['alive'].transform('sum') * 100.

px.bar(group_pclass, 'survived', 'perc_of_group', facet_row='pclass')

## And gender could also maybe have an effect on chances of survival

In [102]:
group_sex = df.groupby(['sex', 'survived'], as_index=False)['alive'].count()
group_sex['perc_of_group'] = group_sex['alive'] / group_sex.groupby(['sex'])['alive'].transform('sum') * 100.
group_sex

Unnamed: 0,sex,survived,alive,perc_of_group
0,female,0,81,25.796
1,female,1,233,74.204
2,male,0,468,81.109
3,male,1,109,18.891


In [103]:
px.bar(group_sex, 'survived', 'perc_of_group', facet_row='sex')

## And so on and so on, there could be many variables that have a predictive effect. This is where we need a statistical model to keep of all the effects and come up with good predictions.

## Let's try to build a first model with the 2 variables that have a clear effect on survival rates: passenger class and sex.

## But statistical models need numbers and our column sex only contains strings `male` and `female`. So we need a numerical column.

In [107]:
df['sex_code'] = pd.get_dummies(df['sex'], drop_first=True)

df.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,sex_code
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,1
1,1,1,female,38.0,1,0,71.283,C,First,woman,False,C,Cherbourg,yes,False,0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0


## We need to split what is used as an input to predict and what needs to be predicted: X and y

In [109]:
X = df[['pclass', 'sex_code']]

y = df['survived']

## Now we can build our first model

In [112]:
logit_model = LogisticRegression()

logit_model.fit(X, y)

LogisticRegression()

## And now we have almost 79% correct predictions when we check the accuracy :)

In [115]:
accuracy_score(y, logit_model.predict(X))

0.7867564534231201