### v4 Exploring the Titanic Data Set
Avoid data leakage withh column transformer, pipe. Uses cross validation.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/titanic-all.csv')
data.shape

(1309, 11)

### Dealing with missing (NaN) values

The Pandas idiosyncratic way of determining which observations have missing values is:

In [3]:
data.isna().sum()

pclass         0
survived       0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
dtype: int64

In [4]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


Three general approaches to dealing with missing values.  
1. Omit the observation all together
2. Omit just the column (variable) with the missing value
3. "Fill in" the missing value.  A process known as imputation

In [5]:
data[data.embarked.isna()]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


There isn't an easy way for us to determine where these two passengers Embarked.  So we can either drop this variable or drop the two observations.   Lets do the later.  We can drop these in two ways:

In [6]:
df = data[data.embarked.notna()]
df.shape

(1307, 11)

In [7]:
df = df[df.fare.notna()]
df.shape

(1306, 11)

Now, what about the `Age` and `Cabin` variables which also have NA values?  `Cabin` number doesn't seem to be a strong predictor, so lets omit that.  Lets impute the `Age` with the mean value.

In [8]:
features = 'sex embarked fare age'.split()
features

['sex', 'embarked', 'fare', 'age']

In [9]:
X = df[features]
y = df.survived

#### Imputation

In [10]:
X.age.head(20)

0     29.0000
1      0.9167
2      2.0000
3     30.0000
4     25.0000
5     48.0000
6     63.0000
7     39.0000
8     53.0000
9     71.0000
10    47.0000
11    18.0000
12    24.0000
13    26.0000
14    80.0000
15        NaN
16    24.0000
17    50.0000
18    32.0000
19    36.0000
Name: age, dtype: float64

In [11]:
X.age.isna().sum()

263

In [12]:
from sklearn.impute import SimpleImputer

In [13]:
imp = SimpleImputer()

In [14]:
imp.fit(X[['age']])

The mean is stored in `statistics_`

In [15]:
imp.statistics_

array([29.81319914])

In [16]:
imp.transform(X[['age']])[:20]

array([[29.        ],
       [ 0.9167    ],
       [ 2.        ],
       [30.        ],
       [25.        ],
       [48.        ],
       [63.        ],
       [39.        ],
       [53.        ],
       [71.        ],
       [47.        ],
       [18.        ],
       [24.        ],
       [26.        ],
       [80.        ],
       [29.81319914],
       [24.        ],
       [50.        ],
       [32.        ],
       [36.        ]])

In [17]:
imp.fit_transform(X[['age']]) ;

**Important:** Make sure you understand the difference between `.fit`, `.transform`, and `.fit_transform`

### One Hot Encoding

Many ML algorithms require the feature matrix to be numeric.  This will require nominal variables to be encoded with One Hot Encoding

In [18]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [19]:
ohe = OneHotEncoder(sparse_output=False)  # false is just for illustration

In [20]:
ohe.fit_transform(X[['embarked']])

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

## Baseline

In [21]:
df.survived.value_counts(normalize=True)

survived
0    0.618683
1    0.381317
Name: proportion, dtype: float64

## Pipeline

In [22]:
from sklearn.compose import make_column_transformer

In [23]:
imp = SimpleImputer()
ohe = OneHotEncoder()

ct = make_column_transformer(
       (ohe, ['sex', 'embarked']),
      (imp, ['age']),
      # (imp2, ['Age']),
       remainder = 'passthrough'
)

In [24]:
a = ct.fit_transform(X)
a[:5, : ]

array([[  1.    ,   0.    ,   0.    ,   0.    ,   1.    ,  29.    ,
        211.3375],
       [  0.    ,   1.    ,   0.    ,   0.    ,   1.    ,   0.9167,
        151.55  ],
       [  1.    ,   0.    ,   0.    ,   0.    ,   1.    ,   2.    ,
        151.55  ],
       [  0.    ,   1.    ,   0.    ,   0.    ,   1.    ,  30.    ,
        151.55  ],
       [  1.    ,   0.    ,   0.    ,   0.    ,   1.    ,  25.    ,
        151.55  ]])

The above is the transformation of the below data frame

In [25]:
X.head(5)

Unnamed: 0,sex,embarked,fare,age
0,female,S,211.3375,29.0
1,male,S,151.55,0.9167
2,female,S,151.55,2.0
3,male,S,151.55,30.0
4,female,S,151.55,25.0


## Cross Validation

In [26]:
from sklearn.pipeline import make_pipeline

In [27]:
data.survived

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1309, dtype: int64

In [28]:
ct = make_column_transformer(
       (ohe, ['sex', 'embarked']),
       (imp, ['age']),
       remainder = 'passthrough'
)

from sklearn.linear_model import LogisticRegression

lgr = LogisticRegression(max_iter=1000)

p1 = make_pipeline(ct, lgr)

X = df[features]
y = df.survived

from sklearn.model_selection import cross_val_score

scores = cross_val_score( p1, X , y, cv=10, scoring='accuracy')#.mean()

print(scores)
print()
print(scores.mean())

[0.80152672 0.80152672 0.85496183 0.78625954 0.80152672 0.79389313
 0.71538462 0.69230769 0.65384615 0.75384615]

0.7655079271873164
