### v2 Exploring the Titanic Data Set
One hot encoding sex

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data/titanic-all.csv')
data.shape

(1309, 11)

### Dealing with missing (NaN) values

The Pandas idiosyncratic way of determining which observations have missing values is:

In [3]:
data.isna().sum()

pclass         0
survived       0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
dtype: int64

In [4]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


Three general approaches to dealing with missing values.  
1. Omit the observation all together
2. Omit just the column (variable) with the missing value
3. "Fill in" the missing value.  A process known as imputation

In [5]:
data[data.embarked.isna()]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


There isn't an easy way for us to determine where these two passengers Embarked.  So we can either drop this variable or drop the two observations.   Lets do the later.  We can drop these in two ways:

In [6]:
data2 = data.dropna(subset=['embarked'])
data2.shape

(1307, 11)

## Baseline

In [7]:
data2.survived.value_counts(normalize=True)

survived
0    0.618975
1    0.381025
Name: proportion, dtype: float64

In [8]:
from sklearn.preprocessing import OneHotEncoder

In [9]:
ohe = OneHotEncoder(sparse_output=False) #, drop='first')
X = ohe.fit_transform( data2[['sex']])

In [10]:
X

array([[1., 0.],
       [0., 1.],
       [1., 0.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [11]:
y = data2['survived']

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=.3, random_state=7)

In [13]:
from sklearn.linear_model import LogisticRegression
lgr = LogisticRegression()

lgr.fit(X_train, y_train)

In [14]:
y_pred = lgr.predict( X_test )
y_pred

array([0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,

In [15]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [16]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(accuracy)
print(report)
print(conf_matrix)

0.7531806615776081
              precision    recall  f1-score   support

           0       0.79      0.83      0.81       246
           1       0.69      0.63      0.65       147

    accuracy                           0.75       393
   macro avg       0.74      0.73      0.73       393
weighted avg       0.75      0.75      0.75       393

[[204  42]
 [ 55  92]]
