## Titanic Datset
## Predict passenger's survival in titanic accident, given the passenger attributes.

In [None]:
from sklearn.datasets import fetch_openml
d  = fetch_openml(name="Titanic", as_frame=True, version=1)
df = d["frame"]
print(d['DESCR'])
# df.head(5)

Features description

* pclass    Class (1 = First, 2 = Second, 3 = Third)
* survived  (0 = died, 1 = survived)
* name
* sex
* age
* sibsp (Number of siblings/spouses on board)
* parch (Number of parents/children on board)
* ticket (Ticket Number)
* fare  (Price of the ticket)
* cabin (Cabin Number)
* embarked (Place where passeneger embarked C = Cherbourg; Q = Queenstown; S = Southampton)
* boat (Lifeboat ID if passenger was rescued)
* body (Body ID if passenger died and body was recovered)
* home.dest (Passenger Hometown)

Types of features 
* Categorical (Number or Text)
    * Ordinal
    * Nominal
* Numerical
* Text
    
TBD: What is the type for following features. (choose from above)
* name (Text)
* sex  (Nominal)
* age  (Numerical)
* ticket (Text)
* embarked (port of embarkment) (Nominal)
* survived (Nominal)
* parch (number of parents/children aboard) (Numerical)
* pclass (Ordinal)



In [None]:
# Missing Data
## TBD Which features in Titanic dataset have missing data

df.info()

## ML Ready: Is data ready to be fed to a classification problem?
## TBD: What are 3 main issues that needs to be resolved in data?
1. Missing data either needs to be imputed or removed
2. Text data either needs to be vectorized or removed if not relevant
3. Categorical data needs to be encoded

In [None]:
import pandas as pd

def cleaner(df):
  for col_name in df.columns:    
      if df[col_name].dtype.name == 'category': 
        # Convert categorical types
        df[col_name] = df[col_name].cat.codes
      if df[col_name].dtype.name == 'object':
        df = df.drop(columns=[col_name])
  return df

def titanic_cleaner(df):
    df = df.drop(columns = ['body']) 
    return df.dropna()

#TBD Using above 2 naive cleaner functions clean the titanic dataframe 

df = titanic_cleaner(cleaner(df))
df.info()

## TBD Split the data into train test (test_size=0.25, random_state=101)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
import numpy as np
Y = np.array(df['survived'])
# dfx = df2.loc[:, df2.columns != 'survived']
X = df.drop('survived', axis = 1)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=101)

## TBD Apply Logistic Regression 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_train, Y_train)
#TBD Accuracy Score on test set
Y_hat = model.predict(X_test)
# Baseline
accuracy_score(Y_test, Y_hat)

In [None]:
## What are the top 3 most important features for predicting survival?
import matplotlib.pyplot as plt
print(np.round(model.coef_, 2))
plt.bar(X.columns, model.coef_.reshape(-1,), tick_label=X.columns)