# Our first Kaggle Competition.

For this assignment we are going to submit our first submission in Kaggle for the [the Titanic Dataset Competition](https://www.kaggle.com/c/titanic/data).

Kaggle is a website where data scientists can compete on Data Science competitions where the goal is to provide the best predictions for a specific dataset. Companies launch these competitions and usually give substantial rewards (in the order of thousands of dollars).

For the titanic competition, the dataset has passenger information for every passenger that was aboard the titanic on its first (and last trip). 

The target variable is whether the passenger died or not when the cruise ship sank
You can download the competition data (and check the data dictionary) on [kaggle](https://www.kaggle.com/c/titanic/data).

You will use the training data (file `train.csv`) to train your classifier, and will create submissions for the `test.csv`. 

Basically, you have to submit a csv file on the shape:

```
PassengerId,Survived
892,0
893,1
894,1
895,0
...
```

Where the PassengerId are the ids of the passengers on the `test.csv` dataset and `Survived` is your model prediction about the passenger (0, die, 1 survives).

In order to submit a file you have to create a profile on the website. Then you can upload the submission using the Website or using the [kaggle api](https://github.com/Kaggle/kaggle-api)

In [1]:
from IPython.display import Image
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
import warnings
warnings.simplefilter("ignore")
%matplotlib inline

matplotlib.rcParams['figure.figsize'] = [6, 6]

In [2]:
pwd

'/Users/sdessouki/Documents/Portugal/9.classification'

In [3]:
titanic = pd.read_csv("../titanic/train.csv")

In [4]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic.shape

(891, 12)

In [None]:
titanic.Pclass.plot.kde();

In [None]:
titanic.Age.plot.kde();

In [None]:
titanic.Parch.plot.kde();

In [None]:
titanic.SibSp.plot.kde();

In [6]:
from sklearn.model_selection import train_test_split
y = titanic["Survived"]
independent_variables=titanic.drop(columns=["Name","PassengerId","Ticket","Cabin","Survived"]).columns

In [7]:
X = titanic[independent_variables]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=42)

In [13]:
X_train.shape

(623, 7)

In [14]:
X_test.shape

(268, 7)

In [15]:
y_train.shape

(623,)

In [16]:
y_test.shape

(268,)

In [18]:
numerical_cols =  titanic.drop(columns=["Survived","Name","PassengerId","Ticket","Pclass","Cabin"]).select_dtypes(np.number).columns
categorical_col = ["Sex","Embarked"]
ordinal_col = ["Pclass"]

KeyError: "['Survived' 'Name' 'PassengerId' 'Ticket' 'Cabin'] not found in axis"

In [None]:
titanic=titanic.drop(columns=["Survived","Name","PassengerId","Ticket","Cabin"])

In [19]:
titanic.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [20]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

imputer = SimpleImputer(strategy="mean")
scaler = StandardScaler()

from mlxtend.feature_selection import ColumnSelector
numerical_col_selector = ColumnSelector(cols=numerical_cols)

from sklearn.pipeline import make_pipeline
numerical_pipeline = make_pipeline(
    numerical_col_selector,
    imputer,
    scaler)

In [21]:
from category_encoders import OneHotEncoder
categorical_pipeline = make_pipeline(
     ColumnSelector(cols=categorical_col),
     OneHotEncoder())

In [22]:
titanic.Pclass.unique()

array([3, 1, 2])

In [23]:
from category_encoders import OrdinalEncoder
# from category_encoders import OrdinalEncoder

# ColumnSelector's output is an array, so we use the column 0 for ordinal encoder
ordinal_encoder = OrdinalEncoder(mapping=[
    {"col": 0, 
      "mapping": {
        1: 1,
        2: 2,
        3: 3,
      } 
     }
])
ordinal_pipeline = make_pipeline(
    ColumnSelector(cols=ordinal_col),
    ordinal_encoder
)

In [24]:
from sklearn.pipeline import make_union
processing_pipeline = make_union(
    numerical_pipeline,
    categorical_pipeline,
    ordinal_pipeline
)

In [28]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
estimator_pipeline = make_pipeline(
    processing_pipeline,
    clf
)

In [29]:
estimator_pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=None,
       transformer_list=[('pipeline-1', Pipeline(memory=None,
     steps=[('columnselector', ColumnSelector(cols=Index(['Age', 'SibSp', 'Parch', 'Fare'], dtype='object'),
        drop_axis=False)), ('simpleimputer', SimpleImputer(copy=True, fill_valu...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

In [30]:
estimator_pipeline.predict(X_test)[:5]

array([0, 0, 0, 1, 1])

In [33]:
predictions = estimator_pipeline.predict(X_test)
true_classes = y_test
prediction_probabilities = estimator_pipeline.predict_proba(X_test)

In [35]:
def tuple_class_prediction(y_true, y_pred):
    return list(zip(y_true, y_pred))

tuple_class_prediction(true_classes, predictions)[:10]

[(1, 0),
 (0, 0),
 (0, 0),
 (1, 1),
 (1, 1),
 (1, 1),
 (1, 1),
 (0, 0),
 (1, 1),
 (1, 1)]

In [36]:
def TP(true_classes, predictions):
    pairs_class_prediction = tuple_class_prediction(true_classes, predictions)
    return len([obs for obs in pairs_class_prediction if obs[0]==1 and obs[1]==1])

def TN(true_classes, predictions):
    pairs_class_prediction = tuple_class_prediction(true_classes, predictions)
    return len([obs for obs in pairs_class_prediction if obs[0]==0 and obs[1]==0])
    
def FP(true_classes, predictions):
    pairs_class_prediction = tuple_class_prediction(true_classes, predictions)
    return len([obs for obs in pairs_class_prediction if obs[0]==0 and obs[1]==1])

def FN(true_classes, predictions):
    pairs_class_prediction = tuple_class_prediction(true_classes, predictions)
    return len([obs for obs in pairs_class_prediction if obs[0]==1 and obs[1]==0])


print("""
True Positives: {}
True Negatives: {}
False Positives: {}
False Negatives: {}
""".format(
    TP(true_classes, predictions),
    TN(true_classes, predictions),
    FP(true_classes, predictions),
    FN(true_classes, predictions)    
))    


True Positives: 80
True Negatives: 137
False Positives: 20
False Negatives: 31



In [37]:
def accuracy(true_classes, predictions):
    tp = TP(true_classes, predictions)
    tn = TN(true_classes, predictions)
    return (tp+tn) / len(true_classes)

accuracy(true_classes, predictions)

0.8097014925373134

In [38]:
titanic_test = pd.read_csv("../titanic/test.csv")

In [39]:
titanic_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [43]:
titanic_test["Survived"]=estimator_pipeline.predict(titanic_test)

In [44]:
titanic_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Predictions,Survived
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,0
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1,1


In [45]:
titanic_test[["PassengerId","Survived"]].to_csv("Submission1.csv", index=False)