Note: this is meant to be a demo of `oo-learning` (https://github.com/shane-kercheval/oo-learning); it is not meant to show the best approach to exploring/cleaning/modeling this particular dataset.

# Set Up Environment

In [None]:
# !pip install oolearning --upgrade

In [3]:
from oolearning import *
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
width = 10
plt.rcParams['figure.figsize'] = [width, width/1.333]

# Import Data

`ExploreClassificationDataset` is a class that provides a lot of convenience for exploring a new (classification) dataset.

Below, the class is initialized from a csv file (but you can also initialize from an existing pandas DataFarme using the constructor (**`ExploreClassificationDataset(dataset, target_variable)`**).

Additionally, sometimes we have a numeric target (even though are target is logically categorical), that we would like to change in order to make the outcome more specific. In this example, we load in the **`titanic`** dataset (https://www.kaggle.com/c/titanic/data), and change the target variable (**`Survived`**) from **`1`'s** & **`0`'s**, to **`lived`**/**`died`**.

In [22]:
csv_file = '../data/titanic.csv'
target_variable = 'Survived'
target_mapping = {0: 'died', 1: 'lived'}  # so we can convert from numeric to categoric

explore = ExploreClassificationDataset.from_csv(csv_file_path=csv_file,
                                                target_variable=target_variable,
                                                map_numeric_target=target_mapping)

# Explore Data and Feature Engineer

In [23]:
explore.dataset.head()  # we can access the data directly by `.dataset`

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,died,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,lived,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,lived,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,lived,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,died,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The first thing that I noticed is **`Pclass`** is imported as numeric feature, but it might be better suited as categoric, so let's convert it. (Note below that **`Pclass`** will now show up under categoric values when we do **`.categoric_summary()`**.)

In [24]:
# Description of data says (https://www.kaggle.com/c/titanic/data):
# pclass: A proxy for socio-economic status (SES)
# 1st = Upper
# 2nd = Middle
# 3rd = Lower

# let's convert Pclass to categoric variable
explore.set_as_categoric(feature='Pclass', mapping={1: 'Upper', 2: 'Middle', 3: 'Lower'})

Let's explore the numeric columns.

In [25]:
explore.numeric_summary()

Unnamed: 0,count,nulls,perc_nulls,num_zeros,perc_zeros,mean,st_dev,coef of var,skewness,kurtosis,min,10%,25%,50%,75%,90%,max
PassengerId,891,0,0.0,0,0.0,446.0,257.354,0.577,0.0,-1.2,1.0,90.0,223.5,446.0,668.5,802.0,891.0
Age,714,177,0.199,0,0.0,29.699,14.526,0.489,0.389,0.178,0.42,14.0,20.125,28.0,38.0,50.0,80.0
SibSp,891,0,0.0,608,0.682,0.523,1.103,2.108,3.695,17.88,0.0,0.0,0.0,0.0,1.0,1.0,8.0
Parch,891,0,0.0,678,0.761,0.382,0.806,2.112,2.749,9.778,0.0,0.0,0.0,0.0,0.0,2.0,6.0
Fare,891,0,0.0,15,0.017,32.204,49.693,1.543,4.787,33.398,0.0,7.55,7.91,14.454,31.0,77.958,512.329


A couple of things we might note, for example:

    A) `Age` has missing values.
    B) `Fare` has `15` zeros, which we might assume is equivalent to a missing/null value. (Tickets probably weren't free.)
    C) `PassengerId` will not be helpful.

Now let's explore the categoric columns.

In [26]:
explore.categoric_summary()

Unnamed: 0,count,nulls,perc_nulls,top,unique,perc_unique
Pclass,891,0,0.0,Lower,3,0.003
Name,891,0,0.0,"van Melkebeke, Mr. Philemon",891,1.0
Sex,891,0,0.0,male,2,0.002
Ticket,891,0,0.0,CA. 2343,681,0.764
Cabin,204,687,0.771,B96 B98,147,0.721
Embarked,889,2,0.002,S,3,0.003
Survived,891,0,0.0,died,2,0.002


For these columns, I might note, for example, that `Name`, `Ticket`, and `Cabin` all have a very high number of unique values (`Cabin` also have a high number of null values). This obviously make sense, but also means we might not want to use these columns.

_(Note: in https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html, the author describes how he cleverly extracts the title associated with the name, where applicable. That is a great idea, but outside the scope of this demo.)_

In [27]:
explore.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])  # drop/remove specific columns

In [28]:
explore.dataset.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,died,Lower,male,22.0,1,0,7.25,S
1,lived,Upper,female,38.0,1,0,71.2833,C
2,lived,Lower,female,26.0,0,0,7.925,S
3,lived,Upper,female,35.0,1,0,53.1,S
4,died,Lower,male,35.0,0,0,8.05,S


In [None]:
explore.plot_against_target('Age')

In [None]:
explore.plot_against_target('Sex')

In [None]:
explore.plot_against_target(feature='Pclass')