# 2. A Data Science Framework: To Achieve 99% Accuracy

URL : https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

1. Chapter 1 - How a Data Scientist Beat the Odds
2. Chapter 2 - A Data Science Framework
- Chapter 3 - Step 1: Define the Problem and Step 2: Gather the Data
- Chapter 4 - Step 3: Prepare Data for Consumption
- Chapter 5 - The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting
- Chapter 6 - Step 4: Perform Exploratory Analysis with Statistics
- Chapter 7 - Step 5: Model Data
- Chapter 8 - Evaluate Model Performance
- Chapter 9 - Tune Model with Hyper-Parameters
- Chapter 10 - Tune Model with Feature Selection
- Chapter 11 - Step 6: Validate and Implement
- Chapter 12 - Conclusion and Step 7: Optimize and Strategize

## Chapter 1 - How a Data Scientist Beat the Odds

## Chapter 2 - A Data Science Framework

1. <b>Define the Problem</b>: If data science, big data, machine learning, predictive analytics, business intelligence, or any other buzzword is the solution, then what is the problem? As the saying goes, don't put the cart before the horse. Problems before requirements, requirements before solutions, solutions before design, and design before technology. Too often we are quick to jump on the new shiny technology, tool, or algorithm before determining the actual problem we are trying to solve.
2. <b>Gather the Data</b>: John Naisbitt wrote in his 1984 (yes, 1984) book Megatrends, we are “drowning in data, yet staving for knowledge." So, chances are, the dataset(s) already exist somewhere, in some format. It may be external or internal, structured or unstructured, static or streamed, objective or subjective, etc. As the saying goes, you don't have to reinvent the wheel, you just have to know where to find it. In the next step, we will worry about transforming "dirty data" to "clean data."
3. <b>Prepare Data for Consumption</b>: This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.
4. <b>Perform Exploratory Analysis</b>: Anybody who has ever worked with data knows, garbage-in, garbage-out (GIGO). Therefore, it is important to deploy descriptive and graphical statistics to look for potential problems, patterns, classifications, correlations and comparisons in the dataset. In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.
5. <b>Model Data</b>: Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It's important to remember, algorithms are tools and not magical wands or silver bullets. You must still be the master craft (wo)man that knows how-to select the right tool for the job. An analogy would be asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible. The same is true in data modelling. The wrong model can lead to poor performance at best and the wrong conclusion (that’s used as actionable intelligence) at worst.
6. <b>Validate and Implement Data Model</b>: After you've trained your model based on a subset of your data, it's time to test your model. This helps ensure you haven't overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our model overfit, generalize, or underfit our dataset.
7. <b>Optimize and Strategize</b>: This is the "bionic man" step, where you iterate back through the process to make it better...stronger...faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design. Once you're able to package your ideas, this becomes your “currency exchange" rate.

## Chapter 3 - Step 1: Define the Problem and Step 2: Gather the Data

## Chapter 4 - Step 3: Prepare Data for Consumption

### 3.1 Import Libraries

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import sys
import pandas as pd
import matplotlib
import numpy as np
import scipy as sp
import IPython
from IPython import display
import sklearn

#misc libraries => misc : miscellaneous 여러 가지 종류의, 이것저것 다양한

import random
import time

import warnings
warnings.filterwarnings('ignore')

from subprocess import check_output

In [2]:
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, \
                        discriminant_analysis,gaussian_process
from xgboost import XGBClassifier

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection, model_selection, metrics

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.plotting import scatter_matrix


%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

### 3.2 Meet and Greet Data

In [3]:
data_raw = pd.read_csv('E:\\kaggle\\Titanic Machine Learning from Disaster\\titanic\\train.csv')
data_val = pd.read_csv('E:\\kaggle\\Titanic Machine Learning from Disaster\\titanic\\test.csv')

data1 = data_raw.copy(deep=True)
data_cleaner = [data1, data_val]
print(data_raw.info())
data_raw.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
864,865,0,2,"Gill, Mr. John William",male,24.0,0,0,233866,13.0,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
321,322,0,3,"Danoff, Mr. Yoto",male,27.0,0,0,349219,7.8958,,S
539,540,1,1,"Frolicher, Miss. Hedwig Margaritha",female,22.0,0,2,13568,49.5,B39,C
71,72,0,3,"Goodwin, Miss. Lillian Amy",female,16.0,5,2,CA 2144,46.9,,S
211,212,1,2,"Cameron, Miss. Clear Annie",female,35.0,0,0,F.C.C. 13528,21.0,,S
528,529,0,3,"Salonen, Mr. Johan Werner",male,39.0,0,0,3101296,7.925,,S
661,662,0,3,"Badt, Mr. Mohamed",male,40.0,0,0,2623,7.225,,C
807,808,0,3,"Pettersson, Miss. Ellen Natalia",female,18.0,0,0,347087,7.775,,S
413,414,0,2,"Cunningham, Mr. Alfred Fleming",male,,0,0,239853,0.0,,S


## Chapter 5 - 3.2.1 The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting

In [4]:
print(data1.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


In [5]:
print(data_val.isnull().sum())

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


### 3.22 Clean Data
Developer Documentation

- pandas.DataFrame
- pandas.DataFrame.info
- pandas.DataFrame.describe
- Indexing and Selecting Data
- pandas.isnull
- pandas.DataFrame.sum
- pandas.DataFrame.mode
- pandas.DataFrame.copy
- pandas.DataFrame.fillna
- pandas.DataFrame.drop
- pandas.Series.value_counts
- pandas.DataFrame.loc

In [6]:
for dataset in data_cleaner:
    dataset['Age'].fillna(dataset['Age'].median(), inplace=True)
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace=True)
    
drop_column=['PassengerId','Cabin','Ticket']
data1.drop(drop_column, axis=1, inplace=True)

print(data1.isnull().sum())
print('-'*10)
print(data_val.isnull().sum())

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
----------
PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64


In [7]:
data1.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


In [8]:
pp = data1.copy()
pp['Name'].str.split(', ',expand=True)[1].str.split('.', expand=True)[0][:10]

0        Mr
1       Mrs
2      Miss
3       Mrs
4        Mr
5        Mr
6        Mr
7    Master
8       Mrs
9       Mrs
Name: 0, dtype: object

In [9]:
pp.Fare.describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [10]:
for dataset in data_cleaner:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
    dataset['IsAlone'] = 1
    dataset['IsAlone'].loc[dataset['FamilySize'] > 1] =0
    
    dataset['Title'] = dataset['Name'].str.split(', ',expand=True)[1].str.split('.', expand=True)[0]
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)

In [11]:
print(data1['Title'].value_counts())

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Mme               1
Jonkheer          1
Lady              1
Don               1
the Countess      1
Ms                1
Sir               1
Capt              1
Name: Title, dtype: int64


In [12]:
stat_min = 10
title_names = (data1['Title'].value_counts() < stat_min)
print(title_names[:5])

Mr        False
Miss      False
Mrs       False
Master    False
Dr         True
Name: Title, dtype: bool


In [14]:
data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(data1['Title'].value_counts())

Mr        517
Miss      182
Mrs       125
Master     40
Misc       27
Name: Title, dtype: int64


In [15]:
data1.info()
data_val.info()
data1.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Name          891 non-null object
Sex           891 non-null object
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Embarked      891 non-null object
FamilySize    891 non-null int64
IsAlone       891 non-null int64
Title         891 non-null object
FareBin       891 non-null category
AgeBin        891 non-null category
dtypes: category(2), float64(2), int64(6), object(4)
memory usage: 85.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 16 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            418 non-null float64
SibSp          418 non-null int64
Parch          418 non-null in

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,FamilySize,IsAlone,Title,FareBin,AgeBin
826,0,3,"Lam, Mr. Len",male,28.0,0,0,56.4958,S,1,1,Mr,"(31.0, 512.329]","(16.0, 32.0]"
243,0,3,"Maenpaa, Mr. Matti Alexanteri",male,22.0,0,0,7.125,S,1,1,Mr,"(-0.001, 7.91]","(16.0, 32.0]"
82,1,3,"McDermott, Miss. Brigdet Delia",female,28.0,0,0,7.7875,Q,1,1,Miss,"(-0.001, 7.91]","(16.0, 32.0]"
640,0,3,"Jensen, Mr. Hans Peder",male,20.0,0,0,7.8542,S,1,1,Mr,"(-0.001, 7.91]","(16.0, 32.0]"
808,0,2,"Meyer, Mr. August",male,39.0,0,0,13.0,S,1,1,Mr,"(7.91, 14.454]","(32.0, 48.0]"
857,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,26.55,S,1,1,Mr,"(14.454, 31.0]","(48.0, 64.0]"
403,0,3,"Hakkarainen, Mr. Pekka Pietari",male,28.0,1,0,15.85,S,2,0,Mr,"(14.454, 31.0]","(16.0, 32.0]"
364,0,3,"O'Brien, Mr. Thomas",male,28.0,1,0,15.5,Q,2,0,Mr,"(14.454, 31.0]","(16.0, 32.0]"
725,0,3,"Oreskovic, Mr. Luka",male,20.0,0,0,8.6625,S,1,1,Mr,"(7.91, 14.454]","(16.0, 32.0]"
473,1,2,"Jerwan, Mrs. Amin S (Marie Marthe Thuillard)",female,23.0,0,0,13.7917,C,1,1,Mrs,"(7.91, 14.454]","(16.0, 32.0]"


### 3.23 Convert Formats

In [20]:
data1.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked', 'FamilySize', 'IsAlone', 'Title', 'FareBin', 'AgeBin',
       'Sex_Code', 'Embarked_Code', 'Title_Code', 'AgeBin_Code',
       'FareBin_Code'],
      dtype='object')

In [17]:
label = LabelEncoder()
for dataset in data_cleaner:
    dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
    dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
    dataset['Title_Code'] = label.fit_transform(dataset['Title'])
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])
    
Target=['Survived']

data1_x = ['Sex','Pclass','Embarked','Title','SibSp','Parch','Age','Fare','FamilySize','IsAlone']
data1_x_calc = ['Sex_Code','Pclass','Embarked_Code','Title_Code','SibSp','Parch','Age','Fare']
data1_xy = Target + data1_x
print('Original X Y: ', data1_xy)

Original X Y:  ['Survived', 'Sex', 'Pclass', 'Embarked', 'Title', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone']


In [18]:
data1_x_bin = ['Sex_Code','Pclass','Embarked_Code','Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y:', data1_xy_bin)

Bin X Y: ['Survived', 'Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']


In [19]:
data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy)
data1_dummy.head()

Dummy X Y:  ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_Master', 'Title_Misc', 'Title_Miss', 'Title_Mr', 'Title_Mrs']


Unnamed: 0,Pclass,SibSp,Parch,Age,Fare,FamilySize,IsAlone,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Misc,Title_Miss,Title_Mr,Title_Mrs
0,3,1,0,22.0,7.25,2,0,0,1,0,0,1,0,0,0,1,0
1,1,1,0,38.0,71.2833,2,0,1,0,1,0,0,0,0,0,0,1
2,3,0,0,26.0,7.925,1,1,1,0,0,0,1,0,0,1,0,0
3,1,1,0,35.0,53.1,2,0,1,0,0,0,1,0,0,0,0,1
4,3,0,0,35.0,8.05,1,1,0,1,0,0,1,0,0,0,1,0


### 3.24 Da-Double Check Cleaned Data

In [21]:
print('Train columns with null values:\n', data1.isnull().sum())
print("-"*10)
print (data1.info())
print("-"*10)

print('Test/Validation columns with null values:\n', data_val.isnull().sum())
print("-"*10)
print (data_val.info())
print("-"*10)

data_raw.describe(include = 'all')

Train columns with null values: 
 Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Fare             0
Embarked         0
FamilySize       0
IsAlone          0
Title            0
FareBin          0
AgeBin           0
Sex_Code         0
Embarked_Code    0
Title_Code       0
AgeBin_Code      0
FareBin_Code     0
dtype: int64
----------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 19 columns):
Survived         891 non-null int64
Pclass           891 non-null int64
Name             891 non-null object
Sex              891 non-null object
Age              891 non-null float64
SibSp            891 non-null int64
Parch            891 non-null int64
Fare             891 non-null float64
Embarked         891 non-null object
FamilySize       891 non-null int64
IsAlone          891 non-null int64
Title            891 non-null object
FareBin          891 non-null catego

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Moussa, Mrs. (Mantoura Boulos)",male,,,,1601.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


### 3.25 Split Training and Testing Data

In [23]:
model_selection

<module 'sklearn.model_selection' from 'C:\\Users\\la\\Anaconda3\\envs\\py36\\lib\\site-packages\\sklearn\\model_selection\\__init__.py'>

In [22]:
data1_x_calc

['Sex_Code',
 'Pclass',
 'Embarked_Code',
 'Title_Code',
 'SibSp',
 'Parch',
 'Age',
 'Fare']

In [27]:
len(data1_x_dummy), data1_x_dummy

(17,
 ['Pclass',
  'SibSp',
  'Parch',
  'Age',
  'Fare',
  'FamilySize',
  'IsAlone',
  'Sex_female',
  'Sex_male',
  'Embarked_C',
  'Embarked_Q',
  'Embarked_S',
  'Title_Master',
  'Title_Misc',
  'Title_Miss',
  'Title_Mr',
  'Title_Mrs'])

In [26]:
train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc],
                                                                        data1[Target], random_state = 0)
train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin],
                                                                                        data1[Target] ,
                                                                                        random_state = 0)
train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.\
                        train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)


print("Data1 Shape: {}".format(data1.shape))
print("Train1 Shape: {}".format(train1_x.shape))
print("Test1 Shape: {}".format(test1_x.shape))

train1_x_bin.head()

Data1 Shape: (891, 19)
Train1 Shape: (668, 8)
Test1 Shape: (223, 8)


Unnamed: 0,Sex_Code,Pclass,Embarked_Code,Title_Code,FamilySize,AgeBin_Code,FareBin_Code
105,1,3,2,3,1,1,0
68,0,3,2,2,7,1,1
253,1,3,2,3,2,1,2
320,1,3,2,3,1,1,0
706,0,2,2,4,1,2,1


## Chapter 6 - Step 4: Perform Exploratory Analysis with Statistics

## Chapter 7 - Step 5: Model Data

## Chapter 8 - Evaluate Model Performance

## Chapter 9 - Tune Model with Hyper-Parameters

## Chapter 10 - Tune Model with Feature Selection

## Chapter 11 - Step 6: Validate and Implement

## Chapter 12 - Conclusion and Step 7: Optimize and Strategize