# Week 2 Problem 2

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says YOUR CODE HERE. Do not write your answer in anywhere else other than where it says YOUR CODE HERE. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select Kernel, and restart the kernel and run all cells (Restart & Run all).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select File → Save and CheckPoint)

5. When you are ready to submit your assignment, go to Dashboard → Assignments and click the Submit button. Your work is not submitted until you click Submit.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. If your code does not pass the unit tests, it will not pass the autograder.



# Due Date: 6 PM, January 29, 2018


In [2]:
# Set up Notebook
% matplotlib inline

# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nose.tools import assert_false, assert_equal, assert_almost_equal, assert_true
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn import metrics
from nose.tools import assert_equal
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

# We do this to ignore several specific Pandas warnings
import warnings
warnings.filterwarnings('ignore')

sns.set_style('white')

# Set up Notebook
% matplotlib inline

### Reading in the Titanic Data Set


This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner Titanic, summarized according to economic status (class), sex, age and survival.

The dependent variable in this dataset is: survived(Categorical)

The independent Continous variables are: age and fare. <br>
The independent Categorical variables are: pclass, sex, sibsp, parch, embarked and alone.



In [3]:
col_names = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
             'embarked', 'alone', 'survived']

titanic_data = sns.load_dataset("titanic")

titanic_data = titanic_data[col_names]
titanic_data.head()


Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embarked,alone,survived
0,3,male,22.0,1,0,7.25,S,False,0
1,1,female,38.0,1,0,71.2833,C,False,1
2,3,female,26.0,0,0,7.925,S,True,1
3,1,female,35.0,1,0,53.1,S,False,1
4,3,male,35.0,0,0,8.05,S,True,0


## Problem 1

Write a function for data preprocessing which will involve removing NaNs and converting categorical features into corresponding binarized features. Function should return a dataframe which should contain binarized features as well as numerical features along with the dependent variable.

In [4]:
def preprocess_data(data):
    '''
    1. Preprocess data by removing all NaNs. 
    2. Create binarized features dataframe for each of the categorical features(can use get_dummies() function)
    3. Construct a dataframe for numerical feature
    4. Combine two DataFrames into a new DataFrame
    Parameters
    ----------
    data: dataframe containing titanic dataset
    
    Returns
    -------
    a dataframe 
    '''
    
    #drop all NaNs
    data.dropna(inplace = True)
    
    #create binarized features dataframe
    categorical = ['sex', 'embarked', 'alone']
    cat_data = pd.get_dummies(data[categorical])
    
    #construct a dataframe for numerical feature
    numerical = ['pclass','age','sibsp','parch', 'fare','survived']
    num_data = data[numerical]
    
    #combine two dataframes into a new dataframe(axis = 1 means concatenate along the columns )
    results = pd.concat([num_data, cat_data], axis = 1)
    
    return results

#### Sample output of the 1st 5 rows for the correct solution:

In [5]:
data = preprocess_data(titanic_data)
data.head()


Unnamed: 0,pclass,age,sibsp,parch,fare,survived,alone,sex_female,sex_male,embarked_C,embarked_Q,embarked_S
0,3,22.0,1,0,7.25,0,False,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,False,1,0,1,0,0
2,3,26.0,0,0,7.925,1,True,1,0,0,0,1
3,1,35.0,1,0,53.1,1,False,1,0,0,0,1
4,3,35.0,0,0,8.05,0,True,0,1,0,0,1


In [6]:
assert_equal(len(data.index), 712)
assert_equal(len(data.columns), 12)


# Problem 2

We will now be using the above Titanic dataset to create a Decision tree model in order to predict whether a person survived(which is represented by binary 1) or died(which is represented by binary 0) based on the parameters provided in the data set.

In the code cell below do the following:

- Create a function for creating a Decision tree Classification model using sci-kit learn keeping random_state=40 and maximum depth as a parameter.<br>
- Fit your model on the training features and labels (which are stored in d_train and l_train).


In [7]:
independent_vars = list(data)
independent_vars.remove('survived')
dependent_var = 'survived'

frac = 0.3
d_train, d_test, l_train, l_test = \
    train_test_split(data[independent_vars], data[dependent_var],
                    test_size=frac, random_state=40)


In [8]:
def DecisionTree(data,tree_depth):
    '''
    
    Create a decision Tree Classification model using random_state=40 and max_depth as a parameters. 

    Parameters
    ----------
    data: dataframe containing titanic dataset
    tree_depth: max depth of the tree
    
    Returns
    -------
    A DecisionTree Classifier model 
    '''

    # First we construct our decision tree, we only specify the 
    # random_state hyperparameter to ensure reproduceability.
    dtc = DecisionTreeClassifier(max_depth = tree_depth, random_state=40)

    # Fit estimator to scaled training data
    dtc = dtc.fit(d_train, l_train)

    return dtc

As the depth of the tree increases, the chance of tree overfitting on training dataset increases.

In [9]:
titanic_model_3 = DecisionTree(data,tree_depth=3)
titanic_model_8 = DecisionTree(data,tree_depth=8)

predicted_3 = titanic_model_3.predict(d_test)
score_test_3 = 100.0 * metrics.accuracy_score(l_test, predicted_3)
print(f'Decision Tree Classification [Titanic Test Data(depth=3)] Score = {score_test_3:4.1f}%\n')

predicted_3 = titanic_model_3.predict(d_train)
score_train_3 = 100.0 * metrics.accuracy_score(l_train, predicted_3)
print(f'Decision Tree Classification [Titanic Train Data(depth=3)] Score = {score_train_3:4.1f}%\n')

predicted_8 = titanic_model_8.predict(d_test)
score_test_8 = 100.0 * metrics.accuracy_score(l_test, predicted_8)
print(f'Decision Tree Classification [Titanic Test Data(depth=8)] Score = {score_test_8:4.1f}%\n')

predicted_8 = titanic_model_8.predict(d_train)
score_train_8 = 100.0 * metrics.accuracy_score(l_train, predicted_8)
print(f'Decision Tree Classification [Titanic Train Data(depth=8)] Score = {score_train_8:4.1f}%\n')


Decision Tree Classification [Titanic Test Data(depth=3)] Score = 78.0%

Decision Tree Classification [Titanic Train Data(depth=3)] Score = 82.3%

Decision Tree Classification [Titanic Test Data(depth=8)] Score = 80.4%

Decision Tree Classification [Titanic Train Data(depth=8)] Score = 91.8%



In [10]:
assert_equal(isinstance(titanic_model_3, DecisionTreeClassifier), True)
assert_equal(titanic_model_3.max_depth, 3)


## Problem 3

It is necessary to know what features gives us the most information or are the most important ones to create a model. Create a function to find the top 4 most important feature names along with the correspending value of importance.

In [11]:
def importance(feature_names,model):
    '''
    Find top 4 most important feature names along with the correspending value of importance.  
    
    Parameters
    ----------
    feature_names : containing a list of independent variables(independent_vars)
    model : model name for the decision tree regressor
    
    Returns
    -------
    A sorted list of 4 elements in decending order containing a tuple of feature and importance as an element
    
    Example of what return output will look like
    --------------------------------------------
    [('sex_female', 0.43425212116045792),
     ('fare', 0.17077252568364368),
     ('pclass', 0.15921755986638889),
     ('age', 0.15394111874015437)]
    
    Hint
    ----
    Use zip(feature_names, model.feature_importances_) and use reversed and sorted functions to get the zip in sorted order. 
    Then convert this into a list and take the first 4 elements. 
    Or if you are not using reversed, then return the last 4 elements.
    '''
    
    sort_data = sorted(zip(feature_names, model.feature_importances_), key=lambda x: x[1])[-4:]
    results = list(reversed(sort_data))
    
    return results


In [12]:
imp = importance(independent_vars,titanic_model_3)
a=[]
for i in range(len(imp)):
    a.append(np.round(imp[i][1],4))
assert_almost_equal(a,[0.6597, 0.2367, 0.1036, 0.0],places=4)
b=[]
for i in range(len(imp)):
    b.append(imp[i][0])

assert_equal(b,['sex_female', 'pclass', 'age', 'embarked_S'])

#Feature importance for titanic_model_3 will have the values : 
for i in range(len(imp)):
    print(f'{imp[i][0]} importance = {100.0*imp[i][1]:5.2f}%')



sex_female importance = 65.97%
pclass importance = 23.67%
age importance = 10.36%
embarked_S importance =  0.00%


## Problem 4

We will be using Hitters dataset which contains player information including his performance and salary for Major League Baseball from the 1986 and 1987 seasons. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. 

We will try to predict the Salary of players based on the factors available to us(information like Number of Hits, Number of Home Runs, etc.) in the dataset.



In [13]:
wage = pd.read_csv('/home/data_scientist/data/misc/wages.csv')
wage['log_Salary'] = np.log(wage['Salary'])


In [14]:
import patsy as pts 

y, x = pts.dmatrices('Salary + log_Salary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + Years +' +
                     'CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks + C(League) +' + 
                     'C(Division) + PutOuts + Assists + Assists + C(NewLeague)' ,
                     data=wage, return_type='dataframe')


In the code cell below do the following:

- Split data intro training:testing data set using random_state=23.
- Create 2 Decision tree models for Salary and log_Salary repectively using sci-kit learn keeping random_state=23, mae as error criterion and maximum features as 7.<br>
- Fit your model on the training features and labels (which are stored in ind_train and dep_train).
- Find the RMSE values from both the models for Salary.

In [37]:
'''

- Split data intro training:testing data set(random_state=23)

- Create following 2 Decision Tree Models with:

1. a model named wage_model using Salary as dependent variable
2. a model named log_wage_model using log_Salary as dependent variable

- Using the above models, find :
1. the rmse value using wage_model: rmse_wage
2. the rmse value using log_wage_model: rmse_logwage

'''
frac = 0.3

# Split data intro training:testing data set
ind_train, ind_test, dep_train, dep_test = \
    train_test_split(x, y, test_size=frac, random_state=23)

# Create Regressor with default properties
auto_model = DecisionTreeRegressor(criterion= 'mae', random_state=23, max_features=7)

# Fit estimator with two different response variable
wage_model = auto_model.fit(ind_train, dep_train['Salary'])
log_wage_model = auto_model.fit(ind_train, dep_train['log_Salary'])

# Regress on test data
wage_pred = wage_model.predict(ind_test)
log_wage_pred = log_wage_model.predict(ind_test)


#transform the log predicted values to the previous form
import math
logtransform = []
for i in range(0, 79):
    logtransform.append(math.exp(log_wage_pred[i]))

# Compute RMSE
rmse_wage = np.sqrt(mean_squared_error(dep_test['Salary'], wage_pred))
rmse_logwage = np.sqrt(mean_squared_error(dep_test['Salary'], logtransform))

In [39]:
assert_true(type(wage_model), type(DecisionTreeRegressor))
assert_true(type(log_wage_model), type(DecisionTreeRegressor))
assert_almost_equal(423.44, round(rmse_logwage,2), places = 3)

