In [1]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf #needed for models in this script
import pylab as pl
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

In [2]:
pd.set_option('html', True) #see the dataframe in a more user friendly manner
%matplotlib inline

## Decision Trees Overview

Consider a pool of college applicants. The average SAT score for admission has historically been 2,200, and the average GPA 4.9. We are given the application info on 1,000 applicants and asked to create a model that will allow us to predict students most likely to be admitted. How do we go about doing this?

One approach would be to first divide the applicants into those that have SAT score over 2,200 and then call this the "more likely" group. Then we could split this group further by GPA based on whether their GPA is less than or equal to 4.9 or over 4.9. We call the latter subgroup "most likely" and the former a "high maybe". Then we do the same thing to the group with SAT scores below 2,200 calling the high GPA subgroup a "maybe", and the low GPA subgroup a "probably not".

The following is an example of how the decision tree looks for this problem.

![](files/dtree1.jpg)

1. What do you think would happen if we split on GPA first and then SAT scores---would we get the same groupings? (i.e. what is the best way to split?)
2. What if we used more criteria such as essay evaluation scores, extra curriculars, awards and distinctions in sports etc? (i.e. how many attributes should we use to create splits, and what are the most significant attributes?)
3. We were given averages, but what about the spread, what about outliers? (i.e. how does the distribution of attributes affect misclassification?)

A <u>decision tree</u> uses the intrinsic structure of the data to make these splits. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. In machine learning, decision trees are commonly used to help identify features, and specific values of those features, that are most likely to result in a target value. If the target value is categorical, the model is a classification tree; if the target value is continuous, the model is a regression tree.

In this lesson, we're going to focus on classification trees.

## Data Cleaning and Exploration

Download and unzip UCI HAR Dataset.zip from: https://github.com/shubhabrataroy/Thinkful/tree/master/Data/Unit4.2.2

 * Read through features_info.txt (https://www.dropbox.com/s/mnj4x46z7jb3xw7/features_info.txt?dl=0). This file describes each feature, its physical significance, and also describes features that are derived from raw data by doing some averaging, or sampling, or some operation that gives a numerical result. What do you notice about the dataset? What kind of guidelines for approaching the dataset do you find?
 
 * In static activities (sit, stand, lie down) motion information will not be very useful.
 
 * In the dynamic activities (3 types of walking) motion will be significant.
 
 * Angle variables will be useful both in differentiating “lie vs stand” and “walk up vs walk down”.
 
 * Acceleration and Jerk variables are important in distinguishing various kinds of motion while filtering out random tremors while static.  
 
 * Mag and Angle variables contain the same info as (e.g., are strongly correlated with) XYZ variables. We choose to focus on the latter as they are simpler to reason about. This is a very important point to understand as it results in elimination of a few hundred variables.
 
 * We ignore the band variables as we have no simple way to interpret the meaning and relate them to physical activities.
 
 * mean and std are important, skewness and kurtosis may also so we include all of these.

<b>Clean-Up:</b> for each of the tasks below, think about how you will fix the data and what the implications/ramifications of your fixes will be. Also, can you think of a way to accomplish several of the cleaning tasks at once?

* Identify and fix the inclusion of ( ) in column names. <b>Think I have this one</b>
* Identify and remove duplicate column names. <b>Do I need a function that loops over the column names and appends duplicates to a list? Then I use drop function to drop that list of duplicates?</b>
* Identify and fix the inclusion of ‘-’ in column names. <b>Think I have this one</b>
* Identify and fix extra ) in some column names. <b>Think I have this one</b>
* Identify and fix inclusion of multiple ‘,’ in column names. <b>Not Sure</b>
* Identify and fix column names containing “BodyBody” <b>Think I have this one</b>
* Drop 'Body' and 'Mag' from column names. <b>Think I have this one</b>
* Map 'mean' and 'std' to 'Mean' and 'STD' <b>Think I have this one</b>
* Make 'activity' a categorical variable.
* Plot a histogram of Body Acceleration Magnitude (i.e. histogram of all 6 activities) to see how each variable does as a predictor of static versus dynamic activities.

<b>Open Data:</b>

In [11]:
frame = pd.read_csv('samsungdata.csv')
frame.head(1)

Unnamed: 0.1,Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,activity
0,1,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,1,standing


<b>Clean Data Column Headers:</b>

In [None]:
def fix_column_headers(dataframe):
    
    dataframe.rename(columns=lambda x: x.replace('()', ''), inplace=True) #Identify and fix the inclusion of ( ) in column names.
    dataframe.rename(columns=lambda x: x.replace('-', ''), inplace=True) #Identify and fix the inclusion of ‘-’ in column names.
    dataframe.rename(columns=lambda x: x.replace('))', ')'), inplace=True) #Identify and fix extra ) in some column names.
    dataframe.rename(columns=lambda x: x.replace('BodyBody', ''), inplace=True) #Identify and fix column names containing “BodyBody” names.
    dataframe.rename(columns=lambda x: x.replace('Body', ''), inplace=True) #Drop 'Body' and 'Mag' from column names.
    dataframe.rename(columns=lambda x: x.replace('Mag', ''), inplace=True) #Drop 'Body' and 'Mag' from column names.
    dataframe.rename(columns=lambda x: x.replace('mean', 'Mean'), inplace=True) #Map 'mean' and 'std' to 'Mean' and 'STD'
    dataframe.rename(columns=lambda x: x.replace('std', 'STD'), inplace=True) #Map 'mean' and 'std' to 'Mean' and 'STD'
    