## Day 23 Lecture 1 Assignment

In this assignment, we will explore feature selection and dimensionality reduction techniques. We will use both the FIFA ratings dataset and the Chicago traffic crashes dataset.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
crash_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/traffic_crashes_chicago.csv')
soccer_data = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/fifa_ratings.csv')

In [3]:
soccer_data.head()

Unnamed: 0,ID,Name,Overall,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,...,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle
0,158023,L. Messi,94,84,95,70,90,86,97,93,...,94,48,22,94,94,75,96,33,28,26
1,20801,Cristiano Ronaldo,94,84,94,89,81,87,88,81,...,93,63,29,95,82,85,95,28,31,23
2,190871,Neymar Jr,92,79,87,62,84,84,96,88,...,82,56,36,89,87,81,94,27,24,33
3,192985,K. De Bruyne,91,93,82,55,92,82,86,85,...,91,76,61,87,94,79,88,68,58,51
4,183277,E. Hazard,91,81,84,61,89,80,95,83,...,80,54,41,87,89,86,91,34,27,22


In [4]:
crash_data.head()

Unnamed: 0,RD_NO,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,LANE_CNT,...,WORKERS_PRESENT_I,NUM_UNITS,MOST_SEVERE_INJURY,INJURIES_TOTAL,INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN
0,JC334993,7/4/2019 22:33,45,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",REAR END,DIVIDED - W/MEDIAN BARRIER,,...,,,,,,,,,,
1,JC370822,7/30/2019 10:22,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,TURNING,DIVIDED - W/MEDIAN (NOT RAISED),,...,,,,,,,,,,
2,JC387098,8/10/2019 17:00,25,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,ONE-WAY,,...,,1.0,,,,,,,,
3,JC395195,8/16/2019 16:53,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,NOT DIVIDED,,...,,1.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,JC396604,8/17/2019 16:04,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,PARKED MOTOR VEHICLE,PARKING LOT,,...,,1.0,NO INDICATION OF INJURY,0.0,0.0,0.0,0.0,0.0,1.0,0.0


We will begin with the Chicago traffic crashes dataset, focusing on removing columns with significant missing data.

Remove all columns with more than 5% missing data from the dataframe. (The *missingness summary* function we wrote a few exercises ago will speed this process up significantly.) Print out the columns that were removed, and the proportion of missing data for each column.

In [11]:
# answer goes here

# Define a missingness summary that will return a series with column names and what percentage of data in that column is null
def missingness_summary(df, print_log, sort):
    s = df.isna().sum()*100/len(df)
    if sort == 'asc':
        s.sort_values(ascending=True, inplace=True)
    elif sort == 'desc':
        s.sort_values(ascending=False, inplace=True)
    if print_log == True:
        print(s)
    return s

# assign the missingness series to a variable
missing = missingness_summary(crash_data, True, 'desc')

WORKERS_PRESENT_I                99.835205
DOORING_I                        99.661554
WORK_ZONE_TYPE                   99.439054
WORK_ZONE_I                      99.293316
PHOTOS_TAKEN_I                   98.731833
STATEMENTS_TAKEN_I               97.976032
NOT_RIGHT_OF_WAY_I               95.391656
INTERSECTION_RELATED_I           77.945704
HIT_AND_RUN_I                    72.242307
LANE_CNT                         46.710683
REPORT_TYPE                       2.301220
MOST_SEVERE_INJURY                0.579465
INJURIES_NO_INDICATION            0.577586
INJURIES_TOTAL                    0.577586
INJURIES_FATAL                    0.577586
INJURIES_INCAPACITATING           0.577586
INJURIES_NON_INCAPACITATING       0.577586
INJURIES_REPORTED_NOT_EVIDENT     0.577586
INJURIES_UNKNOWN                  0.577586
NUM_UNITS                         0.375485
BEAT_OF_OCCURRENCE                0.001074
STREET_DIRECTION                  0.000537
STREET_NAME                       0.000268
TRAFFIC_CON

In [18]:
# Get the missing column names with missing percentages > 5
missing_too_much = missing.loc[missing > 5]

# Print it out
print(missing_too_much)

# Drop the columns from crash_data
crash_data.drop(missing_too_much.index, axis=1, inplace=True)
crash_data.columns

WORKERS_PRESENT_I         99.835205
DOORING_I                 99.661554
WORK_ZONE_TYPE            99.439054
WORK_ZONE_I               99.293316
PHOTOS_TAKEN_I            98.731833
STATEMENTS_TAKEN_I        97.976032
NOT_RIGHT_OF_WAY_I        95.391656
INTERSECTION_RELATED_I    77.945704
HIT_AND_RUN_I             72.242307
LANE_CNT                  46.710683
dtype: float64


Index(['RD_NO', 'CRASH_DATE', 'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE',
       'DEVICE_CONDITION', 'WEATHER_CONDITION', 'LIGHTING_CONDITION',
       'FIRST_CRASH_TYPE', 'TRAFFICWAY_TYPE', 'ALIGNMENT',
       'ROADWAY_SURFACE_COND', 'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE',
       'DAMAGE', 'DATE_POLICE_NOTIFIED', 'PRIM_CONTRIBUTORY_CAUSE',
       'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO', 'STREET_DIRECTION',
       'STREET_NAME', 'BEAT_OF_OCCURRENCE', 'NUM_UNITS', 'MOST_SEVERE_INJURY',
       'INJURIES_TOTAL', 'INJURIES_FATAL', 'INJURIES_INCAPACITATING',
       'INJURIES_NON_INCAPACITATING', 'INJURIES_REPORTED_NOT_EVIDENT',
       'INJURIES_NO_INDICATION', 'INJURIES_UNKNOWN'],
      dtype='object')

Next, we will shift our focus to the FIFA ratings dataset and explore univariate feature selection techniques. We will treat "Overall" as the response and the other ratings as features.

Using the correlations between the response and features, identify the 5 features with the greatest univariate correlation to the response.

In [29]:
# answer goes here

soccer_corr = soccer_data.corr()
correlations = soccer_corr.loc[:,'Overall'].drop('Overall', axis=0)

np.abs(correlations).sort_values(ascending=False).head(5)

Reactions       0.847739
Composure       0.801749
ShortPassing    0.722720
BallControl     0.717933
LongPassing     0.585104
Name: Overall, dtype: float64

Use sklearn's "SelectKBest" function to select the top 5 features using two different scoring metrics: f_regression and mutual_info_regression. Print out the top 5 columns that are selected by both. How do they compare to the ones selected by  univariate correlation?

In [35]:
# answer goes here

Y = soccer_data['Overall']
X = soccer_data.drop(['Overall', 'ID', 'Name'], axis=1)

from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression

k = 5
kbest_f_reg = SelectKBest(k=k, score_func=f_regression)
X_best_features_1 = kbest_f_reg.fit_transform(X, Y)

X_best_features_1 = pd.DataFrame(X_best_features_1, columns=X.columns[kbest_f_reg.get_support()]) # Gets column names back into numpy array to make a dataframe for X_best_features

X_best_features_1.columns

Index(['ShortPassing', 'LongPassing', 'BallControl', 'Reactions', 'Composure'], dtype='object')

In [36]:
Y = soccer_data['Overall']
X = soccer_data.drop(['Overall', 'ID', 'Name'], axis=1)

k = 5
kbest_mi_reg = SelectKBest(k=k, score_func=mutual_info_regression)
X_best_features_2 = kbest_mi_reg.fit_transform(X, Y)

X_best_features_2 = pd.DataFrame(X_best_features_2, columns=X.columns[kbest_mi_reg.get_support()]) # Gets column names back into numpy array to make a dataframe for X_best_features

X_best_features_2.columns

Index(['ShortPassing', 'Dribbling', 'BallControl', 'Reactions', 'Composure'], dtype='object')

In [41]:
print(np.abs(correlations).sort_values(ascending=False).head(5).index)
print(X_best_features_1.columns)
print(X_best_features_2.columns)

print('\nThe first two methods came up with the same features, but in different orders.')
print('The last method yielded the \'Dribbling\' feature instead of \'LongPassing\'')

Index(['Reactions', 'Composure', 'ShortPassing', 'BallControl', 'LongPassing'], dtype='object')
Index(['ShortPassing', 'LongPassing', 'BallControl', 'Reactions', 'Composure'], dtype='object')
Index(['ShortPassing', 'Dribbling', 'BallControl', 'Reactions', 'Composure'], dtype='object')

The first two methods came up with the same features, but in different orders.
The last method yielded the 'Dribbling' feature instead of 'LongPassing'


Shifting our focus from feature selection to dimensionality reduction, perform PCA on the ratings provided, excluding "Overall". Then, answer the following questions:

- What percentage of the total variance is capture by the first component? What about the first two, or first three?
- Looking at the components themselves, how would you interpret the first two components in plain English?

In [52]:
# answer goes here

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Y = soccer_data['Overall']
X = soccer_data.drop(['Overall', 'ID', 'Name'], axis=1)

k = 5
pca = PCA(n_components=k)

principalComponents = pca.fit_transform(X)

principalDf = pd.DataFrame(principalComponents, 
        columns=['principal component {}'.format(n + 1) for n in range(k)])

principalDf.head()

Unnamed: 0,principal component 1,principal component 2,principal component 3,principal component 4,principal component 5
0,149.506331,43.713614,0.752653,-1.024545,1.239662
1,132.251163,45.030476,35.464595,-40.51368,-1.442205
2,133.728951,38.197314,-7.738964,-6.082813,4.220901
3,96.639506,94.488071,8.74432,4.223451,9.017821
4,129.262864,39.657661,-5.819084,-8.254157,11.778411


In [58]:
for i in range(3):
    print('Percentage of variance captured by component {} is {}'.format(i + 1,
                                                         pca.explained_variance_ratio_[i]))

Percentage of variance captured by component 1 is 0.39592940781178054
Percentage of variance captured by component 2 is 0.2633194816048358
Percentage of variance captured by component 3 is 0.08504495241196014


Principal component 1 is a 'fake feature' That represents 39% of the total variance across all features in the dataset.
Principal component 1 is a 'fake feature' That represents and additional 26% of the total variance across all features in the dataset.


# DONESO