## This notebook is the part of Georgetown University Data Science Project - Team Ship Happen

## Purpose of this notebook is Feature Selection

Remember: Our goal is to find the smallest set of the available features such that the fitted model will reach its maximal predictive value.


### Import

In [1]:
import os
import zipfile
import requests
import pandas as pd

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

### Load incident data into Panda Dataframe

In [21]:
incident = pd.read_csv(os.path.abspath('mvinjury_data.txt'), sep='\t')

In [22]:
incident.head()

Unnamed: 0,gross_ton,vlength,vdepth,vessel_class,vessel_age,route_type,mvaccident
0,159.0,89.3,12.0,8,70.0,10,1
1,0.0,250.0,10.5,1,65.0,10,1
2,9876.0,459.8,36.2,1,22.0,7,1
3,1830.0,284.0,11.2,1,53.0,4,1
4,1983.0,209.5,12.5,6,35.0,10,1


### Separate dataframe into features and targets

In [23]:
features = incident[['gross_ton', 'vlength', 'vdepth', 'vessel_class', 'vessel_age','route_type']]
labels   = incident['mvaccident']

In [24]:
list(features)

['gross_ton', 'vlength', 'vdepth', 'vessel_class', 'vessel_age', 'route_type']

# Regularization techniques

## LASSO (L1 Regularization)

LASSO forces weak features to have zeroes as coefficients, effectively dropping the least predictive features.

In [10]:
model = Lasso()
model.fit(features, labels)
print(list(zip(features, model.coef_.tolist())))

[('gross_ton', -1.1660091146857848e-07), ('vlength', -0.0), ('vdepth', -0.0), ('vessel_class', -0.0), ('vessel_age', 0.0), ('route_type', 0.0)]


## Ridge Regression (L2 Regularization)
Ridge assigns every feature a weight, but spreads the coefficient values out more equally, shrinking but still maintaining less predictive features.

In [11]:
model = Ridge()
model.fit(features, labels)
print(list(zip(features, model.coef_.tolist())))

[('gross_ton', 9.064740466184525e-08), ('vlength', -9.058693645562245e-06), ('vdepth', -3.5814870364876126e-06), ('vessel_class', -0.0009251451434086009), ('vessel_age', 7.094743250644458e-06), ('route_type', 0.00788699282869464)]


## ElasticNet
ElasticNet is a linear combination of L1 and L2 regularization, meaning it combines Ridge and LASSO and essentially splits the difference.

In [12]:
model = ElasticNet(l1_ratio=0.10)
model.fit(features, labels)
print(list(zip(features, model.coef_.tolist())))

[('gross_ton', -6.841538744242996e-09), ('vlength', -1.135450668664708e-05), ('vdepth', -0.0), ('vessel_class', -0.0), ('vessel_age', 0.0), ('route_type', 0.0)]


## Transformer methods
SelectFromModel()

Scikit-Learn has a meta-transformer method for selecting features based on importance weights.

In [13]:
model = Lasso()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
print(list(features[sfm.get_support(indices=True)]))

[]


In [14]:
model = Ridge()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
print(list(features[sfm.get_support(indices=True)]))

['route_type']


In [15]:
model = ElasticNet()
sfm = SelectFromModel(model)
sfm.fit(features, labels)
print(list(features[sfm.get_support(indices=True)]))

['vlength']


## Dimensionality reduction

### Principal component analysis (PCA)

Linear dimensionality reduction using Singular Value Decomposition (SVD) of the data and keeping only the most significant singular vectors to project the data into a lower dimensional space.
Unsupervised method
Uses a signal representation criterion
Identifies the combination of attributes that account for the most variance in the data.

In [16]:
pca = PCA(n_components=2)
new_features = pca.fit(features).transform(features)
print(new_features)

[[-3716.75904752     9.22059303]
 [-3874.14054158   171.32492014]
 [ 6003.49063603   282.98540402]
 ..., 
 [-3862.3411009    -48.59537522]
 [-3867.28788895   -42.729949  ]
 [-3863.33198755   -48.09081489]]


### Linear discriminant analysis (LDA)

A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. Can be used to reduce the dimensionality of the input by projecting it to the most discriminative directions.
Supervised method
Uses a classification criterion
Tries to identify attributes that account for the most variance between classes.

In [17]:
lda = LDA(n_components=2)
new_features = lda.fit(features, labels).transform(features)
print(new_features)

[[ 0.28579206]
 [ 0.66647869]
 [-1.25107163]
 ..., 
 [-0.18863414]
 [-0.81867291]
 [ 0.29352174]]


In [25]:
# clean data save into a file for using model selection
incident.to_csv('mvinjury_data_final.txt', header=False, sep='\t', index=False)