# Mini-Lab: Logistic Regression and SVMs

Names:
Dylan Scott
Jobin Joseph
Nnenna Okpara
Satvik Ajmera

Instructions:
You are to perform predictive analysis (classification) upon a data set: model the dataset using
methods we have discussed in class: logistic regression and support vector machines, and making
conclusions from the analysis. Follow the CRISP-DM framework in your analysis (you are not
performing all of the CRISP-DM outline, only the portions relevant to the grading rubric outlined
below). This report is worth 10% of the final grade. You may complete this assignment in teams of
as many as three people.

Write a report covering all the steps of the project. The format of the document can be PDF,
*.ipynb, or HTML. You can write the report in whatever format you like, but it is easiest to turn in the
rendered iPython notebook. The results should be reproducible using your report. Please carefully
describe every assumption and every step in your report.

SVM and Logistic Regression Modeling
• [50 points] Create a logistic regression model and a support vector machine model for the
classification task involved with your dataset. Assess how well each model performs (use
80/20 training/testing split for your data). Adjust parameters of the models to make them more
accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel
only is fine to use.

[pick performance stats]

• [10 points] Discuss the advantages of each model for each classification task. Does one type
of model offer superior performance over another in terms of prediction accuracy? In terms of
training time or efficiency? Explain in detail.

• [30 points] Use the weights from logistic regression to interpret the importance of different
features for each classification task. Explain your interpretation in detail. Why do you think
some variables are more important?

• [10 points] Look at the chosen support vectors for the classification task. Do these provide
any insight into the data? Explain.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

import plotly.express as px
import plotly.graph_objects as go

### Dataset add-on
From the first project we submitted we have since added on more data that we found on the NTSB website. We were able to merge in new columns using join as well as apend on more recent data. This will give us more vairables but we will have to clean up some of those added rows. This next section will be the clean up.

In [None]:
#Read in the Aviation Data
final_data = pd.read_csv("Data/final_data.csv",low_memory=False,dtype={'damage': str})
#Delete columns that were imported incorrectly
del final_data["Unnamed: 0"]
del final_data["dprt_state.1"]
del final_data["index"]
del final_data["ntsb_no_x"]
final_data.info()

# Checking Data Cleaning

In [None]:
#It looks like we have some missing values and have an inconsistant UNK vs UNK on flight damage
#combine all injuries includigng those on the ground
#sky_cond_ceil, sky_cond_nonceil
#chekc U vs Unk for wind_vel_ind
#flight crew 
finaldamagecount = final_data["damage"].value_counts().reset_index()
finaldamagecount.head(50)

In [None]:
#looks like we have some inconsistant cities since some are upper and some are lower case
final_data['ev_city'] = final_data['ev_city'].str.upper()
ev_city_fix = final_data["ev_city"].value_counts().reset_index()
ev_city_fix.head(10)

In [None]:
#looks like we have some inconsistant cities since some are upper and some are lower case
final_data['ev_city'] = final_data['ev_city'].str.upper()
ev_city_fix = final_data["ev_city"].value_counts().reset_index()
ev_city_fix.head(10)

In [None]:
final_data.loc[final_data['damage'].str.contains('UNK', na=False), 'damage'] = 'UNK'
finaldamagecount = final_data["damage"].value_counts().reset_index()
finaldamagecount.head(50)

In [None]:
#checking to see if wind_vel_ind had a miss-match with U and UNK
wind_count = final_data["wind_vel_ind"].value_counts().reset_index()
wind_count.head(50)

In [None]:
#dealing with unknnowns
#some columns we can't simply replace the blank value with "Unknown" or 0s since that will skew our data
#'cert_max_gr_wt','afm_hrs_last_insp','rwy_len','rwy_width'
# with the columns listed above we have elected to remove any rows where they are blank. This will help focus our data and it will still leave us with an ample amount of data
final_data.dropna(subset=['cert_max_gr_wt','afm_hrs_last_insp','rwy_len','rwy_width'],inplace=True)

In [None]:
#rename the injuries columns to make them easier to read
final_data = final_data.rename(columns={"inj_tot_f": "Total_Fatal_Injuries", "inj_tot_s": "Total_Serious_Injuries","inj_tot_m":"Total_Minor_Injuries","inj_tot_n":'Total_Uninjured',"inj_tot_t":"Total_Injuries_Flight"})

#fill in 0s when there wasn't an injury in that category
final_data.update(final_data[['Total_Fatal_Injuries','Total_Serious_Injuries','Total_Minor_Injuries','Total_Uninjured','Total_Injuries_Flight','inj_f_grnd','inj_m_grnd','inj_s_grnd']].fillna(0))
final_data.head()

In [None]:
#set missing variables to Unknown in order to run our models
final_data.update(final_data.fillna("UNK"))
final_data.info()

We will be using code from this classes Github: 
https://github.com/jakemdrew/DataMiningNotebooks/blob/master/04.%20Logits%20and%20SVM.ipynb

In [None]:
#we want to account for ALL injuries. This includes injuries on the ground as well as passangers
#Here we will make a new column that shows total injuries including ground ones
final_data['Total_Injuries_Ground'] = final_data['inj_f_grnd']+final_data['inj_m_grnd']+final_data['inj_s_grnd']
final_data['Total_Injuries'] = final_data['Total_Injuries_Ground']+final_data['Total_Injuries_Flight']
final_data.head()

In [None]:
#create a new column of injuried or not to get a binary response
#1 means someone was hurt 0 means someone was not
final_data['Injury'] = np.where(final_data['Total_Injuries'] >0,1,0)
injuries = final_data["Injury"].value_counts().reset_index()
injuries.head(50)

In [None]:
final_data.info()

In [None]:
#delete the index column called "Unnamed: 0"
final_df = final_data.copy()
#Since we added up all of our injuries we don't need the other columns that include injury count since it will be colinear to our prediction variable
final_df = final_df.drop(['Total_Fatal_Injuries','Total_Serious_Injuries','Total_Minor_Injuries','Total_Uninjured','Total_Injuries_Flight','inj_f_grnd','inj_m_grnd','inj_s_grnd','Total_Injuries_Ground',"Total_Injuries","ev_id", "dprt_city"],axis = 1)

In [None]:
X = final_df.drop("Injury", axis = 1).copy()
y = final_df["Injury"].copy()

Do not run this cell!!

In [None]:
# from sklearn.preprocessing import StandardScaler, OneHotEncoder

# # Define which columns should be encoded vs scaled
# #Categorical columns to convert to one hot encoding
# columns_to_encode = ['acft_make',"acft_model","acft_category", "damage",
#                      "far_part","type_fly","dprt_state",
#                      "ev_type","ev_city","ev_state","ev_country",
#                      "ev_highest_injury","sky_cond_ceil",
#                      "sky_cond_nonceil","wind_vel_ind","wx_int_precip",
#                      "phase_flt_spec"]
# #Continuous columns to be scaled
# columns_to_scale  = ['cert_max_gr_wt', 'afm_hrs_last_insp',
#                      'rwy_len',"rwy_width"]

# # Instantiate encoder/scaler
# scaler = StandardScaler()
# ohe = OneHotEncoder(drop="first")

# # Scale and Encode Separate Columns
# scaled_columns = scaler.fit(X[columns_to_scale]) 
# a = scaled_columns.transform(X)
# encoded_columns = ohe.fit(X[columns_to_encode])
# b = encoded_columns.transform(X)
# # Concatenate (Column-Bind) Processed Columns Back Together
# c = np.concatenate([a, b], axis=1)

# One hot encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop="first")
encoder.fit(X)
X = encoder.transform(X)
X

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(y)
y = le.transform(y)
y

# Train/test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler

scl_obj = StandardScaler(with_mean=False)
scl_obj.fit(X_train) 

X_train_scaled = scl_obj.transform(X_train) 
X_test_scaled = scl_obj.transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(class_weight="balanced")
logisticRegr.fit(X_train, y_train)
y_hat = logisticRegr.predict(X_test)

In [None]:
from sklearn import metrics as mt
acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print(acc)
print(conf)

final_data

Here, we looked to see if there were any variables that may be too coorlated with our regression. We got a 99% prediction rate which told us we had a variable that should be considered "unknown" We removed ev_highest_injury which should be treated as unknown. This fixes our issues of too high of accurcy.

# Final Models: USE THIS HOMIES

In [None]:
df = final_df.copy()
del df['ev_highest_injury']

In [None]:
X = df.drop("Injury", axis = 1).copy()
y = df["Injury"].copy()

In [None]:
#one hot encoding with proper model
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop="first")
encoder.fit(X)
X = encoder.transform(X)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(y)
y = le.transform(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
# scale attributes by the training set
scl_obj = StandardScaler(with_mean=False)
scl_obj.fit(X_train) # find scalings for each column that make this zero mean and unit std
# the line of code above only looks at training data to get mean and std and we can use it 
# to transform new feature data

X_train_scaled = scl_obj.transform(X_train) # apply to training
X_test_scaled = scl_obj.transform(X_test)

# Logistic Regression

In [None]:
#scaled
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(class_weight="balanced",solver='liblinear',penalty = 'l2')
logisticRegr.fit(X_train_scaled, y_train)
y_hat = logisticRegr.predict(X_test_scaled)

from sklearn import metrics as mt
acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print(acc)
print(conf)

In [None]:
#NOT Scaled
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(class_weight="balanced",solver='lbfgs', max_iter=1000)
logisticRegr.fit(X_train, y_train)
y_hat = logisticRegr.predict(X_test)

from sklearn import metrics as mt
acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print(acc)
print(conf)
### timer added in the script to see the efficency of the model 
import timeit

def test(n):
    return sum(range(n))

n = 10000
loop = 1000

result = timeit.timeit('test(n)', globals=globals(), number=loop)
print(result / loop)
# 0.0002666301020071842

In [None]:
from sklearn.model_selection import ShuffleSplit

# we want to predict the X and y data as follows:
if 'Injury' in final_df:
    y = final_df['Injury'].values # get the labels we want
    del final_df['Injury'] # get rid of the class label
    X = final_df.values # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    
    
# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)
                         
print(cv_object)

In [None]:
# sort these attributes and spit them out
zip_vars = zip(logisticRegr.coef_.T,final_df.columns) # combine attributes
zip_vars = sorted(zip_vars)
for coef, name in zip_vars:
    print(name, 'has weight of', coef[0]) # now print them out
    

In [None]:
# interpret the weights

# iterate over the coefficients
weights = logisticRegr.coef_.T # take transpose to make a column vector
variable_names = final_df.columns
for coef, name in zip(weights,variable_names):
    print(name, 'has weight of', coef[0])
    
# does this look correct?

In [None]:
# now let's make a pandas Series with the names and values, and plot them
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('ggplot')


weights = pd.Series(logisticRegr.coef_[0],index=final_df.columns)
weights.plot(kind='bar')
plt.show()

# SVM


In [None]:

# lets investigate SVMs on the data and play with the parameters and kernels
from sklearn.svm import SVC

# train the model just as before
svm_clf = SVC(C=0.5, kernel='rbf', degree=3, gamma='auto') # get object
svm_clf.fit(X_train_scaled, y_train)  # train object

y_hat = svm_clf.predict(X_test) # get test set precitions

acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf)

### timer added in the script to see the efficency of the model 
import timeit

def test(n):
    return sum(range(n))

n = 10000
loop = 1000

result = timeit.timeit('test(n)', globals=globals(), number=loop)
print(result / loop)


In [None]:
from sklearn.svm import SVC
#took forever to run

# train the model just as before
#svm_clf = SVC(C=0.5, kernel='linear', degree=3, gamma='auto') # get object
svm_clf.fit(X_train_scaled, y_train)  # train object

y_hat = svm_clf.predict(X_test) # get test set precitions

acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf)

### timer added in the script to see the efficency of the model 
import timeit

def test(n):
    return sum(range(n))

n = 10000
loop = 1000

result = timeit.timeit('test(n)', globals=globals(), number=loop)
print(result / loop)


to do:
one hot encoding
avoid confounding variables - this causes an issue with feature importance
check for highly corrilated variables - 
use a confusion matrix
scale our data
large diff in KDE for support vectors - it falls along the decision bountry vs the read data


In [None]:

# now let's make a pandas Series with the names and values, and plot them
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('ggplot')


weights = pd.Series(lr_clf.coef_[0],index=df_imputed.columns)
weights.plot(kind='bar')
plt.show()