# Mini-Lab: Logistic Regression and SVMs

Names:
Dylan Scott
Jobin Joseph
Nnenna Okpara
Satvik Ajmera

Instructions:
You are to perform predictive analysis (classification) upon a data set: model the dataset using
methods we have discussed in class: logistic regression and support vector machines, and making
conclusions from the analysis. Follow the CRISP-DM framework in your analysis (you are not
performing all of the CRISP-DM outline, only the portions relevant to the grading rubric outlined
below). This report is worth 10% of the final grade. You may complete this assignment in teams of
as many as three people.

Write a report covering all the steps of the project. The format of the document can be PDF,
*.ipynb, or HTML. You can write the report in whatever format you like, but it is easiest to turn in the
rendered iPython notebook. The results should be reproducible using your report. Please carefully
describe every assumption and every step in your report.

SVM and Logistic Regression Modeling
• [50 points] Create a logistic regression model and a support vector machine model for the
classification task involved with your dataset. Assess how well each model performs (use
80/20 training/testing split for your data). Adjust parameters of the models to make them more
accurate. If your dataset size requires the use of stochastic gradient descent, then linear kernel
only is fine to use.

[pick performance stats]

• [10 points] Discuss the advantages of each model for each classification task. Does one type
of model offer superior performance over another in terms of prediction accuracy? In terms of
training time or efficiency? Explain in detail.

• [30 points] Use the weights from logistic regression to interpret the importance of different
features for each classification task. Explain your interpretation in detail. Why do you think
some variables are more important?

• [10 points] Look at the chosen support vectors for the classification task. Do these provide
any insight into the data? Explain.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

import plotly.express as px
import plotly.graph_objects as go

### Dataset add-on
From the first project we submitted we have since added on more data that we found on the NTSB website. We were able to merge in new columns using join as well as apend on more recent data. This will give us more vairables but we will have to clean up some of those added rows. This next section will be the clean up.

In [2]:
#Read in the Aviation Data
final_data = pd.read_csv("Data/final_data.csv",low_memory=False,dtype={'damage': str})
final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115706 entries, 0 to 115705
Data columns (total 35 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Unnamed: 0         115706 non-null  int64  
 1   index              115706 non-null  int64  
 2   ev_id              115706 non-null  object 
 3   ntsb_no_x          115706 non-null  object 
 4   acft_make          115643 non-null  object 
 5   acft_model         115630 non-null  object 
 6   cert_max_gr_wt     98673 non-null   float64
 7   acft_category      115287 non-null  object 
 8   damage             113877 non-null  object 
 9   far_part           114925 non-null  object 
 10  afm_hrs_last_insp  60298 non-null   float64
 11  type_fly           108599 non-null  object 
 12  dprt_city          111864 non-null  object 
 13  dprt_state         108791 non-null  object 
 14  dprt_state.1       108791 non-null  object 
 15  rwy_len            64222 non-null   float64
 16  rw

# Checking Data Cleaning

In [3]:
#It looks like we have some missing values and have an inconsistant UNK vs UNK on flight damage
#combine all injuries includigng those on the ground
#sky_cond_ceil, sky_cond_nonceil
#chekc U vs Unk for wind_vel_ind
#flight crew 
finaldamagecount = final_data["damage"].value_counts().reset_index()
finaldamagecount.head(50)


Unnamed: 0,index,damage
0,SUBS,87994
1,DEST,20892
2,MINR,3302
3,NONE,1600
4,UNK,45
5,UNK,44


In [4]:
final_data.loc[final_data['damage'].str.contains('UNK', na=False), 'damage'] = 'UNK'
finaldamagecount = final_data["damage"].value_counts().reset_index()
finaldamagecount.head(50)

Unnamed: 0,index,damage
0,SUBS,87994
1,DEST,20892
2,MINR,3302
3,NONE,1600
4,UNK,89


In [5]:
#checking to see if wind_vel_ind had a miss-match with U and UNK
wind_count = final_data["wind_vel_ind"].value_counts().reset_index()
wind_count.head(50)

Unnamed: 0,index,wind_vel_ind
0,F,47663
1,UNK,42470
2,SPEC,10900
3,CALM,7074
4,T,4030
5,LVAR,1486


In [6]:
#dealing with unknnowns
#some columns we can't simply replace the blank value with "Unknown" or 0s since that will skew our data
#'cert_max_gr_wt','afm_hrs_last_insp','rwy_len','rwy_width'
# with the columns listed above we have elected to remove any rows where they are blank. This will help focus our data and it will still leave us with an ample amount of data
final_data.dropna(subset=['cert_max_gr_wt','afm_hrs_last_insp','rwy_len','rwy_width'],inplace=True)

In [7]:
#rename the injuries columns to make them easier to read
final_data = final_data.rename(columns={"inj_tot_f": "Total_Fatal_Injuries", "inj_tot_s": "Total_Serious_Injuries","inj_tot_m":"Total_Minor_Injuries","inj_tot_n":'Total_Uninjured',"inj_tot_t":"Total_Injuries_Flight"})

#fill in 0s when there wasn't an injury in that category
final_data.update(final_data[['Total_Fatal_Injuries','Total_Serious_Injuries','Total_Minor_Injuries','Total_Uninjured','Total_Injuries_Flight','inj_f_grnd','inj_m_grnd','inj_s_grnd']].fillna(0))
final_data.head()

Unnamed: 0.1,Unnamed: 0,index,ev_id,ntsb_no_x,acft_make,acft_model,cert_max_gr_wt,acft_category,damage,far_part,...,Total_Fatal_Injuries,Total_Minor_Injuries,Total_Uninjured,Total_Serious_Injuries,Total_Injuries_Flight,sky_cond_ceil,sky_cond_nonceil,wind_vel_ind,wx_int_precip,phase_flt_spec
1,1,1,20001204X00001,ANC99IA025,Boeing,747-100,750000.0,AIR,MINR,121,...,0.0,0.0,4.0,0.0,0.0,NONE,SCAT,CALM,UNK,Landing
3,3,3,20001204X00003,ANC99LA022,Cessna,172,2300.0,AIR,SUBS,91,...,0.0,0.0,1.0,0.0,0.0,BKN,UNK,UNK,LGT,Unknown
4,4,4,20001204X00004,ANC99LA023,Cessna,207,3800.0,AIR,SUBS,135,...,0.0,0.0,1.0,0.0,0.0,BKN,UNK,UNK,UNK,Descent
6,6,6,20001204X00006,ATL99FA044,Beech,300,14100.0,AIR,DEST,91,...,2.0,0.0,0.0,0.0,2.0,BKN,UNK,UNK,MOD,Approach
8,8,8,20001204X00008,ATL99FA046,Aero Commander,560A,6000.0,AIR,DEST,91,...,2.0,0.0,0.0,2.0,4.0,NONE,CLER,UNK,UNK,Approach


In [8]:
#set missing variables to Unknown in order to run our models
final_data.update(final_data.fillna("UNK"))
final_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35427 entries, 1 to 115696
Data columns (total 35 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              35427 non-null  int64  
 1   index                   35427 non-null  int64  
 2   ev_id                   35427 non-null  object 
 3   ntsb_no_x               35427 non-null  object 
 4   acft_make               35427 non-null  object 
 5   acft_model              35427 non-null  object 
 6   cert_max_gr_wt          35427 non-null  float64
 7   acft_category           35427 non-null  object 
 8   damage                  35427 non-null  object 
 9   far_part                35427 non-null  object 
 10  afm_hrs_last_insp       35427 non-null  float64
 11  type_fly                35427 non-null  object 
 12  dprt_city               35427 non-null  object 
 13  dprt_state              35427 non-null  object 
 14  dprt_state.1            35427 non-nul

We will be using code from this classes Github: 
https://github.com/jakemdrew/DataMiningNotebooks/blob/master/04.%20Logits%20and%20SVM.ipynb

In [9]:
#drop not needed columns: index, ev_id ntsb_no_x
del final_data["index"]
del final_data["ev_id"]
del final_data["ntsb_no_x"]
del final_data["dprt_state.1"]
#final_data.info()

In [10]:
#injuries = final_data["Total.Injuries"].value_counts().reset_index()
#injuries.head(50)

In [11]:
#we want to account for ALL injuries. This includes injuries on the ground as well as passangers
#Here we will make a new column that shows total injuries including ground ones
final_data['Total_Injuries_Ground'] = final_data['inj_f_grnd']+final_data['inj_m_grnd']+final_data['inj_s_grnd']
final_data['Total_Injuries'] = final_data['Total_Injuries_Ground']+final_data['Total_Injuries_Flight']
final_data.head()

Unnamed: 0.1,Unnamed: 0,acft_make,acft_model,cert_max_gr_wt,acft_category,damage,far_part,afm_hrs_last_insp,type_fly,dprt_city,...,Total_Uninjured,Total_Serious_Injuries,Total_Injuries_Flight,sky_cond_ceil,sky_cond_nonceil,wind_vel_ind,wx_int_precip,phase_flt_spec,Total_Injuries_Ground,Total_Injuries
1,1,Boeing,747-100,750000.0,AIR,MINR,121,113.0,UNK,CHITOSE,...,4.0,0.0,0.0,NONE,SCAT,CALM,UNK,Landing,0.0,0.0
3,3,Cessna,172,2300.0,AIR,SUBS,91,40.0,PERS,,...,1.0,0.0,0.0,BKN,UNK,UNK,LGT,Unknown,0.0,0.0
4,4,Cessna,207,3800.0,AIR,SUBS,135,49.0,UNK,,...,1.0,0.0,0.0,BKN,UNK,UNK,UNK,Descent,0.0,0.0
6,6,Beech,300,14100.0,AIR,DEST,91,3.0,EXEC,GREENEVILLE,...,0.0,0.0,2.0,BKN,UNK,UNK,MOD,Approach,0.0,2.0
8,8,Aero Commander,560A,6000.0,AIR,DEST,91,13.0,PERS,,...,0.0,2.0,4.0,NONE,CLER,UNK,UNK,Approach,0.0,4.0


In [12]:
#create a new column of injuried or not to get a binary response
#1 means someone was hurt 0 means someone was not
final_data['Injury'] = np.where(final_data['Total_Injuries'] >0,1,0)
injuries = final_data["Injury"].value_counts().reset_index()
injuries.head(50)

Unnamed: 0,index,Injury
0,1,18750
1,0,16677


In [13]:
#delete the index column called "Unnamed: 0"
final_df = final_data.copy()

del final_df["Unnamed: 0"]
del final_df["Total_Injuries"]
#Since we added up all of our injuries we don't need the other columns that include injury count since it will be colinear to our prediction variable
final_df = final_df.drop(['Total_Fatal_Injuries','Total_Serious_Injuries','Total_Minor_Injuries','Total_Uninjured','Total_Injuries_Flight','inj_f_grnd','inj_m_grnd','inj_s_grnd','Total_Injuries_Ground'],axis = 1)

In [14]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 35427 entries, 1 to 115696
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   acft_make          35427 non-null  object 
 1   acft_model         35427 non-null  object 
 2   cert_max_gr_wt     35427 non-null  float64
 3   acft_category      35427 non-null  object 
 4   damage             35427 non-null  object 
 5   far_part           35427 non-null  object 
 6   afm_hrs_last_insp  35427 non-null  float64
 7   type_fly           35427 non-null  object 
 8   dprt_city          35427 non-null  object 
 9   dprt_state         35427 non-null  object 
 10  rwy_len            35427 non-null  float64
 11  rwy_width          35427 non-null  float64
 12  ev_type            35427 non-null  object 
 13  ev_city            35427 non-null  object 
 14  ev_state           35427 non-null  object 
 15  ev_country         35427 non-null  object 
 16  ev_highest_injury  35

# Train Test Split

final_data

In [15]:
from sklearn.model_selection import ShuffleSplit

# we want to predict the X and y data as follows:
if 'Injury' in final_data:
    y = final_data['Injury'].values # get the labels we want
    del final_data['Injury'] # get rid of the class label
    X = final_data.values # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    
    
# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)
                         
print(cv_object)

ShuffleSplit(n_splits=3, random_state=None, test_size=0.2, train_size=None)


# Logistic Regression

to do:
one hot encoding
avoid confounding variables - this causes an issue with feature importance
check for highly corrilated variables - 
use a confusion matrix
scale our data
large diff in KDE for support vectors - it falls along the decision bountry vs the read data
