You are to build upon the predictive analysis (classification) that you already completed in the previous mini-project, adding additional modeling from new classification algorithms as well as more explanations that are inline with the CRISP-DM framework. You should use appropriate cross validation for all of your analysis (explain your chosen method of performance validation in detail). Try to use as much testing data as possible in a realistic manner (you should define what you think is realistic and why).
This report is worth 20% of the final grade. Please upload a report (one per team) with all code used, visualizations, and text in a single document. The format of the document can be PDF, *.ipynb, or HTML. You can write the report in whatever format you like, but it is easiest to turn in the rendered iPython notebook. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.

### Dataset Selection
Select a dataset identically to the way you selected for the first project work week and mini-project. You are not required to use the same dataset that you used in the past, but you are encouraged. You must identify two tasks from the dataset to regress or classify. That is:
- two classification tasks OR
- two regression tasks OR
- one classification task and one regression task

For example, if your dataset was from the diabetes data you might try to predict two tasks: (1) classifying if a patient will be readmitted within a 30 day period or not, and (2) regressing what the total number of days a patient will spend in the hospital, given their history and specifics of the encounter like tests administered and previous admittance.

### Grading Rubric

#### Data Preparation (15 points total)
- [10 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.
 
- [5 points] Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).
### Modeling and Evaluation (70 points total)
- [10 points] Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.

- [10 points] Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate.
- [20 points] Create three different classification/regression models (e.g., random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric.
- [10 points] Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.
- [10 points] Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods.
- [10 points] Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.
### Deployment (5 points total)
- [5 points] How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?
### Exceptional Work (10 points total)
- You have free reign to provide additional modeling.
- One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?

Two dataframes for each classification task

Data cleanup (Dylan and Satvik)
Broad phase of flight dataframe

Injury (Injury)  for KNN (Nnenna)
- Look into ROC Curves
- Look at Sklearn parameters for KNN


Injury (Injury) for Decision Trees (Jobin)
- Look at Sklearn parameters for decision trees

Injury (Injury) for KNN

Injury (Injury) for Decision Trees



In [246]:
import pandas as pd
import numpy as np

In [247]:
#Read in the Aviation Data
final_data = pd.read_csv("../Data/final_data.csv",low_memory=False,dtype={'damage': str})
#Delete columns that were imported incorrectly
del final_data["Unnamed: 0"]
del final_data["dprt_state.1"]
del final_data["index"]
del final_data["ntsb_no_x"]
del final_data['wind_vel_ind']

final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115706 entries, 0 to 115705
Data columns (total 30 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   ev_id              115706 non-null  object 
 1   acft_make          115643 non-null  object 
 2   acft_model         115630 non-null  object 
 3   cert_max_gr_wt     98673 non-null   float64
 4   acft_category      115287 non-null  object 
 5   damage             113877 non-null  object 
 6   far_part           114925 non-null  object 
 7   afm_hrs_last_insp  60298 non-null   float64
 8   type_fly           108599 non-null  object 
 9   dprt_city          111864 non-null  object 
 10  dprt_state         108791 non-null  object 
 11  rwy_len            64222 non-null   float64
 12  rwy_width          63110 non-null   float64
 13  ev_type            115706 non-null  object 
 14  ev_city            115646 non-null  object 
 15  ev_state           109635 non-null  object 
 16  ev

In [248]:
pd.set_option('display.max_columns', 50)

In [249]:
final_data.head(10)

Unnamed: 0,ev_id,acft_make,acft_model,cert_max_gr_wt,acft_category,damage,far_part,afm_hrs_last_insp,type_fly,dprt_city,dprt_state,rwy_len,rwy_width,ev_type,ev_city,ev_state,ev_country,ev_highest_injury,inj_f_grnd,inj_m_grnd,inj_s_grnd,inj_tot_f,inj_tot_m,inj_tot_n,inj_tot_s,inj_tot_t,sky_cond_ceil,sky_cond_nonceil,wx_int_precip,phase_flt_spec
0,20001204X00000,Cessna,207,3800.0,AIR,SUBS,135,75.0,UNK,BETHEL,AK,,,ACC,QUINHAGAK,AK,USA,MINR,0.0,0.0,0.0,,1.0,,,1.0,BKN,UNK,UNK,Approach
1,20001204X00001,Boeing,747-100,750000.0,AIR,MINR,121,113.0,UNK,CHITOSE,JA,11800.0,150.0,INC,FAIRBANKS,AK,USA,NONE,0.0,0.0,0.0,,,4.0,,,NONE,SCAT,UNK,Landing
2,20001204X00002,Piper,PA-31-350,7369.0,AIR,SUBS,135,32.0,UNK,CHENEGA BAY,AK,,,ACC,ANCHORAGE,AK,USA,NONE,0.0,0.0,0.0,,,6.0,,,OVC,SCAT,UNK,Unknown
3,20001204X00003,Cessna,172,2300.0,AIR,SUBS,91,40.0,PERS,,,6398.0,150.0,ACC,BETHEL,AK,USA,NONE,0.0,0.0,0.0,,,1.0,,,BKN,UNK,LGT,Unknown
4,20001204X00004,Cessna,207,3800.0,AIR,SUBS,135,49.0,UNK,,AK,2610.0,40.0,ACC,CHEVAK,AK,USA,NONE,0.0,0.0,0.0,,,1.0,,,BKN,UNK,UNK,Descent
5,20001204X00005,Piper,PA-22-160,1840.0,AIR,SUBS,91,,PERS,,,2200.0,70.0,ACC,ANCHORAGE,AK,USA,NONE,0.0,0.0,0.0,,,1.0,,,UNK,BKNT,UNK,Takeoff
6,20001204X00006,Beech,300,14100.0,AIR,DEST,91,3.0,EXEC,GREENEVILLE,SC,5500.0,100.0,ACC,CULLMAN,AL,USA,FATL,0.0,0.0,0.0,2.0,,,,2.0,BKN,UNK,MOD,Approach
7,20001204X00007,Piper,PA-28-181,2550.0,AIR,DEST,91,,PERS,ANDREWS,NC,,,ACC,BREVARD,NC,USA,FATL,0.0,0.0,0.0,1.0,,,,1.0,NONE,CLER,UNK,Maneuvering
8,20001204X00008,Aero Commander,560A,6000.0,AIR,DEST,91,13.0,PERS,,,3800.0,36.0,ACC,BELLEVIEW,FL,USA,FATL,0.0,0.0,0.0,2.0,,,2.0,4.0,NONE,CLER,UNK,Approach
9,20001204X00009,Piper,PA-24-250,2900.0,AIR,SUBS,91,40.0,PERS,ALLAIRE,NJ,,,ACC,COLBERT,GA,USA,NONE,0.0,0.0,0.0,,,1.0,,,OVC,UNK,LGT,Unknown


In [250]:
#replace the all empty values to Nan to fix dprt_city column
final_data= final_data.replace(r'^\s+$', np.nan, regex=True)

In [251]:
finaldamagecount = final_data["dprt_city"].value_counts().reset_index()
finaldamagecount

Unnamed: 0,index,dprt_city
0,Anchorage,393
1,ANCHORAGE,380
2,Houston,226
3,Phoenix,199
4,Fairbanks,196
...,...,...
16399,W. KINGSTON,1
16400,St. Peterburg,1
16401,Sumner,1
16402,WRIGHTSTORM,1


In [252]:
final_data['acft_make'] = final_data['acft_make'].str.upper()
final_data['acft_category'] = final_data['acft_category'].str.upper()
final_data['damage'] = final_data['damage'].str.upper()
final_data['type_fly'] = final_data['type_fly'].str.upper()
final_data['dprt_city'] = final_data['dprt_city'].str.upper()
final_data['dprt_state'] = final_data['dprt_state'].str.upper()
final_data['ev_city'] = final_data['ev_city'].str.upper()
final_data['ev_type'] = final_data['ev_type'].str.upper()
final_data['ev_city'] = final_data['ev_city'].str.upper()
final_data['ev_country'] = final_data['ev_country'].str.upper()
final_data['sky_cond_ceil'] = final_data['sky_cond_ceil'].str.upper()
final_data['sky_cond_nonceil'] = final_data['sky_cond_nonceil'].str.upper()
final_data['wx_int_precip'] = final_data['wx_int_precip'].str.upper()
final_data['phase_flt_spec'] = final_data['phase_flt_spec'].str.upper()
final_data['ev_highest_injury'] = final_data['ev_highest_injury'].str.upper()

In [253]:
final_data.loc[final_data['damage'].str.contains('UNK', na=False), 'damage'] = 'UNK'
# final_data = final_data.loc[final_data['phase_flt_spec'].str.contains('UNK', na=False), 'damage'] = 'UNK'

In [254]:
#rename the injuries columns to make them easier to read
final_data = final_data.rename(columns={"inj_tot_f": "Total_Fatal_Injuries", 
                                        "inj_tot_s":"Total_Serious_Injuries",
                                        "inj_tot_m":"Total_Minor_Injuries",
                                        "inj_tot_n":'Total_Uninjured',
                                        "inj_tot_t":"Total_Injuries_Flight"})

#fill in 0s when there wasn't an injury in that category
final_data.update(final_data[['Total_Fatal_Injuries','Total_Serious_Injuries',
                              'Total_Minor_Injuries','Total_Uninjured',
                              'Total_Injuries_Flight','inj_f_grnd',
                              'inj_m_grnd','inj_s_grnd']].fillna(0))
final_data.head()

Unnamed: 0,ev_id,acft_make,acft_model,cert_max_gr_wt,acft_category,damage,far_part,afm_hrs_last_insp,type_fly,dprt_city,dprt_state,rwy_len,rwy_width,ev_type,ev_city,ev_state,ev_country,ev_highest_injury,inj_f_grnd,inj_m_grnd,inj_s_grnd,Total_Fatal_Injuries,Total_Minor_Injuries,Total_Uninjured,Total_Serious_Injuries,Total_Injuries_Flight,sky_cond_ceil,sky_cond_nonceil,wx_int_precip,phase_flt_spec
0,20001204X00000,CESSNA,207,3800.0,AIR,SUBS,135,75.0,UNK,BETHEL,AK,,,ACC,QUINHAGAK,AK,USA,MINR,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,BKN,UNK,UNK,APPROACH
1,20001204X00001,BOEING,747-100,750000.0,AIR,MINR,121,113.0,UNK,CHITOSE,JA,11800.0,150.0,INC,FAIRBANKS,AK,USA,NONE,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,NONE,SCAT,UNK,LANDING
2,20001204X00002,PIPER,PA-31-350,7369.0,AIR,SUBS,135,32.0,UNK,CHENEGA BAY,AK,,,ACC,ANCHORAGE,AK,USA,NONE,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,OVC,SCAT,UNK,UNKNOWN
3,20001204X00003,CESSNA,172,2300.0,AIR,SUBS,91,40.0,PERS,,,6398.0,150.0,ACC,BETHEL,AK,USA,NONE,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,BKN,UNK,LGT,UNKNOWN
4,20001204X00004,CESSNA,207,3800.0,AIR,SUBS,135,49.0,UNK,,AK,2610.0,40.0,ACC,CHEVAK,AK,USA,NONE,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,BKN,UNK,UNK,DESCENT


In [255]:
final_data.dropna(subset=['cert_max_gr_wt','afm_hrs_last_insp',
                          'rwy_len','rwy_width'],inplace=True)

In [256]:
final_data = final_data.reset_index(drop=True)

In [257]:
final_data.update(final_data.fillna("UNK"))


In [258]:
phase_df = final_data.copy()

In [259]:
#we want to account for ALL injuries. This includes injuries on the ground as well as passangers
#Here we will make a new column that shows total injuries including ground ones
final_data['Total_Injuries_Ground'] = final_data['inj_f_grnd']+final_data['inj_m_grnd']+final_data['inj_s_grnd']
final_data['Total_Injuries'] = final_data['Total_Injuries_Ground']+final_data['Total_Injuries_Flight']
final_data.head()

Unnamed: 0,ev_id,acft_make,acft_model,cert_max_gr_wt,acft_category,damage,far_part,afm_hrs_last_insp,type_fly,dprt_city,dprt_state,rwy_len,rwy_width,ev_type,ev_city,ev_state,ev_country,ev_highest_injury,inj_f_grnd,inj_m_grnd,inj_s_grnd,Total_Fatal_Injuries,Total_Minor_Injuries,Total_Uninjured,Total_Serious_Injuries,Total_Injuries_Flight,sky_cond_ceil,sky_cond_nonceil,wx_int_precip,phase_flt_spec,Total_Injuries_Ground,Total_Injuries
0,20001204X00001,BOEING,747-100,750000.0,AIR,MINR,121,113.0,UNK,CHITOSE,JA,11800.0,150.0,INC,FAIRBANKS,AK,USA,NONE,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,NONE,SCAT,UNK,LANDING,0.0,0.0
1,20001204X00003,CESSNA,172,2300.0,AIR,SUBS,91,40.0,PERS,UNK,UNK,6398.0,150.0,ACC,BETHEL,AK,USA,NONE,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,BKN,UNK,LGT,UNKNOWN,0.0,0.0
2,20001204X00004,CESSNA,207,3800.0,AIR,SUBS,135,49.0,UNK,UNK,AK,2610.0,40.0,ACC,CHEVAK,AK,USA,NONE,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,BKN,UNK,UNK,DESCENT,0.0,0.0
3,20001204X00006,BEECH,300,14100.0,AIR,DEST,91,3.0,EXEC,GREENEVILLE,SC,5500.0,100.0,ACC,CULLMAN,AL,USA,FATL,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,BKN,UNK,MOD,APPROACH,0.0,2.0
4,20001204X00008,AERO COMMANDER,560A,6000.0,AIR,DEST,91,13.0,PERS,UNK,UNK,3800.0,36.0,ACC,BELLEVIEW,FL,USA,FATL,0.0,0.0,0.0,2.0,0.0,0.0,2.0,4.0,NONE,CLER,UNK,APPROACH,0.0,4.0


In [260]:
final_data['Injury'] = np.where(final_data['Total_Injuries'] >0,1,0)
injuries = final_data["Injury"].value_counts().reset_index()
injuries.head(3)

Unnamed: 0,index,Injury
0,1,18750
1,0,16677


In [261]:
final_df = final_data.copy()
#Since we added up all of our injuries we don't need the other columns that include injury count since it will be colinear to our prediction variable
final_df = final_df.drop(['Total_Fatal_Injuries','Total_Serious_Injuries','Total_Minor_Injuries',
                          'Total_Uninjured','Total_Injuries_Flight','inj_f_grnd','inj_m_grnd',
                          'inj_s_grnd','Total_Injuries_Ground',"Total_Injuries","ev_highest_injury",
                          "ev_id","dprt_city"],axis = 1)
final_df = final_df.reset_index(drop=True)

# Final Dataframe for predicting total injury

In [105]:
df = final_df.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35427 entries, 0 to 35426
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   acft_make          35427 non-null  object 
 1   acft_model         35427 non-null  object 
 2   cert_max_gr_wt     35427 non-null  float64
 3   acft_category      35427 non-null  object 
 4   damage             35427 non-null  object 
 5   far_part           35427 non-null  object 
 6   afm_hrs_last_insp  35427 non-null  float64
 7   type_fly           35427 non-null  object 
 8   dprt_state         35427 non-null  object 
 9   rwy_len            35427 non-null  float64
 10  rwy_width          35427 non-null  float64
 11  ev_type            35427 non-null  object 
 12  ev_city            35427 non-null  object 
 13  ev_state           35427 non-null  object 
 14  ev_country         35427 non-null  object 
 15  sky_cond_ceil      35427 non-null  object 
 16  sky_cond_nonceil   354

In [106]:
X = df.drop("Injury", axis = 1).copy()
y = df["Injury"].copy()

In [107]:
#One hot encode specific columns without standardizing and scaling continuous variables
from sklearn.preprocessing import OneHotEncoder
categorical_features = ['acft_make', 'acft_model', 'acft_category', 'damage','far_part', 'type_fly',
                        'dprt_state','ev_type','ev_city', 'ev_state','ev_country', 'sky_cond_ceil', 'sky_cond_nonceil',
                        'wx_int_precip', 'phase_flt_spec']

ohe = OneHotEncoder()

X_object = X.select_dtypes('object')
ohe.fit(X_object)

codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(categorical_features)

X = pd.concat([X.select_dtypes(exclude='object'), 
               pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)

In [121]:
X = X.to_numpy()
X.shape

In [108]:
# X.reset_index(drop = True, inplace=True)

In [109]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
LE.fit(y)
y = LE.transform(y)
y

array([0, 0, 0, ..., 1, 1, 1])

In [110]:
from sklearn.model_selection import StratifiedShuffleSplit 
cv = StratifiedShuffleSplit(n_splits=1,test_size=0.10, random_state=42)

In [125]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
import time
lr_clf = LogisticRegression(class_weight="balanced",solver='liblinear', penalty="l2",max_iter=1000,random_state=42)
iter_num=0
for train_indices, test_indices in cv.split(X,y): 
#     start = time.time()
#     elapsed_time = (time.time() - start)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    lr_clf.fit(X_train,y_train)  # train object

    y_hat = lr_clf.predict(X_test) # get test set predictions
    print("====Iteration",iter_num," ====")
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print('accuracy:', acc )
    print(conf )
#     print('CV Time: ', elapsed_time)
    iter_num+=1

====Iteration 0  ====
accuracy: 0.8309342365227208
[[1537  131]
 [ 468 1407]]


# Dataframe for predicting broad phase of flight

In [262]:
#Removed columns that contain UNKNOWN and UNK
new_phase_df = phase_df[(phase_df['phase_flt_spec'] != "UNKNOWN") & (phase_df['phase_flt_spec'] != "UNK")].copy()

In [263]:
new_phase_df["phase_flt_spec"].value_counts()

LANDING        16743
TAKEOFF         5776
APPROACH        3300
DESCENT         2995
MANEUVERING     1569
CRUISE          1218
CLIMB            923
TAXI             741
STANDING         393
GOAROUND         301
OTHER            162
HOVER             94
Name: phase_flt_spec, dtype: int64

In [264]:
new_phase_df.reset_index(drop=True,inplace=True)

In [265]:
del new_phase_df["ev_id"]
del new_phase_df["ev_highest_injury"]
del new_phase_df["dprt_city"]

For classifying the broad phase of flight we want to include all the different types of injury columns

In [266]:
new_phase_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34215 entries, 0 to 34214
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   acft_make               34215 non-null  object 
 1   acft_model              34215 non-null  object 
 2   cert_max_gr_wt          34215 non-null  float64
 3   acft_category           34215 non-null  object 
 4   damage                  34215 non-null  object 
 5   far_part                34215 non-null  object 
 6   afm_hrs_last_insp       34215 non-null  float64
 7   type_fly                34215 non-null  object 
 8   dprt_state              34215 non-null  object 
 9   rwy_len                 34215 non-null  float64
 10  rwy_width               34215 non-null  float64
 11  ev_type                 34215 non-null  object 
 12  ev_city                 34215 non-null  object 
 13  ev_state                34215 non-null  object 
 14  ev_country              34215 non-null

In [267]:
new_phase_df.head()

Unnamed: 0,acft_make,acft_model,cert_max_gr_wt,acft_category,damage,far_part,afm_hrs_last_insp,type_fly,dprt_state,rwy_len,rwy_width,ev_type,ev_city,ev_state,ev_country,inj_f_grnd,inj_m_grnd,inj_s_grnd,Total_Fatal_Injuries,Total_Minor_Injuries,Total_Uninjured,Total_Serious_Injuries,Total_Injuries_Flight,sky_cond_ceil,sky_cond_nonceil,wx_int_precip,phase_flt_spec
0,BOEING,747-100,750000.0,AIR,MINR,121,113.0,UNK,JA,11800.0,150.0,INC,FAIRBANKS,AK,USA,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,NONE,SCAT,UNK,LANDING
1,CESSNA,207,3800.0,AIR,SUBS,135,49.0,UNK,AK,2610.0,40.0,ACC,CHEVAK,AK,USA,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,BKN,UNK,UNK,DESCENT
2,BEECH,300,14100.0,AIR,DEST,91,3.0,EXEC,SC,5500.0,100.0,ACC,CULLMAN,AL,USA,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,BKN,UNK,MOD,APPROACH
3,AERO COMMANDER,560A,6000.0,AIR,DEST,91,13.0,PERS,UNK,3800.0,36.0,ACC,BELLEVIEW,FL,USA,0.0,0.0,0.0,2.0,0.0,0.0,2.0,4.0,NONE,CLER,UNK,APPROACH
4,GETTIS H. HUDSON,CORBAN BABY ACE,900.0,AIR,SUBS,91,10.0,PERS,UNK,1900.0,75.0,ACC,PLEASANT VIEW,TN,USA,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,NONE,OVCT,UNK,TAKEOFF


|Phase of Flight||Code|
|----||----|
|OTHER||1|
|APPROACH||2|
|CLIMB||3|
|CRUISE||4|
|DESCENT||5|
|GOAROUND||6|
|HOVER||7|
|LANDING||8|
|MANEUVERING||9|
|STANDING||10|
|TAKEOFF||11|
|TAXI||12|

In [268]:
new_phase_df["phase_flt_spec"] = new_phase_df["phase_flt_spec"].replace({"OTHER":1,"APPROACH":2,"CLIMB":3,
                                                                         "CRUISE":4,"DESCENT":5,"GOAROUND":6,
                                                                         "HOVER":7,"LANDING":8,"MANEUVERING":9,
                                                                         "STANDING":10,"TAKEOFF":11,"TAXI":12})

In [269]:
new_phase_df.columns

Index(['acft_make', 'acft_model', 'cert_max_gr_wt', 'acft_category', 'damage',
       'far_part', 'afm_hrs_last_insp', 'type_fly', 'dprt_state', 'rwy_len',
       'rwy_width', 'ev_type', 'ev_city', 'ev_state', 'ev_country',
       'inj_f_grnd', 'inj_m_grnd', 'inj_s_grnd', 'Total_Fatal_Injuries',
       'Total_Minor_Injuries', 'Total_Uninjured', 'Total_Serious_Injuries',
       'Total_Injuries_Flight', 'sky_cond_ceil', 'sky_cond_nonceil',
       'wx_int_precip', 'phase_flt_spec'],
      dtype='object')

In [270]:
new_phase_df["phase_flt_spec"].value_counts()

8     16743
11     5776
2      3300
5      2995
9      1569
4      1218
3       923
12      741
10      393
6       301
1       162
7        94
Name: phase_flt_spec, dtype: int64

In [271]:
X_broad = new_phase_df.drop("phase_flt_spec", axis = 1).copy()
y_broad = new_phase_df["phase_flt_spec"].copy()

In [272]:
X_broad.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34215 entries, 0 to 34214
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   acft_make               34215 non-null  object 
 1   acft_model              34215 non-null  object 
 2   cert_max_gr_wt          34215 non-null  float64
 3   acft_category           34215 non-null  object 
 4   damage                  34215 non-null  object 
 5   far_part                34215 non-null  object 
 6   afm_hrs_last_insp       34215 non-null  float64
 7   type_fly                34215 non-null  object 
 8   dprt_state              34215 non-null  object 
 9   rwy_len                 34215 non-null  float64
 10  rwy_width               34215 non-null  float64
 11  ev_type                 34215 non-null  object 
 12  ev_city                 34215 non-null  object 
 13  ev_state                34215 non-null  object 
 14  ev_country              34215 non-null

In [273]:
#One hot encode specific columns without standardizing and scaling continuous variables
from sklearn.preprocessing import OneHotEncoder
categorical_features = ['acft_make', 'acft_model', 'acft_category', 'damage','far_part', 'type_fly',
                        'dprt_state','ev_type','ev_city', 'ev_state','ev_country', 'sky_cond_ceil',
                        'sky_cond_nonceil','wx_int_precip']

ohe = OneHotEncoder()

X_object_broad = X_broad.select_dtypes('object')
ohe.fit(X_object_broad)

codes_1 = ohe.transform(X_object_broad).toarray()
feature_names_1 = ohe.get_feature_names(categorical_features)

X_broad = pd.concat([X_broad.select_dtypes(exclude='object'), 
               pd.DataFrame(codes_1,columns=feature_names_1).astype(int)], axis=1)

In [274]:
X_broad

Unnamed: 0,cert_max_gr_wt,afm_hrs_last_insp,rwy_len,rwy_width,inj_f_grnd,inj_m_grnd,inj_s_grnd,Total_Fatal_Injuries,Total_Minor_Injuries,Total_Uninjured,Total_Serious_Injuries,Total_Injuries_Flight,acft_make_ HACKNEY,acft_make_ LARSON,acft_make_1977 COLFER-CHAN,acft_make_2001 MCGIRL,acft_make_781569 INC,acft_make_A PAIR OF JACKS,acft_make_A. H. GETTINGS,acft_make_AARDEMA ROBERT JOHN,acft_make_AB SPORTINE AVIACIJA,acft_make_ABERNATHY,acft_make_ACRO SPORT,acft_make_ACRODUSTER,acft_make_ADAMS,...,ev_country_TK,ev_country_TW,ev_country_UK,ev_country_UNK,ev_country_USA,ev_country_VG,ev_country_WN,sky_cond_ceil_BKN,sky_cond_ceil_NONE,sky_cond_ceil_OBSC,sky_cond_ceil_OVC,sky_cond_ceil_UNK,sky_cond_ceil_VV,sky_cond_nonceil_BKNT,sky_cond_nonceil_CLER,sky_cond_nonceil_FEW,sky_cond_nonceil_OVCT,sky_cond_nonceil_POBS,sky_cond_nonceil_SCAT,sky_cond_nonceil_UNK,wx_int_precip_HVY,wx_int_precip_LGT,wx_int_precip_LT,wx_int_precip_MOD,wx_int_precip_UNK
0,750000.0,113.0,11800.0,150.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
1,3800.0,49.0,2610.0,40.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
2,14100.0,3.0,5500.0,100.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
3,6000.0,13.0,3800.0,36.0,0.0,0.0,0.0,2.0,0.0,0.0,2.0,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
4,900.0,10.0,1900.0,75.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34210,2080.0,1.0,2700.0,80.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1
34211,48300.0,17.0,7001.0,100.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,6.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
34212,1320.0,1.0,600.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
34213,1450.0,44.0,4000.0,75.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


In [236]:
X_broad[0:3].describe()

Unnamed: 0,cert_max_gr_wt,afm_hrs_last_insp,rwy_len,rwy_width,inj_f_grnd,inj_m_grnd,inj_s_grnd,Total_Fatal_Injuries,Total_Minor_Injuries,Total_Uninjured,Total_Serious_Injuries,Total_Injuries_Flight,acft_make_ HACKNEY,acft_make_ LARSON,acft_make_1977 COLFER-CHAN,acft_make_2001 MCGIRL,acft_make_781569 INC,acft_make_A PAIR OF JACKS,acft_make_A. H. GETTINGS,acft_make_AARDEMA ROBERT JOHN,acft_make_AB SPORTINE AVIACIJA,acft_make_ABERNATHY,acft_make_ACRO SPORT,acft_make_ACRODUSTER,acft_make_ADAMS,...,ev_country_TK,ev_country_TW,ev_country_UK,ev_country_UNK,ev_country_USA,ev_country_VG,ev_country_WN,sky_cond_ceil_BKN,sky_cond_ceil_NONE,sky_cond_ceil_OBSC,sky_cond_ceil_OVC,sky_cond_ceil_UNK,sky_cond_ceil_VV,sky_cond_nonceil_BKNT,sky_cond_nonceil_CLER,sky_cond_nonceil_FEW,sky_cond_nonceil_OVCT,sky_cond_nonceil_POBS,sky_cond_nonceil_SCAT,sky_cond_nonceil_UNK,wx_int_precip_HVY,wx_int_precip_LGT,wx_int_precip_LT,wx_int_precip_MOD,wx_int_precip_UNK
count,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
mean,255966.666667,55.0,6636.666667,96.666667,0.0,0.0,0.0,0.666667,0.0,1.666667,0.0,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.666667,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.666667,0.0,0.0,0.0,0.333333,0.666667
std,427876.411284,55.244909,4699.258807,55.075705,0.0,0.0,0.0,1.154701,0.0,2.081666,0.0,1.154701,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735,0.0,0.0,0.0,0.57735,0.57735
min,3800.0,3.0,2610.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8950.0,26.0,4055.0,70.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5
50%,14100.0,49.0,5500.0,100.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
75%,382050.0,81.0,8650.0,125.0,0.0,0.0,0.0,1.0,0.0,2.5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1.0,0.0,0.0,0.0,0.5,1.0
max,750000.0,113.0,11800.0,150.0,0.0,0.0,0.0,2.0,0.0,4.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0


In [275]:
X_broad = X_broad.to_numpy()

In [276]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
LE.fit(y_broad)
y_broad = LE.transform(y_broad)
y_broad

array([ 7,  4,  1, ...,  5,  7, 10])

In [277]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
import time
lr_clf_2 = LogisticRegression(random_state=42)
iter_num=0
for train_indices, test_indices in cv.split(X_broad,y_broad): 
    X_train = X_broad[train_indices]
    y_train = y_broad[train_indices]
    
    X_test = X_broad[test_indices]
    y_test = y_broad[test_indices]
    lr_clf_2.fit(X_train,y_train)  # train object

    y_hat = lr_clf_2.predict(X_test) # get test set predictions
    print("====Iteration",iter_num," ====")
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print('accuracy:', acc )
    print(conf )

    iter_num+=1

====Iteration 0  ====
accuracy: 0.44447691408533024
[[   0    0    0    0    0    0    0   14    0    0    2    0]
 [   0    0    0   17    0    0    0  310    0    0    3    0]
 [   0    0    0    0    0    0    0   91    0    0    1    0]
 [   0    0    0   60    1    0    0   56    0    0    5    0]
 [   0    0    0   96    0    0    0  191    0    0   13    0]
 [   0    0    0    0    0    0    0   30    0    0    0    0]
 [   0    0    0    3    0    0    0    5    0    0    1    0]
 [   0    0    0  191    0    0    0 1454    0    0   30    0]
 [   0    0    0   75    0    0    0   74    0    0    8    0]
 [   0    0    0    2    0    0    0   36    0    0    1    0]
 [   0    0    0   39    1    0    0  531    0    0    7    0]
 [   0    0    0   15    0    0    0   58    0    0    1    0]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
