You are to build upon the predictive analysis (classification) that you already completed in the previous mini-project, adding additional modeling from new classification algorithms as well as more explanations that are inline with the CRISP-DM framework. You should use appropriate cross validation for all of your analysis (explain your chosen method of performance validation in detail). Try to use as much testing data as possible in a realistic manner (you should define what you think is realistic and why).
This report is worth 20% of the final grade. Please upload a report (one per team) with all code used, visualizations, and text in a single document. The format of the document can be PDF, *.ipynb, or HTML. You can write the report in whatever format you like, but it is easiest to turn in the rendered iPython notebook. The results should be reproducible using your report. Please carefully describe every assumption and every step in your report.
### Dataset Selection
Select a dataset identically to the way you selected for the first project work week and mini-project. You are not required to use the same dataset that you used in the past, but you are encouraged. You must identify two tasks from the dataset to regress or classify. That is:
• two classification tasks OR
• two regression tasks OR
• one classification task and one regression task
For example, if your dataset was from the diabetes data you might try to predict two tasks: (1) classifying if a patient will be readmitted within a 30 day period or not, and (2) regressing what the total number of days a patient will spend in the hospital, given their history and specifics of the encounter like tests administered and previous admittance.
### Grading Rubric
• Data Preparation (15 points total)
• [10 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.
 
• [5 points] Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).
• Modeling and Evaluation (70 points total)
• [10 points] Choose and explain your evaluation metrics that you will use (i.e., accuracy,
precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.
• [10 points] Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate.
• [20 points] Create three different classification/regression models (e.g., random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric.
• [10 points] Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.
• [10 points] Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods.
• [10 points] Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.
• Deployment (5 points total)
• [5 points] How useful is your model for interested parties (i.e., the companies or
organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?
• Exceptional Work (10 points total)
• You have free reign to provide additional modeling.
• One idea: grid search parameters in a parallelized fashion and visualize the
performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?

Two dataframes for each classification task

Data cleanup (Dylan and Satvik)
Broad phase of flight dataframe

Injury (Injury)  for KNN (Nnenna)
- Look into ROC Curves
- Look at Sklearn parameters for KNN


Injury (Injury) for Decision Trees (Jobin)
- Look at Sklearn parameters for decision trees

Injury (Injury) for KNN

Injury (Injury) for Decision Trees



In [1]:
import pandas as pd
import numpy as np

In [2]:
#Read in the Aviation Data
final_data = pd.read_csv("../Data/final_data.csv",low_memory=False,dtype={'damage': str})
#Delete columns that were imported incorrectly
del final_data["Unnamed: 0"]
del final_data["dprt_state.1"]
del final_data["index"]
del final_data["ntsb_no_x"]
del final_data['ev_id']

final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115706 entries, 0 to 115705
Data columns (total 30 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   acft_make          115643 non-null  object 
 1   acft_model         115630 non-null  object 
 2   cert_max_gr_wt     98673 non-null   float64
 3   acft_category      115287 non-null  object 
 4   damage             113877 non-null  object 
 5   far_part           114925 non-null  object 
 6   afm_hrs_last_insp  60298 non-null   float64
 7   type_fly           108599 non-null  object 
 8   dprt_city          111864 non-null  object 
 9   dprt_state         108791 non-null  object 
 10  rwy_len            64222 non-null   float64
 11  rwy_width          63110 non-null   float64
 12  ev_type            115706 non-null  object 
 13  ev_city            115646 non-null  object 
 14  ev_state           109635 non-null  object 
 15  ev_country         115199 non-null  object 
 16  ev

In [3]:
final_data

Unnamed: 0,acft_make,acft_model,cert_max_gr_wt,acft_category,damage,far_part,afm_hrs_last_insp,type_fly,dprt_city,dprt_state,...,inj_tot_f,inj_tot_m,inj_tot_n,inj_tot_s,inj_tot_t,sky_cond_ceil,sky_cond_nonceil,wind_vel_ind,wx_int_precip,phase_flt_spec
0,Cessna,207,3800.0,AIR,SUBS,135,75.0,UNK,BETHEL,AK,...,,1.0,,,1.0,BKN,UNK,UNK,UNK,Approach
1,Boeing,747-100,750000.0,AIR,MINR,121,113.0,UNK,CHITOSE,JA,...,,,4.0,,,NONE,SCAT,CALM,UNK,Landing
2,Piper,PA-31-350,7369.0,AIR,SUBS,135,32.0,UNK,CHENEGA BAY,AK,...,,,6.0,,,OVC,SCAT,UNK,UNK,Unknown
3,Cessna,172,2300.0,AIR,SUBS,091,40.0,PERS,,,...,,,1.0,,,BKN,UNK,UNK,LGT,Unknown
4,Cessna,207,3800.0,AIR,SUBS,135,49.0,UNK,,AK,...,,,1.0,,,BKN,UNK,UNK,UNK,Descent
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115701,PIPER,PA38,,,,,,,,,...,0.0,0.0,1.0,0.0,1.0,,,,,Maneuvering
115702,PIPER,PA-18-150,,AIR,SUBS,091,,PERS,,,...,0.0,0.0,1.0,0.0,1.0,NONE,CLER,F,,Takeoff
115703,CESSNA,172,,,,091,,,,,...,0.0,0.0,2.0,0.0,2.0,,,,,Taxi
115704,PIPER,PA-12,,,,091,,,,,...,0.0,0.0,2.0,0.0,2.0,,,,,GoAround


In [7]:
#replace the all empty values to Nan to fix dprt_city column
final_data= final_data.replace(r'^\s+$', np.nan, regex=True)

In [8]:
final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115706 entries, 0 to 115705
Data columns (total 30 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   acft_make          115635 non-null  object 
 1   acft_model         115608 non-null  object 
 2   cert_max_gr_wt     98673 non-null   float64
 3   acft_category      115287 non-null  object 
 4   damage             113877 non-null  object 
 5   far_part           114925 non-null  object 
 6   afm_hrs_last_insp  60298 non-null   float64
 7   type_fly           108599 non-null  object 
 8   dprt_city          93107 non-null   object 
 9   dprt_state         90890 non-null   object 
 10  rwy_len            64222 non-null   float64
 11  rwy_width          63110 non-null   float64
 12  ev_type            115706 non-null  object 
 13  ev_city            115644 non-null  object 
 14  ev_state           109635 non-null  object 
 15  ev_country         115199 non-null  object 
 16  ev