## Grading Rubric
### Business Understanding (10 points total).

• Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific.

### Data Understanding (80 points total)
• [10 points] Describe the meaning and type of data (scale, values, etc.) for each
attribute in the data file.

• [15 points] Verify data quality: Explain any missing values, duplicate data, and outliers.
Are those mistakes? How do you deal with these problems? Be specific.

• [10 points] Give simple, appropriate statistics (range, mode, mean, median, variance,
counts, etc.) for the most important attributes and describe what they mean or if you found something interesting. Note: You can also use data from other sources for comparison. Explain the significance of the statistics run and why they are meaningful.

• [15 points] Visualize the most important attributes appropriately (at least 5 attributes). Important: Provide an interpretation for each chart. Explain for each attribute why the chosen visualization is appropriate.

• [15 points] Explore relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.

• [10 points] Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).

• [5 points] Are there other features that could be added to the data or created from existing features? Which ones?
 
### Exceptional Work (10 points total)
• You have free reign to provide additional analyses.
• One idea: implement dimensionality reduction, then visualize and interpret the results.  

# Business Understanding

In [2]:
import pandas as pd
import numpy as np

In [3]:
aviation_data = pd.read_csv("Data/AviationData.csv")
aviation_data.columns

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'LOCATION', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object')

In [4]:
aviation_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85976 entries, 0 to 85975
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Event.Id                85976 non-null  object 
 1   Investigation.Type      85963 non-null  object 
 2   Accident.Number         85976 non-null  object 
 3   Event.Date              85976 non-null  object 
 4   LOCATION                85898 non-null  object 
 5   Country                 85469 non-null  object 
 6   Latitude                31587 non-null  float64
 7   Longitude               31578 non-null  float64
 8   Airport.Code            48612 non-null  object 
 9   Airport.Name            51298 non-null  object 
 10  Injury.Severity         85842 non-null  object 
 11  Aircraft.damage         83047 non-null  object 
 12  Aircraft.Category       29226 non-null  object 
 13  Registration.Number     81756 non-null  object 
 14  Make                    85908 non-null

In [8]:
aviation_data.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,LOCATION,Country,Latitude,Longitude,Airport.Code,Airport.Name,...,Purpose.of.flight,Air.carrier,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date
0,20210200000000.0,Accident,CEN21FA130,2021-02-16,"JANESVILLE, WI",United States,42.595377,-89.030245,,,...,Ferry,,2.0,0.0,0.0,0.0,VMC,,,
1,20210200000000.0,Accident,ERA21FA130,2021-02-15,"St Thomas, CB",United States,18.354444,-65.027778,,,...,Aobv,Caribbean Buzz Management Llc.,4.0,0.0,0.0,0.0,VMC,,,
2,20210200000000.0,Accident,ANC21LA017,2021-02-13,"TYONEK, AK",United States,61.336392,-152.01643,,,...,Personal,Paul Andrews,0.0,0.0,2.0,0.0,,,,
3,20210200000000.0,Accident,CEN21LA127,2021-02-12,"PRAIRIE DU SAC, WI",United States,43.297731,-89.755693,91C,SAUK-PRAIRIE,...,Instructional,,0.0,0.0,0.0,1.0,VMC,,,
4,20210200000000.0,Accident,ERA21LA131,2021-02-10,"LAKE PLACID, FL",United States,27.243723,-81.413767,09FA,,...,Personal,Case Robert,0.0,0.0,1.0,0.0,,,,


In [5]:
from pandas.plotting import scatter_matrix
import seaborn as sns

# Data Meaning/Type

In [17]:
aviation_data.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,...,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date,City,State
0,20210200000000.0,Accident,CEN21FA130,2021-02-16,United States,42.595377,-89.030245,,,Fatal,...,2.0,0.0,0.0,0.0,VMC,,,,JANESVILLE,WI
1,20210200000000.0,Accident,ERA21FA130,2021-02-15,United States,18.354444,-65.027778,,,Fatal,...,4.0,0.0,0.0,0.0,VMC,,,,St Thomas,CB
2,20210200000000.0,Accident,ANC21LA017,2021-02-13,United States,61.336392,-152.01643,,,Minor,...,0.0,0.0,2.0,0.0,,,,,TYONEK,AK
3,20210200000000.0,Accident,CEN21LA127,2021-02-12,United States,43.297731,-89.755693,91C,SAUK-PRAIRIE,Non-Fatal,...,0.0,0.0,0.0,1.0,VMC,,,,PRAIRIE DU SAC,WI
4,20210200000000.0,Accident,ERA21LA131,2021-02-10,United States,27.243723,-81.413767,09FA,,Minor,...,0.0,0.0,1.0,0.0,,,,,LAKE PLACID,FL


# Verify Data Quality

In [6]:
#We have 30 columns to work with
#First we will check to see what percent of each column is null

#Percent of missing data
percent_missing = aviation_data.isnull().sum() * 100 / len(aviation_data)
#Create DF 
missing_value_df = pd.DataFrame({'column_name': aviation_data.columns,
                                 'percent_missing': percent_missing})
#Then sort by least to most
missing_value_df.sort_values('percent_missing', inplace=True)
missing_value_df

Unnamed: 0,column_name,percent_missing
Event.Id,Event.Id,0.0
Accident.Number,Accident.Number,0.0
Event.Date,Event.Date,0.0
Investigation.Type,Investigation.Type,0.01512
Make,Make,0.079092
LOCATION,LOCATION,0.090723
Model,Model,0.115148
Injury.Severity,Injury.Severity,0.155857
Country,Country,0.589699
Amateur.Built,Amateur.Built,0.790918


What we can see from the missing data above is categories such as Air Carrier and schedule are missing the most. When prodicting with this dataset we will primarly forcus on total number of injuries and Injury severity. Something to consider with the missing data in air carrier is if there was a corrlation between air carrier and plane crashes I don't beleive that business would still be operating.

Regarding many of the missing values in the fields: Total Fatal Injuries, Total Minor Injuries and Total Serious Injuries: we will be adding a total injuries column and consult outside sources to confirm these nulls as 0s. 

In [7]:
#Summary of the data for continious variables
aviation_data.describe().apply(lambda s: s.apply('{0:.1f}'.format))

Unnamed: 0,Latitude,Longitude,Number.of.Engines,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured
count,31587.0,31578.0,80399.0,58158.0,55466.0,56695.0,71092.0
mean,37.5,-2655.6,1.1,0.8,0.3,0.5,6.1
std,12.5,455322.7,0.4,6.3,1.4,2.9,30.2
min,-78.0,-80911844.0,0.0,0.0,0.0,0.0,0.0
25%,33.3,-114.7,1.0,0.0,0.0,0.0,0.0
50%,38.1,-94.3,1.0,0.0,0.0,0.0,1.0
75%,42.5,-81.6,1.0,1.0,0.0,1.0,2.0
max,89.2,435.8,8.0,349.0,111.0,380.0,699.0


In [15]:
#splitting state and city from location
aviation_data['City'] = aviation_data['LOCATION'].str.split(',').str[0]
aviation_data['State'] = aviation_data['LOCATION'].str.split(',').str[1]
#dropping location since we now have state and city
aviation_data = aviation_data.drop(['LOCATION'],axis=1)

In [16]:
aviation_data.head()

Unnamed: 0,Event.Id,Investigation.Type,Accident.Number,Event.Date,Country,Latitude,Longitude,Airport.Code,Airport.Name,Injury.Severity,...,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Total.Uninjured,Weather.Condition,Broad.phase.of.flight,Report.Status,Publication.Date,City,State
0,20210200000000.0,Accident,CEN21FA130,2021-02-16,United States,42.595377,-89.030245,,,Fatal,...,2.0,0.0,0.0,0.0,VMC,,,,JANESVILLE,WI
1,20210200000000.0,Accident,ERA21FA130,2021-02-15,United States,18.354444,-65.027778,,,Fatal,...,4.0,0.0,0.0,0.0,VMC,,,,St Thomas,CB
2,20210200000000.0,Accident,ANC21LA017,2021-02-13,United States,61.336392,-152.01643,,,Minor,...,0.0,0.0,2.0,0.0,,,,,TYONEK,AK
3,20210200000000.0,Accident,CEN21LA127,2021-02-12,United States,43.297731,-89.755693,91C,SAUK-PRAIRIE,Non-Fatal,...,0.0,0.0,0.0,1.0,VMC,,,,PRAIRIE DU SAC,WI
4,20210200000000.0,Accident,ERA21LA131,2021-02-10,United States,27.243723,-81.413767,09FA,,Minor,...,0.0,0.0,1.0,0.0,,,,,LAKE PLACID,FL


# Simple Statisitics