# Project Title
**Author**: Todd Strain



## Overview
Virtucon have decided to diversify and explore new reevnue strams by offering flight operation services
for commercial and private enterprises. The company is concerned about the risks and environmental impact
of this new line of business. Since the company has little knowledge of these risks we've undertaken an
analysis of these issues. Three datasets were used to determin which models of aircraft are the safest to
operate while having an acceptable level of environmental impact. The result was the Canadaair RJ190 was the
safetst aircraft to operate. The <blank> was the most enironmentaly friendly aircraft. Our reecommendation is 


## Business Problem
Virtucon wants to diversify it's business and introduce new revenu streams by operating aircraft for commercial 
and private enterprises. They want to know the safest and aircraft to operate with the least environmenatl impact.
To answer that question we used the Bureau of Transportation Statistics (BTS) formula:

*“Rates are computed by dividing the number of Fatalities, Seriously injured persons, Total accidents, and Fatal accidents by the number of Aircraft-miles, Aircraft departures, or Flight hours.” *

To determine environmntal impact we used data for popular aircraft from EASA.

## Data Understanding
For aircraft safty we used two datasets NTSB aircraft crash database and BTS T-100 Domestic Segmaent database for the 10 years preceeding 2024.

The NTSB crash database includes aviation accident data from 1962 to 2023 about civil aviation accidents and selected incidents in the United States and international waters. We are most interested in the number of accidents per aircraft model type, and the number and grade of passenger injuries.

The BTS T-100 Domestic Segment database includes on-flight origin and destination records. From this dataset we use the make and model of aircraft, and the number of flight instances. 

In both datasets we've selected data from 2014-2024 (most recently reported.) 


***

### Data Structures
- **df**            DataFrame. Full AviationData.csv provided by project. Containf flight crash data.
- **bts_df**        DataFrame. 5 years of flight data from BTS. Contains airline flight data.
- **ac_df**         DataFrame. Legend that matches aircraft code in bts_df to Make/Model.
- **joined_df**     DataFrame. Merged bts_df with Make/Model from ac_df.
- **bts_actype**    DataFrame. List of aircraft from joined_df. Used to create spreadsheet for standarized names.
- **bts_map_df**    DataFrame. Imported from bts_actype spreadsheet with actype converted to standardized names.
- **bts_map_dic**   Dict. bts_map_df converted to dictionary so it can be added as new column to joined_df.
- **crash_actype**  DataFrame. Copy of Make, Model columns from df. Later updated with standardized model name.
- **custom_ac**     DataFrame. Copy of Model column from crash_actype with value counts.
- **custom_ac_li**  List. custom_ac converted to list. Used to update crash_actype individual Make with 'Custom'.
- **df3**           DataFrame. df merged with updated crash_actype to get standardized names. Added InjuryScore column.
- **total_score**   DataFrame. Aircraft type and score totaled from df3.




## Data Clenaing and Preparation
***

In [187]:
import pandas as pd
import numpy as np
import glob

In [188]:
df = pd.read_csv('zippedData/AviationData.csv', encoding='latin-1')

  df = pd.read_csv('zippedData/AviationData.csv', encoding='latin-1')


In [189]:
df.columns

Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',
       'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',
       'Airport.Name', 'Injury.Severity', 'Aircraft.damage',
       'Aircraft.Category', 'Registration.Number', 'Make', 'Model',
       'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',
       'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',
       'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',
       'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',
       'Publication.Date'],
      dtype='object')

In [190]:
df.shape

(88889, 31)

In [191]:
# Drop rows before 2012 so we can match data by year in bts_df
date_mask = (df['Event.Date'] > '2011-12-31')
# THIS RETURNS ALL ROWS GREATER THAN THE DATE PROVIDED ABOVE
df = df.loc[date_mask]

In [192]:
# Rows without an aircraft model type don't help use. Dropping these rows.
df = df.dropna(subset=['Model'])

In [193]:
# Some values in Make are upper, and some are mixed case. Change the Make column to all uppercase
df['Make'] = df['Make'].str.upper()

In [194]:
# Convert Event.Date from string to datetime. Used to filter df on years
df['Event.Date'] = df['Event.Date'].astype('datetime64[ns]')

In [195]:
# Load in T100 data from csv files
files_li = glob.glob('zippedData/T_T100D*.2.csv') 
bts_df = pd.DataFrame(pd.read_csv(files_li[0])) 
for i in range(1,len(files_li)): 
    data = pd.read_csv(files_li[i]) 
    df1 = pd.DataFrame(data) 
    bts_df = pd.concat([df1,bts_df],axis=0, ignore_index=True) 

In [196]:
bts_df.columns

Index(['DISTANCE', 'AIR_TIME', 'UNIQUE_CARRIER_NAME', 'ORIGIN', 'DEST',
       'AIRCRAFT_TYPE', 'YEAR', 'MONTH', 'DISTANCE_GROUP'],
      dtype='object')

In [197]:
bts_df['YEAR'].value_counts()

YEAR
2022    411461
2019    395552
2021    384081
2016    382397
2018    380238
2017    373658
2015    370162
2014    358366
2012    347452
2013    343132
2020    297833
Name: count, dtype: int64

In [198]:
bts_df['UNIQUE_CARRIER_NAME'].value_counts().head(20)

UNIQUE_CARRIER_NAME
Southwest Airlines Co.                  381471
Delta Air Lines Inc.                    348303
United Air Lines Inc.                   301166
SkyWest Airlines Inc.                   237574
American Airlines Inc.                  188528
Federal Express Corporation             186737
ExpressJet Airlines LLC d/b/a aha!      129799
Hageland Aviation Service               122776
Allegiant Air                           122148
United Parcel Service                   107885
Republic Airline                        104241
Alaska Airlines Inc.                     98684
Envoy Air                                92405
Frontier Airlines Inc.                   85449
Endeavor Air Inc.                        85005
Spirit Air Lines                         79363
Grant Aviation                           72125
JetBlue Airways                          69685
PSA Airlines Inc.                        68853
Ryan Air f/k/a Arctic Transportation     65132
Name: count, dtype: int64

## Because the Aviation Crash Data and BTS use different model names for Aircraft type we need to standardize on a set of names.
***

In [199]:
# BTS uses number code to identify aircraft type in it's flight data. The legend is in a seperate csv file. 
# Load in the legend for aircraft type. 
ac_df = pd.read_csv('zippedData/L_AIRCRAFT_TYPE.csv')

In [200]:
# Merge flight data with aircfaft type. This will match BTS type names to the fligt data.
joined_df = pd.merge(bts_df,ac_df,left_on='AIRCRAFT_TYPE',right_on='Code')

In [201]:
# Making a list of aircraft from the flight data. We use this list to assign standard names.
bts_actype = joined_df[['Description']].copy()

In [202]:
# Drop duplicates from flight aircraft 
bts_actype.drop_duplicates(inplace=True)

In [None]:
# Write this to Excel so we can standardize aircraft names.
# bts_actype.to_excel('zippedData/bts_actype2.xlsx')

In [203]:
# Read in our standardized label spreadsheet after edit
bts_map_df = pd.read_excel('zippedData/bts_actype.2.xlsx')

In [204]:
bts_map_df['Model'] = bts_map_df['Model'].astype(str)

In [205]:
# Convert standardized name spreadsheet to mapping dictionary
bts_map_dic = dict(zip(bts_map_df.Description, bts_map_df.Model))

In [206]:
# Map original aircraft names to standardized name and put in new column
joined_df['NewModel'] = joined_df['Description'].map(bts_map_dic)

In [None]:
joined_df.loc[4044331]

In [207]:
# Convert some intergers to strings
joined_df['NewModel'] = joined_df['NewModel'].astype(str)

In [208]:
# Making a list of aircraft from the NTSB crash data. 
crash_actype = df[['Make', 'Model']].copy()

In [209]:
# We really want to do this on crash_actype
#df['Model'].replace('-', '', inplace=True)
#df['Model'].replace('/', '', inplace=True)
crash_actype['Model'].replace('-', '', inplace=True, regex=True)
crash_actype['Model'].replace('/', '', inplace=True, regex=True)

In [210]:
# Drop duplicates from the crash_actype df.
crash_actype.drop_duplicates(inplace=True)


In [211]:
# Get value counts so we can drop single instance aircraft.
custom_ac = crash_actype['Model'].value_counts().to_frame()

In [212]:
# Convert custom_ac to list. Used to update crash_actype individual Make with 'Custom'.
custom_ac = custom_ac[custom_ac['count'] > 2]
custom_ac_li = custom_ac.index.to_list()


In [213]:
# Lots of duplicates after replacing Make with 'Custom'. Drop duplicates from the crash_actype df
crash_actype.drop_duplicates(inplace=True)

In [214]:
# Replace individual Maker with 'Custom'
for model in custom_ac_li:
    crash_actype.loc[crash_actype['Model'] == model, 'Make'] = 'Custom'

In [186]:
crash_actype['NewModel'].isna().sum()

15418

In [215]:
''' 
This function will 
1) take in a model string from crash_actype 
2) match it with a newmodel from joined_df
3) update crash_actype with new_model column
TODO: Replace with faster version.
'''
def match_types(model):
    for newmodel in joined_df['NewModel']:
        if model.startswith(newmodel):
            #print(f'Processing: {model} \t:{newmodel}')
            return newmodel
        

In [None]:
"""
    Creating the NewModel colum takes a lot of CPU and time. It's been saved as a pickle file so we can just
    read it in and save time.
"""
#crash_actype['NewModel'] = [match_types(model) for model in crash_actype['Model']]

In [179]:
# Save df to pickle file.
#crash_actype.to_pickle('zippedData/crash_actype.2.pkl')

In [216]:
# Read in pickle file
crash_actype = pd.read_pickle(r'zippedData/crash_actype.pkl')

## Data Modeling
***

In [217]:
# List the top 40 flown aircraft types
joined_df['NewModel'].value_counts().head(40)

NewModel
737         888665
A320        642230
208         219309
RJ100       201830
757         178044
ERJ175      173038
RJ700       169427
MD80        156614
145         153205
206         139183
CRJ700      137718
767         104339
A300         65703
PA31         63511
EMB170       57466
747          44479
1900         44466
DASH8        39522
190          38706
PC12         36956
212          32537
MD11         28799
DC10         27205
DHC2         26880
140          24471
DC9          22351
nan          20616
777          19567
PA32         16926
135          15880
King Air     15591
GA8          12387
402          12274
727          12106
GIV          11323
340B         11120
A330         10326
C185         10051
FALCON        8001
EMB120        7929
Name: count, dtype: int64

In [248]:
df['Model'].value_counts()[200:250]

Model
PA-34-200T        15
7BCM              15
M20M              15
AT-802A           15
737 7H4           15
150J              15
340A              15
PA 24-250         15
B75N1             15
G36               14
PA-23-250         14
RV-8              14
PA-28-235         14
F33               14
DHC-2             14
SGS 2-33A         14
PA38              14
M20TN             14
188               14
RV7               14
114               14
PA-28-151         14
G 164B            14
AL3               14
441               13
PA-24-260         13
PA 25-235         13
PA-28             13
150H              13
55                13
AT-502            13
310R              13
B36TC             13
JA30 SUPERSTOL    13
RV-7              13
SPORTCRUISER      13
II                13
PA 22             13
PA 28R-200        13
AT-301            13
RV-12             13
A-1C-180          13
PA 28-161         13
PA-28R-180        13
7EC               13
560XL             13
DA20              13
172L   

In [218]:
# Merge Aircraft Data with crash_actype to get standardized Aircraft Types
df3 = pd.merge(df, crash_actype,left_on='Model',right_on='Model', suffixes=("_l", "_r"))

In [219]:
top_40 = joined_df['NewModel'].value_counts().head(40).index.to_list()

In [249]:
top_40

['737',
 'A320',
 '208',
 'RJ100',
 '757',
 'ERJ175',
 'RJ700',
 'MD80',
 '145',
 '206',
 'CRJ700',
 '767',
 'A300',
 'PA31',
 'EMB170',
 '747',
 '1900',
 'DASH8',
 '190',
 'PC12',
 '212',
 'MD11',
 'DC10',
 'DHC2',
 '140',
 'DC9',
 'nan',
 '777',
 'PA32',
 '135',
 'King Air',
 'GA8',
 '402',
 '727',
 'GIV',
 '340B',
 'A330',
 'C185',
 'FALCON',
 'EMB120']

In [225]:
df3 = df3.loc[df3['NewModel'].isin(top_40)]

In [226]:
df3.shape

(3201, 34)

In [227]:
# Some rows have NaN for values. Replace these with 0 value. 
df3['Total.Fatal.Injuries'] = df3['Total.Fatal.Injuries'].fillna(0)
df3['Total.Serious.Injuries'] = df3['Total.Serious.Injuries'].fillna(0)
df3['Total.Minor.Injuries'] = df3['Total.Minor.Injuries'].fillna(0)

In [228]:
# Check replacement
df3['Total.Fatal.Injuries'].isna().sum()


0

In [229]:
# Our scoring system will use 3 points for fatality; 1 point for serious injury and .5 points for minor injury.
df3['Injury_Score'] = df3['Total.Fatal.Injuries'].map(lambda x: x*3)
df3['Injury_Score'] = df3['Injury_Score'] + df3['Total.Serious.Injuries']
df3['Injury_Score'] = df3['Injury_Score'] + (df3['Total.Minor.Injuries'] * .5)

In [230]:
df3.loc[:,['Total.Fatal.Injuries', 'Total.Serious.Injuries','Total.Minor.Injuries','Injury_Score']].head(5)


Unnamed: 0,Total.Fatal.Injuries,Total.Serious.Injuries,Total.Minor.Injuries,Injury_Score
2485,0.0,0.0,0.0,0.0
2486,0.0,0.0,0.0,0.0
2487,0.0,0.0,0.0,0.0
2488,0.0,0.0,0.0,0.0
2489,0.0,0.0,0.0,0.0


In [231]:
total_score = df3.groupby('NewModel')['Injury_Score'].sum().to_frame()

In [236]:
df3['NewModel'].value_counts()

NewModel
206         1492
737          429
A320         222
208          159
140          158
PA32         107
A330          81
777           72
402           71
PA31          60
767           55
747           47
757           42
1900          32
212           32
PC12          22
DHC2          18
FALCON        18
MD11          18
MD80          12
GIV           11
A300          10
190            8
DC9            6
GA8            6
727            5
EMB120         4
145            2
King Air       1
ERJ175         1
Name: count, dtype: int64

In [232]:
type(total_score)

pandas.core.frame.DataFrame

In [233]:
total_score['Count'] = joined_df.groupby('NewModel').size()

In [235]:
joined_df.groupby('NewModel').size()[1:50]


NewModel
12           88
124         601
135       15880
140       24471
145      153205
172        6631
180        1350
182         110
185        2520
190       38706
1900      44466
2000        702
204B         21
206      139183
208      219309
212       32537
23          109
235         264
31          525
328        1936
340A       2727
340B      11120
35A        3552
360         330
4000        182
400XP      1106
402       12274
407         755
412         224
500         168
5000       4047
550        3676
60          230
600           4
6000        583
65          366
650         734
700        1981
727       12106
737      888665
747       44479
757      178044
767      104339
76TD         35
777       19567
787        4912
99          887
A119         69
A200       4570
dtype: int64

In [None]:
total_score.columns

In [None]:
total_score['Rating'] = (total_score['Injury_Score'] / total_score['Count']) * 1000

In [None]:
total_score.sort_values('Rating')

In [None]:
df['Broad.phase.of.flight'].notna().sum()