### Problem Statement: 
How severe can an airplane accident be?
Flying has been the go-to mode of travel for years now; it is timesaving, affordable, and extremely convenient. According to the FAA, 2,781,971 passengers fly every day in the US, as in June 2019. Passengers reckon that flying is very safe, considering strict inspections are conducted and security measures are taken to avoid and/or mitigate any mis happenings. However, there remain a few chances of unfortunate incidents.
Imagine you have got a project from leading airline. You are required to build Machine Learning models to anticipate and classify the severity of an airplane accident based on past incidents. With this, all airlines, even the entire aviation industry, can predict the severity of airplane accidents caused due to various factors and, correspondingly, have a plan of action to minimize the risk associated with them.

### Data:
The dataset comprises 2 files (Data is shared in the same folder) : 
●	Train.csv: [10000 x 12 excluding the headers] contains Training data
●	Test.csv: [2500 x 11 excluding the headers] contains Test data
#### Columns	Description
Accident_ID	unique id assigned to each row
Accident_Type_Code	the type of accident (factor, not numeric)
Cabin_Temperature	the last recorded temperature before the incident, measured in degrees Fahrenheit
Turbulence_In_gforces	the recorded/estimated turbulence experienced during the accident
Control_Metric	an estimation of how much control the pilot had during the incident given the factors at play
Total_Safety_Complaints	number of complaints from mechanics prior to the accident
Days_Since_Inspection	how long the plane went without inspection before the incident
Safety_Score	a measure of how safe the plane was deemed to be
Violations	number of violations that the aircraft received during inspections
Severity	a description (4 level factor) on the severity of the crash [Target]

 
 
### Solution Approach:
#### Libraries Used:
•	Pandas for data manipulation
•	NumPy for performing mathematical operations on the data
•	Matplotlib, Seaborn and Scikitplot for visualization
•	Sklearn for pre-processing and model building & evaluation
#### Steps and Inferences:
1.	Data was read and loaded into a DataFrame and first few records were visualized
2.	The dataTypes were understood and we could infer the following
a.	Safety_Score, Control_Metric, Turbulence_In_gforces, Cabin_Temprature, Max_Elevation and Adverse_Weather_Metric are continuous in nature
b.	Severity, Days_Since_Inspection, Total_Safety_Compliants, Accident_Type_Code and Violations are Discreet categorical data
3.	There were no null values present in the dataset
4.	Five point summary was performed on the continuous data and following were observed
a.	Mean of the following attributes are not in sync with the median which implies the presence of outliers
i.	Total_Safety_Complaints
ii.	Violations
iii.	Adverse_Weather_Metric
5.	Exploratory Data Analysis was performed starting with univariate analysis and following were observed
a.	Safety score is normally distributed  and has few outliers
b.	Control_Metric is skewed to the left but normally distributed with outliers
c.	Turbulence_In_gForce is skewed to the left but normally distributed with outliers
d.	Cabin_Temprature is skewed to the right but normally distributed with outliers
e.	Max_Elevation is t normally distributed with outliers
f.	Adverse_Weather_Metric is highly skewed to the right with huge number of outliers
g.	Average Days_Since_Inspection is 13
h.	number of records with Total_Safety_Compliants as 0-10 is high
i.	Accident_Type_Code 5 has fewer number of records
j.	2 violations has the highest number of records
k.	There is a slight class imbalance with 'Significant_Damage_And_Fatalities' group having fewer number of records
6.	Multivariate analysis were performed along with correlation matrix and below were the features with top correlations
Accident_Type_Code  	Adverse_Weather_Metric    	0.739361
Safety_Score        	Days_Since_Inspection     	0.685386
Control_Metric      	Turbulence_In_gforces     	0.643285
7.	Numerical data was standardized using z-score
8.	Removed all records with z-score greater and lesser than 3 and -3 respectively as the values are outliers
9.	493 records were removed as they were considered outliers
10.	Label Encoding was performed on the Target Column
11.	Basic Feature Engineering was performed in the interest of time and lack of domain expertise to create new features
12.	Independent features and Target columns were split as X and Y
13.	K-Fold cross-validation was performed to create multiple folds of  train and validation data
14.	2 Models were taken into consideration based on the data
a.	Gradient Boosting
b.	XG Boost
15.	XGBoost was considered as final model and RandomSearchCV was used for hyper-parameterization
16.	Validation set was evaluated and all metrics (F1, recall, precision and accuracy) was above 95%
17.	Test Set was also loaded and predicted with the model already built

#### Things that can improve model performance (Not done in this analysis due to time constraints):
1.	There was slight imbalance in the dataset which we can over-sample to get better model performance for all classes
2.	By gaining domain expertise, we can create new features which the model can interpret better.
3.	Test Data outliers can be imputed to get better results
4.	Statistical inference testing can be done to improve feature selection
5.	Feature importance can be obtained and model can be retrained


### Importing Necessary Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler 
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import multilabel_confusion_matrix
import scikitplot as skplt

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Reading and understanding the data-set

In [None]:
data=pd.read_csv('/kaggle/input/airplane-accidents-severity-dataset/train.csv')

#### Checking the head of the data-set

In [None]:
data.head()

##### ● Acciedent_ID : This is the unique ID of the accident as recorded by the authorities
##### ● Adverse_Weather_Metric : measured weather metric in the basis of the adverse events occured
##### ● Violations: number of violations that the aircraft received during inspections
##### ● Max_Elevation : maximim altitude the airplane has reached during the event
##### ● Accident_Type_Code : Code of the accident classified by the authorities
##### ● Cabin_Temperature: the last recorded temperature before the incident, measured in degrees Fahrenheit
##### ● Turbulence_In_gforces :the recorded/estimated turbulence experienced during the accident
##### ● Control_Metric : an estimation of how much control the pilot had during the incident given the factors at play
##### ● Total_Safety_Complaints:number of complaints from mechanics prior to the accident
##### ● Days_Since_Inspection:how long the plane went without inspection before the incident
##### ● Safety_Score: a measure of how safe the plane was deemed to be

#### Checking the data-types of the data

In [None]:
data.dtypes

##### Safety_Score, Control_Metric,Turbulence_In_gforces,Cabin_Temprature,Max_Elevation,Adverse_Weather_Metric are continous in nature
##### Severity, Days_Since_Inspection, Total_Safety_Compliants, Accident_Type_Code and Violations are Discreet categorical data

#### Checking the information of the data

In [None]:
data.info()

In [None]:
data.shape

##### There are 10000 records with 11 independent variables and 1 target variable

#### Check for any null values in the data

In [None]:
na_values=data.isna().sum()
print(na_values)

#### To describe the data- Five point summary- Remove Accident ID as it does not help with analysis of data

In [None]:
data.drop(['Accident_ID'],axis=1,inplace=True)
data.describe().T

##### Mean of the following attributes are not in sync with the median which implies the presence of outliers
###### Total_Safety_Complaints, Violations, Adverse_Weather_Metric

### Exploratory Data Analytics

#### Uni-Variate Analysis

In [None]:
# Distribution of continous data

# Safety_Score, Control_Metric,Turbulence_In_gforce


plt.figure(figsize=(30,6))

#Subplot 1
plt.subplot(1,3,1)
plt.title('Safety_Score')
sns.distplot(data['Safety_Score'],color='red')

#Subplot 2
plt.subplot(1,3,2)
plt.title('Control_Metric')
sns.distplot(data['Control_Metric'],color='blue')

#Subplot 3
plt.subplot(1,3,3)
plt.title('Turbulence_In_gforces')
sns.distplot(data['Turbulence_In_gforces'],color='green')



plt.figure(figsize=(30,6))

#Subplot 1- Boxplot
plt.subplot(1,3,1)
plt.title('Safety_Score')
sns.boxplot(data['Safety_Score'],orient='horizondal',color='red')

#Subplot 2
plt.subplot(1,3,2)
plt.title('Control_Metric')
sns.boxplot(data['Control_Metric'],orient='horizondal',color='blue')

#Subplot 3
plt.subplot(1,3,3)
plt.title('Turbulence_In_gforces')
sns.boxplot(data['Turbulence_In_gforces'],orient='horizondal',color='green')


##### Safety score is normally distributed  and has few outliers
##### Control_Metric is skewed to the left but normally distributed with outliers
##### Turbulence_In_gForce is skewed to the left but normally distributed with outliers 

In [None]:
# Distribution of continous data
# Cabin_Temprature, Max_Elevation, Adverse_Weather_Metric

plt.figure(figsize=(30,6))

#Subplot 1
plt.subplot(1,3,1)
plt.title('Cabin_Temperature')
sns.distplot(data['Cabin_Temperature'],color='red')

#Subplot 2
plt.subplot(1,3,2)
plt.title('Max_Elevation')
sns.distplot(data['Max_Elevation'],color='blue')

#Subplot 3
plt.subplot(1,3,3)
plt.title('Adverse_Weather_Metric')
sns.distplot(data['Adverse_Weather_Metric'],color='green')



plt.figure(figsize=(30,6))

#Subplot 1- Boxplot
plt.subplot(1,3,1)
plt.title('Cabin_Temperature')
sns.boxplot(data['Cabin_Temperature'],orient='horizondal',color='red')

#Subplot 2
plt.subplot(1,3,2)
plt.title('Max_Elevation')
sns.boxplot(data['Max_Elevation'],orient='horizondal',color='blue')

#Subplot 3
plt.subplot(1,3,3)
plt.title('Adverse_Weather_Metric')
sns.boxplot(data['Adverse_Weather_Metric'],orient='horizondal',color='green')


##### Cabin_Temprature is skewed to the right but normally distributed with outliers
##### Max_Elevation is t normally distributed with outliers
##### Adverse_Weather_Metric is highly skewed to the right with huge number of outliers 

In [None]:
##### Days_Since_Inspection, Total_Safety_Compliant
plt.figure(figsize=(30,6))

#Subplot 1
plt.subplot(1,2,1)
plt.title('Days_Since_Inspection')
sns.countplot(data['Days_Since_Inspection'],color='red')

#Subplot 2
plt.subplot(1,2,2)
plt.title('Total_Safety_Complaints')
sns.countplot(data['Total_Safety_Complaints'],color='blue')


##### Average Days_Since_Inspection is 13
##### number of records with Total_Safety_Compliants as 0-10 is high

In [None]:
# Accident_Type_Code and Violations

plt.figure(figsize=(30,6))


#Subplot 1
plt.subplot(1,2,1)
plt.title('Accident_Type_Code')
sns.countplot(data['Accident_Type_Code'],color='red')

#Subplot 2
plt.subplot(1,2,2)
plt.title('Violations')
sns.countplot(data['Violations'],color='blue')

#####  Accident_Type_Code 5 has fewer number of records
##### 2 violations has the highest number of records

In [None]:
plt.figure(figsize=(30,6))

plt.title('Severity')
sns.countplot(data['Severity'],color='red')

#### There is a slight class imbalance with 'Significant_Damage_And_Fatalities' group having fewer number of records

#### Multi-variate Analysis

In [None]:
sns.pairplot(data,palette="Set2", diag_kind="kde", height=2.5)

#### Correlation Matrix

In [None]:
correlation=data.corr()
correlation.style.background_gradient(cmap='coolwarm')

In [None]:
data.corr()>0.5

In [None]:
data.corr()<-0.5

In [None]:
df=data.drop(['Severity'],axis=1)
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=20):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

##### These pairs of independent attibutes have good correlation

#### Handle Outliers

In [None]:
data.info()

In [None]:
dataNumericals = pd.DataFrame(data, columns =data.columns[data.dtypes == 'float64']) 
dataNumericals.head()

#### Applying z-score to scale the data and standardize the data


In [None]:
dataNumericals=dataNumericals.apply(zscore)

In [None]:
dataNumericals.head()

#### Removing all records with z-score greater and lesser than 3 and -3 respectivley as the values are outliers

In [None]:
floats = dataNumericals.columns[dataNumericals.dtypes == 'float64']
for columns in floats:
    indexNames_larger = dataNumericals[dataNumericals[columns]>3].index
    indexNames_lesser = dataNumericals[dataNumericals[columns]<-3].index
    # Delete these row indexes from dataFrame
    dataNumericals.drop(indexNames_larger , inplace=True)
    dataNumericals.drop(indexNames_lesser , inplace=True)
    data.drop(indexNames_larger , inplace=True)
    data.drop(indexNames_lesser , inplace=True)

In [None]:
dataNumericals.info()

In [None]:
data.info()

#### 493 records were removed as they were considered outliers

#### Merging the scaled columns back to the original dataframe

In [None]:
data.drop(data.columns[data.dtypes == 'float64'],axis=1,inplace=True)

In [None]:
data.head()

In [None]:
for column in dataNumericals.columns:
    data[column]=dataNumericals[column]

In [None]:
data.head()

In [None]:
data.info()

#### Label Encoding the Target Column

In [None]:
data['Severity'].unique()

In [None]:
encoder=LabelEncoder()
data['Severity']=encoder.fit_transform(data['Severity'])

In [None]:
data.head()

### Feature Engineering

In [None]:
data['Total_Safety_Complaints'] = np.power(2, data['Total_Safety_Complaints'])
data['Days_Since_Inspection'] = np.power(2, data['Days_Since_Inspection'])
data['Safety_Score'] = np.power(2, data['Safety_Score'])

#### Splitting X-independent attributes and Y-dependent attributes and keeping the test set seperate
#### Creating multiple cross-validation to reduce overfitting

In [None]:
X=data.drop(['Severity'],axis=1)

In [None]:
Y=data['Severity']

In [None]:
Xtrain_val,X_test,ytrain_val,Y_test=train_test_split(X,Y,test_size=0.2,random_state=22)

In [None]:
kf = KFold(n_splits=10,random_state=2,shuffle=True)
kf.get_n_splits(Xtrain_val)
print(kf)


for train_index, val_index in kf.split(Xtrain_val):
    print("TRAIN:", train_index, "VALIDATION:", val_index)
    X_train, X_val = Xtrain_val.iloc[train_index], Xtrain_val.iloc[val_index]
    y_train, y_val = ytrain_val.iloc[train_index], ytrain_val.iloc[val_index]

### Model Building- Gradient Boosting classifier is used along with RandomSearch cross validation

In [None]:
#Pipeline
pipe_GBR = Pipeline([('GBR', GradientBoostingClassifier())]) 

#Parameter-grid
param_grid = {'GBR__n_estimators': [50,100,150],'GBR__learning_rate':[0.1,0.2,0.5]} 
 
#Using RandomSearchCV
Random_GBR = RandomizedSearchCV( pipe_GBR , param_distributions=param_grid, cv= 5, n_iter=3) 

#Fitting the data in the model
Random_GBR.fit(X_train, y_train) 

print(" Best cross-validation score obtained is: {:.2f}". format( Random_GBR.best_score_)) 
print(" Best parameters as part of Gridsearch is: ", Random_GBR.best_params_) 
print(" Train set score obtained is: {:.2f}". format( Random_GBR.score( X_train, y_train)))
print(" Validation set score obtained is: {:.2f}". format( Random_GBR.score( X_val, y_val)))
print(" Test set score obtained is: {:.2f}". format( Random_GBR.score( X_test, Y_test)))

In [None]:
y_pred=Random_GBR.predict(X_test)

In [None]:
accuracy_score=metrics.accuracy_score(Y_test,y_pred)
percision_score=metrics.precision_score(Y_test,y_pred,average='macro')
recall_score=metrics.recall_score(Y_test,y_pred,average='macro')
f1_score=metrics.f1_score(Y_test,y_pred,average='macro')
print("The Accuracy of this model is {0:.2f}%".format(accuracy_score*100))
print("The Percision of this model is {0:.2f}%".format(percision_score*100))
print("The Recall score of this model is {0:.2f}%".format(recall_score*100))
print("The f1 score of this model is {0:.2f}%".format(f1_score*100))

In [None]:
Random_GBR.cv_results_

In [None]:
classification_report=metrics.classification_report(Y_test,y_pred)

In [None]:
print(classification_report)

### Model Building-XG Boosting classifier is used along with RandomSearch cross validation- Final Model

In [None]:
#Pipeline
pipe_XGB = Pipeline([('XGB', XGBClassifier())]) 

#Parameter-grid
param_grid = {'XGB__learning_rate':[0.1,0.2,0.3],'XGB__max_depth' :[10,50,100], 'XGB__gamma':[0.1,0.3,0.5]} 
 
#Using RandomSearchCV
Random_XGB = RandomizedSearchCV( pipe_XGB , param_distributions=param_grid, cv= 5, n_iter=3) 
#Fitting the data in the model
Random_XGB.fit(X_train, y_train)

print(" Best cross-validation score obtained is: {:.2f}". format( Random_XGB.best_score_)) 
print(" Best parameters as part of Gridsearch is: ", Random_XGB.best_params_) 
print(" Train set score obtained is: {:.2f}". format( Random_XGB.score( X_train, y_train)))
print(" Validation set score obtained is: {:.2f}". format( Random_XGB.score( X_val, y_val)))
print(" Test set score obtained is: {:.2f}". format( Random_XGB.score( X_test, Y_test)))

In [None]:
y_pred=Random_XGB.predict(X_test)

#### Test Evaluation Metrics

In [None]:
accuracy_score=metrics.accuracy_score(Y_test,y_pred)
percision_score=metrics.precision_score(Y_test,y_pred,average='macro')
recall_score=metrics.recall_score(Y_test,y_pred,average='macro')
f1_score=metrics.f1_score(Y_test,y_pred,average='macro')
print("The Accuracy of this model is {0:.2f}%".format(accuracy_score*100))
print("The Percision of this model is {0:.2f}%".format(percision_score*100))
print("The Recall score of this model is {0:.2f}%".format(recall_score*100))
print("The f1 score of this model is {0:.2f}%".format(f1_score*100))

In [None]:
Random_XGB.cv_results_

In [None]:
classification_report=metrics.classification_report(Y_test,y_pred)

In [None]:
print(classification_report)

In [None]:
skplt.metrics.plot_confusion_matrix(Y_test,y_pred,figsize=(12,12))

### Predicting the Test Data

In [None]:
testData=pd.read_csv("/kaggle/input/airplane-accidents-severity-dataset/test.csv")

#### Pre-processing the test data

In [None]:
testData.drop(['Accident_ID'],axis=1,inplace=True)
testData.head()

In [None]:
testData.info()

In [None]:
testDataNumericals = pd.DataFrame(testData, columns =testData.columns[testData.dtypes == 'float64']) 
testDataNumericals.head()

In [None]:
testDataNumericals=testDataNumericals.apply(zscore)

In [None]:
testData.drop(testData.columns[testData.dtypes == 'float64'],axis=1,inplace=True)
testData.head()

In [None]:
for column in testDataNumericals.columns:
    testData[column]=testDataNumericals[column]

In [None]:
testData.head()

In [None]:
testData['Total_Safety_Complaints'] = np.power(2, testData['Total_Safety_Complaints'])
testData['Days_Since_Inspection'] = np.power(2, testData['Days_Since_Inspection'])
testData['Safety_Score'] = np.power(2, testData['Safety_Score'])

#### Predictions using Xtreme Gradient Boosting

In [None]:
testPredictions=Random_XGB.predict(testData)

In [None]:
testData['Severity']=encoder.inverse_transform(testPredictions)

In [None]:
testData.head()

In [None]:
finalData=pd.read_csv("/kaggle/input/airplane-accidents-severity-dataset/test.csv")

In [None]:
finalData['Severity']=testData['Severity']

In [None]:
finalData.head()

In [None]:
finalData.to_csv('test.csv')