# Machine Learning, Winternship
### Widhya.org
**An Airline company is facing issues with the unability to predict FLight Delay. Due to this company is having a huge loss. The Company needs a machine learning engineer to build a model which accurately predicts the flight delay.**

## Importing Libraries
Let's import some important libraries to start with. Further we will keep on adding other libraries as required.

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn import metrics


## Data Collection

Let's read the data. Notice that file size is huge (approximately 565MB).

In [None]:

flight_df = pd.read_csv('/kaggle/input/flight-delays/flights.csv')
flight_df

It shows a warning that datatypes used is of mixed type. To resolve this we will set the low_memory parameter as False.

To know more about this warning/error, follow the link below.

*StackOverflow: "[StackOverflow: Pandas read_csv low_memory and dtype options](https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options)"*


In [None]:
flight_df = pd.read_csv('/kaggle/input/flight-delays/flights.csv', low_memory=False)
flight_df

Now, to further analyse the data we will take a sampple of 100000 rows.

In [None]:
flight_df = flight_df[0:100000]
flight_df.info()

Let's find out the number of flights diverted in the sample data. We will find this using unique counts of 'DIVERTED' columns. There are two values in the diverted column. In Diverted column, value 0 indicates Not Diverted and value 1 indicates Diverted.

In [None]:
flight_df.value_counts('DIVERTED')

So, from the top 10,000 sample data set, we have 224 number of flights got diverted. Moving on!

## EDA
Some key points:
- [Exploratory Data Analysis (EDA)](https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to:
    - maximize insight into a data set
    - uncover underlying structure
    - extract important variables
    - detect outliers and anomalies
    - test underlying assumptions
    - develop parsimonious models
    - determine optimal factor settings. 
    


Now, let's analyse the dataset using pairplots, jointplots using Seaborn module. 
- [Seaborn](https://towardsdatascience.com/data-visualization-using-seaborn-fc24db95a850) is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.


Let's plot the jointplot between 'SCHEDULED_ARRIVAL' and 'ARRIVAL_TIME'.

In [None]:
sns.jointplot(data = flight_df, x = "SCHEDULED_ARRIVAL", y = "ARRIVAL_TIME")

Now, we will calculate correlation among the variables to see the strength of association/relation between them.

### More about [Correlation](https://medium.com/analytics-vidhya/what-is-correlation-4fe0c6fbed47).

In [None]:
flight_df.corr()

From the above we can see that there are a number of unwwanted features (i.e not highly correlated).

Let's drop these variables. 
- 'YEAR'
- 'FLIGHT_NUMBER'
- 'AIRLINE'
- 'DISTANCE'
- 'TAIL_NUMBER','TAXI_OUT'
- 'SCHEDULED_TIME'
- 'DEPARTURE_TIME'
- 'WHEELS_OFF'
- 'ELAPSED_TIME'
- 'AIR_TIME'
- 'WHEELS_ON'
- 'DAY_OF_WEEK'
- 'TAXI_IN'
- 'CANCELLATION_REASON'

In [None]:
flight_df = flight_df.drop(['YEAR','FLIGHT_NUMBER','AIRLINE','DISTANCE','TAIL_NUMBER','TAXI_OUT', 'SCHEDULED_TIME','DEPARTURE_TIME','WHEELS_OFF','ELAPSED_TIME', 'AIR_TIME','WHEELS_ON','DAY_OF_WEEK','TAXI_IN','CANCELLATION_REASON'], axis=1)

Let's print the correlation of features with 'ARRIVAL_DELAY' and then we will find the feature which has the highest correlation with the 'ARRIVAL_DELAY'

In [None]:
flight_df[flight_df.columns[1:]].corr()['ARRIVAL_DELAY'][:].sort_values(ascending=False)

As it can be seen that feature 'DEPARTURE_DELAY' has the highest correlation with 'ARRIVAL_DELAY'.

Moving on!

# Data Cleaning and Preprocessing

As we have already removed the least correlated variables, now let's find out the variable which 

In [None]:
flight_df.isna().sum()

Notice that the data types of all the above features is Numerical. In order to handle these missing values we will replace the 'Null' values with mean values.

In [None]:
flight_df = flight_df.fillna(flight_df.mean())
flight_df.isna().sum()

In order to build a predictive machine learning classification model, we need one or more independent variables and one dependent variable. 
In the dataset we donot have any independent variable which serves the purspose as the result if flight is delayed or not.

So, let's create a variable named 'Result' which takes the value 0 and 1. 
- '0' : flight not delayed and 
- '1' : flight delayed.
These values will be imputed by using the condition if 'ARRIVAL_DELAY' is greater than 15 then impute '1'(flight delayed) else '0'.

In [None]:
result = []
for row_value in flight_df['ARRIVAL_DELAY']:
    if row_value > 15:
        result.append(1)
    else:
        result.append(0) 
        
flight_df['Result']  = result
flight_df.tail(10)

Let's find out number of flights delayed.

In [None]:
flight_df['Result'].value_counts()

It can be seen that a total of 36,221 flights got delayed out of 100,000 samples that we have taken.

# Model Creation

We will consider only following features for Model Building.
>'MONTH', 'DAY', 'SCHEDULED_DEPARTURE', 'DEPARTURE_DELAY', 'SCHEDULED_ARRIVAL', 'DIVERTED', 'CANCELLED', 'AIR_SYSTEM_DELAY','SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY','WEATHER_DELAY', 'Result'

and dropping other variables.

In [None]:
flight_df = flight_df.drop(['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'ARRIVAL_TIME', 'ARRIVAL_DELAY'],axis=1)


Further for model training and testing purpose, let's split dataset with test size as 30%.

In [None]:
flight_df = flight_df.values

# X consists all independent varibles.
# y consists dependent variable.
X, y = flight_df[:,:-1], flight_df[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 42)

Now we will use StandardScaler() class to fit and transform

In [None]:
scaled_features = StandardScaler().fit_transform(X_train, X_test)

DecisionTreeClassifier to model the dataset. 

In [None]:
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

# Model Prediction 
AUC score of Model

In [None]:
pred = clf.predict_proba(X_test)
auc_score = roc_auc_score(y_test, pred[:,1])
auc_score

Reference: [https://www.kaggle.com/hrishikeshmalkar/flight-delay-prediction](https://www.kaggle.com/hrishikeshmalkar/flight-delay-prediction)