# Preamble
This is my first data science project. I learned a lot due to this project, whether it's in python or in the area of data science. The biggest part in such projects is the data preparation and cleansing. As well as in every other data science project, I spend a lot of time here until I came  to the most fun part, the model building and prediction. I tried and errored different approaches during the data preparation, read about better ways and worse ways to do it. I sometimes used the long way to get a better understanding of how the data has to be prepared instead of using the short and faster approach. 
The main goal of this project is to show some ways of data preparation as well as  predict the delay of a flight based on his input features. Therefore I needed to prepare and create data, which will be handled in the first chapter ***“Data Analysis and Preprocessing”***. There I will give an overview of the data structure and the data condition as well. The next chapter then, will contain the ***“Feature and Label Selection”*** were especially the features will be determined by different feature selections approaches. According to the goal of predicting the delay, the target of this essay is the *flight delay at arrival*. To determine the target I use the Random Forest Regression approach. The amount of 5 million flights is quite a lot for the computation time, therefore I will slice smaller size datasets and use them for prediction.

I would be very pleased for any feedback on this project and the coding ways. ***So feel free to leave some comments***. It's my first project and it can only get better. I am looking forward to your feedback and a upvote if liked it. <br>
***Thanks a lot.***
<br>
<br>
Cheers,<br>
Robin

___

# Imports
## Library Imports and Helpers

In [None]:
# Imports 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib as mpl
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from mpl_toolkits.basemap import Basemap


import seaborn as sns

#maschine learning libraries
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix 

from sklearn.metrics import mean_absolute_error
from sklearn.svm import SVC

import datetime
import time
from time import strftime, gmtime

## CSV-File Import

In [None]:
df_flights = pd.read_csv('../input/flights.csv')

# Data Analysis and Preprocessing
The following is the first overview of all attributes:

In [None]:
df_flights.head()

The first thing I noticed is the missing date for each record. It is separated into the attribute **YEAR**, **MONTH** and **DAY**. I could create a full date out of these values right now, but in further processing, a datetime object or a string object could lead to more problems. Right now we have got clean numerical values which can be integrated in any prediction model without problems:

In [None]:
df_flights.loc[:,('YEAR','MONTH','DAY')].dtypes

In [None]:
df_flights.count()

In [None]:
df_flights.head(15)

## Overview

Lets get an overview. We have a **CANCELLATION_REASON**, that has not so much value in comparison to the other attribute. Another anomaly is found in the attributes **AIR_SYSTEM_DELAY**, **SECURITY_DELAY**, **AIRLINE_DELAY**, **LATE_AIRCRAFT_DELAY** and **WEATHER_DELAY**. These attributes have only one-fifth of what the other attributes have. 

There are a lot of time values whether it is a time value or just a minute value but there are all stored as a numerical value:
    - SCHEDULED_DEPARTURE
    - DEPARTURE_TIME
    - DEPARTURE_DELAY
    - TAXI_OUT
    - WHEELS_OFF
    - SCHEDULED_TIME
    - ELAPSED_TIME
    - AIR_TIME
    - WHEELS_ON
    - TAXI_IN
    - SCHEDULED_ARRIVAL
    - ARRIVAL_TIME
    - ARRIVAL_DELAY
    - AIR_SYSTEM_DELAY
    - SECURITY_DELAY
    - AIRLINE_DELAY
    - LATE_AIRCRAFT_DELAY


This list can also be divided into two types of time vales. The one that is actually real-time values and the one that is only minute values:

***Time of Day Values:***
    - ARRIVAL_TIME
    - DEPARTURE_TIME
    - SCHEDULED_DEPARTURE
    - WHEELS_ON
    - WHEELS_OFF
    - SCHEDULED_ARRIVAL
        
        
***Minute Values:***
    - DEPARTURE_DELAY
    - TAXI_IN 
    - TAXI_OUT
    - SCHEDULED_TIME
    - ELAPSE_TIME
    - AIR_TIME
    - ARRIVAL_DELAY
    - AIR_SYSTEM_DELAY
    - SECURITY_DELAY
    - AIRLINE_DELAY
    - LATE_AIRCRAFT_DELAY

It now becomes clear that the actual delay is missing. The delay them self will be the focus on this work according to predict a delayed flight. Later more. 

Despite the time values there are some none time values:

- **AIRLINE** - consists out of two letter airline shortcuts corresponding to the data out of the airline.csv.

- **FLIGHT_NUMBER** - A code for an airline service to identify a flight from a departure a destination.

- **TAIL_NUMBER** - The common unique flight number of the flight.

- **ORIGIN_AIRPORT**  and **DESTINATION_AIRPORT** include the three letter IATA-Codes for the Airports.


### Delay Times
The flight delays have already been calculated in the field **ARRIVAL_DELAY**. It is the difference out of **ARRIVAL_TIME** and **SCHEDULED_ARRIVAL**. Therefore the **ARRIVAL_DELAY** is negative when the flight arrives in (scheduled) time and positive when it is delayed.



## Analyzing Datatypes
First of all, I will take a look at our dataset and analyze our datatypes. 

In [None]:
df_flights.dtypes

We already knew we have two different time values, the ***time of day*** and the ***minute*** values. Especially the ***time of day*** values that are shown below are most frequently stored in a float64 where the first two digits represent the hour and the last two the minutes:

    - SCHEDULED_DEPARTURE
    - DEPARTURE_TIME
    - DEPARTURE_DELAY
    - TAXI_OUT
    - WHEELS_OFF
    - SCHEDULED_TIME
    - ELAPSED_TIME
    - AIR_TIME
    - WHEELS_ON
    - TAXI_IN
    - SCHEDULED_ARRIVAL
    - ARRIVAL_TIME
    - ARRIVAL_DELAY
    - AIR_SYSTEM_DELAY
    - SECURITY_DELAY
    - AIRLINE_DELAY
    - LATE_AIRCRAFT_DELAY
    
There is a need to convert them all to datetime. In addition, it seems to be helpful to write/use a function for this conversion. (Thanks to  <a href="https://www.kaggle.com/fabiendaniel">fabiendaniel</a> and her great tutorial <a href="https://www.kaggle.com/fabiendaniel/predicting-flight-delays-tutorial">here</a> ):


In [None]:
# converting input time value to datetime.
def conv_time(time_val):
    if pd.isnull(time_val):
        return np.nan
    else:
            # replace 24:00 o'clock with 00:00 o'clock:
        if time_val == 2400: time_val = 0
            # creating a 4 digit value out of input value:
        time_val = "{0:04d}".format(int(time_val))
            # creating a time datatype out of input value: 
        time_formatted = datetime.time(int(time_val[0:2]), int(time_val[2:4]))
    return time_formatted

In [None]:
### # convert ARRIVAL_TIME to datetime time format and write it back into df field ARRIVAL_TIME:
df_flights['ARRIVAL_TIME'] = df_flights['ARRIVAL_TIME'].apply(conv_time)
df_flights['DEPARTURE_TIME'] = df_flights['DEPARTURE_TIME'].apply(conv_time)
df_flights['SCHEDULED_DEPARTURE'] = df_flights['SCHEDULED_DEPARTURE'].apply(conv_time)
df_flights['WHEELS_OFF'] = df_flights['WHEELS_OFF'].apply(conv_time)
df_flights['WHEELS_ON'] = df_flights['WHEELS_ON'].apply(conv_time)
df_flights['SCHEDULED_ARRIVAL'] = df_flights['SCHEDULED_ARRIVAL'].apply(conv_time)

The required data has now the correct format and can already be viewed:

In [None]:
df_flights[['YEAR','MONTH','DAY','SCHEDULED_DEPARTURE','DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT',
       'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME','WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL'
            ,'ARRIVAL_TIME','ARRIVAL_DELAY','AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY']].dtypes

## Handling the Null Values
After I converted the necessary time values to a DateTime datatype, I need to check our data according to its integrity. Null values or missing data are often occurring data states that need to be handled.

In addition to several other methods, I will focus on two or three methods in this notebook to deal with null value data or missing data.

One option is to delete the corresponding rows.

Another case of handling missing or null value data is to reconstruct the missing data according to information from other columns. Imagine there is a start and an end time and only the duration is missing. You could calculate the missing values simply by the difference between end time and start time. Accordingly, you do not have to delete the data column but you can continue to use the information contained in it.

One of the best ways to handle missing or null value data is the imputation. The imputation will fill the missing gaps with some numbers that are based on existing data columns. The numbers are not as accurate as the real data but fits the needs for the most prediction models and lead to a better resolution of the model. If you need some additional information about that method, read <a href="https://www.kaggle.com/dansbecker/handling-missing-values">this</a> notebook by DanB.

In [None]:
#-------------------------------------------------------------
# null value analysing function.
# gives some infos on columns types and number of null values:
def nullAnalysis(df):
    tab_info=pd.DataFrame(df.dtypes).T.rename(index={0:'column type'})

    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()).T.rename(index={0:'null values (nb)'}))
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100)
                         .T.rename(index={0:'null values (%)'}))
    return tab_info

In [None]:
nullAnalysis(df_flights)

### Reconstruct Data Manually
Our null analysis above shows the following features with a lot of null values:
    - CANCELLATION_REASON
    - AIR_SYSTEM_DELAY
    - SECURITY_DELAY
    - AIRLINE_DELAY
    - LATE_AIRCRAFT_DELAY
    - WEATHER_DELAY
    
In this case here, I try to determine or "calculate" the data by deriving the situation. Look at the values that are mostly empty according to the coherent afford of an airline to not be the reason for a delay. Therefore the missing data (or Not-a-Number data) is not based on a bad data quality, it is more the fact that it didn't happen any action by these delay features. You can prove it by looking at a tuple of one of that features when there is at least one feature triggered, all the other features are **"initialized"** with **"0.0"**:

In [None]:
# show selected columns where AIRLINE_DELAY isnot null
df_flights.loc[df_flights['AIRLINE_DELAY'].notnull(), ['AIRLINE_DELAY','AIR_SYSTEM_DELAY','SECURITY_DELAY','AIRLINE_DELAY','LATE_AIRCRAFT_DELAY','WEATHER_DELAY']].head()

So it's ok to transform the NAN-data to the value ***"0.0"*** because there was no impact on the flight by these data that causes a delay:

In [None]:
df_flights['AIRLINE_DELAY'] = df_flights['AIRLINE_DELAY'].fillna(0)
df_flights['AIR_SYSTEM_DELAY'] = df_flights['AIR_SYSTEM_DELAY'].fillna(0)
df_flights['SECURITY_DELAY'] = df_flights['SECURITY_DELAY'].fillna(0)
df_flights['LATE_AIRCRAFT_DELAY'] = df_flights['LATE_AIRCRAFT_DELAY'].fillna(0)
df_flights['WEATHER_DELAY'] = df_flights['WEATHER_DELAY'].fillna(0)

In [None]:
nullAnalysis(df_flights)

Null values have now decreased significantly. There are only a few attributes left. Particular striking, however, is the **CANCELLATION_REASON** that hits the high mark with around 98% null values. We need to take a closer look at the cancellation data.

### Dealing with Null Values in Categorical Data

In [None]:
df_flights.loc[df_flights['CANCELLATION_REASON'].notnull(),['CANCELLATION_REASON']].head(15)

The reason for cancellation of flights splits into the following occurrences:
- A - Airline/Carrier
- B - Weather
- C - National Air System
- D - Security

... and has the following ration:

In [None]:
# group by CANCELLATION_REASON to see the ration
df_flights['CANCELLATION_REASON'].value_counts()

As you can see the main reason for cancelation is **B** the weather. It is well known that the weather is often the cause of delays and cancelations. In the case of this attribute, we look at the weather as a cancelation reason, not a delay reason. Now there is the following question: If we want to predict delay times from departure flights, is it necessary to include flight cancellation reasons in our calculation? Don't we want to focus only on not canceled flights, on flights with a departure and a (late) arrival time?
The answer is: No, we want them all! We don't want to lose data for our prediction. Every information, in this case, is important. For example cancellation reason "Weather" for a canceled flight. The flight themselves did not take place, that's right, but what about the consequences of the canceled flights? All the passengers need to get to their destinations, therefore they will be booked on the next flight or moreover the canceled flight will start in another timeshift and will probably block another's plains flight slot. That all leads to a knock-on effect on other flights. 

#### "Manuell" Conversion of categories to numeric values
Most models don't work pretty good with categorical values. They need to be converted into numeric values to use a prediction model. There is a way to convert all categorical data into numeric values, its called **One-Hot Encoding**. This approach will line in all categorical values in separate columns, creates a new column and matches every occurrence of the categorical value with 1 or 0 for non-occurrences. You should take a deeper look into it  

Another approach that is similar to the **One-Hot encoding** approach is included in the Pandas library and is called ***get_dummies()***. This function converts categorical values into dummy/indicator values as well. In this case, all null value data are also converted and filled with ***0*** values. Nevertheless, in this case and according to the small amount of four categorical values, I will convert the **CANCELLATION_REASON** manually. The final result will be the following:
- NaN = 0
- A = 1
- B = 2
- C = 3
- D = 4


In [None]:
# -------------------------------------
# converting categoric value to numeric
df_flights.loc[df_flights['CANCELLATION_REASON'] == 'A', 'CANCELLATION_REASON'] = 1
df_flights.loc[df_flights['CANCELLATION_REASON'] == 'B', 'CANCELLATION_REASON'] = 2
df_flights.loc[df_flights['CANCELLATION_REASON'] == 'C', 'CANCELLATION_REASON'] = 3
df_flights.loc[df_flights['CANCELLATION_REASON'] == 'D', 'CANCELLATION_REASON'] = 4

# -----------------------------------
# converting NaN data to numeric zero
df_flights['CANCELLATION_REASON'] = df_flights['CANCELLATION_REASON'].fillna(0)

In [None]:
# check null values
nullAnalysis(df_flights)

### Null Values in Different Types of Time Values

Due to the null values in the different time values (see evaluation above) I need to get a closer view of these null values.

In contrast to the other time values (**AIRLINE_DELAY, AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, WEATHER_DELAY**), the remaining time values are the partly measured times and the partly calculated times. I already talked about that separation and now dig deeper into it:

**Measured Times:**
- SCHEDULED_TIME
- TAXI_IN
- WHEELS_ON


**Calculated Times:**
- ELAPSED_TIME
- AIR_TIME
- ARRIVAL_TIME

The time units that are listed under ***Measured Times*** are time units that have been determined by the airline. They are not calculated, they are values that have been measured or scheduled. The situation is different with the ***Calculated Times***. There we have formulas for calculating the values which are composed as follows:

* **ELAPSED_TIME** = TAXI_OUT + AIR_TIME + TAXI_IN

* **AIR_TIME** = WHEELS_ON - WHEELS_OFF 

* **ARRIVAL_TIME** =  WHEELS_ON + TAXI_IN

Good to know how to calculate this values, bad is the fact that the values to calculate these times are also NaN - data. That is probably the reason for its initial NaN - data value. I have no choice but to declare the data as outliers.

In [None]:
# drop the last 1% of missing data rows.
df_flights = df_flights.dropna(axis=0)

### Analyzing Distribution after Cleansing, Conversion and Preprocessing

In [None]:

df_times = df_flights[
[
    'SCHEDULED_DEPARTURE',
    'DEPARTURE_TIME',
    'DEPARTURE_DELAY',
    'TAXI_OUT',
    'WHEELS_OFF',
    'SCHEDULED_TIME',
    'ELAPSED_TIME',
    'AIR_TIME',
    'DISTANCE',
    'WHEELS_ON',
    'TAXI_IN',
    'SCHEDULED_ARRIVAL',
    'ARRIVAL_TIME',
    'ARRIVAL_DELAY',
    'DIVERTED',
    'CANCELLED',
    'CANCELLATION_REASON',
    'AIR_SYSTEM_DELAY',
    'SECURITY_DELAY',
    'AIRLINE_DELAY',
    'LATE_AIRCRAFT_DELAY',
    'WEATHER_DELAY'
]]

In [None]:
pd.set_option('float_format', '{:f}'.format)

df_times.describe()

I have cleaned up the data and assigned the correct data types. Now it is time to select our label and moreover our features to predict the label.

# Feature and Label Selection
For our prediction, I now need to identify the features that are most likely to impact on the flight delays.

First I want to include the airports and try to figure out whether there is an impact on the delay regarding the departure airport or not. For this, I will include the airports from another file in this evaluation. With the included information about the location of the airport, I could identify regions on the map that support a delay.

First, I will include the airlines in the evaluation to get a distribution of the delays per airline. Later I will add the airports and their location data to the evaluation to get a closer view of the map and some location-based delays.

## Merging the Airline Codes (IATA-Codes)
I am going to merge the IATA-Airline codes from the other .csv-file. 

In [None]:
df_airlines = pd.read_csv('../input/airlines.csv')
df_airlines

Above are the values out of the csv-file. Now I am going to check out the distribution in the flight dataset and join them with the other data.

In [None]:
df_flights['AIRLINE'].value_counts()

***Southwest Airline (WN)*** is the airline with the most entries in this evaluation. By contrast, ***Virgin America (VX)*** is the one with the lowest. I I will join them to the other data to get a closer view into it.

In [None]:
# joining airlines
df_flights = df_flights.merge(df_airlines, left_on='AIRLINE', right_on='IATA_CODE', how='inner')

In [None]:
# dropping old column and rename new one
df_flights = df_flights.drop(['AIRLINE_x','IATA_CODE'], axis=1)
df_flights = df_flights.rename(columns={"AIRLINE_y":"AIRLINE"})

### Analyzing the Delays by Airline
Getting an overview of delays by airlines companies.

In [None]:
sns.set(style="whitegrid")

# initialize the figure
fig_dim = (16,14)
f, ax = plt.subplots(figsize=fig_dim)
sns.despine(bottom=True, left=True)

# Show each observation with a scatterplot
sns.stripplot(x="ARRIVAL_DELAY", y="AIRLINE",
              data=df_flights, dodge=True, jitter=True
            )

The distribution above shows the airlines in comparison to their **ARRIVAL_DELAYs**. It clearly shows that ***American Airlines*** has a wide spread of delays. By contrast, the airline with the most entries is ***Southwest Airlines*** and their delays look pretty low compared to the ***American Airlines*** delays. I will elaborate on this in the following:

In [None]:
# Group by airline and sum up / count the values
df_flights_grouped_sum = df_flights.groupby('AIRLINE', as_index= False)['ARRIVAL_DELAY'].agg('sum').rename(columns={"ARRIVAL_DELAY":"ARRIVAL_DELAY_SUM"})
df_flights_grouped_cnt = df_flights.groupby('AIRLINE', as_index= False)['ARRIVAL_DELAY'].agg('count').rename(columns={"ARRIVAL_DELAY":"ARRIVAL_DELAY_CNT"})

# Merge the two groups together
df_flights_grouped_delay = df_flights_grouped_sum.merge(df_flights_grouped_cnt, left_on='AIRLINE', right_on='AIRLINE', how='inner')
# Calculate the average delay per airline
df_flights_grouped_delay.loc[:,'AVG_DELAY_AIRLINE'] = df_flights_grouped_delay['ARRIVAL_DELAY_SUM'] / df_flights_grouped_delay['ARRIVAL_DELAY_CNT']

df_flights_grouped_delay.sort_values('ARRIVAL_DELAY_SUM', ascending=False)

In conclusion, ***Southwest Airlines*** has a lot of mostly smaller delays which are in total the high mark of delays in this evaluation. On the other side and with a hint on our distribution chart above, ***American Airlines*** has a lot of huge delays in single flights which effects the total delay of the airline. They are in the upper thirds of the delays but their mean delay per airline is one of the lowest of all airlines.

## Feature Correlation

So let us look at the correlation between each of the features ( and the label as well). This might be the first step into a closer feature selection. The main goal is to identify the features that affect the **ARRIVAL_DELAY** in a positive or negative way.

In [None]:
# Dataframe correlation
del_corr = df_flights.corr()

# Draw the figure
f, ax = plt.subplots(figsize=(11, 9))

# Draw the heatmap
sns.heatmap(del_corr)

### Results from Correlation Matrix
I am dividing the different correlations into two parts,  the **positive correlations** (higher than *0.6* ) and the **less positive correlations** (less than *0.6* but higher than *0.2*). The results are listed in the list below:

#### Positive correlations between:
* DEPARTURE_DELAY and
    * ARRIVAL_DELAY
    * LATE_AIRCRAFT_DELAY
    * AIRLINE_DELAY
* ARRIVAL_DELAY and
    * DEPARTURE_DELAY
    * LATE_AIRCRAFT_DELAY
    * AIRLINE_DELAY

#### Less positive correlations between:
* ARRIVAL_DELAY and
    * AIR_SYSTEM_DELAY
    * WEATHER_DELAY
* DEPARTURE_DELAY and
    * AIR_SYSTEM_DELAY
    * WEATHER_DELAY
* TAXI_OUT and
    * AIR_SYSTEM_DELAY
    * ELAPSED_TIME

#### This leads to the following factors of influence:
I will list the different correlations here to see which features have the most counted influence on different other features.

<table class="table">
  <thead>
    <tr>
    <th scope="col">Positive Value</th>
    <th scope="col">Count</th>
    <th scope="col">Type</th>
    </tr>
  </thead>
  <tr>
    <td scope="row">++</td>
    <td scope="row">2</td>
    <td scope="row">LATE_AIRCRAFT_DELAY</td>             
  </tr>
  <tr>
    <td>++</td>
    <td>2</td>
    <td>AIRLINE_DELAY</td>             
  </tr>
  <tr>
    <td>++</td>
    <td>1</td>
    <td>ARRIVAL_DELAY</td>             
  </tr>
 <tr>
    <td>+-</td>
    <td>3</td>
    <td>AIR_SYSTEM_DELAY</td>             
  </tr>
  <tr>
    <td>+-</td>
    <td>2</td>
    <td>WEATHER_DELAY</td>             
  </tr>
   <tr>
    <td>+-</td>
    <td>1</td>
    <td>ELAPSED_TIME</td>             
  </tr>
</table>

These could be the main features (except the **ARRIVAL_DELAY** itself) that influence partly or entirely the flight delays. This needs to be measured by a feature selection method.

## Feature Selection with Machine Learning Algorithms
In following, I want to proof the above written down feature correlation count with a machine learning algorithms. Do they really correlate as good as I think with the **ARRIVAL_DELAY**? I will classify the data into delayed and not delayed data and define a label (DELAYED) for that in the dataframe. Afterward I will show the feature importance for the given attributes.

In the beginning, I need to reduce computation time by reducing the data on January 2015. Otherwise, this whole prediction will execute too long.

In [None]:
# Only using data from January
df_flights_jan = df_flights.loc[(df_flights.loc[:,'YEAR'] == 2015 ) & (df_flights.loc[:,'MONTH'] == 1 )]

In [None]:
df_flights_jan.head()

In [None]:
# Marking the delayed flights
df_flights_jan['DELAYED'] = df_flights_jan.loc[:,'ARRIVAL_DELAY'].values > 0

In [None]:
# Label definition
y = df_flights_jan.DELAYED

# Choosing the predictors
feature_list_s = [
    'LATE_AIRCRAFT_DELAY'
    ,'AIRLINE_DELAY'
    ,'AIR_SYSTEM_DELAY'
    ,'WEATHER_DELAY'
    ,'ELAPSED_TIME']

# New dataframe based on a small feature list
X_small = df_flights_jan[feature_list_s]

In [None]:
# RandomForestClassifier with 10 trees and fitted on the small feature set 
clf = RandomForestClassifier(n_estimators = 10, random_state=32) 
clf.fit(X_small, y)

# Extracting feature importance for each feature
i=0
df_feature_small = pd.DataFrame(columns=['FEATURE','IMPORTANCE'])
for val in (clf.feature_importances_):
    df_feature_small.loc[i] = [feature_list_s[i],val]
    i = i + 1
    

df_feature_small.sort_values('IMPORTANCE', ascending=False)

A ittle bit has changed. Now the **AIR_SYSTEM__DELAY** has got the most influences on a flight that has been delayed. This feature had a less positive correlation in our correlation resume above. All the other features have remained in the same order of importance as we have found out. Let us try a wider range with the same model. And please keep in mind that we use a classification here. We have classified the data into delayed and not delayed data and want to find out now which of these features effects a delay of a flight the most. There could be and there probably will be different features for a flight that arrives just in time, but this will be part of a later section where we try to determine the actual arrival at an airport.

In [None]:
# choosing the predictors
feature_list = [
    'YEAR'
    ,'MONTH'
    ,'DAY'
    ,'AIRLINE'
    ,'LATE_AIRCRAFT_DELAY'
    ,'AIRLINE_DELAY'
    ,'AIR_SYSTEM_DELAY'
    ,'WEATHER_DELAY'
    ,'ELAPSED_TIME'
    ,'DEPARTURE_DELAY'
    ,'SCHEDULED_TIME'
    ,'AIR_TIME'
    ,'DISTANCE'
    ,'TAXI_IN'
    ,'TAXI_OUT'
    ,'DAY_OF_WEEK'
    ,'SECURITY_DELAY'
]

X = df_flights_jan[feature_list]

Here I need to convert the **AIRLINE** feature into a numeric value, the feature themselves is not important for our features determination but I probably want to show the airline in a later approach.

In [None]:
# Label encoding of AIRLINE and write this back to df
from sklearn.preprocessing import LabelEncoder
labelenc = LabelEncoder()

# Converting "category" airline to integer values
X.iloc[:,feature_list.index('AIRLINE')] = labelenc.fit_transform(X.iloc[:,feature_list.index('AIRLINE')])

In case we need the old values back, we will use the *.inverse_transform()* function:

In [None]:
# Convert my encoded categories back
labelenc.inverse_transform(X.iloc[:, feature_list.index('AIRLINE')])

Now the feature importance will be shown:

In [None]:
# Fit the new features and the label (based on feature_list)
clf = RandomForestClassifier(n_estimators=10, random_state=32) 
clf.fit(X, y)

In [None]:
i=0
df_feature_selection = pd.DataFrame(columns=['FEATURE','IMPORTANCE'])
for val in (clf.feature_importances_):
    df_feature_selection.loc[i] = [feature_list[i],val]
    i = i + 1
    

df_feature_selection.sort_values('IMPORTANCE', ascending=False)

This looks quite different than our first approach. Ok the **AIR_SYSTEM_DELAY** still stays nearly at the top, but **LATE_AIRCRAFT_DELAY**, **AIRLINE_DELAY**, **WEATHER_DELAY** and **ELAPSED_TIME** moved down some positions. The **ELAPSED_TIME** remains in the top five but our other features have got a different importance given by the other features and remember the **ELAPSED_TIME** was the worst feature of our first calculation and now stays at the top five.


### Summary
We have now analyzed several features and compared them due to a classification model. We came to the conclusion that with a limited view of the features, the causes of delays of a flight can be completely different. The more important it is to use high quality of data and a differentiated set of features for a prediction.

This feature importance will help us in later progress if we need to prune decision trees from the random forest to maximize our prediction accuracy in the random forest.

In the next chapter, I will start building the main prediction model for predicting the actual arrival delay themselves.

# Data Prediction
## Preparing the Prediction



### Building the Model First
I am building the model first. Here I am choosing 100 trees for the model to not overexert the computation time in later purpose.

In [None]:
# RandomForest with 100 trees
forest_model = RandomForestRegressor(n_estimators = 100, random_state=42)


### Choosing the Prediction Target
This time I choose the ARRIVAL_DELAY as the target and change the model to the Random Forest Regressor (seen above) to predict the exact minutes delayed or arrived in time.

In [None]:
y = df_flights_jan.ARRIVAL_DELAY
y = np.array(y)

### Choosing the Predictors 
To predict our prediction target (ARRIVAL_DELAY), we need some features. I will select the same features as in the chapter before.

In [None]:
X = np.array(X)

## Separating into Test and Train Datasets
It is necessary to separate the data into train and test dataset.

In [None]:
# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.

train_X, val_X, train_y, val_y = train_test_split(X, y, test_size = 0.35, random_state = 42)

### The Shape of  Train- and Testdata

In [None]:
print('Training Features Shape:', train_X.shape)
print('Training Labels Shape:', train_y.shape)
print('Testing Features Shape:', val_X.shape)
print('Testing Labels Shape:', val_y.shape)

## Model Training and Prediction


### Establish Baseline

In [None]:
# Average arrival delay for our dataset
baseline_preds = df_flights_jan['ARRIVAL_DELAY'].agg('sum') / df_flights_jan['ARRIVAL_DELAY'].agg('count') 

# Baseline error by average arrival delay 
baseline_errors = abs(baseline_preds - val_y)
print('Average baseline error: ', round(np.mean(baseline_errors),2))

This is our average baseline error of 22.16 minutes of delays we want to beat with our regression model. 

### Train Model

In [None]:
# Fit the model
forest_model.fit(train_X, train_y)

### Predict and Validate the Result

In [None]:
# Predict the target based on testdata 
flightdelay_pred= forest_model.predict(val_X)

In [None]:
#Calculate the absolute errors
errors = abs(flightdelay_pred - val_y)

#### Return the Absolute Error

In [None]:
print('Mean Absolute Error: ', round(np.mean(errors),3), 'minutes.')

This looks quite after overfitting. The mean absolute error is pretty small which means the model predicts the arrival delay nearly accurate or over accurate. I will validate and visualize the model in the next chapter.

# Validate and Visualize the Model
In this chapter, I will validate and visualize the prediction model. The previous mentioned mean absolute error of 0.857 minute seems to be a quite good prediction of the arrival delay. The predictions are on average around 0.857 minutes away from the real value. This is a really exact prediction. It is mandatory to check the model whether it is an overfitted one or not.

The previously shown feature importance of the model looks like this:

In [None]:
# Determine the feature importance of our model
i=0
df_model_features = pd.DataFrame(columns=['FEATURE','IMPORTANCE'])
for val in (forest_model.feature_importances_):
    df_model_features.loc[i] = [feature_list[i],val]
    i = i + 1
    
# Print the determined feature importance
df_model_features.sort_values('IMPORTANCE', ascending=False)

Nearly 85% of the feature importance is based by the **DEPARTURE_DELAY** feature which is a lot. The next eight features not even have individually more than 10% of importance, they are all lower.  **AIR_SYSTEM_DELAY**, **SCHEDULED_TIME** and **ELAPSED_TIME** have at least values that are greater than 1.0%. The remaining features all having an importance that is lower 1.0%. 

I will take a closer look into the single features in the next section.

## Visualize the Linear Regression between Features and Target
Next, I will visualize the linear regression on the first six features that I mainly used to fit model and predict the target. Based on that I will show the determination of coefficiency per feature.  This will give an overview of how good the single features are able to predict the target. Right after that, I will calculate the models r-squared.

In [None]:
from statistics import *

# Calculate the solpe and intercept
def best_fit_slope_and_intercept(xs,ys):
    m = ( ((mean(xs) * mean(ys)) - mean(xs*ys)) /
          ((mean(xs) * mean(xs)) - mean(xs*xs)) )
    b = mean(ys) - m*mean(xs)
    return m, b

# Calculate the regression line
def regression_line(m, feature, b):
        regression_line = [(m*x) + b for x in feature]
        return regression_line

# Draw six grid scatter plot and calculate all necessary functions
def draw_sixgrid_scatterplot(feature1, feature2, feature3, feature4, feature5, feature6, target):
    fig = plt.figure(1, figsize=(16,15))
    gs=gridspec.GridSpec(3,3)
    
    # Axis for the grid
    ax1=fig.add_subplot(gs[0,0])
    ax2=fig.add_subplot(gs[0,1])
    ax3=fig.add_subplot(gs[0,2])
    ax4=fig.add_subplot(gs[1,0])
    ax5=fig.add_subplot(gs[1,1])
    ax6=fig.add_subplot(gs[1,2])
    
    # Drawing dots based on feature and target
    ax1.scatter(feature1, target, color = 'g')
    ax2.scatter(feature2, target, color = 'c')
    ax3.scatter(feature3, target, color = 'y')
    ax4.scatter(feature4, target, color = 'k')
    ax5.scatter(feature5, target, color = 'grey')
    ax6.scatter(feature6, target, color = 'm')
    
    # Get best fit for slope and intercept
    m1,b1 = best_fit_slope_and_intercept(feature1, target)
    m2,b2 = best_fit_slope_and_intercept(feature2, target)
    m3,b3 = best_fit_slope_and_intercept(feature3, target)
    m4,b4 = best_fit_slope_and_intercept(feature4, target)
    m5,b5 = best_fit_slope_and_intercept(feature5, target)
    m6,b6 = best_fit_slope_and_intercept(feature6, target)

    # Build regression lines
    regression_line1 = regression_line(m1, feature1, b1)
    regression_line2 = regression_line(m2, feature2, b2)
    regression_line3 = regression_line(m3, feature3, b3)
    regression_line4 = regression_line(m4, feature4, b4)
    regression_line5 = regression_line(m5, feature5, b5)
    regression_line6 = regression_line(m6, feature6, b6)
            
    # Plotting regression lines
    ax1.plot(feature1,regression_line1)
    ax2.plot(feature2,regression_line2)
    ax3.plot(feature3,regression_line3)
    ax4.plot(feature4,regression_line4)
    ax5.plot(feature5,regression_line5)
    ax6.plot(feature6,regression_line6)
    
    # Naming the axis
    ax1.set_xlabel(feature1.name)
    ax1.set_ylabel(target.name)
    ax2.set_xlabel(feature2.name)    
    ax2.set_ylabel(target.name)
    ax3.set_xlabel(feature3.name)
    ax3.set_ylabel(target.name)
    ax4.set_xlabel(feature4.name)
    ax4.set_ylabel(target.name)
    ax5.set_xlabel(feature5.name)
    ax5.set_ylabel(target.name)
    ax6.set_xlabel(feature6.name)
    ax6.set_ylabel(target.name)
    
    # Give the labels space
    plt.tight_layout()
    plt.show()
        

In [None]:
# Determine the squared error
def squared_error_reg(ys_orig, ys_line):
    return sum((ys_line-ys_orig)**2)

# Calculating r-squared
def coefficient_of_determination(ys_orig, ys_line):
    y_mean:line = [mean(ys_orig) for y in ys_orig]
    squared_error_regr = squared_error_reg(ys_orig, ys_line)
    squared_error_y_mean = squared_error(ys_orig, y_mean_line)
    return 1 - (squared_error_regr / squared_error_y_mean)

In [None]:
# Draw the grid scatters
draw_sixgrid_scatterplot(df_flights_jan['DEPARTURE_DELAY'], df_flights_jan['AIR_SYSTEM_DELAY'], df_flights_jan['SCHEDULED_TIME'],
                         df_flights_jan['ELAPSED_TIME'], df_flights_jan['TAXI_OUT'], df_flights_jan['AIRLINE_DELAY'], df_flights_jan['ARRIVAL_DELAY'])




Here we see the linear regression lines for the features 
* DEPARTURE_DELAY
* AIR_SYSTEM_DELAY
* SCHEDULED_TIME
* ELAPSED_TIME 
* TAXI_OUT
* AIRLINE_DELAY

These are the six main features that affect the decision trees the most. As I mentioned before, the **DEPARTURE_DELAY** is the most important features with 85% of importance. This is clearly visible in the first chart where you can see the linear correlation and the good fitting regression line between the target **ARRIVAL_DELAY** and the feature **DEPARTURE_DELAY**. When the departure delay raises, the arrival delay will rise, in most cases,  as well. The straight line fits the scattering pretty good. 

However, this looks a bit different for the feature **AIR_SYSTEM_DELAY**. The scattering spreads a little bit wider and raises the **ARRIVAL_DELAY** right at the beginning of the feature impact. The linear regression line probably fits the correlation good. It is not as good as the previous combination out of **DEPARTURE_DELAY** and **ARRIVAL_DELAY** but it still fits a good regression line.

According to the third important feature, the **SCHEDULED_TIME**, this all looks way different. Here is a big and wide scattering where the regression line hits the ground. The line follows straight the x-axis and there is no increase in curve running visibly. This means the **SCHEDULED_TIME** has no impact on the **ARRIVAL_DELAY**. This seems obvious, what kind of effect could have a scheduled arrival time on an arrival delay. Nearly nothing. The only thing that could matter in this topic would be a flight restriction for certain times where airplanes are not allowed to land on that airport. But this kind of flight route would not be planned by an airline company. 

The same way looks the feature **ELAPSED_TIME**. The scattering and the regression line looks almost the same as for the **SCHEDULED_TIME** which is based on the fact that it comes to the same cause here.

The next feature, the **TAXI_OUT** makes the regression line raise a bit but the scattering looks even wide as before. This features acts nearly the same as the **DEPARTURE_DELAY**. It affects the **DEPARTURE_TIME** by the calculation out of **WHEEL_OFF** **-** **TAXI_OUT** which is used to determine the **DEPARTURE_DELAY** afterward.

The last feature in the feature importance checks it the **AIRLINE_DELAY**. Here we have again a feature that has an active impact on the **ARRIVAL_DELAY**.  The scattering spreads a little bit wider and raises the **ARRIVAL_DELAY** right at the beginning of a feature impact, which leads to this slope of the regression line. The slope looks nearly the same as in the **DEPARTURE_DELAY** chart. But the interesting fact is, according to the prediction model this feature does not have any high impact on the **ARRIVAL_DELAY**. Moreover, this feature is on the sixth position due to the prediction model's importance check and it does not even have a feature importance that is higher than 0.01 . In return, according to this regression line, the **AIRLINE_DELAY** must effects the **ARRIVAL_DELAY** in one way or the other.

The question here is, why does the prediction model does not uses the **AIRLINE_DELAY** in its prediction more highly valued. Does this here looks like a need to intervene in the model to make it use the **AIRLINE_DELAY** in a higher importance rank? I will go deeper in that specific feature in the upcoming chapter.

## Visual Tree
In the following approach, I will visualize one of the decision trees from the model above. For a better visibility, I will only show the first six splits in the tree to show the six most important features. In contrast to the conventional presentation of decision trees, I will show this tree in a rotated form from left to right (not top-down) to get it full size on this page. This will give you a better legibility. I chose six to hopefully find the features from the feature importance discovery and especially the **AIRLINE_DELAY** out of the previous chapter.

In [None]:
# The original forest model
model = forest_model

# Extract single tree
estimator = model.estimators_[1]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                max_depth=6,
                rotate = True,
                feature_names = feature_list,
               # class_names = ,
                rounded = True, proportion = False, 
                precision = 2, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=3000'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

The split depth shows several features which are used more often for split decisions. One of them is of course, the **DEPARTURE_DELAY**. There are several more which will be discussed in the following comparison between the *Feature Importance by Arrival Delay* where there was tried to determine the minutes of arrival delay (prediction model with the decision tree above) and the *Features Importance by Delayed Flights* where the data was classified into delayed or not delayed flight (previous features selection). 

##### Feature Importance by Arrival Delay (minute based)
The above-shown decision tree shows the following features in his split, which is based on its feature importance determination:

In [None]:
i=0
df_model_s_features = pd.DataFrame(columns=['FEATURE','IMPORTANCE'])
for val in (forest_model.feature_importances_):
    df_model_s_features.loc[i] = [feature_list[i],val]
    i = i + 1
    

df_model_s_features.sort_values('IMPORTANCE', ascending=False)

##### Feature Importance by Delayed Flights (True or False)
This here is ones again our previous feature importance analysis based on delayed flights with the main five features for comparison:

In [None]:
df_feature_selection.sort_values('IMPORTANCE', ascending=False).head(6) 

If we compare the two feature importance tables, the first one (*Feature Importance by Arrival Delay* - based on the decision tree above) and the second one (*Feature Importance by Delayed Flights*), we clearly recognize that all of the used features by the decision tree (first table), are included in the second table. The second table represents the discovered features from the first indication where we tried to determine the important features of the model. They are all used by the decision tree in a slightly different order, except the **AIR_TIME**. The **AIR_TIME** does not appear in the first six split decisions,  therefore the **AIRLINE_DELAY** has taken place. 

Our previous taken feature importance identification (second table) only shows the influencer on a flight that has already been delayed, not in which case the flight has arrived after schedule as shown in the feature importance of the decision tree's split (first table).
As you can see the most important feature for the regression model is with around 0.85 importance still the **DEPARTURE_DELAY**.  You can see there are a lot of splits based on the **DEPARTURE_DELAY** in the visualized tree. This could be as well an explanation for that low mean absolute error of around 1.0 minute.

The **DEPARTURE_DELAY** seems to be the main features for predicting the right **ARRIVAL_DELAY** at all. Let's take a closer look into it before we go further on with the validation:

In [None]:
# Count of DEPARTURE_DELAYs that are not zero and could influence our prediction.
print("DEPARTURE_DELAY count: ")
print(df_flights_jan[df_flights_jan['DEPARTURE_DELAY'] != 0]['DEPARTURE_DELAY'].count())
print("-------------------------------")
print("All datarow count:")
print((df_flights_jan)['DEPARTURE_DELAY'].count())
print("-------------------------------")
print("-------------------------------")
print("Percentag of DEPARTURE_DELAY that is not zero:")
print(df_flights_jan[df_flights_jan['DEPARTURE_DELAY'] != 0]['DEPARTURE_DELAY'].count() / df_flights_jan['DEPARTURE_DELAY'].count())

Nearly 95% of the values from **DEPARTURE_DELAY** are set with a value that is not zero. The nearly 100% fulfillment and the effects from the **DEPARTURE_DELAY** on the  **ARRIVAL_DELAY** leads to that feature importance for the built model. So it seems to be not unusual to have such an accuracy in that case.
Still, this seems too accurate, we are talking here about a minute difference to the real arrival delay of a flight. There is a need to check the accuracy of the model in a much better way.

In the next chapter, I will analyze the coefficient of determination to get a better overview of how good the model fits the dataset.

## The Coefficient of Determination -  The Model Fitness
In this chapter, I will calculate the coefficient of determination or "*R-squared*" for the model. It will show how good the inputs fit the output of the model, or how good the model represents the underlying data. That means if the regressions of our features have an R-squared close to 1, it means that the independent variables (the features) are well-suited to predict the dependent variable (our target, the **ARRIVAL_DELAY**). 

I will now calculate the R-squared for the built model based on the training and test dataset:

In [None]:
print("----------------- TRAINING ------------------------")
print("r-squared score: ",forest_model.score(train_X, train_y))
print("------------------- TEST --------------------------")
print("r-squared score: ", forest_model.score(val_X, val_y))

This here seems to be as well pretty accurate. The training dataset is a known dataset by the model why the test dataset is used as well here. As we know due to the previous analysis, the model is highly based on the **DEPARTURE_DELAY** feature. All the model's decision is based on what the **DEPARTURE_DELAY** does, which afterward leads to that accuracy.

I will test the model with another new dataset and calculate the necessary key figures.

### Test with Unknown Data Again
I will use data from February now, to test the model against total new, unknown data. After all the necessary model preparations I will print out the Mean Absolute Error as well as the r-squared score of the new test data.

In [None]:
df_flights_feb = df_flights.loc[(df_flights.loc[:,'YEAR'] == 2015 ) & (df_flights.loc[:,'MONTH'] == 2 )]

# We only need them as test sets, no split in train and test(val) needed
X2 = df_flights_feb[feature_list]
y2 = df_flights_feb.ARRIVAL_DELAY

In [None]:
# Converting "category" airline to integer values
X2.iloc[:,feature_list.index('AIRLINE')] = labelenc.fit_transform(X2.iloc[:,feature_list.index('AIRLINE')])

In [None]:
# Filling the features and the target again
X2 = np.array(X2)
y2 = np.array(y2)

# Predict the new data based on the old model (forest_model)
flightdelay_pred_feb = forest_model.predict(X2)

#Calculate the absolute errors
errors_feb = abs(flightdelay_pred_feb - y2)

In [None]:
# Mean Absolute Error im comparison
print('Mean Absolute Error January: ', round(np.mean(errors),3), 'minutes.')
print('---------------------------------------------------------------')
print('Mean Absolute Error February: ', round(np.mean(errors_feb),3), 'minutes.')

The difference between the two datasets (January and February) is not that big, it's even very small. The model even fits on total new data. What about the R-squared calculation?

In [None]:
print("r-squared score January: ",forest_model.score(val_X, val_y))
print("------------------- TEST --------------------------")
print("r-squared score February: ", forest_model.score(X2, y2))

On the one hand side, the model seems to be good at new data on the other hand's side, the model seems to learned the data by hart. I can see that the data from February probably does not have much difference to the data from January, but a comparison with data from a summer month with different circumstances would not give an accurate comparison because we did not train the model with data from all over the year.

I would have need to use data from all over the year to train the model and afterward test it with a different month from different seasons. This could be a task for a new version of this notebook, but right now this here "fits the needs". 

As I already mentioned, the mean absolute error, as well as the r-squared equation both look that the model would not fit that well, because they seem too accurate. The model is highly based on the **DEPARTURE_DELAY** feature and makes its decisions by that. If there is a flight that has been delayed but not according to the **DEPARTURE_DELAY**, the model would probably don't give a prediction that has that accuracy. 

I will test this in the following. 

## Model Check without DEPARTURE_DELAY Impact
For this test, I will search for a special flight that is not delayed by the **DEPARTURE_DELAY** and is at least a delayed flight of 60 minutes (**ARRIVAL_DELAY** > 60).

In [None]:
# Searching for a flight that fits our needs
df_flights_feb[(df_flights_feb.loc[:,'DEPARTURE_DELAY'] < 0) & (df_flights_feb.loc[:,'ARRIVAL_DELAY'] > 60)].head(10)

The delayed flight with the index number *19777* seems to be a good one. It has the following properties:

In [None]:
# Look into the flight with indexnumber 19777
df_flights_feb.loc[19777]

We clearly see that the **DEPARTURE_DELAY** is not the reason for the delay this time, moreover the airplane departed early than scheduled. So let's use this flight for the model check. Preparations in the following step:

In [None]:
# Setting up a new dataframe for February and converting the AIRLINE feature again
X3 = df_flights_feb.loc[:,feature_list]
X3.iloc[:,feature_list.index('AIRLINE')] = labelenc.fit_transform(X3.iloc[:,feature_list.index('AIRLINE')])

# Retrieving the flight with index 19777 (delayed flight without departure delay).
X3 = X3.loc[19777]
# Setting the target for our flight index 19777
y3 = df_flights_feb.loc[19777]['ARRIVAL_DELAY']

# Converting to array for the model use
X3 = np.array(X3)
y3 = np.array(y3)

### Flight Delay Prediction without DEPARTURE_DELAY
Next step will be the prediction and the validation of the result. Therefore I will use the already trained model and give them the information from the special flight above.

In [None]:
# Printing the important stuff
flight_pred_s = forest_model.predict([X3])
print("Predicted Delay of the Flight (Minutes): ", flight_pred_s)
print("-------------------------------------------------")
print("Original Delay of the Flight (Minutes):  ", y3)
print("_________________________________________________")
print("_________________________________________________")
print("Difference (Minutes)                   : ", flight_pred_s - y3)



### Conclusion
The gap between the predicted and the original delay is 8.49 minutes. Here we can see how the model behavior changes according to the missing main feature impact (the **DEPARTURE_DELAY**). The original delay is much lower than the mean absolute error of 0.857 minutes from the previous calculations. The conjecture about the risk of one high rated feature has confirmed. Nevertheless, this difference is in a range that has not be bad at all. It seems this model has a good accuracy to predict the flight delay. 

Some kind of pruning for the **DEPARTURE_DELAY** would definitely improve the model more. I will keep that in mind for a later version of this model, this notebook, right now I'm happy with the result of the model and will leave it as it is. 

<br>

Thanks for review. **If you like it, give it a upvote** and don't forget your **feedback** down below!<br>
I am looking forward to.

Cheers,

Robin
