# 1. Introduction: Business Goal & Problem Definition

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

This project´s goal is doing passengers quantity prediction by month and year to help air companies control the resources they need to allocate in order to offer the most adequate services to their clients, at the same time they don´t waste funds in unnecessary actions, bringing more profitability to the business. The available dataset brings data from 1949 to 1960. Please look at the conclusion’s comments in the last section.

# 2. Importing Basic Libraries

In [None]:
!pip install openpyxl
import io
import openpyxl
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 3. Data Collection

In [None]:
air_passenger_ds = pd.read_csv("../input/air-passengers/AirPassengers.csv", sep=",")

air_passenger_ds

# 4. Data Preliminary Exploration

In [None]:
#Checking a dataset sample

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
pd.options.display.float_format="{:,.2f}".format
air_passenger_ds.sample(n=10, random_state=0)

In [None]:
#Checking dataset info by feature

air_passenger_ds.info(verbose=True, null_counts=True)

In [None]:
#Checking the existence of zeros in rows

(air_passenger_ds==0).sum(axis=0).to_excel("zeros_per_feature.xlsx")
(air_passenger_ds==0).sum(axis=0)

In [None]:
#Checking the existence of duplicated rows

air_passenger_ds.duplicated().sum()

In [None]:
#Checking basic statistical data by feature

air_passenger_ds.describe(include="all")

# 5. Data Cleaning

    We´ll perform the following:
    
    
    1. Change "Month" name to "Date" in order to have a more intuitive name for the column
    
        
    2. Convert "Date" to datetime datatype
    
    
    3. Set "Date" column as Index 
    
    
    4. Order dataset by "Date"

In [None]:
1#

air_passenger_ds.rename({"Month": "Date"}, axis=1, inplace=True)

#2

air_passenger_ds["Date"] = pd.to_datetime(air_passenger_ds["Date"])


#3

air_passenger_ds.set_index("Date", inplace=True)

#4

air_passenger_ds.sort_values(by=["Date"])


air_passenger_ds.to_excel("air_passenger_ds_clean.xlsx")

# 6. Data Exploration

# 6.1 Visualizing Data Along the Time

In [None]:
fig = px.line(air_passenger_ds, y=["#Passengers"], height=500, width=1500)
fig.layout.showlegend = False
fig.update_layout(title="Air Passengers across the years (1949-1960)", xaxis_title="Date", yaxis_title="Passengers")

# 6.2 Checking Data Stationarity

In order to determine stationarity, the three statisticals below can not change over the time:
* mean
* variance
* autocorrelation


In order to check it, we´ll use two methods:
1. Moving Average
2. ADF (Augmented Dickey–Fuller) Test


* We´ll conclude in 6.1 and 6.2 sections the original dataset is nonstationary, but in section 7 we´ll change it to stationary using several methods

# 6.2.1 Moving Average

In [None]:
mean = air_passenger_ds["#Passengers"].rolling(window=12).mean() #moving Mean along the 12 prior days
std = air_passenger_ds["#Passengers"].rolling(window=12).std() #moving Standard Deviation along the 12 prior days

import plotly.graph_objects as go

fig1 = px.line(air_passenger_ds, y=["#Passengers"])
fig1.update_traces(line=dict(color = "blue"), name="Original Data")

fig2 = px.line(mean)
fig2.update_traces(line=dict(color = "yellow"), name="Rolling Mean")

fig3 = px.line(std)
fig3.update_traces(line=dict(color = "red"), name="Rolling Standard Deviation")

fig4 = go.Figure(data=fig1.data + fig2.data + fig3.data)
fig4.update_layout(title="Data vs Mean vs Std", xaxis_title="Date", yaxis_title="Passengers", height=500, width=1500)

fig4.show()

# 6.2.2 ADF (Augmented Dickey–Fuller) Test

In [None]:
# For Data to be stationary p value should be < 0.05 and critical values should be close to Test Statistics

from statsmodels.tsa.stattools import adfuller

print("Results of Dickey-Fuller Test:")
dftest = adfuller(air_passenger_ds["#Passengers"], autolag="AIC")
dfoutput = pd.Series(dftest[0:4], index=["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"])
for key, value in dftest[4].items():
    dfoutput["Critical Value (%s)"%key] = value

dfoutput

In [None]:
#Alternatively using Profile Report to see variables statistics and correlations

from pandas_profiling import ProfileReport
profile = ProfileReport(air_passenger_ds, title="Air Passenger")
profile.to_file(output_file="Air_Passenger.html")

# 7. Data Stationarity Transformation

# 7.1 Applying Log

In [None]:
air_passenger_ds_log = np.log(air_passenger_ds)
mean_log = air_passenger_ds_log.rolling(window=12).mean()
std_log = air_passenger_ds_log.rolling(window=12).std()

fig1 = px.line(air_passenger_ds_log, y=["#Passengers"])
fig1.update_traces(line=dict(color = "blue"), name="Original Data (Log)")

fig2 = px.line(mean_log)
fig2.update_traces(line=dict(color = "yellow"), name="Rolling Mean (Log)")

fig3 = px.line(std_log)
fig3.update_traces(line=dict(color = "red"), name="Rolling Standard Deviation (Log)")

fig4 = go.Figure(data=fig1.data + fig2.data + fig3.data)
fig4.update_layout(title="Logarithmic Data vs Mean vs Std", xaxis_title="Date", yaxis_title="Passengers (Log)", height=500, width=1500)

fig4.show()

In [None]:
# ADF (Augmented Dickey–Fuller) Test

print("Results of Dickey-Fuller Test:")
dftest = adfuller(air_passenger_ds_log["#Passengers"], autolag="AIC")
dfoutput = pd.Series(dftest[0:4], index=["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"])
for key, value in dftest[4].items():
    dfoutput["Critical Value (%s)"%key] = value

dfoutput

# 7.2 Applying Log Differencing Simple Moving Average

In [None]:
air_passenger_ds_log_dsma = air_passenger_ds_log - mean_log
air_passenger_ds_log_dsma.dropna(inplace=True)
mean_log_dsma = air_passenger_ds_log_dsma.rolling(window=12).mean()
std_log_dsma = air_passenger_ds_log_dsma.rolling(window=12).std()

fig1 = px.line(air_passenger_ds_log_dsma, y=["#Passengers"])
fig1.update_traces(line=dict(color = "blue"), name="Original Data (Log Differencing Simple Moving Average)")

fig2 = px.line(mean_log_dsma)
fig2.update_traces(line=dict(color = "yellow"), name="Rolling Mean (Log Differencing Simple Moving Average)")

fig3 = px.line(std_log_dsma)
fig3.update_traces(line=dict(color = "red"), name="Rolling Standard Deviation (Log Differencing Simple Moving Average)")

fig4 = go.Figure(data=fig1.data + fig2.data + fig3.data)
fig4.update_layout(title="Log Differencing Simple Moving Average Data vs Mean vs Std", xaxis_title="Date", yaxis_title="Passengers (Log Differencing Simple Moving Average)", height=500, width=1500)

fig4.show()

In [None]:
# ADF (Augmented Dickey–Fuller) Test

print("Results of Dickey-Fuller Test:")
dftest = adfuller(air_passenger_ds_log_dsma["#Passengers"], autolag="AIC")
dfoutput = pd.Series(dftest[0:4], index=["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"])
for key, value in dftest[4].items():
    dfoutput["Critical Value (%s)"%key] = value

dfoutput

# 7.3 Applying Log Exponential Moving Average

In [None]:
exponentialDecayWeightedAverage = air_passenger_ds_log.ewm(halflife=12, min_periods=0, adjust=True).mean()
air_passenger_ds_log_ema = air_passenger_ds_log - exponentialDecayWeightedAverage
air_passenger_ds_log_ema.dropna(inplace=True)
mean_log_ema = air_passenger_ds_log_ema.rolling(window=12).mean()
std_log_ema = air_passenger_ds_log_ema.rolling(window=12).std()

fig1 = px.line(air_passenger_ds_log_ema, y=["#Passengers"])
fig1.update_traces(line=dict(color = "blue"), name="Original Data (Log Exponential Moving Average)")

fig2 = px.line(mean_log_ema)
fig2.update_traces(line=dict(color = "yellow"), name="Rolling Mean (Log Exponential Moving Average)")

fig3 = px.line(std_log_ema)
fig3.update_traces(line=dict(color = "red"), name="Rolling Standard Deviation (Log Exponential Moving Average)")

fig4 = go.Figure(data=fig1.data + fig2.data + fig3.data)
fig4.update_layout(title="Log Exponential Moving Average Data vs Mean vs Std", xaxis_title="Date", yaxis_title="Passengers (Exponential Moving Average)", height=500, width=1500)

fig4.show()

In [None]:
# ADF (Augmented Dickey–Fuller) Test

print("Results of Dickey-Fuller Test:")
dftest = adfuller(air_passenger_ds_log_ema["#Passengers"], autolag="AIC")
dfoutput = pd.Series(dftest[0:4], index=["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"])
for key, value in dftest[4].items():
    dfoutput["Critical Value (%s)"%key] = value

dfoutput

# 7.4 Applying Log Differencing Previous Value

In [None]:
air_passenger_ds_log_dpv = air_passenger_ds_log - air_passenger_ds_log.shift()
air_passenger_ds_log_dpv.dropna(inplace=True)
mean_log_dpv = air_passenger_ds_log_dpv.rolling(window=12).mean()
std_log_dpv = air_passenger_ds_log_dpv.rolling(window=12).std()

fig1 = px.line(air_passenger_ds_log_dpv, y=["#Passengers"])
fig1.update_traces(line=dict(color = "blue"), name="Original Data (Log Differencing Previous Value)")

fig2 = px.line(mean_log_dpv)
fig2.update_traces(line=dict(color = "yellow"), name="Rolling Mean (Log Differencing Previous Value)")

fig3 = px.line(std_log_dpv)
fig3.update_traces(line=dict(color = "red"), name="Rolling Standard Deviation (Log Differencing Previous Value)")

fig4 = go.Figure(data=fig1.data + fig2.data + fig3.data)
fig4.update_layout(title="Log Differencing Previous Value Data vs Mean vs Std", xaxis_title="Date", yaxis_title="Passengers (Log Differencing Previous Value)", height=500, width=1500)

fig4.show()

In [None]:
# ADF (Augmented Dickey–Fuller) Test

print("Results of Dickey-Fuller Test:")
dftest = adfuller(air_passenger_ds_log_dpv["#Passengers"], autolag="AIC")
dfoutput = pd.Series(dftest[0:4], index=["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"])
for key, value in dftest[4].items():
    dfoutput["Critical Value (%s)"%key] = value

dfoutput

# 7.5 Applying Log Seasonal Decomposition

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(air_passenger_ds_log)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.figure(figsize=(20, 7))
plt.subplot(411)
plt.plot(air_passenger_ds_log, label="Original")
plt.legend(loc="best")
plt.subplot(412)
plt.plot(trend, label="Trend")
plt.legend(loc="best")
plt.subplot(413)
plt.plot(seasonal, label="Seasonability")
plt.legend(loc="best")
plt.subplot(414)
plt.plot(residual, label="Residuals")
plt.legend(loc="best")
plt.tight_layout()

In [None]:
air_passenger_ds_log_sd = residual
air_passenger_ds_log_sd.dropna(inplace=True)
mean_log_sd = air_passenger_ds_log_sd.rolling(window=12).mean()
std_log_sd = air_passenger_ds_log_sd.rolling(window=12).std()

fig1 = px.line(air_passenger_ds_log_sd)
fig1.update_traces(line=dict(color = "blue"), name="Original Data (Log Seasonal Decomposition)")

fig2 = px.line(mean_log_sd)
fig2.update_traces(line=dict(color = "yellow"), name="Rolling Mean (Log Seasonal Decomposition)")

fig3 = px.line(std_log_sd)
fig3.update_traces(line=dict(color = "red"), name="Rolling Standard Deviation (Log Seasonal Decomposition)")

fig4 = go.Figure(data=fig1.data + fig2.data + fig3.data)
fig4.update_layout(title="Log Seasonal Decomposition Data vs Mean vs Std", xaxis_title="Date", yaxis_title="Passengers (Log Seasonal Decomposition)", height=500, width=1500)

fig4.show()

In [None]:
# ADF (Augmented Dickey–Fuller) Test

print("Results of Dickey-Fuller Test:")
dftest = adfuller(air_passenger_ds_log_sd, autolag="AIC")
dfoutput = pd.Series(dftest[0:4], index=["Test Statistic", "p-value", "#Lags Used", "Number of Observations Used"])
for key, value in dftest[4].items():
    dfoutput["Critical Value (%s)"%key] = value

dfoutput

# 8. AR and MA Models Lags Finding

A nonseasonal ARIMA model is classified as an "ARIMA(p,d,q)" model, where:
* p is the number of autoregressive terms
* d is the number of nonseasonal differences needed for stationarity
* q is the number of lagged forecast errors in the prediction equation

In [None]:
#For ex: a lag k autocorrelation is the correlation between values that are k time periods apart

import warnings
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARMA',
                        FutureWarning)
warnings.filterwarnings('ignore', 'statsmodels.tsa.arima_model.ARIMA',
                        FutureWarning)

from statsmodels.tsa.stattools import arma_order_select_ic

#Select here the Data Stationarity Transformation Method (dstm) to use:
#For this exercise we´re choosing Log Differencing Previous Value
dstm = air_passenger_ds_log_dpv

#Lags output
print(arma_order_select_ic(dstm))

# 9. Algorithm Implementation & Assessment

# 9.1 AR Model

In [None]:
#Creating an AR model and checking its Metrics

from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

model_ar = ARIMA(air_passenger_ds_log, order = (2,1,0)).fit(disp=-1)
y_preds = model_ar.fittedvalues
rss = sum((y_preds-dstm["#Passengers"])**2)
score = r2_score(dstm, y_preds)
mse = mean_squared_error(dstm, y_preds)
print("Metrics: RSS:{0:,.3f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(rss, score, mse, np.sqrt(mse)))

#Plotting
x_ax = range(len(dstm))
plt.scatter(x_ax, dstm, s=5, color="blue", label="Original")
plt.plot(x_ax, y_preds, lw=0.8, color="red", label="Predicted")
plt.title("RSS: %.4f"% sum((y_preds-dstm["#Passengers"])**2))
plt.legend()
plt.show()


#Converting predictions to original scale
predictions_ARIMA_diff = pd.Series(model_ar.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(air_passenger_ds_log["#Passengers"].iloc[0], index=air_passenger_ds_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)

#Plotting Original vs Predicted Data
fig1 = px.line(air_passenger_ds)
fig1.update_traces(line=dict(color = "blue"), name="Original Data")

fig2 = px.line(predictions_ARIMA)
fig2.update_traces(line=dict(color = "purple"), name="Predicted Data")

fig3 = go.Figure(data=fig1.data + fig2.data)
fig3.update_layout(title="Original vs Predicted Data", xaxis_title="Date", yaxis_title="Passengers", height=500, width=1500).show()

#Plotting Future Predicted Data for five years
plt.rc("figure", figsize=(20,7))
pred_plot = model_ar.plot_predict(1,204)
plt.title("Future Predicted Data")
plt.show()

#Visualizing y_pred in the dataset
y_pred_all = predictions_ARIMA
air_passenger_ds["passengers_predicted"] = y_pred_all
air_passenger_ds.to_excel("model_ar.xlsx")

# 9.2 MA Model

In [None]:
#Creating a MA model and checking its Metrics

model_ma = ARIMA(air_passenger_ds_log, order = (0,1,2)).fit(disp=-1)
y_preds = model_ma.fittedvalues
rss = sum((y_preds-dstm["#Passengers"])**2)
score = r2_score(dstm, y_preds)
mse = mean_squared_error(dstm, y_preds)
print("Metrics: RSS:{0:,.3f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(rss, score, mse, np.sqrt(mse)))

#Plotting
x_ax = range(len(dstm))
plt.scatter(x_ax, dstm, s=5, color="blue", label="Original")
plt.plot(x_ax, y_preds, lw=0.8, color="red", label="Predicted")
plt.title("RSS: %.4f"% sum((y_preds-dstm["#Passengers"])**2))
plt.legend()
plt.show()


#Converting predictions to original scale
predictions_ARIMA_diff = pd.Series(model_ma.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(air_passenger_ds_log["#Passengers"].iloc[0], index=air_passenger_ds_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)

#Plotting Original vs Predicted Data
fig1 = px.line(air_passenger_ds)
fig1.update_traces(line=dict(color = "blue"), name="Original Data")

fig2 = px.line(predictions_ARIMA)
fig2.update_traces(line=dict(color = "purple"), name="Predicted Data")

fig3 = go.Figure(data=fig1.data + fig2.data)
fig3.update_layout(title="Original vs Predicted Data", xaxis_title="Date", yaxis_title="Passengers", height=500, width=1500).show()

#Plotting Future Predicted Data for five years
plt.rc("figure", figsize=(20,7))
pred_plot = model_ma.plot_predict(1,204)
plt.title("Future Predicted Data")
plt.show()

#Visualizing y_pred in the dataset
y_pred_all = predictions_ARIMA
air_passenger_ds["passengers_predicted"] = y_pred_all
air_passenger_ds.to_excel("model_ma.xlsx")

# 9.3 ARIMA Model

In [None]:
#Creating an ARIMA model and checking its Metrics

model_arima = ARIMA(air_passenger_ds_log, order = (2,1,2)).fit(disp=-1)
y_preds = model_arima.fittedvalues
rss = sum((y_preds-dstm["#Passengers"])**2)
score = r2_score(dstm, y_preds)
mse = mean_squared_error(dstm, y_preds)
print("Metrics: RSS:{0:,.3f}, R2:{1:,.3f}, MSE:{2:,.2f}, RMSE:{3:,.2f}".format(rss, score, mse, np.sqrt(mse)))

#Plotting
x_ax = range(len(dstm))
plt.scatter(x_ax, dstm, s=5, color="blue", label="Original")
plt.plot(x_ax, y_preds, lw=0.8, color="red", label="Predicted")
plt.title("RSS: %.4f"% sum((y_preds-dstm["#Passengers"])**2))
plt.legend()
plt.show()


#Converting predictions to original scale
predictions_ARIMA_diff = pd.Series(model_arima.fittedvalues, copy=True)
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA_log = pd.Series(air_passenger_ds_log["#Passengers"].iloc[0], index=air_passenger_ds_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum, fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)

#Plotting Original vs Predicted Data
fig1 = px.line(air_passenger_ds)
fig1.update_traces(line=dict(color = "blue"), name="Original Data")

fig2 = px.line(predictions_ARIMA)
fig2.update_traces(line=dict(color = "purple"), name="Predicted Data")

fig3 = go.Figure(data=fig1.data + fig2.data)
fig3.update_layout(title="Original vs Predicted Data", xaxis_title="Date", yaxis_title="Passengers", height=500, width=1500).show()

#Plotting Future Predicted Data for five years
plt.rc("figure", figsize=(20,7))
pred_plot = model_arima.plot_predict(1,204)
plt.title("Future Predicted Data")
plt.show()

#Visualizing y_pred in the dataset
y_pred_all = predictions_ARIMA
air_passenger_ds["passengers_predicted"] = y_pred_all
air_passenger_ds.to_excel("model_arima.xlsx")

# 10. Model Deployment

In [None]:
pd.options.display.float_format="{:,.4f}".format

deploy_ds = pd.date_range(start="1/1/1961", end="12/1/1965", freq="MS")
deploy_ds = pd.DataFrame({"Date":deploy_ds})
deploy_ds["Passengers"] = np.exp(pd.DataFrame(model_arima.forecast(steps=60)[0]))
date_input = input("Enter the date you would like to estimate the passengers number - valid for next five years after dataset range, meaning from 1961 to 1965 (MM/YYYY): ")
print("{}".format(deploy_ds.loc[deploy_ds["Date"] == date_input]))

# 11. Conclusions

IF YOU LIKE IT OR IF IT HELPS YOU SOMEHOW, COULD YOU PLEASE UPVOTE? THANK YOU VERY MUCH!!!

We were able to develop a model to predict passengers’ quantity to help the air company allocate the required resources in future months/years and maximize its profitability. We used ARIMA model, bringing a RSS = 1.0292, but the project can be further improved, first by choosing a better Data Stationarity Transformation Method (with a better p-value), and second by exploring SARIMA model, since it´s a seasonal dataset.