In [None]:
import pmdarima

In [None]:
import pandas as pd 
df = pd.read_csv("AirPassengers.csv")

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df['Month'] = pd.to_datetime(df['Month'], format='%Y-%m')
print(df.head())

Note that this process automatically inserts the first day of each month, which is basically a dummy value since we have no daily passenger data.

In [None]:
#convert the month column to an index
df.index = df['Month']
del df['Month']  #deletes the month column
print(df.head())

<h2>Visualization</h2>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns 

sns.lineplot(df)
plt.ylabel("Number of Passengers")

<h2>Stationarity</h2>

* Stationarity is a key part of time series analysis.
*  A stationary time series is one whose statistical properties, such as mean, variance, and autocorrelation, remain constant over time.
* You should check for stationarity because it not only makes modeling time series easier


We will use the Dickey Fuller test to check for stationarity in our data. This test will generate critical values and a p-value, which will allow us to accept or reject the null hypothesis that there is no stationarity. If we reject the null hypothesis, that means we accept the alternative, which states that there is stationarity.

These values allow us to test the degree to which present values change with past values. If there is no stationarity in the data set, a change in present values will not cause a significant change in past values.

In [None]:
#Let’s test for stationarity in our airline passenger data. 
#To start, let’s calculate a seven-month rolling mean.

rolling_mean = df.rolling(7).mean()
rolling_std = df.rolling(7).std()

In [None]:
plt.plot(df, color="blue",label="Original Passenger Data")
plt.plot(rolling_mean, color="red", label="Rolling Mean Passenger Number")
plt.plot(rolling_std, color="black", label = "Rolling Standard Deviation in Passenger Number")
plt.title("Passenger Time Series, Rolling Mean, Standard Deviation")
plt.legend(loc="best")



In [None]:
! pip install statsmodels

In [None]:
from statsmodels.tsa.stattools import adfuller

adft = adfuller(df,autolag="AIC")
output_df = pd.DataFrame({"Values":[adft[0],adft[1],adft[2],adft[3], adft[4]['1%'], adft[4]['5%'], adft[4]['10%']]  , "Metric":["Test Statistics","p-value","No. of lags used","Number of observations used", "critical value (1%)", "critical value (5%)", "critical value (10%)"]})
print(output_df)

We can see that our data is not stationary from the fact that our p-value is greater than 5 percent and the test statistic is greater than the critical value. We can also draw these conclusions from inspecting the data, as we see a clear, increasing trend in the number of passengers.

<h2>Autocorrelation</h2>

* This is a measure of how correlated time series data is at a given point in time with past values
* if our passenger data has strong autocorrelation, we can assume that high passenger numbers today suggest a strong likelihood that they will be high tomorrow as well.


In [None]:
autocorrelation_lag1 = df['#Passengers'].autocorr(lag=1)
print("One Month Lag: ", autocorrelation_lag1)

In [None]:
#Now, let’s try three, six and nine months:
autocorrelation_lag3 = df['#Passengers'].autocorr(lag=3)
print("Three Month Lag: ", autocorrelation_lag3)

autocorrelation_lag6 = df['#Passengers'].autocorr(lag=6)
print("Six Month Lag: ", autocorrelation_lag6)

autocorrelation_lag9 = df['#Passengers'].autocorr(lag=9)
print("Nine Month Lag: ", autocorrelation_lag9)

* "Three Month Lag: 0.837394765081794": This indicates that there is a strong positive autocorrelation at a lag of three months. In other words, the value at a specific time point is highly correlated with the value that occurred three months earlier.

* Same for other lag

<h2>Decomposition</h2>

* Trend decomposition is another useful way to visualize the trends in time series data.
* It is a technique used to break down a time series into its individual components.
* Trend Component (T), Seasonal Component (S) & Residual Component (R) or Error Component


<b>Trend Component (T)</b>
* It captures the underlying direction or tendency of the data over time, such as increasing, decreasing, or staying relatively constant.

<b>Seasonal Component (S)</b>
* The seasonal component represents the periodic, repetitive patterns or fluctuations in the time series that occur at fixed intervals.
  
<b>Residual Component (R) or Error Component</b>
* The residual component (also known as the error or remainder component) represents the random noise or irregular variations in the time series data.

To proceed, let’s import seasonal_decompose from the statsmodels package:

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

decompose = seasonal_decompose(df['#Passengers'],model='additive', period=7)
decompose.plot()
plt.show()

From this plot, we can clearly see the increasing trend in number of passengers and the seasonality patterns in the rise and fall in values each year.

<h2>Forecasting</h2>

* Time series forecasting allows us to predict future values in a time series given current and past data.
* We will use the ARIMA method to forecast the number of passengers, which allows us to forecast future values in terms of a linear combination of past values. 

In [None]:
df['Date'] = df.index
train = df[df['Date'] < pd.to_datetime("1960-08", format='%Y-%m')]

train['train'] = train['#Passengers']  #  assigns the values from the '#Passengers' column to a new column called 'train' in the 'train' DataFrame.
del train['Date']
del train['#Passengers']

test = df[df['Date'] >= pd.to_datetime("1960-08", format='%Y-%m')]
del test['Date']

test['test'] = test['#Passengers']
del test['#Passengers']

In [None]:
test

In [None]:
# from sklearn.model_selection import train_test_split


# X = df.drop(columns=['#Passengers'])  # Features excluding '#Passengers'
# y = df['#Passengers']  # Target variable '#Passengers'

# # Split the data into training and testing sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
plt.plot(train, color = "black")
plt.plot(test, color = "red")
plt.title("Train/Test split for Passenger Data")
plt.ylabel("Passenger Number")
plt.xlabel('Year-Month')
sns.set()
plt.show()

In [None]:
!pip install pmdarima

In [None]:
# import pmdarima

In [None]:
from pmdarima.arima import auto_arima
model = auto_arima(train)
model.fit(train)
forecast = model.predict(n_periods=len(test))
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])

In [None]:
from math import sqrt
from sklearn.metrics import mean_squared_error
rms = sqrt(mean_squared_error(test,forecast))
print("RMSE: ", rms)

In [None]:
plt.plot(train, color = "black")
plt.plot(forecast, color = "red")
plt.title("Train/Prediction of Passenger")
plt.ylabel("Passenger Number")
plt.xlabel('Year-Month')
sns.set()
plt.show()