# EDA
Exploratory Data Analysis is typically undergone before developing machine learning algorithms. Its important to understand the data before using it in machine learning algorithms, which often make assumptions about the data.

**This notebook achieves the following:**
* Autocorrelation of time-lags (kevin)
* determine stationarity
    * Dickey-fuller test (kevin)
    * Time series decomposition into: trend, seasonal, residual, using the appropiate model (additive or multiplicative) (kevin)
    * heteroskedasticity
* Cointegration between exogenous variables and target (natalie)
* Cross correlation between target and exogenous variables (Muhammed)
* constant time series (kevin)
* ACF/PACF (Natalie)
* ANOVA (Muhammed)

## Imports:

In [None]:
import pandas as pd
from pathlib import Path
import re
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfullerx
from statsmodels.tsa.stattools import acf,pacf
from statsmodels.graphics.tsaplots import plot_pacf,plot_acf
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Markdown

## Data

In [None]:
cwd=Path.cwd()
data= pd.read_csv(cwd.parent / 'Data' /'Train'/'train1990s.csv',parse_dates=[0],date_format='%m%Y',index_col=0)
display(data)

## Get untransformed variables:

fred has applied transformations to the orginal data (such as using a log transformation to remove heterskedascisity) however we need to analyze the original data. To do this we need to remove the transformed variables (contains fred_\<name of varibale\>_\<name of transform\>) and consider the untransformed data first.

In [None]:
transformedCols=[]
for i in data.columns:
    match=re.findall(r'fred_.*_.*',i)
    if match!=[]:
        transformedCols.append(match[0])

unmodifiedDf= data.drop(transformedCols,axis=1)
display(unmodifiedDf)

# EDA:

## Display the graphs:

In [None]:
def display_series(series:pd.Series, decompose:bool=False, title:str=None):
    '''
    This function displays the time series or the trend-seasonal-residual decomposition of a time series.

    Parameters:
    ------------
    series: The time series to be displayed
    
    decompose: If true, trend-seasonal-residual decomposition will be displayed, otherwise only the original searies (Defaulsts to False).

    title: an optional title to be added to the graph.

    Returns:
    ---------
    Returns Nothing, but does display a graph.

    '''

    # Perform time series decompossition:
    if decompose:
        decomp= seasonal_decompose(series,period=12)

        fig,ax= plt.subplots(4,sharex=True)
        
        ax[0].plot(series, c='black')
        ax[0].set_ylabel('Original')

        ax[1].plot(decomp.trend, c='green')
        ax[1].set_ylabel('Trend')

        ax[2].plot(decomp.seasonal, c='orange')
        ax[2].set_ylabel('Seasonality')

        ax[3].plot(decomp.resid,c='blue')
        ax[3].set_ylabel('Residual')


        if title is not None:
            fig.suptitle(title)

        
        fig.supxlabel('Time')
        fig.subplots_adjust(wspace=0, hspace=0.15)
        plt.xticks([]) # Removes the dates being showed(not legible)
        plt.show()
    
    # Display orignal time series (no decomposition)
    else:
        plt.plot(series, c='black')
        plt.ylabel('Level')
        plt.xlabel('Time')

        plt.xticks([]) # Removes the dates being showed(not legible)

        if title is not None:
            plt.title(title)

        plt.show()

def is_stationary(series: pd.Series,sig_level:float=0.05):
    '''
    This function checks if a time series is stationary, using the Dickey-fuller test.

    Parameters:
    -----------
    series: the time series to check for stationarity.

    sig_level: the significance level to be used for the hypothesis test (as a decimal).

    Returns:
    --------
    Returns a bool, True if series is stationary, otherwise False.

    '''

    return adfuller(series)[1]<sig_level

def get_mean_std(series:pd.Series):
    '''
    Calculates the mean and standad deviation of a time series.
    NOTE: these metrics are only useful if the timeseries is stationary.

    Parameters:
    ------------
    series: The timeseries of interest.

    Returns:
    ---------
    A tuple, where the first element is the mean and the 2nd element is the standard deviation of the time series.
    '''

    return np.mean(series), np.std(series)

In [None]:
for col in unmodifiedDf.columns:
    
    stationary=is_stationary(unmodifiedDf[col])
    display_series(unmodifiedDf[col],decompose=(not stationary),title=col)
    mean,std=get_mean_std(unmodifiedDf[col])
    display(Markdown(f"**Stationary:** {stationary}   <br>**Mean:** {mean}<br>**Std:** {std}"))
    
