## 6.6 Sourcing & Analyzing Time Series Data

#### Contents:
1. Import libraries
2. Import data
3. Subsetting, wrangling, and cleaning time-series data¶
4. Time-series analysis: decomposition
5. Stationarizing the Consumer Price Index

#### 1. Import libraries

In [417]:
import quandl
import pandas as pd
from datetime import datetime
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.api as sm # Using .api imports the public access version of statsmodels, which is a library that handles 
# statistical models.
from statsmodels.graphics.tsaplots import plot_acf
import os
import warnings # This is a library that handles warnings.
warnings.filterwarnings("ignore") # Disable deprecation warnings that could indicate, for instance, a suspended library or 
# feature. These are more relevant to developers and very seldom to analysts.
import matplotlib as mpl

In [418]:
print(plt.style.available)

['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-v0_8', 'seaborn-v0_8-bright', 'seaborn-v0_8-colorblind', 'seaborn-v0_8-dark', 'seaborn-v0_8-dark-palette', 'seaborn-v0_8-darkgrid', 'seaborn-v0_8-deep', 'seaborn-v0_8-muted', 'seaborn-v0_8-notebook', 'seaborn-v0_8-paper', 'seaborn-v0_8-pastel', 'seaborn-v0_8-poster', 'seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-white', 'seaborn-v0_8-whitegrid', 'tableau-colorblind10']


In [419]:
# Set plot style
plt.style.use('seaborn-v0_8-whitegrid') # options to try: 'seaborn-v0_8-whitegrid', 'fivethirtyeight', 'bmh', or 'seaborn-v0_8-deep'

#### 2. Import data

In [420]:
path = r'/Users/rose/Documents/My Tableau Repository/Advance Analytics & Dashboard Design'

In [421]:
# Load the dataset 
df_listings_unsupervised_ml = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'listings_unsupervised_ml.pkl'))

In [422]:
# Load the dataset 
df_listings_numeric_unsupervised_ml = pd.read_pickle(os.path.join(path, '02 Data','Prepared Data', 'listings_numeric_unsupervised_ml.pkl'))

In [423]:
# Load the excel dataset
full_path = path + '/02 Data/Original Data/TouristNationality.xlsx'

airbnb_timeseries = pd.read_excel(full_path, sheet_name='NoOfVisitors')

In [424]:
# Configure API key 
quandl.ApiConfig.api_key = 'k9S7J6_XDdsis1ppQL82'

In [425]:
airbnb_timeseries.head()

Unnamed: 0,# of visitors,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
0,Foreigners,36828475,48072626,59061977,68399860,63045754,65292492,70899043,3140049,1492657,3392469
1,Japanese,994518773,985671330,1021639827,1044526146,1084119193,1097994322,1138296900,640261324,659202013,823539008
2,Total,1031347248,1033743956,1080701804,1112926006,1147164947,1163286814,1209195943,643401373,660694670,826931477


In [426]:
type(airbnb_timeseries)

pandas.core.frame.DataFrame

In [427]:
airbnb_timeseries.shape

(3, 11)

In [428]:
airbnb_timeseries.columns

Index(['# of visitors', 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021,
       2022],
      dtype='object')

In [429]:
airbnb_timeseries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   # of visitors  3 non-null      object
 1   2013           3 non-null      int64 
 2   2014           3 non-null      int64 
 3   2015           3 non-null      int64 
 4   2016           3 non-null      int64 
 5   2017           3 non-null      int64 
 6   2018           3 non-null      int64 
 7   2019           3 non-null      int64 
 8   2020           3 non-null      int64 
 9   2021           3 non-null      int64 
 10  2022           3 non-null      int64 
dtypes: int64(10), object(1)
memory usage: 396.0+ bytes


In [430]:
airbnb_timeseries = airbnb_timeseries.set_index('Year')

KeyError: "None of ['Year'] are in the columns"

In [None]:
airbnb_timeseries.head()

In [None]:
airbnb_timeseries.columns

In [None]:
plt.figure(figsize=(8,4), dpi=100)
plt.plot(airbnb_timeseries.index, airbnb_timeseries['Number of visitors'], marker='o', linestyle='-')
plt.title('Plot of Data with Date on X-axis')
plt.xlabel('Date')
plt.ylabel('Number of visitors')
plt.grid(True)
plt.show()

#### 3. Subsetting, wrangling, and cleaning time-series data

In [None]:
# Reset index so that you can use the "Date" column as a filter
airbnb_timeseries_2 = airbnb_timeseries.reset_index()

In [None]:
airbnb_timeseries_2.head()

In [None]:
# Assuming 'Year' column is of integer type (int64)
data_sub = airbnb_timeseries_2.loc[(airbnb_timeseries_2['Year'] >= 2013) & (airbnb_timeseries_2['Year'] <= 2017)]


In [None]:
data_sub.shape

In [None]:
data_sub.head()

In [None]:
# Convert 'Year' to datetime with format specified
data_sub['Year'] = pd.to_datetime(data_sub['Year'], format='%Y')  # Assuming 'Year' is in YYYY format

# Set 'Year' as the index
data_sub.set_index('Year', inplace=True)

# Verify the index to ensure correct years are present
print(data_sub.index)  # This should show the index (years) present in your DataFrame

# Now 'Year' is the index of the DataFrame
print(data_sub.head())

In [None]:
# Plot the new data set

plt.figure(figsize=(8,4), dpi=100)
plt.plot(data_sub)

In [None]:
# Check for missing values (you shouldn't have any)

data_sub.isnull().sum() 

In [None]:
# Check for duplicates

dups = data_sub.duplicated()
dups.sum()

#### 4. Time-series analysis: decomposition

In [None]:
data_sub.head()

In [None]:
data_sub.head()

In [None]:
# Sort the data by date in ascending order
data_sub_sorted = data_sub.sort_index()

In [None]:
# data_sub.set_index('Year', inplace=True)  

# Skip decomposition and work directly with the time series data
# Example: Plotting the raw time series
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(data_sub.index, data_sub['Number of visitors'], marker='o', linestyle='-')
plt.title('Raw Time Series')
plt.xlabel('Year')
plt.ylabel('Number of visitors')
plt.grid(True)
plt.show()


In [None]:
# Decompose the time series using an additive model

decomposition = sm.tsa.seasonal_decompose(data_sub_sorted['Number of visitors'], model='additive')

# Plot the decomposition
decomposition.plot()
plt.show()

In [None]:
from pylab import rcParams # This will define a fixed size for all special charts.

rcParams['figure.figsize'] = 8, 5

In [None]:
# Plot the separate components

decomposition.plot()
plt.show()

In [None]:
# The adfuller() function will import from the model from statsmodels for the test; however, running it will only return 
# an array of numbers. This is why you need to also define a function that prints the correct output from that array.

from statsmodels.tsa.stattools import adfuller # Import the adfuller() function

def dickey_fuller(timeseries): # Define the function
    # Perform the Dickey-Fuller test:
    print ('Dickey-Fuller Stationarity test:')
    test = adfuller(timeseries, autolag='AIC')
    result = pd.Series(test[0:4], index=['Test Statistic','p-value','Number of Lags Used','Number of Observations Used'])
    for key,value in test[4].items():
       result['Critical Value (%s)'%key] = value
    print (result)

# Apply the test using the function on the time series
dickey_fuller(data_sub_sorted['Number of visitors'])

The test statistic exceeds the critical value, which signifies presence of unit root. As a result,  we can't reject the null hypothesis and can conclude that this is non-stationary.

In [None]:
# Convert the column to a pandas Series if it's not already
ts_series = data_sub_sorted['Number of visitors']

# Plot the autocorrelation function
plot_acf(ts_series)
plt.show()

The autocorrelation map above shows one vertical line edging outside the confidence interval indicating non-stationarity and therefore validating the earlier results from the Dickey-Fuller test.

#### 5. Stationarizing the Consumer Price Index

In [None]:
data_diff = data_sub - data_sub_sorted.shift(1) # The df.shift(1) function turns the observation to t-1, making the whole thing t - (t -1)

In [None]:
data_diff.dropna(inplace = True) # Here, you remove the missing values that came about as a result of the differencing. 
# You need to remove these or you won't be able to run the Dickey-Fuller test.

In [None]:
data_diff.head()

In [None]:
data_diff.columns

In [None]:
# Check out what the differencing did to the time-series curve

plt.figure(figsize=(8,5), dpi=100)
plt.plot(data_diff)

In [None]:
ts_series = data_diff['Number of visitors']

# Perform the Dickey-Fuller test
result = adfuller(ts_series)

These 2nd results of the Dickey-Fuller test now indicate that the time series is stationary. The test statistic is now smaller than all the critical values. Furthermore, the low p-value indicates a significant test result, providing strong evidence to reject the null hypothesis of a unit root (non-stationarity), further supporting the conclusion of stationarity.

In [None]:
plot_acf(ts_series)
plt.show()