<h1>MLTS Exercise 01 - Data Exploration</h1>

### Task:

This notebook provides three time series datasets, each representing daily user counts for separate applications over the year 2020:
* `App_1_Users_2020.csv`
* `App_2_Users_2020.csv`
* `App_3_Users_2020.csv`

For each dataset, complete the following tasks:

* Load the dataset.
* Visualize the data to observe user trends over time.
* Examine the data for any notable characteristics.
* Split the data into training and test sets, with the test set comprising the final three months of 2020.

In subsequent Exercises 02 and 03, these datasets will be utilized to perform Bayesian Linear Regression analysis.

In [None]:
# import packages
import os
import pandas as pd
import matplotlib.pyplot as plt

Load the dataset

In [None]:
# paths to data
base_path = '../00_Datasets/01-03_App_Datasets'
path = 'App_1_Users_2020.csv'
# path = 'App_2_Users_2020.csv'
# path = 'App_3_Users_2020.csv'

path = os.path.join(base_path, path)

In [None]:
# Read data from file
data = pd.read_csv(path, sep=';', index_col='date')
data.index = pd.to_datetime(data.index)
data.head()

Visualize the data to observe user trends over time

In [None]:
# Plot data
fig, ax = plt.subplots()
ax = data["users"].plot()
ax.set_xlabel('Date')
ax.set_ylabel('Users')
ax.grid()
ax.set_title('Users over Time')

plt.show()

Examine the data for any notable characteristics

* Dataset 1: The number of users exhibits linear growth over time. There is a noticeable change in daily fluctuations starting in early April, with the variability becoming more pronounced.
* Dataset 2: The data follows a quadratic trend, where the number of users steadily increases from January, reaching a peak around July and August, followed by a gradual decline from September to December.
* Dataset 3: The data aligns with a cubic function. User numbers increase and stabilize around 14,000, followed by a slight increase in November and December. A sharp drop to zero in March suggests a potential outage or missing data.

Split the data into training and test sets, with the test set comprising the final three months of 2020.

In [35]:
# Split into train and test set
split_time = '2020-10-01'
train = data[data.index < split_time]
test = data[data.index >= split_time]

In [None]:
# Relative size of test set
round((len(test) / len(data)) * 100, 2)

In [None]:
# Plot train and test set
fig, ax = plt.subplots()
ax = train["users"].plot(label='train')
test["users"].plot(label='test')
ax.axvline(split_time, c="k", ls=":")
ax.set_xlabel('Date')
ax.set_ylabel('Users')
ax.grid()
ax.set_title('Users over Time')

plt.legend()
plt.show()