# Data Links

This lists usefull Data sources : 

- UCI Machine Learning repository : https://archive.ics.uci.edu/ml/index.php
- UEA & UCR Time Series Classification Repository : https://timeseriesclassification.com/dataset.php

# Usefull tips

- Downsampling and Upsampling can be usefull to adapt data
- Using Interpolation technique carefully
- Smoothing can be usefull to use data for prediction (ex: pd.ewm performs exponential weighted smoothing : recent data have greater weight)
- Carefull know what the timestamps used mean (upload or data acquisition..) not to add lookahead in data

# Exploratory analysis

- Use Histogram, plots
- Statistical summary tables 
- Correlation tables 
- plot with diff to remove trends and time correlations

In [None]:
import pandas as pd
from matplotlib import pyplot
from statsmodels.graphics.tsaplots import plot_pacf
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt 
%pylab inline
plt.style.use('dark_background')
import plotly.io as pio
from ds_toolbox.graphs import plot_evolution, plot_hist
pd.options.plotting.backend = "plotly"
pio.templates.default = 'plotly_dark'
# Data Daily total female births in California, 1959
df = pd.read_csv('data_general/daily-total-female-births-in-cal.csv', sep=',').iloc[0:365]
df.columns = ['date', 'female_births']
df.index= pd.to_datetime(df['date'])
df = plot_hist(df=df, keys=['female_births'])

## Stationarity

A time series is stationarity if for any lags the distribution of values is equal.

Possible to transform data to make them stationarity (by differenciation, logarithm, squared) : but important to keep in mind the meaning of such transformation about data Informations.

In [None]:
df['moving_mean'] = df.expanding(min_periods=2).mean()
df['moving_std'] = df['female_births'].expanding(min_periods=2).std()
df['EMA'] = df['female_births'].ewm(span=40,adjust=False).mean()
df = plot_evolution(df=df, keys=['female_births', 'moving_mean', 'moving_std', 'EMA'], title='Female Births in California 1959')

In [None]:
print('Dickey-Fuller criteria: p=', str(sm.tsa.stattools.adfuller(df['female_births'])[1]))

## Auto-Correlation

In [None]:
df['auto_corr'] = [df['female_births'].autocorr(lag=i) for i in range(len(df))]
df=plot_evolution(df=df, keys=['auto_corr'], title='Auto correlation functions')

In [None]:
plt.figure(figsize(30,10))
plot = plot_pacf(df['female_births'], lags=50)

## Residual Seasonal Trend Decomposition

In [None]:
plt.figure(figsize(20,10))
plot = sm.tsa.seasonal_decompose(df['female_births']).plot()