![](CRISP_DM.png)

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
# for plotting
%matplotlib inline
import matplotlib as mpl
import plotly.graph_objects as go
import matplotlib.pyplot as plt
mpl.rcParams['figure.figsize'] = (16, 10)
# dataframe option
pd.set_option('display.max_rows', 200)

In [None]:
# Same as earlier notebook settings set date from starting in ascending order and create country_list
df_analyse=pd.read_csv('../data/processed/COVID_small_sync_timeline_table.csv',sep=';')  
country_list=df_analyse.columns[1:] # creat country list
df_analyse.sort_values('date',ascending=True).head()

## 8.1 Helper functions
* Create function in order to plot different values quickly and simililar format evaluation of a time series dataset.
* In python function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data, draw plot or perform action specified in function as a result.

In [None]:
# same as last notebook to easen our work for plotting differnt data series dataset
def quick_plotting(x_in, df_input,y_scale='log',slider=False):
    """ Quick basic plot for quick static evaluation of a time series
    
        you can push selective columns of your data frame by .iloc[:,[0,6,7,8]]
        
        Parameters:
        ----------
        x_in : array 
            array of date time object, or array of numbers
        df_input : pandas dataframe 
            the plotting matrix where each column is plotted
            the name of the column will be used for the legend
        scale: str
            y-axis scale as 'log' or 'linear'
        slider: bool
            True or False for x-axis slider
    
        
        Returns:
        ----------
        
    """
    fig = go.Figure()

    for each in df_input.columns:
        fig.add_trace(go.Scatter(x=x_in, y=df_input[each], name=each, opacity=1.0))
    
    fig.update_layout(autosize=True, width=800,height=800,xaxis_title = 'Timeline in Days',
                      yaxis_title = 'Confirmed infected people (Source:Johns-hopkins csse)',
                      font=dict(family="PT Sans, monospace",size=14,color="#860303"))
    fig.update_yaxes(type=y_scale),
    fig.update_xaxes(tickangle=-45,nticks=20,tickfont=dict(size=12,color="#860303"))
    if slider==True:
        fig.update_layout(xaxis_rangeslider_visible=True)
    fig.show()

In [None]:
# define function to calculate mean_absoulute_percentage_error
def MAPE(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [None]:
quick_plotting(df_analyse.date, df_analyse.iloc[:,3:-1], y_scale='log',slider=True)

## 8.2 Fitting a polynomial curve
*This function is from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
* Polynomial interpolation is approximate a function with a polynomial of degree n_degree by using ridge regression [scikit-learn.org](https://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html#sphx-glr-auto-examples-linear-model-plot-polynomial-interpolation-py)

In [None]:
# check the data of first 26 raw and skipping the doubling rate column and save in other dataframe 
## to check that all data are there
df_check=df_analyse.iloc[0:27,3:-1].reset_index()
df_check.head(20)

### 8.2.1 Additional info for usage of *args and *kwargs
+ *args and **kwargs are mostly used in function definitions. *args and **kwargs allow you to pass a variable number of arguments to a function. What variable means here is that you do not know beforehand how many arguments can be passed to your function by the user so in this case you use these two keywords. *args is used to send a non-keyworded variable length argument list to the function. 

In [None]:
#defining function for polynomial regression
def Polynomial_Regression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

In [None]:
#Pivot a level of the (necessarily hierarchical) index labels.
y=df_check[['Germany','Italy','US','Spain','Korea, South']].unstack().sort_index(axis=0,level=1)

In [None]:
y.head()

In [None]:
# set test_points = 29 for seeing overfitting
test_points=28
y_train=y[0:-test_points-1]
y_test=y[-test_points:]

In [None]:
X_train=np.arange(len(y_train)).reshape(-1, 1)/4.0 
X_test=np.arange(len(y_train),len(y_train)+test_points).reshape(-1, 1)/4.0

In [None]:
# plottoing using subplot feature from matplotlib
fig, ax1 = plt.subplots(1, 1)

ax1.scatter(np.arange(len(y))/4,y, color='red')
ax1.axvspan((len(y)-test_points-1)/4, len(y)/4, facecolor='y', alpha=0.5)

for degree in [1,3,7,15]:
    y_hat_insaple=Polynomial_Regression(degree).fit(X_train, y_train).predict(X_train)
    y_hat_test = Polynomial_Regression(degree).fit(X_train, y_train).predict(X_test)

    X_plot=np.concatenate((X_train, X_test), axis=None)
    y_plot=np.concatenate((y_hat_insaple, y_hat_test), axis=None)

    ax1.plot(X_plot, y_plot, label='degree={0}'.format(degree)+ 
                 '     MAPE train:  ' + str(MAPE(y_hat_insaple, y_train))[0:3]
                 +'    MAPE test    ' +str(MAPE(y_hat_test, y_test))[0:3]) 

ax1.set_ylim(100, 1500000)
ax1.set_yscale('log')
ax1.legend(loc='best',prop={'size': 16});

## 8.3 Theory
### Regression Metrics <font color=red>(source: WiKi)<font>
    
<font color = green>
    
#### 1. Mean Absolut Error
In statistics, mean absolute error 'MAE' is a measure of errors between paired observations expressing the same phenomenon. 

Comparing examples of 'Y' (forecasts) versus 'X' (actual/observe) across time steps
 MAE is calculated as:<font>
 
$\mathrm{MAE} = \frac{\sum_{i=1}^n\left| y_i-x_i\right|}{n} =\frac{\sum_{i=1}^n\left| e_i \right|}{n}.$

#### 2. Mean Absolut Percentage Error (MAPE)

mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), 
is a measure of prediction accuracy of a forecasting method in [[statistics]],

for example in trend estimation, also used as a loss function for regression problems in machine learning. 
It usually expresses the accuracy as a ratio defined by the formula:

$\mbox{MAPE} = \frac{1}{n}\sum_{i=1}^n  \left|\frac{x_i-y_i}{x_i}\right| $
    
 #### 3. Symmetric mean absolute percentage error
Symmetric mean absolute percentage error ('SMAPE' or 'sMAPE') is an accuracy measure based on percentage (or relative) errors. It is usually defined{{Citation needed|reason=S. Makridakis didn't use following definition in his article ''Accuracy measures: theoretical and practical concerns,'1993.|date=May 2017}} as follows:

$ \text{SMAPE} = \frac{100\%}{n} \sum_{t=1}^n \frac{\left|F_t-A_t\right|}{(|A_t|+|F_t|)/2}$
        <font>