# Federico Brian - On Vodka Consumption in Russia (1998-2016)

### Since the beginning of time, Russia has been known for his relevant alcohol consumption history. All of us, at least once in our lifetime, have experienced in real life and/or have seen a video of a drunken Russian acting under the influence, or chugging dry an entire bottle of vodka. But how the consumption of vodka has changed over the past few years? Let's have a closer look.
#### We want to exploit this dataset to implement polynomial regression, using the Ridge regression provided with the SciKit Learn Python package.
Let's import some packages:

In [None]:
from sklearn.datasets import make_regression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd 
from matplotlib import pyplot as plt
import numpy as np
from sklearn.linear_model import Ridge

Let's load the dataset:

In [None]:
alcohol_ds = pd.read_csv("../input/alcohol-consumption-in-russia/russia_alcohol.csv", parse_dates = ['year'])

Let's plot the data:

In [None]:
plt.rcParams['figure.figsize'] = [14, 10]

plt.xlabel('Years')
plt.ylabel('Vodka consumption in litres per capita')
plt.scatter(alcohol_ds.year, alcohol_ds.vodka, marker='.', color='blue')
plt.show()

Every dot represents a different russian region. 

In [None]:
vodka = list()
for i in range(0,19):
    tot = 0.0
    for j in range(0,85):
        num = float(alcohol_ds.vodka.tolist()[(i*85) + j])
        if "nan" not in str(num):
            tot += num
    vodka.append(tot/85)
    
print(vodka)

Here, we can see another plot, representing the per capita consumption per year (not divided by region anymore):

In [None]:
plt.rcParams['figure.figsize'] = [14, 10]

plt.xlabel('Years')
plt.ylabel('Vodka consumption in litres per capita')
plt.scatter(alcohol_ds.year.unique(), vodka, marker='o', color='blue')
plt.show()

Now, let's try to implement polynomial regression using the Ridge Regression method, available in scikit-learn.

## Linear regression

In [None]:
alphas = [.1, .5, 1, 10, 100]
predictions = []
years = StandardScaler().fit_transform(alcohol_ds.year.unique().reshape(-1, 1))

for alpha in alphas:
    ridge_reg = Ridge(alpha = alpha)
    ridge_reg.fit(years, vodka) 
    predictions.append(ridge_reg.predict(years))


Let's plot our result:

In [None]:
plt.rcParams['figure.figsize'] = [14, 10]

plt.xlabel('Years')
plt.ylabel('Vodka consumption in litres per capita')
plt.scatter(alcohol_ds.year.unique(), vodka, marker='o', color='blue')

colors = ["green", "red", "purple", "black", "orange"]
i = 0 
for prediction in predictions:
    label = "alpha = " + str(alphas[i])
    plt.plot(alcohol_ds.year.unique(), prediction, color=colors[i], linewidth='1', label=label)
    
    i += 1
    
plt.legend()    
plt.show()

### Discussion
We can see how, as alpha's value rises, the regression tries to remember previous high values and to give a more time-wise prediction. Instead, as alpha's value drops, the regression seems to forget how high older values were, and thus gives a more precise prediction in the short-run. However, in my opinion, higher alpha's values give a more precise prediction in the long run.

## Polynomial Regression

In [None]:
degrees = [1, 2, 4, 6, 10]
predictions = []

for degree in degrees:
    model = make_pipeline(PolynomialFeatures(degree), Ridge(alpha=alphas[-1]))
    model.fit(years, vodka)
    predictions.append(model.predict(years))


Let's plot our results:

In [None]:
plt.rcParams['figure.figsize'] = [14, 10]

plt.xlabel('Years')
plt.ylabel('Vodka consumption in litres per capita')
plt.scatter(alcohol_ds.year.unique(), vodka, marker='o', color='blue')

colors = ["green", "red", "purple", "black", "orange"]
i = 0 
for prediction in predictions:
    label = "degree = " + str(degrees[i])
    plt.plot(alcohol_ds.year.unique(), prediction, color=colors[i], linewidth='1', label=label)
    
    i += 1
    
plt.legend()    
plt.show()

### Discussion
We can see how, as the polynomial's degree rises, the regression focuses too much on the data came along with the dataset. In fact, vodka consumption in 2016 has reached an historical minimum that could possibly rise across the next years. This phenomena is called *overfitting*, because the regression is very good at inferring values from the training set, but (perhaps) less good when inferring new and unknown values. 
