First notebook here ! Giving it a shot, as an exercise, with this curious dataset containing alcohol consumption patterns over the world. 

Many interesting relationships could be explored with other country indicators (GDP per capita, life expectancy, Gini index...) but I will keep it minimal for now.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objects as go

plt.style.use('ggplot')

In [None]:
df = pd.read_csv('../input/alcohol-comsumption-around-the-world/drinks.csv')

In [None]:
df.describe()

# Joint distributions

In [None]:
sns.pairplot(df, kind='reg')

# Rankings

Before looking at the rankings, one can note that several countries have zero alcohol consumption (at least according to their figures...). Some surprises lie in this list, like Monaco for which I emit [some doubts](https://youtu.be/M4AKfRkwT2Q?t=648).

Let us separate them from the rest.

In [None]:
no_alcohol = df.total_litres_of_pure_alcohol == 0
dfd = df[~no_alcohol]

print(df.country[no_alcohol])

In [None]:
# Top 10 and smallest 10 alcohol consumers

idx = dfd.total_litres_of_pure_alcohol.argsort().values

fig = go.Figure()
fig.add_trace(go.Bar(x=dfd.total_litres_of_pure_alcohol.iloc[idx[:10]], y=dfd.country.iloc[idx[:10]], orientation='h', name='total_liters', marker=dict(color='indianred')))
fig.add_trace(go.Bar(x=[0], y=['...'], orientation='h'))
fig.add_trace(go.Bar(x=dfd.total_litres_of_pure_alcohol.iloc[idx[-10:]], y=dfd.country.iloc[idx[-10:]], orientation='h', name='total_liters', marker=dict(color='indianred')))
fig.update_layout(showlegend=False, xaxis_title='total_litres_alcohol', title='Top 10 and smallest 10 alcohol consumers')
fig.show()

Note the presence of several small countries with advantageous fiscality among the top alcohol consumers, like Andorra and Luxembourg where most likely inhabitants from neighboring countries go buy tax-free alcohol.

Let us look at particular beverages.

In [None]:
# Top 10 and smallest 10 beer consumers

idx = dfd.beer_servings.argsort().values

fig = go.Figure()
fig.add_trace(go.Bar(x=dfd.beer_servings.iloc[idx[:10]], y=dfd.country.iloc[idx[:10]], orientation='h', name='beer_servings', marker=dict(color='indianred')))
fig.add_trace(go.Bar(x=[0], y=['...'], orientation='h'))
fig.add_trace(go.Bar(x=dfd.beer_servings.iloc[idx[-10:]], y=dfd.country.iloc[idx[-10:]], orientation='h', name='beer_servings', marker=dict(color='indianred')))
fig.update_layout(showlegend=False, xaxis_title='beer_servings', title='Top 10 and smallest 10 beer consumers')
fig.show()

There is a lot more variety than I expected in this top 10. Only 2 countries have a zero beer consumption.

In [None]:
print(dfd.country[df.beer_servings == 0])

In [None]:
# Top 10 and smallest 10 spirits consumers

idx = dfd.spirit_servings.argsort().values

fig = go.Figure()
fig.add_trace(go.Bar(x=dfd.spirit_servings.iloc[idx[:10]], y=dfd.country.iloc[idx[:10]], orientation='h', name='spirit_servings', marker=dict(color='indianred')))
fig.add_trace(go.Bar(x=[0], y=['...'], orientation='h'))
fig.add_trace(go.Bar(x=dfd.spirit_servings.iloc[idx[-10:]], y=dfd.country.iloc[idx[-10:]], orientation='h', name='spirit_servings', marker=dict(color='indianred')))
fig.update_layout(showlegend=False, xaxis_title='spirit servings', title='Top 10 and smallest 10 spirits consumers')
fig.show()

We can see that Slavic and Caribbean countries are the most frequent in the top of this ranking, which can be explained by social inequalities amidst easy access to strong beverages. Grenade, for example, has been grappling with this problem for [many years](https://www.gov.gd/sites/default/files/egov/ncodc/draft-alcohol-policy.pdf).

Also, some countries seem to have a zero spirit consumption, namely:

In [None]:
print(dfd.country[df.spirit_servings == 0])

In [None]:
# Top 10 and smallest 10 wine consumers

idx = dfd.wine_servings.argsort().values

fig = go.Figure()
fig.add_trace(go.Bar(x=dfd.wine_servings.iloc[idx[:10]], y=dfd.country.iloc[idx[:10]], orientation='h', name='wine_servings', marker=dict(color='indianred')))
fig.add_trace(go.Bar(x=[0], y=['...'], orientation='h'))
fig.add_trace(go.Bar(x=dfd.wine_servings.iloc[idx[-10:]], y=dfd.country.iloc[idx[-10:]], orientation='h', name='wine_servings', marker=dict(color='indianred')))
fig.update_layout(showlegend=False, xaxis_title='wine_servings', title='Top 10 and smallest 10 wine consumers')
fig.show()

France is at the first place!  According to this ranking, wine seems to be the elixir of richer countries. I am more surprised to see a lot of countries with zero wine consumption, probably due to the cultural absence of it.

In [None]:
print(dfd.country[df.wine_servings == 0])

# Shares by beverage

Let us compare the proportion of each alcoholic beverage worldwide and in some selected countries.

In [None]:
dfsel = df[df.country.isin(['Germany', 'France', 'Russian Federation',  'USA', 
                          'China', 'India', 'Italy', 'United Kingdom', 'Australia', 'South Africa', 'Brasil'])]

dfsel = dfsel.append({'country': 'World', 
                      'beer_servings': df.beer_servings.sum(), 
                      'wine_servings': df.wine_servings.sum(),
                      'spirit_servings': df.spirit_servings.sum(), 
                      'total_litres_of_pure_alcohol': df.total_litres_of_pure_alcohol.sum()}, ignore_index=True)
dfsel['total_servings'] = dfsel.beer_servings + dfsel.spirit_servings + dfsel.wine_servings
dfsel['prop_wine'] = dfsel.wine_servings / dfsel.total_servings
dfsel['prop_beer'] = (dfsel.beer_servings) / dfsel.total_servings
dfsel['prop_spirit'] = dfsel.spirit_servings / dfsel.total_servings
dfsel = dfsel.sort_values(by='country')

In [None]:
colors = ['firebrick', 'yellow', 'cyan']

fig = go.Figure()
fig.add_trace(go.Bar(x=dfsel['prop_wine'], y=dfsel['country'], orientation='h', name='wine',
                     marker=dict(color=colors[0])))
fig.add_trace(go.Bar(x=dfsel['prop_beer'], y=dfsel['country'], orientation='h', name='beer',
                     marker=dict(color=colors[1])))
fig.add_trace(go.Bar(x=dfsel['prop_spirit'], y=dfsel['country'], orientation='h', name='spirit',
                     marker=dict(color=colors[2])))

fig.update_layout(barmode='stack', title='Types of alcohol shares')

fig.show()

Worldwide, the three types of beverages are quite balanced. 

# Relationship between servings and liters ?

While looking at the data, I found it weird to have different units (servings on the one hand, pure alcohol liters on the other), and was wondering if there was a fixed relationship between those. Let's find out.

Naturally one would imagine a linear formula with a different coefficient for each beverage.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

mod = LinearRegression().fit(df[['beer_servings', 'spirit_servings', 'wine_servings']], df.total_litres_of_pure_alcohol.values)
pred = mod.predict(df[['beer_servings', 'spirit_servings', 'wine_servings']])
print(r2_score(df.total_litres_of_pure_alcohol, pred))

Sadly the linear relationship is not exact. It gets weirder when we look at the residuals' distribution, which suggests the presence of outliers for which the linear relationship underestimates real alcohol.

In [None]:
sns.histplot(pred - df.total_litres_of_pure_alcohol)

If we look at the direct servings -> liters relationship, the outliers are very apparent.

In [None]:
df['total_servings'] = df.beer_servings + df.spirit_servings + df.wine_servings
sns.scatterplot(data=df, x='total_servings', y='total_litres_of_pure_alcohol')

This is weird. It suggests the use of the [RANSAC regressor](https://sklearn.org/modules/linear_model.html#ransac-regression) to identify these outliers. This is a very interesting method for robust regression, that fits a linear regression to a population of "inlier" points and identifies outliers.

In [None]:
from sklearn.linear_model import RANSACRegressor
from sklearn.metrics import r2_score

mod = RANSACRegressor(residual_threshold=2).fit(df[['beer_servings', 'spirit_servings', 'wine_servings']], df.total_litres_of_pure_alcohol.values)
# plt.scatter(data=df.loc[mod.inlier_mask_], x='total_servings', y='total_litres_of_pure_alcohol', c='blue')
# plt.scatter(data=df.loc[~mod.inlier_mask_], x='total_servings', y='total_litres_of_pure_alcohol', c='red')

fig = go.Figure()
fig.add_trace(go.Scatter(x=df.loc[mod.inlier_mask_, 'total_servings'], 
                         y=df.loc[mod.inlier_mask_, 'total_litres_of_pure_alcohol'],
                        mode='markers', name='inliers', text=df.loc[mod.inlier_mask_, 'country']))
fig.add_trace(go.Scatter(x=df.loc[~mod.inlier_mask_, 'total_servings'], 
                         y=df.loc[~mod.inlier_mask_, 'total_litres_of_pure_alcohol'],
                        mode='markers', name='outliers', text=df.loc[~mod.inlier_mask_, 'country']))
fig.show()

In [None]:
df.loc[~mod.inlier_mask_]

In [None]:
print(mod.estimator_.coef_)
pred = mod.predict(df.loc[mod.inlier_mask_, ['beer_servings', 'spirit_servings', 'wine_servings']])
print(r2_score(df.loc[mod.inlier_mask_, 'total_litres_of_pure_alcohol'], pred))

The robust regressor is a very good predictor of litres of pure alcohol per beverage, although the coefficients are not what I expected (approximately 18 cL of alcohol per serving, for all beverages)

But looking at who the outliers are, nothing jumps at first sight. Any ideas on where this curious relationship comes from ?