# BADSS Workshop 3
Presented by Data Science Society at Berkeley

## Data Visualization, Modeling, & Inferences

Saturday, March 16, 2019

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

We will be looking at the same `country` dataset from the World Development Indicator. Run the cell below to load the data.

In [None]:
data = pd.read_csv("Country.csv")
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
data.head()

## `matplotlib` Example

In [None]:
x = np.arange(0, 10, 0.1)

sns.distplot(np.sin(x))
sns.distplot(np.cos(x))
plt.xlabel('x')
plt.ylabel('y')
plt.title('Sinusoids')
plt.legend(('sin(x)', 'cos(x)'));

## Continuous Data

#### Histogram 1: Latest Industrial Data

In [None]:
latest_indust = data.dropna(axis=0, subset= ["LatestIndustrialData"])
sns.distplot(latest_indust['LatestIndustrialData']);

#### Histogram 1: Latest Industrial Data (Rug plot)

In [None]:
sns.distplot(latest_indust['LatestIndustrialData'], kde = True, rug = True);

#### Boxplot

In [None]:
d = latest_indust.dropna(axis=0, subset=["LatestWaterWithdrawalData"])
sns.boxplot(x="LatestWaterWithdrawalData", data = d);

In [None]:
lower, upper = np.percentile(d["LatestWaterWithdrawalData"], [25, 75])
iqr = upper - lower
iqr

In [None]:
upper_cutoff = upper + 1.5 * iqr
lower_cutoff = lower - 1.5 * iqr
upper_cutoff, lower_cutoff

#### Scatter Plot (with regression line)

In [None]:
scatter_plot = sns.lmplot(x="LatestIndustrialData", y="LatestTradeData", data=data)
plt.ylim(2001, 2016);

#### Scatter Plot (without regression line)

In [None]:
s_plot = sns.lmplot(x="LatestIndustrialData", y="LatestTradeData", data=data, fit_reg = False)
plt.ylim(2001, 2016);

#### Scatter Plot with a third variable (hue)

In [None]:
s_plot = sns.lmplot(x="LatestIndustrialData", y="LatestTradeData", data=data, hue= "IncomeGroup", fit_reg = False)
plt.ylim(2001, 2016);

### Exercise 1: Continuous Plot

Plot an overlay histogram showing the distribution of the `LastestIndustrialData` and `LatestWaterWithdrawalData` between 1997 and 2014. Your final plot should look like this:

<img src="plot1.png" />

In [None]:
plt.figure(figsize=(8,5))
...
...

# Remember to label your axes and give your plot a title.
...
...
...

plt.legend(...)
plt.xlim(...);

# Discrete Data

#### Bar Chart

In [None]:
plt.figure(figsize=(8,5))
ax = sns.countplot(x='IncomeGroup', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right");

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x='SystemOfTrade', y='LatestWaterWithdrawalData', data=d, ci=False)
plt.ylim(1990, 2010);

#### Dot Chart

In [None]:
sns.pointplot(x='SystemOfTrade', y='LatestIndustrialData', data=d);

### Exercise 2: Discrete Plot

Plot a barplot comparing `LatestIndustrialData` and `IncomeGroup`, segmented by `SystemOfTrade`, between 1990 and 2015. Your final plot should look like this:

<img src="plot2.png" />

In [None]:
sns.barplot(x=..., y=..., hue=..., data=d, ci=False);
plt.legend(loc=...)
plt.xlim(..., ...);

# Linear Regression

Run this cell to install `scikit-learn` if you don't have that on your machine.

In [None]:
!pip install scikit-learn

We will be exploring with a related dataset (the actual important one) for the World Development Indicator. Each of the 3 main variables have a subset of its own to separate male and female statistics. Make sure you are choosing the right variable so that your model makes sense!

In [None]:
# Importing dataset while doing some additional cleaning.
ind = pd.read_csv('ind.csv').dropna().reset_index(drop=True)
ind.head()

Explore relationship among life expectancy, access to electricity, and adult literacy. We are only looking at a very small portion of the data of the World Development Indicators from 2000. 

Our demo will be looking at the relationship between access to electricity and life expectancy. We will be using a module callel `linear_model` in the `scikit-learn` package.

In [None]:
from sklearn.linear_model import LinearRegression

# We need a DataFrame instead of a Series for x and y.
x = ind[['elec']]  
y = ind[['exp']]

exp_v_elec = LinearRegression()
exp_v_elec.fit(x, y)

We can look at the coefficient (slope) and intercept of the linear model to interpret our model. Remember our simple linear regression formula:

$$ y = a + bx + \epsilon $$

In [None]:
print("intercept:    ", exp_v_elec.intercept_[0])
print("coefficient:  ", exp_v_elec.coef_[0][0])

***Question to Ponder:*** What does the model tell you?

We can also standardize units in our linear model `exp_v_elec` to see our correlation coefficient `r`. Recall that standardizing unit can be done by this formula:

$$ z = \frac{x-\mu}{\sigma}, $$

where $z$ is the standardized unit, $\mu$ the mean, and $\sigma$ the standard deviation. Instead of doing this by hand, we will be using `StandardScaler` in the `preprocessing` module of the `scikit-learn` package.

In [None]:
from sklearn.preprocessing import StandardScaler

scalerX = StandardScaler()
scalerX.fit(x)
x_std = scalerX.transform(x)

scalerY = StandardScaler()
scalerY.fit(y)
y_std = scalerY.transform(y)

In [None]:
exp_v_elec_std = LinearRegression()
exp_v_elec_std.fit(x_std, y_std)
print("correlation coefficient (r):  ", exp_v_elec_std.coef_[0][0])

The convention to understand the strength of correlation is that for $|r| > 0.7$, the correlation is strong, and for $|r| < 0.3$, the correlation is weak. Anything in between is moderate.

To see what a strong correlation is visually, we can plot a scatter plot with a regression line for `elec` and `exp`, using `lmplot` in seaborn.

In [None]:
sns.lmplot('elec', 'exp', data=ind)
plt.title('Life Expectancy vs. Access to Electricity')
plt.xlabel('Access to Electricity (% population)')
plt.ylabel('Life Expectancy at Birth (years)');

***Question to Ponder:*** Are you convinced that the correlation between these 2 variables is this strong? Is there something from the dataset that may be misleading?

### Linear Model Exercise

Choose 2 variables from the indicator (`ind`) dataset of your choice, and build a linear model. Report the coefficient (slope), intercept, and the correlation coefficient (standardized slope) for your linear model. At the end, generate a plot that shows the regression line for your model. 

Use the code above to help you get started. 