# Today's Coding Topics
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-summmer/2023-07-05/notebook/concept_and_code_demo.ipynb)

* Recap of previous lecture
* Demos on common visulization libraries
    * `matplotlib`
    * `seaborn`
* Regression Analysis

# The Python Visualization Landscape


`Visualization` is one of the most important tasks in data analysis. In data exploration stage, you could use visualization to identify outliers and some general patterns exist in the data. In modeling stage, you could use visualization to monitor the performance of your model. In reporting stage, you could use visualization to better illustrate the business insight you want to deliver to the audience. There are a couple of commonly used Python libraries that could help with these use cases:

**Matplotlib**<br/>
`matplotlib` is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is the most popular plotting library in Python, which gives users different levels of control over every aspect of a figure.

**Seaborn**<br/>
`seaborn` provides a high-level interface for drawing attractive and informative statistical graphics. It builds on top of matplotlib and integrates closely with pandas data structures.

**Plotly**<br/>
`plotly` is an open source library enables users to create beautiful interactive web-based visualizations that can be displayed in Jupyter notebooks, saved to standalone HTML files, or served as part of pure Python-built web application using `Dash`

**Bokeh**<br/>
Bokeh is an interactive visualization library for modern web browsers. It provides elegant, concise construction of versatile graphics, and affords high-performance interactivity over large or streaming datasets. Bokeh can help anyone who would like to quickly and easily make interactive plots, dashboards, and data applications.

# Visualization with `matplotlib`

* Comes with the `Anaconda` installation
* If you don't have it on your computer, you could install it with pip or conda at the command line or terminal with
`pip install matplotlib` or `conda install matplotlib`
* `matplotlib` breaks a plot/graph into different components. With the library, you can add/delete/edit these components one by one based on detailed use cases.
* With `matplotlib`, you could create reproducible figures in a programmatic way.

![](https://matplotlib.org/_images/anatomy1.png)

## Import the core library

In [None]:
import matplotlib.pyplot as plt
import numpy as np

**`matplotlib` provides two types of methods for you to contruct plots**
* `Function based methods`: It is a Matlab style plotting functions. The coding is very concise, and you can control the plot from a few high level settings.
* `Object-oriented methods`: It is a method that needs you to create figure objects, and all the plot control statements come from the methods or attributes associated with the figure object. The coding is a little bit more complicated, but you have a finer control on the figure components.

## Figure paramter tuning

### Plot range

In [None]:
## revisit the simple X-Y plot

fig, axes = plt.subplots()

x = np.linspace(0,10,1000)
y = [v ** 2 for v in x]

axes.plot(x,y)
axes.set_xlabel('X label')
axes.set_ylabel('Y label')
axes.set_title('This is another test')
axes.set_xlim(0,3)
axes.set_ylim(0,10)

plt.show()

### Legend

In [None]:
## plot multiple lines in one graph

fig, axes = plt.subplots()

axes.plot(x,y,label='$x^2$')
axes.plot(x, x**3, label='$x^3$')

axes.set_xlabel('X label')
axes.set_ylabel('Y label')
axes.set_title('This is another test')
axes.set_xlim(0,3)
axes.set_ylim(0,10)

axes.legend()
plt.show()

You can customize the position of the legend box

```sh
        ===============   =============
        Location String   Location Code
        ===============   =============
        'best'            0
        'upper right'     1
        'upper left'      2
        'lower left'      3
        'lower right'     4
        'right'           5
        'center left'     6
        'center right'    7
        'lower center'    8
        'upper center'    9
        'center'          10
        ===============   =============
```

In [None]:
fig, axes = plt.subplots()

axes.plot(x,y,label='$x^2$')
axes.plot(x, x**3, label='$x^3$')

axes.set_xlabel('X label')
axes.set_ylabel('Y label')
axes.set_title('This is another test')
axes.set_xlim(0,3)
axes.set_ylim(0,10)

axes.legend(loc=1) # customize the position
plt.show()

### Text

In [None]:
## add a text annotation at a specific position in the graph
fig,ax = plt.subplots()

t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2*np.pi*t)
line, = plt.plot(t, s, lw=2)

plt.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
             arrowprops=dict(facecolor='black', shrink=0.05),
             )

plt.ylim(-2, 2)
plt.show()

In [None]:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50, density=1, facecolor='g', alpha=0.75)


plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
plt.axis([40, 160, 0, 0.03]) # a different way to define x-y limits
plt.grid(True)
plt.show()

### Line colors, width, styles
* For detailed parameter settings, check https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.plot.html

In [None]:
fig, ax = plt.subplots()

x = np.linspace(0,5,20)
ax.plot(x, x+1, color='b')
ax.plot(x, x+2, color='g')
ax.plot(x, x+3, color='r')

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

ax.plot(x, x+1, color='r', linewidth=0.25)
ax.plot(x, x+2, color='r', linewidth=0.50)
ax.plot(x, x+3, color='r', linewidth=1.00)
ax.plot(x, x+4, color='r', linewidth=2.00)

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

ax.plot(x, x+1, color='r', linewidth=0.25)
ax.plot(x, x+2, color='r', linewidth=0.50)
ax.plot(x, x+3, color='r', linewidth=1.00)
ax.plot(x, x+4, color='r', linewidth=2.00)

ax.plot(x, x+5, color="green", lw=3, linestyle='-')
ax.plot(x, x+6, color="green", lw=3, ls='-.')
ax.plot(x, x+7, color="green", lw=3, ls=':')

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

ax.plot(x, x+1, color='r', linewidth=0.25)
ax.plot(x, x+2, color='r', linewidth=0.50)
ax.plot(x, x+3, color='r', linewidth=1.00)
ax.plot(x, x+4, color='r', linewidth=2.00)

ax.plot(x, x+5, color="green", lw=3, linestyle='-')
ax.plot(x, x+6, color="green", lw=3, ls='-.')
ax.plot(x, x+7, color="green", lw=3, ls=':')

ax.plot(x, x+ 8, color="blue", lw=3, ls='-', marker='+')
ax.plot(x, x+ 9, color="blue", lw=3, ls='--', marker='o')
ax.plot(x, x+10, color="blue", lw=3, ls='-', marker='s')

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

ax.plot(x, x+1, color='r', linewidth=0.25)
ax.plot(x, x+2, color='r', linewidth=0.50)
ax.plot(x, x+3, color='r', linewidth=1.00)
ax.plot(x, x+4, color='r', linewidth=2.00)

ax.plot(x, x+5, color="green", lw=3, linestyle='-')
ax.plot(x, x+6, color="green", lw=3, ls='-.')
ax.plot(x, x+7, color="green", lw=3, ls=':')

ax.plot(x, x+ 8, color="blue", lw=3, ls='-', marker='+')
ax.plot(x, x+ 9, color="blue", lw=3, ls='--', marker='o')
ax.plot(x, x+10, color="blue", lw=3, ls='-', marker='s')

ax.plot(x, x+11, color="purple", lw=1, ls='-', marker='o', markersize=2)
ax.plot(x, x+12, color="purple", lw=1, ls='-', marker='o', markersize=4)
ax.plot(x, x+13, color="purple", lw=1, ls='-', marker='o', markersize=8, markerfacecolor="red")
ax.plot(x, x+14, color="purple", lw=1, ls='-', marker='s', markersize=8, 
        markerfacecolor="yellow", markeredgewidth=3, markeredgecolor="green");

plt.show()

# Visualization with `seaborn`

## Import the core library

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Distribution of Data

* All about penguins: [[link](https://seaworld.org/animals/all-about/penguins/physical-characteristics/#:~:text=Flippers,modified%20into%20paddle%2Dlike%20flippers.&text=Each%20flipper%20is%20covered%20with,tapered%2C%20flattened%20flippers%20for%20swimming.)]

In [None]:
penguins = sns.load_dataset("penguins")
penguins.head(3)

In [None]:
type(penguins)

In [None]:
penguins.shape

In [None]:
penguins.island.unique()

In [None]:
penguins.species.unique()

### Single variable distribution

In [None]:
sns.histplot(penguins.flipper_length_mm)

In [None]:
sns.histplot(penguins.flipper_length_mm, kde=False, bins=30)

### Multi-variable joint distribution

In [None]:
sns.jointplot(x='flipper_length_mm',y='body_mass_g',data=penguins,kind='scatter')

In [None]:
sns.jointplot(x='flipper_length_mm',y='body_mass_g',data=penguins,kind='hex')

In [None]:
sns.jointplot(x='flipper_length_mm',y='body_mass_g',data=penguins,kind='reg')

In [None]:
sns.jointplot(x='flipper_length_mm',y='body_mass_g',data=penguins,hue='sex')

### Pair Plot

In [None]:
sns.pairplot(penguins, hue='island')
# sns.pairplot(penguins)

In [None]:
sns.pairplot(penguins, hue='sex')

In [None]:
sns.pairplot(penguins, hue='sex',palette='rainbow')

### barplot and countplot

* `Barplot` shows point estimates and confidence intervals as rectangular bars. By default, it displays the mean value. You could also pass in a customized function
* Reference: https://seaborn.pydata.org/generated/seaborn.barplot.html

In [None]:
sns.barplot(x='species',y='bill_length_mm',data=penguins) # by default it displays the mean values

In [None]:
sns.barplot(x='species',y='bill_length_mm',data=penguins,estimator=np.std)

In [None]:
sns.barplot(x='species',y='bill_length_mm',data=penguins,hue='sex')

* `Countplot` shows the counts of observations in each categorical bin using bars.

In [None]:
sns.countplot(x='island',data=penguins)

### Boxplot/Violionplot/Stripplot

In [None]:
# sns.boxplot(x='species',y='flipper_length_mm',data=penguins)
sns.boxplot(x='species',y='flipper_length_mm',data=penguins,hue='island')

In [None]:
sns.boxplot(y='species',x='flipper_length_mm',data=penguins,hue='island',orient='h')

`violin plot`: https://seaborn.pydata.org/generated/seaborn.violinplot.html

In [None]:
sns.violinplot(x='species',y='flipper_length_mm',data=penguins,hue='island')

`strip plot`: https://seaborn.pydata.org/generated/seaborn.stripplot.html

In [None]:
sns.stripplot(x='species',y='flipper_length_mm',data=penguins,hue='island')

A more general format

`cat plot`: https://seaborn.pydata.org/generated/seaborn.catplot.html

In [None]:
sns.catplot(x='species',y='flipper_length_mm',data=penguins,hue='island',kind='box')

## Heatmap

In [None]:
penguins.info()

In [None]:
sns.heatmap(penguins.select_dtypes(include=np.number).corr(),cmap='coolwarm')

## Grids

In [None]:
sns.PairGrid(penguins)

In [None]:
g = sns.PairGrid(penguins)
g.map(plt.scatter)

In [None]:
g = sns.PairGrid(penguins)
g.map_diag(plt.hist)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)

In [None]:
g = sns.JointGrid(x="bill_length_mm", y="body_mass_g", data=penguins)

In [None]:
g = sns.JointGrid(x="bill_length_mm", y="body_mass_g", data=penguins)
g = g.plot(sns.regplot, sns.histplot)

## Regression Plot

* Reference: https://seaborn.pydata.org/tutorial/regression.html

In [None]:
sns.lmplot(x='flipper_length_mm',y='body_mass_g',data=penguins,)

In [None]:
sns.lmplot(x='flipper_length_mm',y='body_mass_g',data=penguins,hue='sex')

In [None]:
sns.lmplot(x='flipper_length_mm',y='body_mass_g',data=penguins,hue='sex',markers=['o','v'])

In [None]:
sns.lmplot(x='flipper_length_mm',y='body_mass_g',data=penguins,col='sex')

In [None]:
sns.lmplot(x='flipper_length_mm',y='body_mass_g',data=penguins,col='sex',row='species')

In [None]:
sns.lmplot(x='flipper_length_mm',y='body_mass_g',data=penguins,hue='sex',row='species')

## Colors and Styles

Reference: `matplotlib` colormap: https://matplotlib.org/3.1.1/gallery/color/colormap_reference.html

In [None]:
sns.set_style(style='darkgrid')
# sns.set_context('notebook',font_scale=1)
sns.lmplot(x='flipper_length_mm',y='body_mass_g',data=penguins,hue='sex',palette='seismic')

# Linear Regression

Suppose you want to know **if money makes people happy**, so you download the Better Life Index data from the OECD’s website and stats about gross domestic product (GDP) per capita from the IMF’s website.
* OECD: https://www.oecdbetterlifeindex.org/
* IMF: https://ourworldindata.org/

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
pd.options.display.max_rows=None
pd.options.display.max_columns=None

## Get the data

In [None]:
oecd_bli = pd.read_csv('../data/oecd-better-life-index.csv',sep=',',header=0)
gdp_per_capita = pd.read_csv('../data/imf-gdp-per-capita.csv',sep=',',header=0,thousands=',')

In [None]:
oecd_bli.head(3)

In [None]:
gdp_per_capita.head(3)

## Transform the data

Get the view of 

| Country | GDP per capita (USD) | Life satisfaction |
|---------|----------------------|-------------------|

### format the oecd data

In [None]:
oecd_bli.Indicator.unique()

In [None]:
oecd_bli.Inequality.unique()

In [None]:
oecd_bli2 = oecd_bli.loc[
    (oecd_bli.Inequality=='Total')&(oecd_bli.Indicator=='Life satisfaction'),
    ['Country','Value']
].rename(columns={'Country':'country','Value':'life_satisfaction'})

In [None]:
oecd_bli2.head(3)

In [None]:
oecd_bli2.country.nunique()

### format the gdp per capita data

In [None]:
gdp_per_capita2 = gdp_per_capita.loc[:,['Country','2015']].rename(columns={
    'Country':'country',
    '2015':'gdp_per_capita'
})
gdp_per_capita2.head(3)

In [None]:
gdp_per_capita2.country.nunique()

### join the two

In [None]:
country_stats = pd.merge(
    gdp_per_capita2,
    oecd_bli2,
    on='country'
).sort_values(by='gdp_per_capita').reset_index(drop=True)

In [None]:
country_stats.head(3)

In [None]:
country_stats.shape

## split the data

In [None]:
# remove_indices = [2,3,8,12,37,38,39]
# keep_indices = list(set(range(40)) - set(remove_indices))

countries_remove = ['Brazil','Mexico','Chile','Czech Republic','Norway','Switzerland','Luxembourg']
countries_keep = set(country_stats.country) - set(countries_remove)

In [None]:
# sample_data = country_stats.iloc[keep_indices]
# missing_data = country_stats.iloc[remove_indices]
sample_data = country_stats[country_stats.country.isin(countries_keep)]
missing_data = country_stats[country_stats.country.isin(countries_remove)]

In [None]:
missing_data

## visualize the sample data

In [None]:
plt.scatter(sample_data.gdp_per_capita, sample_data.life_satisfaction, label = 'gap vs. happiness')
plt.xlabel('gdp per capita')
plt.ylabel('life satisfaction')
plt.legend()

plt.show()

In [None]:
sample_data[sample_data.country.isin(['Hungary','Korea','France','Australia','United States'])]

In [None]:
plt.scatter(sample_data.gdp_per_capita, sample_data.life_satisfaction)
position_text = {
    "Hungary": (15000, 5),
    "Korea": (28000, 5.5),
    "France": (35000, 6.0),
    "Australia": (42000, 6.5),
    "United States": (52000, 6.0),
}

for country, pos_text in position_text.items():
    pos_data_x = sample_data.loc[sample_data.country==country,'gdp_per_capita']
    pos_data_y = sample_data.loc[sample_data.country==country,'life_satisfaction']
    country = "U.S." if country == "United States" else country
    plt.annotate(country, xy=(pos_data_x, pos_data_y), xytext=pos_text,
            arrowprops=dict(facecolor='black', shrink=0.1))
    plt.plot(pos_data_x, pos_data_y, "ro")

plt.xlabel('GDP per capita (USD)')
plt.ylabel('Life satisfaction')
plt.show()

There seems to be a linear trend here!!

## Fit a linear regression model!

Linear regression model
$$
y = \theta_0 + \theta_1X
$$

The fit essentially minimized the Sum of Squared Errors (SSE)
$$
SSE = \sum_{i=1}^{N} (y_i - \hat{y_i})^2 = \sum_{i=1}^{N} (y_i - \theta_0 - \theta_1X_i)^2
$$
Or, in the matrix form
$$
SSE = ||y - Xw||^2
$$

![](https://nextjournal.com/data/QmfPuPp4V74FyvTTojMj6ix9T8Skj1ji4GhX5Pr6zK8w4N?filename=linear-regression.png&content-type=image/png)

In [None]:
%%time

from sklearn import linear_model
lr = linear_model.LinearRegression()

# prepare the data
Xsample = np.c_[sample_data.gdp_per_capita]
ysample = np.c_[sample_data.life_satisfaction]

# train the model
lr.fit(Xsample, ysample)

# get the model parameters
# t0, t1 = lr.intercept_[0], lr.coef_[0][0]
t0 = lr.intercept_[0]
t1 = lr.coef_[0][0]

t0, t1

In [None]:
Xsample.ndim

In [None]:
## same as previous plot
plt.scatter(sample_data.gdp_per_capita, sample_data.life_satisfaction)
position_text = {
    "Hungary": (20000, 5),
    "Korea": (28000, 5.5),
    "France": (35000, 6.0),
    "Australia": (42000, 6.5),
    "United States": (52000, 6.0),
}

for country, pos_text in position_text.items():
    pos_data_x = sample_data.loc[sample_data.country==country,'gdp_per_capita']
    pos_data_y = sample_data.loc[sample_data.country==country,'life_satisfaction']
    country = "U.S." if country == "United States" else country
    plt.annotate(country, xy=(pos_data_x, pos_data_y), xytext=pos_text,
            arrowprops=dict(facecolor='black', shrink=0.1))
    plt.plot(pos_data_x, pos_data_y, "ro")

## plot the fitted line here
X = np.linspace(start=0, stop=60000, num=1000)
plt.plot(X, t1*X + t0, label='fitted line')
plt.text(1000, 5, '$\\theta_0=5.16$')
plt.text(1000, 4.8, '$\\theta_1=4.24 \\times 10^{-5}$')
    
plt.xlabel('GDP per capita (USD)')
plt.ylabel('Life satisfaction')
plt.legend()
plt.show()

## Make predictions to a new country

In [None]:
gdp_per_capita2[gdp_per_capita2.country=='Cyprus']

In [None]:
X_new = [[22587.49]]
lr.predict(X_new)

In [None]:
gdp_per_capita2[gdp_per_capita2.country=='China']

In [None]:
X_new = [[7989.72]]
lr.predict(X_new)

## What happens when we add the missing countries back to the data?

In [None]:
missing_data

In [None]:
%%time

lr2 = linear_model.LinearRegression()

# prepare the data
Xfull = np.c_[country_stats.gdp_per_capita]
yfull = np.c_[country_stats.life_satisfaction]

# train the model
lr2.fit(Xfull, yfull)

# get the model parameters
t0full, t1full = lr2.intercept_[0], lr2.coef_[0][0]
t0full, t1full

In [None]:
## same as previous plot
plt.scatter(country_stats.gdp_per_capita, country_stats.life_satisfaction)

X = np.linspace(start=0, stop=100000, num=1000)

## plot the fitted line based on partial dataset
plt.plot(X, t1*X + t0, label='fitted line (partial)')

## plot the fitted line of the complete dataset
plt.plot(X, t1full*X + t0full, label='fitted line (complete)')

## highlight the countries from the missing data
position_text2 = {
    "Brazil": (-1000, 7),
    "Mexico": (15000, 8),
    "Chile": (30000, 8.2),
    "Czech Republic": (45000, 8.5),
    "Norway": (55000, 5),
    "Switzerland": (75000, 6),
    "Luxembourg": (85000, 5)
}

for country, pos_text in position_text2.items():
    pos_data_x = country_stats.loc[country_stats.country==country,'gdp_per_capita']
    pos_data_y = country_stats.loc[country_stats.country==country,'life_satisfaction']
    country = "U.S." if country == "United States" else country
    plt.annotate(country, xy=(pos_data_x, pos_data_y), xytext=pos_text,
            arrowprops=dict(facecolor='black', shrink=0.1))
    plt.plot(pos_data_x, pos_data_y, "ro")

plt.xlabel('GDP per capita (USD)')
plt.ylabel('Life satisfaction')
plt.legend()
plt.show()

Apparently, the previous training data `sample_data` we used isn't representative!!

## Overfitting

In [None]:
## add polynomial features
from sklearn import preprocessing

df = 10
poly = preprocessing.PolynomialFeatures(degree=df)
scaler = preprocessing.StandardScaler()
lr3 = linear_model.LinearRegression()

Xfull_new = scaler.fit_transform(
    poly.fit_transform(Xfull)
)

## fit linear regression model on Xfull_new
lr3.fit(Xfull_new,yfull)

In [None]:
## define a set of new X
X = np.c_[np.linspace(start=0, stop=100000, num=1000)]

X_new = scaler.fit_transform(
    poly.fit_transform(X)
)

## predict on the new X
y = lr3.predict(X_new)

In [None]:
plt.scatter(country_stats.gdp_per_capita, country_stats.life_satisfaction)

## plot the fitted line based on partial dataset
plt.plot(X, t1*X + t0, label='fitted line (partial)')

## plot the fitted line of the complete dataset
plt.plot(X, t1full*X + t0full, label='fitted line (complete)')

## highlight the countries from the missing data
position_text2 = {
    "Brazil": (-1000, 7),
    "Mexico": (15000, 8),
    "Chile": (30000, 8.2),
    "Czech Republic": (45000, 8.5),
    "Norway": (55000, 5),
    "Switzerland": (75000, 6),
    "Luxembourg": (85000, 5)
}

for country, pos_text in position_text2.items():
    pos_data_x = country_stats.loc[country_stats.country==country,'gdp_per_capita']
    pos_data_y = country_stats.loc[country_stats.country==country,'life_satisfaction']
    country = "U.S." if country == "United States" else country
    plt.annotate(country, xy=(pos_data_x, pos_data_y), xytext=pos_text,
            arrowprops=dict(facecolor='black', width=0.5, shrink=0.1, headwidth=5))
    plt.plot(pos_data_x, pos_data_y, "ro")

## plot the polynomial fit
plt.plot(X,y,label='polynomial fit (df={})'.format(df))
    
    
plt.xlabel('GDP per capita (USD)')
plt.ylabel('Life satisfaction')
plt.ylim(0,13)
plt.legend()
plt.show()

## Apply L2 Regularization (Ridge Regression)

Instead of minimizing
$$
SSE = ||y - Xw||^2
$$

We will minimize
$$
SSE = ||y - Xw||^2 + \alpha ||w||^2
$$

In the 2D situation

$$
SSE = ||y - Xw||^2 + \alpha ||w||^2 = \sum_{i=1}^{N} (y_i - \theta_0 - \theta_1X_i)^2 + \alpha (\theta_0^2 + \theta_1^2)
$$

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
* There is another kind of regularization called `L1 Regularization` (also called `Lasso Regression`)
    * More details can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
    * Aur´elien G´eron `Hands-On Machine Learning with Scikit-Learn and TensorFlow` O’Reilly Media, 2022.
        * Chapter 4
    * Differences between `L1` and `L2` regularization
        * https://www.linkedin.com/pulse/intuitive-visual-explanation-differences-between-l1-l2-xiaoli-chen/
        * https://www.quora.com/What-is-the-difference-between-L1-and-L2-regularization-How-does-it-solve-the-problem-of-overfitting-Which-regularizer-to-use-and-when

The amount of regularization to apply during learning can be controlled by a hyperparameter. A hyperparameter is a parameter of a learning algorithm (not of the model). As such, it is not affected by the learning algorithm itself; it must be set prior to training and remains constant during training. 

Here, the parameter $\alpha$ is a hyperparamter. If you set the regularization hyperparameter to a very large value, you will get an almost flat model (a slope close to zero); the learning algorithm will almost certainly not overfit the training data, but it will be less likely to find a good solution. Tuning hyperparameters is an important part of building a Machine Learning system

In [None]:
## apply regularization on the original fit on sample_data
ridge = linear_model.Ridge(alpha=10**9.5)
Xsample = np.c_[sample_data.gdp_per_capita]
ysample = np.c_[sample_data.life_satisfaction]
ridge.fit(Xsample, ysample)
t0ridge, t1ridge = ridge.intercept_[0], ridge.coef_[0][0]

In [None]:
## same as previous plot
plt.scatter(country_stats.gdp_per_capita, country_stats.life_satisfaction)

X = np.linspace(start=0, stop=100000, num=1000)

## plot the fitted line based on partial dataset
plt.plot(X, t1*X + t0, label='fitted line (partial)')

## plot the fitted line of the complete dataset
plt.plot(X, t1full*X + t0full, label='fitted line (complete)')

## plot the fitted line based on partial dataset with regularization
plt.plot(X, t1ridge*X + t0ridge, label='fitted line (partial+regularization)')

## highlight the countries from the missing data
position_text2 = {
    "Brazil": (-1000, 7),
    "Mexico": (15000, 8),
    "Chile": (30000, 8.2),
    "Czech Republic": (45000, 8.5),
    "Norway": (55000, 5),
    "Switzerland": (75000, 6),
    "Luxembourg": (85000, 5)
}

for country, pos_text in position_text2.items():
    pos_data_x = country_stats.loc[country_stats.country==country,'gdp_per_capita']
    pos_data_y = country_stats.loc[country_stats.country==country,'life_satisfaction']
    country = "U.S." if country == "United States" else country
    plt.annotate(country, xy=(pos_data_x, pos_data_y), xytext=pos_text,
            arrowprops=dict(facecolor='black', shrink=0.1))
    plt.plot(pos_data_x, pos_data_y, "ro")

plt.xlabel('GDP per capita (USD)')
plt.ylabel('Life satisfaction')
plt.legend()
plt.ylim(4,12)
plt.show()

### Vary the regularization parameter $\alpha$

In [None]:
alphas = [10**i for i in range(20)]

In [None]:
Xsample = np.c_[sample_data.gdp_per_capita]
ysample = np.c_[sample_data.life_satisfaction]

In [None]:
%%time

from sklearn.metrics import mean_squared_error

rmses = []
r2s = []

for a in alphas:
    ridge = linear_model.Ridge(alpha=a)
    ridge.fit(Xsample, ysample)
    ysample_predict = ridge.predict(Xsample)
    rmse = mean_squared_error(ysample, ysample_predict)
    r2 = ridge.score(Xsample,ysample)
    
    rmses.append(rmse)
    r2s.append(r2)

In [None]:
# plt.plot(alphas, rmses)
plt.plot(list(range(20)), rmses)
plt.show()

In [None]:
plt.plot(list(range(20)), r2s)
plt.show()

# Logistic Regression

The Logistic Regression model is typically used to fit/predict binary variable or variables with value between 0 and 1
$$
y = \sigma(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-(\theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n)}}
$$

$\sigma(z)$ is also called the `sigmoid` function, it has the following property

$$
\frac{d\sigma(z)}{dz} = \sigma(z)(1-\sigma(z))
$$

![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png)

Linear Regression vs. Logistic Regression:
![](https://static.javatpoint.com/tutorial/machine-learning/images/linear-regression-vs-logistic-regression.png)
![](https://saedsayad.com/images/LogReg_1.png)

**Example**
* We'll look at the Pima Indian Diabetes dataset: https://www.kaggle.com/uciml/pima-indians-diabetes-database
* We are going to predict diabetes using Logistic Regression Classifier

## Prepare the data

In [None]:
# load dataset
pima = pd.read_csv("../data/diabetes.csv", header=0, sep=',')
pima.head(3)

In [None]:
pima.columns

In [None]:
X = pima[pima.columns[:-1]]
y = pima.Outcome

In [None]:
X.head(3)

In [None]:
X.shape

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

## Train-Test split

In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=123)

## Fit the model

In [None]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X_train,y_train)

# predict values in test dataset
y_pred=logreg.predict(X_test)

In [None]:
y_pred

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

## Evaluate the model

In [None]:
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

In [None]:
logreg.score(X_test, y_test) # also returns the accuracy

In [None]:
# ROC curve
y_pred_proba = logreg.predict_proba(X_test)[:,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="auc="+str(auc))
plt.legend(loc=4)
plt.show()

## More

In [None]:
logreg.intercept_, logreg.coef_

In [None]:
pima.head(3)

$$
p = \sigma(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-(\theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n)}}
$$

We could rewrite the relation to

$$
log(\frac{p}{1-p}) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n
$$

In [None]:
df_coef = pd.DataFrame({
    'feature': X.columns,
    'coef': logreg.coef_[0]
})

In [None]:
df_coef