<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Python Programming: Ridge Regression

## 1.0 Example 

In [1]:
# Example 
# ---
# Regularization is the process of penalizing coefficients of variables either by removing them and or reducing their impact. 
# Ridge regression reduces the effect of problematic variables close to zero but never fully removes them. 
# ---
# Question: Build a regrssion model to predict expenses based on the variables available.
# ---
# Dataset source: Pydataset Library: VietNamI Dataset
# ---
#

In [1]:
# Importing our libraries
# 
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
# installing !pip install pydataset and importing pydataset so as to use a dataset from the package
# 
from pydataset import data 

In [5]:
# Data Preparation
# 

# Loading the data and convert the sex variable to a dummy variable
#
df = pd.DataFrame(data('VietNamI'))
df.loc[df.sex== 'male', 'sex'] = 0
df.loc[df.sex== 'female','sex'] = 1
df['sex'] = df['sex'].astype(int)

# Setting up our X and y datasets
#
X = df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]
y = df['lnhhexp']
y

1        2.730363
2        2.737248
3        2.266935
4        2.392753
5        3.105335
           ...   
27762    1.847290
27763    2.461460
27764    2.460262
27765    1.920169
27766    2.468833
Name: lnhhexp, Length: 27765, dtype: float64

In [6]:
# Creating our baseline regression model
# This is a model that has no regularization to it
# 
regression = LinearRegression()
regression.fit(X,y)
first_model = (mean_squared_error(y_true=y,y_pred=regression.predict(X)))
print(first_model)

# The output  value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.

0.35528915032173053


In [7]:
# In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. 
# L2 is the name of the hyperparameter that is used in ridge regression. 
# Determining the value of a hyperparameter requires the use of a grid. 
# In the code below, we first create our ridge model and indicate normalization in order to get better estimates. 
# Next we setup the grid that we will use. 
# The search object has several arguments within it. Alpha is hyperparameter we are trying to set. 
# The log space is the range of values we want to test. 
# We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. 
# Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling 
# and cv is the number of folds to develop for the cross-validation. 
#
ridge = Ridge(normalize=True)
search = GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

In [8]:
# We now use the .fit function to run the model and then use the .best_params_ and
#  .best_scores_ function to determine the models strength. 
# 
search.fit(X,y)
search.best_params_
{'alpha': 0.01}
abs(search.best_score_) 

# The best_params_ tells us what to set alpha too which in this case is 0.01. 
# The best_score_ tells us what the best possible mean squared error is. 
# In this case, the value of 0.38 is worse than what the baseline model was. 

0.38013256937541345

In [9]:
# We can confirm this by fitting our model with the ridge information and finding the mean squared error below
#
ridge = Ridge(normalize=True,alpha=0.01)
ridge.fit(X,y)
second_model = (mean_squared_error(y_true=y,y_pred=ridge.predict(X)))
print(second_model)

0.35529321992606566


In [10]:
# The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. 
# In addition, these results indicate that there is little difference between the ridge and baseline models. 
# This is confirmed with the coefficients of each model found below.
# 
coef_dict_baseline = {}
for coef, feat in zip(regression.coef_,data("VietNamI").columns):
    coef_dict_baseline[feat] = coef
coef_dict_baseline

# The coefficient values are about the same. This means that the penalization made little difference with this dataset.

{'pharvis': 0.013282050886951472,
 'lnhhexp': 0.06480086550467927,
 'age': 0.0040124122787959186,
 'sex': -0.08739614349708912,
 'married': 0.07527646383836173,
 'educ': -0.0618092130060028,
 'illness': 0.04087038457896277,
 'injury': -0.002763768716569054,
 'illdays': -0.0067170633108931,
 'actdays': 0.14687843649771162}

## 2.0 Challenges

### <font color="green">Challenge 1</font>

In [None]:
# Challenge 1 
# ---
# Question: Build an accurate model that can estimate the weight of fish given the following dataset.
# ---
# Dataset url = http://bit.ly/FishDataset
# ---
# 
OUR CODE GOES HERE

### <font color="green">Challenge 2</font>

In [None]:
# Challenge 2
# ---
# Question: Build a regression algorithm for predicting unemployment within an economy.
# ---
# Dataset url = http://bit.ly/EconomicDataset
# ---
# Dataset Info
# 1. date. Month of data collection
# 2. psavert, personal savings rate
# 3. pce, personal consumption expenditures, in billions of dollars
# 4. unemploy, number of unemployed in thousands 
# 5. empmed, median duration of unemployment, in week
# 6. pop, total population, in thousands
# ---
# 
OUR CODE GOES HERE

### <font color="green">Challenge 3</font>

In [None]:
# Challenge 3
# ---
# Question: Build a regression model to predict the life expectancy of a country. 
# Apply ridge regression to your model.
# ---
# Dataset url = http://bit.ly/LifeExpectancyDataset
# ---
# Dataset Info:
# Country: Country
# Year: Year
# Status: Developed or Developing status
# Life expectancy: Life Expectancy in age
# Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
# infant deaths: Number of Infant Deaths per 1000 population
# Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
# percentage expenditure: Expenditure on health as a percentage of Gross Domestic Product per capita(%)
# Hepatitis B: Hepatitis B (HepB) immunization coverage among 1-year-olds (%)
# Measles: Measles: number of reported cases per 1000 population
# BMI: Average Body Mass Index of entire population
# under-five: deaths Number of under-five deaths per 1000 population
# Polio: Polio (Pol3) immunization coverage among 1-year-olds (%)
# Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
# Diphtheria: Diphtheria tetanus toxoid and pertussis (DTP3) immunization coverage among 1-year-olds (%)
# HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
# GDP: Gross Domestic Product per capita (in USD)
# Population: Population of the country
# thinness 1-19 years: Prevalence of thinness among children and adolescents for Age 10 to 19 (% )
# thinness 5-9 years: Prevalence of thinness among children for Age 5 to 9(%)
# Income composition of resources: Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
# Schooling: Number of years of Schooling(years)
# ---
# 
OUR CODE GOES HERE

### <font color="green">Challenge 4</font>

In [None]:
# Challenge 4
# ---
# Question: Given the beauty dataset below, create a regression model to predict wages upon applying ridge regression.
# ---
# Dataset url = http://bit.ly/BeautyDataset
# ---
# 
OUR CODE GOES HERE

### <font color="green">Challenge 5</font>

In [None]:
# Challenge 5
# ---
# Create a regression model to predict sales prices. 
# Apply regularization techniques.
# ---
# Dataset source = http://bit.ly/HousePricesDataset
# ---
# 
OUR CODE GOES HERE