<h1 style="text-align: center;" markdown="1">Intro to Data Science and Machine Learning Using Python</h1> 
<h2 style="text-align: center;" markdown="2">Nov 30 - Dec 1, 2019</h2>
<h3 style="text-align: center;" markdown="3">Lab on Linear Regression</h3>


> *This notebook is part of the Workshop on Introduction to Data Science and Machine Learning with Python, a 2-day workshop organized by NAAMII. The objectives of this notebook is to use Linear Regression to predict district wise poverty rates in Nepal using predictor variables such as literacy rate, household size, share of kids etc. The data used in this exercise is part of Census 2011 conducted by the Central Bureau of Statistics (CBS), Nepal.*


# Table of Contents

[Introduction](#introduction)  
&nbsp;&nbsp;&nbsp;[Explanation and Theory](#introduction-theory)  
[Data Setup](#setup)  
[Two Dimensional Linear Regression](#1dlinreg)   
[Scikit-Learn Linear Regression](#sklearn-linreg)    
[Summary](#summary)

# Linear Regression Introduction <a class="anchor" id="introduction"></a>

Linear regression is one of the most simple and well-known machine learning algorithms for regression. It approximates the relationship between two or more variables using a straight line. 

It assumes that there is a linear relationship between the independent variables and the dependent variables. In high-dimensional datasets, this may not be the case, so linear regression may not be the best choice for many problems.



## Explanation and Theory <a class="anchor" id="introduction-theory"></a>

**Assumptions and properties.** 
Suppose we have a data set that consists of n samples and m features

\begin{equation}
\label{eqn:samples}
\mathbf{X} = \{\mathbf{x_1}, \mathbf{x_2}, \ldots \mathbf{x_n}\}, \quad   \mathbf{x_i} \in \mathbb{R}^{m}
\end{equation}

where

\begin{equation}
\label{eqn:features}
\mathbf{x_i} =  (x_i^{(1)}, x_i^{(2)}, \ldots x_i^{(m)} ) ^T
\end{equation}

The target variable is given by
\begin{equation}
\label{eqn:target}
\mathbf{Y}= \{y_1, y_2 \ldots y_n \} \quad where \,\,\, y_j \in (0,1)
\end{equation}

**Algorithm and Training.** Linear regression takes the form of a linear model:

$$f(i)=\beta_0+\beta_1x_{i}^{(1)}+...+\beta_m x_{i}^{(m)} $$

where $\beta_0,...,\beta_m$ are the regression coefficients or weights assigned to each feature $x$. For each data point $i$, a pseudo-variable $x_{0,i}=1$ is added to correspond to the intercept coefficient $\beta_0$. This allows us to write the model in vector form as:
$$f(i)=\boldsymbol{\beta}\cdot\boldsymbol{X_i}$$

When we train the linear regression classifier, we are trying to find the best values of $\beta$ to match the data i.e minimizes the following error functional or **objective function**:
$$ \frac{1}{2} \sum_{i=1}^{n} (y_i - f(i))^2$$

There are several techniques to solve this minimization problem, such as [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent).

# Data setup<a class="anchor" id="setup"></a>

First, we'll load the libraries that will be used throughout this notebook.

In [0]:
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import random

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Loading data. In this notebook we should have cleaned and organized the data already, so we should simply have to import our data file into a dataframe

In [0]:
#copy path of your file inside quotes
df = pd.read_csv("")


In [0]:
#Write the code to display the first 5 columns of the dataframe


In [0]:
#Write the code to display the number of rows and columns of the dataframe


In [0]:
#rename the column 'Unnamed: 0' to 'district'
df.rename(columns = {'Unnamed: 0': 'district'},inplace=True)

# Two Dimensional Linear Regression<a class="anchor" id="1dlinreg"></a>

We will begin by using linear regression to estimate poverty rates for each district using only one independent variable - literacy rate. Let's begin by making a scatter plot of the Poverty rate vs. Literacy rate to get a sense of how they depend on one another. 

In [0]:
plt.scatter(df['literacy rate'],df['poverty rate'])
plt.xlabel('Literacy Rate')
plt.ylabel('Poverty Rate')
plt.title('Poverty Rate vs. Literacy Rate in Districts of Nepal')

It looks like the Poverty rate is inversely related to Literacy rate in a somewhat linear way. We can quantify this relationship by finding the correlation between the two variables.

In [0]:
#Write the code to find the pearson's correlation between literacy rate and poverty rate


In [0]:
df.head()

In [0]:
#Write the code to display the last 5 columns of the dataframe


Let us remove Achham and Udayapur for now. We will use the model we created to predict the proverty rate of Accham and Udaypur.



In [0]:
#Write the code to get rid of the first and last rows of the dataframe and call the new dataframe df_2d
df_2d = 

In [0]:
df_2d.shape

In two dimensions, the equation of the line is 
$$y=mx+b $$
where b is the intercept and m is the slope of the line. So basically, linear regression algorithm gives us the most optimal value for the intercept and the slope (in two dimensions). The y and x variables remain the same, since they are the data features and cannot be changed. The values that we can control are the intercept(b) and slope(m). There can be multiple straight lines depending upon the values of intercept and slope. Basically what the linear regression algorithm does is it fits multiple lines on the data points and returns the line that results in the least error.

In [0]:
model= LinearRegression()

In [0]:
#Fill in the x and y values for fit (Note that we want to predict the poverty rates given the literacy rates)
model.fit(,)

In [0]:
#Fill in the x values for predict (Note that we want to predict the poverty rates given the literacy rates)
predictions = model.predict()

In [0]:
df_2d['predictions'] = predictions

In [0]:
df_2d.head()

In [0]:
#Plotting the regression line and the data in one graph
sns.lineplot(x='literacy rate', y = 'predictions', data=df_2d, color='red')
sns.scatterplot(x='literacy rate', y = 'poverty rate', data=df_2d)

As discussed earlier, linear regression model basically finds the best value for the intercept and slope, which results in a line that best fits the data. To see the value of the intercept and slope calculated by the linear regression algorithm for our dataset, execute the following code.

In [0]:
#To retrieve the intercept:
b = model.intercept_[0]
print(b)
#For retrieving the slope:
m = model.coef_[0][0]
print(m)

So the equation of the line that best fits the data is 
$$ y = -0.94631567 x + 0.82211828 $$

For Achham, the predicted poverty rate is

In [0]:
round(m*0.476151+b , 3)

But, we know that the actual poverty rate was 0.472

For Udaypur, the predicted poverty rate is

In [0]:
#Write the code for finding the poverty rate for Udaypur


But, we know that the actual poverty rate was 0.259

The accuracy or score of Linear Regression is tied to how well the data is correlated with each other. The coefficient of dertermination (R^2) for Linear Regression is the square of correlation.   

# Scikit-Learn Linear Regression <a class="anchor" id="sklearn-linreg"></a>

Scikit-Learn is one of the most popular Python machine learning library available. It contains a multitude of powerful tools the makes it an ideal choice for use in production environments. It provides a very robust, standardized API and comprehensive documentation, so once a user becomes familiar with the structure of building models with the package, it is fairly easy to apply it to other models.

For this reason, we will utilize Scikit-Learn throughout the remainder of this workshop. 

Scikit-Learn's [LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model incorporates several different methods behind the scenes that help it to perform more reliably with different types of datasets.

We will use four predictor variables - 'hh size', 'literacy rate', 'nkid06' and 'nkid715' and  to predict the 'poverty rate' where -

* `literacy rate` -  ratio of literate population and total population in the district

* `nkid06` - average share of kids of ages between 1 and 6 in the households of the district

* `nkid715` - average share of kids of ages between 7 and 15 in the households of the district



In [0]:
df_predict = df[['literacy rate','district','nkid06','nkid715','hhsize']]
df_y = df[['poverty rate']]

In [0]:
df_predict.shape

In [0]:
df_predict.dropna(inplace=True)

In [0]:
df_predict.isnull().any()

In [0]:
#Write the code to drop the column 'district' from the dataframe df_predict


Let us split the data in training and testing set. The test_size parameter allows you to control the ratio of testing set you want to randomly capture from your data set. With the random_state parameter you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df_predict, df_y, test_size=0.20, random_state=42)

In [0]:
# Let us look at the shape our training and testing splits
X_train.shape, X_test.shape, y_train.shape, X_test.shape

In [0]:
model= LinearRegression()

In [0]:
model.fit(X_train,y_train)

In [0]:
predictions = model.predict(X_test)

In [0]:
df_compare = X_test.reset_index(drop=True)

In [0]:
df_compare['Actual'] = y_test.values

In [0]:
df_compare['Predicted'] = predictions

In [0]:
df_compare.head()

In [0]:
df_compare.info()

In [0]:
#Write a code to plot a scatterplot of the 'Actual' and the 'Predicted' values from df_compare

The evaluation of Linear Regression can also be done by finding the residuals, which is the difference of the actual and the predicted values. The values of the residuals should be close to zero.

In [0]:
df_compare['residuals'] = df_compare['Actual'] - df_compare['Predicted']

In [0]:
sns.scatterplot(x='Predicted', y = 'residuals', data=df_compare)

# Linear Regression Summary <a class="anchor" id="summary"></a>
In this notebook, we have covered the basic concepts of a linear regression, and applied it to estimating poverty rates using several socio-economic variables. 