## Module 7: Supervised Machine Learning (SML) and Linear Regression

<b>What is Machine Learning?</b>

Machine learning is the field of study that focuses on giving computers the ability to learn without being explicitly programmed. 

In Supervised Machine Learning (SML), data models - also known as algorithms - learn from labeled data. Learning the data includes understanding the patterns and relationships that exist within your dataset and between variables. Once the model understands the data - it is able to associate the patterns that were discovered to unlabeled data. In other words, the model takes information that it's learned and applies it to an environment where it isn't being directly told what to do - similar to a student who learns through lessons (examples and exercises where the correct answer is provided) and then is expected to apply what they've learned on a test (answers are not provided). Once your model has "taken the test", it is then up to you to grade it to determine how well your model performed. 

There are two types of SML algorithms: regression and classification. 
* <b>Regression</b> predicts continuous values (i.e. sales).
* <b>Classification</b> predicts discrete outcomes (i.e. survived/died).

***************************************

<LEFT><img src='https://miro.medium.com/max/1164/1*589X2eXJJkatGRG-z-s_oA.png'></LEFT>
***************************************

Today's lesson will focus on Linear Regression. Not unlike Linear Regression with Statsmodels (Module 6), we are interested in creating a linear regression models that fits our data well. However, unlike Statsmodels, we aren't only interested in learning about the relationships within the dataset we have in front of us -- we are also interested in creating a linear regression model that can help us make predictions with data we don't yet have, or data we can anticipate getting in the future. To do this, there are several new steps that we will have to complete, including:
* Cleaning up our data
* Handling categorical variables
* Dividing our data into two sets of data based on the dependent and independent variables
* Determining the characteristics of our linear regression model
* Splitting our data into subsets
* Training our model
* Testing our model
* And finally, evaluating how well our model did at predicting outcomes
***************************************
Creating a linear regression model is typically done with the goal of making <b>acurate</b> predictions about your dependent variable. 

Using the data regarding student grades, we are hoping to predict the final grade of a student given various characteristics. Our goal is to not only better understand the relationship between student characteristics and their final grade, but to create a model that can accurately predict the grade a student will receive given several other factors. 

If a teacher created a model, based on historic data, that could accurately predict final grade - they could input student information into the model during the middle of the term to predict how well a student is expected to perform. With this information, they can intervene early if a student is on the path to performing poorly in the course. 

This is only one example, but forcasting future trends/predicting specific outcomes is the main objective with linear regression. However, the full utility of your model will not be made evident immediately - instead of working through all of the uses of your model, this lesson is going to cover how to create an accuarate linear regression model that fits with your data.
***************************************

***************************************
<b>Step 1: Importing necessary libraries and data</b>
***************************************

In [1]:
# libraries to work with data
import pandas as pd
import numpy as np

# libraries to visualize data
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

# statsmodels linear regression library
import statsmodels.formula.api as smf

# import the student grade dataset
df = pd.read_csv("gradedata2.csv")

# drop the variables that you won't include in model
df = df.drop(["fname","lname","address"], axis = 1)

# review the first few rows of your data
df.head()

Unnamed: 0,gender,age,exercise,hours,grade
0,female,17,3,10,82.4
1,male,18,4,4,78.2
2,male,18,5,9,79.3
3,female,14,2,7,83.2
4,female,18,4,15,87.4


***************************************
<b>Step 2: Handling categorical variables</b>
***************************************
When you use the sklearn linear regression functions - you <u>cannot</u> include categorical variables in your model unless you convert them to numeric variables. 

Dummy variables convert categorical variables into a series of zeros and ones, which makes working with these variables a lot easier. When you convert your variables to dummy variables, "1" represents the presence of a factor and "0" represents that absence of a factor. Converting a variable in this manner allows us to statistically treat categorical variables like numeric variables. It is good practice to convert ALL your categorical variables to dummy variables prior to developing your regression model.

When you create dummy variables, you will exclude one category from your dataset. This will automatically create a reference category for your variables.
***************************************

In [2]:
# create dummy variables for the column gender
df = pd.get_dummies(df, columns = ['gender'], drop_first = True)
df.head()

Unnamed: 0,age,exercise,hours,grade,gender_male
0,17,3,10,82.4,0
1,18,4,4,78.2,1
2,18,5,9,79.3,1
3,14,2,7,83.2,0
4,18,4,15,87.4,0


***************************************
<b>Step 3: Preparing data for linear regression</b>
***************************************
We will be using the sklearn library for our SML linear regression analyses. The sklearn library is a very popular machine learning library for Python. There are several functions within the general sklearn library that we will need to import before we can get started. 

Before we start our analyses, we need to organize our data into the proper format. Sklearn requires that we explore our dependent and independent variables seperately. To do this, we will simply create two sets of dat -- one set that includes everything BUT our dependent variable, and another that includes ONLY our dependent variable. 

In [3]:
# import the functions needed for sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import math

# assign the linear regression function to a short variable name
lm = LinearRegression()

In [4]:
# create a dataset with only the dependent variable
y = df['grade']
print(y.head()) # preview the first few rows

# create a dataset with only the independent variables
X = df.drop('grade', axis = 1)
print(X.head()) # preview the first few rows 

0    82.4
1    78.2
2    79.3
3    83.2
4    87.4
Name: grade, dtype: float64
   age  exercise  hours  gender_male
0   17         3     10            0
1   18         4      4            1
2   18         5      9            1
3   14         2      7            0
4   18         4     15            0


***************************************
<b>Step 4: Exploring the features of your linear regression model</b>
***************************************

Sklearn allows you to view information about your model -- similar to statsmodels -- however, you have to work a little harder to see the information. Below, we are going to fit our model to our data, check the R^2 score, determine the intercept of the model, and check what our variable coefficients are. The interpretations of the information are the same -- the information is just presented differently. 

In [6]:
# fit the linear regression function to your full data
lm.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [7]:
# determine the R^2 value for our full data
lm.score(X, y)

0.6645580504702335

In [8]:
# determine the y-intercept/constant
lm.intercept_

58.08744547305868

In [9]:
# determine the coefficients for the independent variables
lm.coef_

array([ 0.04050091,  0.98413259,  1.91732377, -0.44848377])

In [10]:
# create a list of the variable names and the coefficient values
summary = pd.DataFrame(list(zip(X.columns, lm.coef_)), columns = ["Variable", "Coefficient"])
summary

Unnamed: 0,Variable,Coefficient
0,age,0.040501
1,exercise,0.984133
2,hours,1.917324
3,gender_male,-0.448484


In [11]:
## sklearn does not have direct functionality to calculate p-values
## you can use output from the statsmodels library to determine the p-values of each variable

result = smf.ols('grade ~ age + exercise + hours + gender_male', data = df).fit()
result.summary()

0,1,2,3
Dep. Variable:,grade,R-squared:,0.665
Model:,OLS,Adj. R-squared:,0.664
Method:,Least Squares,F-statistic:,988.1
Date:,"Sat, 14 Nov 2020",Prob (F-statistic):,0.0
Time:,13:53:01,Log-Likelihood:,-6299.1
No. Observations:,2000,AIC:,12610.0
Df Residuals:,1995,BIC:,12640.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,58.0874,1.326,43.804,0.000,55.487,60.688
age,0.0405,0.075,0.543,0.587,-0.106,0.187
exercise,0.9841,0.089,11.073,0.000,0.810,1.158
hours,1.9173,0.031,61.617,0.000,1.856,1.978
gender_male,-0.4485,0.253,-1.773,0.076,-0.944,0.047

0,1,2,3
Omnibus:,325.522,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2284.723
Skew:,-0.569,Prob(JB):,0.0
Kurtosis:,8.111,Cond. No.,214.0


***************************************
<b>Step 5: Splitting data into training and testing sets</b>
***************************************

We've determined the relationship between our variables, but now we want to train and test our model to be able to accurately predict our outcome variable. Before we train and test, we have to split our data into four seperate groups:

* <b>Subset of Training Data for Independent Variables (X_train)</b>: this is a dataset that includes only the independent variables and is a larger randomized percentage of the original dataset
* <b>Subset of Training Data for Dependent Variable (y_train)</b>: this is a dataset that includes only the dependent variable and is a larger randomized percentage of the original dataset
* <b>Subset of Testing Data for Independent Variables (X_test)</b>: this is a dataset that includes only the independent variables and is a smaller randomized percentage of the original dataset
* <b>Subset of Testing Data for Dependent Variable (y_test)</b>: this is a dataset that includes only the dependent variable and is a smaller randomized percentage of the original dataset

We split our data so we can use some of our data to train our model (i.e expose our model to the patterns in our data) and use the remainder of our data to test how well our model learned from the training data. Your dataset will be divided randomly into the training and testing sets -- the training dataset will always be the larger percentage of your data. 

Splitting our dataset is made much easier with the <b>train_test_split function</b> that we imported from the sklearn library. 

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(1340, 4)
(1340,)
(660, 4)
(660,)


In [13]:
y_train.head()

1037     64.9
664      32.0
813      70.8
1269    100.0
1848     99.3
Name: grade, dtype: float64

***************************************
<b>Step 6: Training linear regression model</b>
***************************************

Training your model is so simple - all you have to do is fit a Linear Regression function to our training data -- the key is to explose the model to BOTH the independent variables and the dependent variable. We are basically showing the model: we have a female student who is 17 years old, exercises 4 hours per week, and studies 5 hours per week and received a final grade of 65.... 

Once we do this for all the rows in our dataset, we are going to test the model by providing it all the information about the independent variables but not the outcome: we have a male student who is 19 years old, exercises 6 hours per week, and studies 3 hours per week -- what final grade do you predict this student will receive?

In [14]:
lm.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

***************************************
<b>Step 7: Testing linear regression model</b>
***************************************

Testing our model requires us to show our model the subset of testing data and ask it to make predictions based on this data. In this case, we are specifically showing the model the students age, gender, hours of exercise and study and asking the model to predict what it thinks that students final grade is. 

In [15]:
# test how well your model can predict y-values given x-values
y_predict = lm.predict(X_test)

# let's see what predictions are being made and how they compare to the actual values
data_predict = pd.DataFrame(list(zip(y_test, y_predict)), columns = ["actual", "prediction"])
data_predict.head()

Unnamed: 0,actual,prediction
0,91.4,95.8327
1,81.9,76.680998
2,84.5,79.078059
3,96.2,92.11932
4,66.2,65.389395


***************************************
<b>Step 8: Evaluating linear regression model</b>
***************************************

Our model made some predictions about our dependent variable? Now what? The next step is to statistically evaluate how well our model was able to predict these values. We will do this by using the following metrics:

* <b>Mean Squared Error (MSE)</b>: this is a measure of how closely the fitted regression line is to the actual data points. The smaller the MSE, the closer the fit is to the data. There is no golden standard for an acceptable MSE value -- typically, this value is used to compare different iterations of models (similar to using the adj R^2 to determine which models fit the data the best). You can use this value to determine which model is a better fit for your data. 

* <b>Root Mean Squared Error (RMSE)</b>: this is the square-root of the MSE, and is much easier to directly interpret. A well-fitting regression model results in predicted values that are close to the observed data values. The RMSE indicates how the model fits to the data -- in other words, how close the observed data points are to the model's predicted values. RSME is a good measure of how accurately the model predicts the outcome. The RMSE can be interpreted as the average difference in the observed values and the predicted values. 

Once you calculate these values, you can determine if you are happy with the results. If not, you should tweak your initial model, run through the training/testing process again, and determine if another model is a better fit. In some instances, the accuracy of the model is important but it isn't vital to the work you want to do with the model -- for example, a model that is designed to predict how much money a customer plans to spend in your store will still be functional if the predictions are off by $10. However, a model that predicts the rate of growth of a brain tumor -- accuracy of the predictions are going to be a lot more important.  
***************************************
There are many factors that may contribute to inaccuracy in your model. If you aren't satisfied with the performance of your model, there are a few changes you can make before re-fitting your model:

<b>Need more data:</b> We need to have a huge amount of data to get the best possible prediction. 

<b>Bad assumptions:</b> We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.

<b>Poor features:</b> The features we used may not have had a high enough correlation to the values we were trying to predict.

In [16]:
# calculating the Mean Squared Error (MSE)

MSE = mean_squared_error(y_predict, y_test)
print("The Mean Squared Error (MSE) of our model is:",MSE)

# calculating the Root Mean Squared Error (RMSE)

RMSE = math.sqrt(MSE)
print("On average, our model is", RMSE, "points away from the actual grade when making predictions.")

The Mean Squared Error (MSE) of our model is: 33.47358012521237
On average, our model is 5.785635671662395 points away from the actual grade when making predictions.


***************************************
<b>Making predictions with linear regression model</b>
***************************************

What would be the fun of a linear regression model without the ability to make specific predictions about our data? Making predictions is very simple with the sklearn library because we can make use of the .predict() function. 

Before you can make predictions about your model overall, you have to re-fit the Linear Regression function to your full data (not the testing/training data). 

In [17]:
# re-fit the linear regression model to the full X and y data
lm.fit(X,y)

# in the order of your variables, list the values you want to input for each variable
# age: 17
# hours exercise: 3
# hours of study: 10
# gender-male: 0 <this means the student is a female>
lm.predict([[17, 3, 10, 0]])

array([80.90159641])

In [18]:
df.head(2)

Unnamed: 0,age,exercise,hours,grade,gender_male
0,17,3,10,82.4,0
1,18,4,4,78.2,1


***************************************
<b>Next week</b>: Supervised Machine Learning and Classification with Logistic Regression
***************************************