# Linear Regression
File(s) needed: Employee.csv

We create a model from data to either describe that data or predict a future value or outcome. When the data we need to model appears to exhibit a linear relationship between two or more variables, we use linear models to describe them. The idea of a linear model is simple - to find a straight line that best describes the data. 

We have seen that idea applied when we created plots that included a fitted regression line. What we are interested in now, however, is **_the equation that describes that fitted regression line_**. If we have the equation, we can use it to make a prediction, like what would be the expected tip for a value of the total bill.

To review from algebra, the equation of a line is represented by 
\begin{equation*}
y = mx + b
\end{equation*}
where
<ul style="list-style-type:none;">
    <li><i><b>y</i></b> is the value we are predicting (the dependent or response variable),</li>
    <li><i><b>x</i></b> is the value we are using to make the prediction (the independent or predictor variable),</li>
<li><i><b>m</i></b> is the slope of the line, and</li>
<li><i><b>b</i></b> is the y-intercept.</li>
</ul>

Let's look again at a scatterplot with a fitted regression line as a reminder.

In [None]:
# Import the pandas and seaborn libraries and see example of earlier scatter plot
import seaborn as sns

# load a copy of the tips dataset
tips = sns.load_dataset('tips')

# Scatterplot with regression line using lmplot
fig = sns.lmplot(x='total_bill', y='tip', data=tips)

# The statsmodels library
Since we know what the regression line looks like from the above plot, let's use the "tips" data to find the equation of the line. There are multiple libraries we could use to create our regression model, but we will use the `statsmodels` library here. Later, we'll take a quick look at another popular library.

https://www.statsmodels.org/dev/regression.html

If you look at the `statsmodels` documentation, you will see that there are even multiple ways to specify our model parameters when using that library. We will use the `formula` API from `statsmodels` for our model specification. This API allows us to use what are called R-style formulas in our code. We'll talk about that more in our example.

https://www.statsmodels.org/dev/example_formulas.html

In [None]:
# import formula API from statsmodel using the conventional statement



In [None]:
# Look at the tips data to review how it is structured
tips.head()

# Simple linear regression
A simple linear regression model uses just one independent variable to explain the dependent variable. This simple linear regression problem will use the `ols` function from the statsmodels.formula library. Ordinary least squares is one common method to estimate the parameters of a regression line. First we'll write the code using the R-style formula, then discuss what it does.


In [None]:
# Create the model specification using OLS, then fit the data to the model spec


In [None]:
# Display the results from our OLS regression (use print())


How do we interpret the results? We can use the coefficient values to predict a tip (y) based upon the total bill (x).

\begin{equation*}
y = 0.1050x + 0.9203
\end{equation*}

We have some options for getting the results. We can use the `params` attribute to get just the coefficients and the `conf_int()` method to get just the confidence interval parameters.

In [None]:
# Use the params attribute to get the coefficients


In [None]:
# Use the conf_int method to get the confidence interval


## Exercise: simple linear regression
Use the file 'Employee.csv' and `statsmodels.formula` to create a simple linear regression model that predicts 'salary' from 'salbegin'. Note: the column 'id' should be used as the index column when reading the data.

In [None]:
# Simple regression using statsmodels


# Multiple Regression
With the simple regression we just did, we regressed our response variable on one predictor variable. We can also regress the response variable on more than one predictor variable. When we do that, we call it _multiple regression_.

In this case, the equation of the line is represented by 
\begin{equation*}
y = m_1x_1 + m_2x_2 + ... + m_ix_i + b
\end{equation*}
where
<ul style="list-style-type:none;">
    <li><i><b>y</i></b> is the value we are predicting (the dependent or response variable),</li>
    <li><i><b>x</i></b> is the value of a predictor variable. The multiple predictors are denoted as x<sub>1</sub> through x<sub>i</sub>,</li>
<li><i><b>m</i></b> is the value of a regression coefficient. The multiple coefficients are denoted as m<sub>1</sub> through m<sub>i</sub>, and</li>
<li><i><b>b</i></b> is the intercept.</li>
</ul>

The code to fit a multiple regression model to our data set using the `statsmodels.formula` API is the same as for a simple regression, with one exception: we need to "add" the additional predictors to the right hand side of the model specification using plus signs.

For example, in the earlier simple example we wrote
```
model = smf.ols(formula='tip ~ total_bill', data=tips)
```
Now we need to add `size` as a predictor.

In [None]:
# Specify the model adding size as a predictor

# Inspect the results


With two predictors, we are modeling the following equation:
\begin{equation*}
y = m_1x_1 + m_2x_2 + b
\end{equation*}

We interpret the overall results the same way as with the simple regression example. We can use the coefficient values to predict a tip (y) based upon the total bill (x<sub>1</sub>) and size (x<sub>2</sub>).

The predicted regression coefficients are all significant, so they all remain in our model. With the results added, the equation becomes
\begin{equation*}
y = 0.0927total bill + 0.1926size + 0.6689
\end{equation*}

If we want to interpret individual coefficients, it is with the understanding that the others are held constant. For example, we would say that the tip increases by about 19 cents for every person added to the party as long as the total bill doesn't change.

We can still use the `params` attribute to get just the coefficients and the `conf_int()` method to get just the confidence interval parameters if that is all we want.

## Exercise: multiple regression with statsmodels
Use the file 'Employee.csv' to create a linear regression model that predicts 'salary' using 'salbegin' and 'educ' and the statsmodels library. Write all the necessary code in the cell below. Note: the column 'id' should be used as the index column when reading the data.

In [None]:
# Multiple regression using statsmodels


# Regression with categorical predictors

The examples have only used continuous predictors to this point. Like many data sets, the `tips` data contains categorical data like "sex" (values of "Male" and "Female") and "day" (values of "Thur", "Fri", "Sat", and "Sun"). What if we want to use one of those variables as a possible predictor?

The ordinary least squares algorithm (and many others) can't do anything with text values like "Male" or "Fri" when modeling a regression. We get around that by creating dummy variables.

A _**dummy variable**_ takes the value of 0 (not present) or 1 (present) to indicate the state of a categorical value we think might have an effect on the outcome. To implement dummy variables, each unique value of the categorical variable becomes a new variable (i.e., column in the data set) with a 0 or 1 value. Then, because of some potential advanced statistical problems, we designate one of the values as the reference value and drop it. In the example of the values for "sex," we only need one dummy variable, because if we keep the "Female" dummy variable, we either have a value of 1 (a female) or a value of 0 (not a female, so a male). "Male" becomes the reference value and we drop it from the analysis.

In [None]:
# Check for unique values for sex and day


But there is good news! `statsmodels` will automatically create dummy variables for us AND it drops the reference value. All we do is add the desired categorical variable(s) to the formula.

In [None]:
# Add the categorical variables to our model specification


# Inspect the results


We interpret the results for the dummy variables _in terms of the reference value_. For example, we would say from these results that tips are about 16 cents bigger on Friday than Thursday, all else equal. Thursday is the reference value because it is the one missing from the output summary.

**_Except_** - look at the p-values in the summary. None of the categorical variables is significant here, so we would not include them in our model.

---

# The sklearn library
A library specifically created for machine learning in Python is **scikit-learn**. It is built on the NumPy, SciPy, and matplotlib libraries. We import it (or parts of it) using the library name `sklearn`.

https://scikit-learn.org/stable/index.html

Why scikit-learn? This quote from CIO.com sums things up nicely:

> Scikits are Python-based scientific toolboxes built around SciPy, the Python library for scientific computing. Scikit-learn is an open source project focused on machine learning that is careful about avoiding scope creep and jumping on unproven algorithms. On the other hand, it has quite a nice selection of solid algorithms, and it uses Cython (the Python to C compiler) for functions that need to be fast, such as inner loops.
>
> Scikit-learn earns the highest marks for ease of development among all the machine learning frameworks I’ve tested. The algorithms work as advertised and documented, the APIs are consistent and well-designed, and there are few “impedance mismatches” between data structures. It’s a pleasure to work with a library in which features have been thoroughly fleshed out and bugs thoroughly flushed out.
><p style="text-align:right;font-size:80%">from https://www.cio.com/article/3213189/10-hot-data-analytics-trends-and-5-going-cold.html</p>

Of course, the way we build our model specification and the output we see are different with scikit-learn than with statsmodels. 

## Multiple regression using sklearn
We build our regression model in three steps when we use sklearn.
1. Import the linear_model module from sklearn,
2. Create the linear regression object.
3. Specify the predictor (X) and response (y) variables.

We'll create the multiple regression model and inspect the coefficients and intercept. Other values from the model results are also available using the `metrics` module of sklearn. See https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics for more.

In [None]:
# import the linear_model module from sklearn
from sklearn import linear_model

In [None]:
# Create the linear regression object
lr = linear_model.LinearRegression()

In [None]:
# Use total_bill and size as predictors (note the double square brackets in X)
# Remember why we do it that way?
predicted = lr.fit(X=tips[['total_bill','size']], y=tips['tip'])

In [None]:
#Inspect the results
print(predicted.coef_)
print(predicted.intercept_)

## Simple regression using sklearn
The simple regression model follows the same steps as the multiple regression one. We just have one predictor in X so it doesn't need to be inside a list.

In [None]:
# The import statement is already run.
# Create the linear regression object
lr = linear_model.LinearRegression()

In [None]:
# Specify the predictor (X) and response (y) variable
predicted = lr.fit(X=tips['total_bill'], y=tips['tip'])

This version fails because of the way the sklearn linear regression model is built. You can see at the bottom of the error message what we need to do. The issue is that the linear regression object defaults to a _multiple_ linear regression format, so it is expecting a list of predictor variables. Since we have only one predictor here ("a single feature"), we use reshape(-1,1) on our predictor (X) data values to make it work.

Let's look at the difference with and without the reshape to see what is happening.

In [None]:
# Inspect the values without reshape
tips['total_bill'].values

In [None]:
# See the result of using reshape
tips['total_bill'].values.reshape(-1,1)

In [None]:
# Run the corrected code to create the model
predicted = lr.fit(X=tips['total_bill'].values.reshape(-1,1), y=tips['tip'])

In [None]:
# Look at the results generated using sklearn
# They are the same values as with statsmodels
print(predicted.coef_)
print(predicted.intercept_)

## Categoricals using sklearn
With sklearn, we have to create our own dummy variables. However, pandas has a function called `get_dummies` that will do the work for us. It will convert all categoricals in a data frame into dummy variables. We can save the results as a new data frame and run the multiple regression like before with the additional variables included in the model specification.

In [None]:
# create the dummy variables and save in a new data frame (double square brackets again)
tips_dummy=pd.get_dummies(tips[['total_bill','size','sex','smoker','day','time']])
tips_dummy.head()

The only problem is that we have all the dummy values present. We need to drop the reference values. We can do that with the same code by adding `drop_first=True` to the get_dummies function as an argument.

In [None]:
# create the dummy variables and save in a new data frame without reference values
tips_dummy=pd.get_dummies(tips[['total_bill','size','sex','smoker','day','time']], drop_first=True)
tips_dummy.head()

Now we can fit the regression model to this data set. Note that since we only have the predictor variables in the tips_dummy data frame, we don't need to list specific columns for the value of X.

In [None]:
# Create the linear regression object
lr = linear_model.LinearRegression()

# Add size as a predictor (note the double square brackets in X)
predicted = lr.fit(X=tips_dummy, y=tips['tip'])

#Inspect the results
print(predicted.coef_)
print(predicted.intercept_)

## Cleaning up the sklearn output
The coefficients are in the same order they are in the data set. But it would be much nicer if they were labeled. The results are stored in a numpy array, which can't handle values and labels. We have to store the labels separately and append the coefficient values to them.

This same technique can be applied to the results in **_any_** of the sklearn examples covered in this notebook.

In [None]:
# import numpy
import numpy as np

# The model was fit in the last code cell.
# Get the intercept and coefficients, and store them in an object named values
values = np.append(predicted.intercept_, predicted.coef_)

# Get the names of the values
names = np.append('intercept', tips_dummy.columns)

# Put everything together in a labeled data frame and display results
results = pd.DataFrame(values, index=names, columns=['coeff'])
print(results)