# Lab 07: Gradient Descent and Sklearn

In this lab, we will work through the process of:
1. Defining loss functions,
1. Performing feature engineering,

This lab will continue using the toy `tips` calculation dataset used in a prior lab.


<br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

# Feature Engineering

To begin, let's load the tips dataset from the `seaborn` library.  This dataset contains records of tips, total bill, and information about the person who paid the bill. As earlier, we'll be trying to predict tips from the other data.

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
np.random.seed(42)
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
%matplotlib inline

In [4]:
# Run this cell to load the tips dataset; no further action is needed.
data = sns.load_dataset("tips")

print("Number of Records:", len(data))
data.head()

Number of Records: 244


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


---

## Feature Functions

So far, we've only considered models of the form $\hat{y} = f_{\theta}(x) = \theta_0 + \sum_{j=1}^p x_j\theta_j$, where $\hat{y}$ is a quantitative continuous variable. 

We call this a linear model because it is a linear combination of the features $x_1, \dots, x_p$. However, our features don't need to be numbers: we could have categorical values such as names. Additionally, the true relationship doesn't have to be linear, as we could have a relationship that is quadratic, such as the relationship between the height of a projectile and time.

In these cases, we often apply **feature functions**, functions that take in some value and output another value. This might look like converting a string into a number, combining multiple numeric values, or creating a boolean value from some filter.

If we use $\phi$ to represent the feature (_"phi"_-ture) function or transformation applied to our data, then our model takes the following form: $$\hat{y} = f_{\theta}(x) = \theta_0 + \sum_{j=1}^p \phi(x)_j\theta_j$$

### Example Feature Functions

1. **One-hot encoding**
    - Converts a single categorical feature into many binary features, each of which represents one of the possible values in the original column.
    - Each of the binary feature columns produced contains a 1 for rows that had that column's label in the original column and 0 elsewhere.
1. **Polynomial feature**
    - Creates polynomial combinations of features.
1. **Normalized/Standardized feature**
    - Normalizes features so they have a mean of 0 and a standard deviation of 1.

<br/>
<hr style="border: 1px solid #fdb515;" />

## Defining the Model and Engineering Features

In Lab 5, we used both a Simple Linear Regression (SLR) model and a constant model on this dataset. Now, let's make a more complicated model that utilizes other features in our dataset. You can imagine that we might want to use the features with an equation that looks as shown below:

$$ \text{Tip} = \theta_0 + \theta_1 \cdot \text{total}\_\text{bill} + \theta_2 \cdot \text{sex} + \theta_3 \cdot \text{smoker} + \theta_4 \cdot \text{day} + \theta_5 \cdot \text{time} + \theta_6 \cdot \text{size} $$

Unfortunately, that's not possible because some of these features like "day" are not numbers, so it doesn't make sense to multiply by a numerical parameter. Let's start by converting some of these non-numerical values into numerical values.

Before we do this, let's separate out the tips and the features into two separate variables, and add a bias term using `pd.insert` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html)).

In [5]:
# Run this cell to create our design matrix X; no further action is needed.
tips = data['tip']
X = data.drop(columns='tip')
X.insert(0, 'bias', 1)
X.head()

Unnamed: 0,bias,total_bill,sex,smoker,day,time,size
0,1,16.99,Female,No,Sun,Dinner,2
1,1,10.34,Male,No,Sun,Dinner,3
2,1,21.01,Male,No,Sun,Dinner,3
3,1,23.68,Male,No,Sun,Dinner,2
4,1,24.59,Female,No,Sun,Dinner,4


<br/>
<hr style="border: 1px solid #fdb515;" />

## Question 1: Feature Engineering

First, let's convert our features to numerical values. A straightforward approach is to map some of these non-numerical features into numerical ones. 

For example, we could convert the `day` feature to a numerical value from 1-7. However, one of the disadvantages of directly translating to a numeric value is that we unintentionally assign certain features disproportionate weight. Consider assigning Sunday to the numeric value of 7, and Monday to the numeric value of 1. In our linear model, Sunday will have 7 times the influence of Monday, which can (and likely will) lower the performance of our model.

Instead, let's use **one-hot encoding** to better represent these features!  As you learned in the lecture, one-hot encoding is a feature engineering method that represents non-numeric features using boolean vectors (numerical values 0 or 1).

In the `tips` dataset, for example, we encode Sunday as the row vector `[0 0 0 1]` because our dataset only contains bills from Thursday through Sunday. This replaces the `day` feature with four boolean features indicating if the record occurred on Thursday, Friday, Saturday, or Sunday. One-hot encoding therefore assigns a more even weight across each category in non-numeric features.

Complete the code below to one-hot encode our dataset. This `DataFrame` holds our "featurized" data, which is also often denoted by $\phi$.

**Hint 1:** You should use sklearn's `OneHotEncoder` class ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)) when doing your one-hot encoding. Note that `OneHotEncoder` transforms data into a [SciPy sparse matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) to save space; we'll need to convert these back into regular arrays before doing any operations on them. Check out `.toarray()` ([documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.toarray.html)) for how to convert this to a `NumPy` array.

**Hint 2:** Look through the lecture slides and/or course notes for examples of how to use `OneHotEncoder`.

In [19]:
def one_hot_encode(data):
    
    # Initialize a OneHotEncoder object
    ohe = OneHotEncoder()
    
    # Fit the encoder
    ohe.fit(data[["day"]])
    
    encoded_day = ohe.transform(data[["day"]]).toarray()
    encoded_day_df = pd.DataFrame(encoded_day, columns=ohe.get_feature_names_out())
    
    
    
    """
    Return the one-hot encoded DataFrame of our input data.
    
    Parameters
    -----------
    data: A DataFrame that may include non-numerical features.
    
    Returns
    -----------
    A one-hot encoded DataFrame that only contains numeric features.
    
    """
    ...
    return encoded_day_df
    
one_hot_X = one_hot_encode(X)
one_hot_X.head()

Unnamed: 0,day_Fri,day_Sat,day_Sun,day_Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0


<br>

---

### Question 2b: sklearn

Another way to fit a linear regression model is to use `scikit-learn`/`sklearn` as we have seen in Lab 6.  As a reminder, here are the three steps to use `sklearn`:

1. Create a `sklearn` object.
1. `fit` the object to data.
1. Analyze fit, or call `predict`.


The `sklearn` `LinearRegression` object ([documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)) models the Ordinary Least Squares (OLS) problem, also using numerical methods to estimate $\hat{\theta}$. Fill in the code below such that `sklearn_model` **fits** OLS using `sklearn`.

**Hint:** Since we have included the bias column in our design matrix explicitly, we need to adjust the `fit_intercept` parameter appropriately when creating the `LinearRegression` model. 

In [25]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
sklearn_model = model.fit(one_hot_X, data[['tip']])

print("sklearn with bias column thetas:")
print(sklearn_model.coef_)

sklearn with bias column thetas:
[[-0.20386903  0.05449758  0.31652571 -0.16715426]]


<br>
 
---
 
### Question 2c: sklearn and `fit_intercept`

To avoid always explicitly building in a bias column into our design matrix, `sklearn`'s `LinearRegression` object also supports `fit_intercept=True` during instantiation. 

Fill in the code below by first assigning `one_hot_X_nobias` to the `one_hot_X` design matrix with the bias column dropped, then fit a new `LinearRegression` model, with intercept.

In [31]:
one_hot_X_nobias = X.drop(['bias'], axis=1)

one_hot_X_nobias = one_hot_encode(one_hot_X_nobias)

sklearn_model_intercept = LinearRegression().fit(one_hot_X_nobias, tips)

# Note that sklearn returns intercept (theta_0) and coefficients (other thetas) separately.
# We concatenate the intercept and other thetas before printing for easier comparison with the models above.
print("sklearn with intercept thetas:")
print(np.concatenate(([sklearn_model_intercept.intercept_], sklearn_model_intercept.coef_)))

sklearn with intercept thetas:
[ 2.93860587 -0.20386903  0.05449758  0.31652571 -0.16715426]


We printed the MSE for the `SciPy` and both `sklearn` solutions below (all using L2 loss). Notice that while the theta coefficients are different for the two `sklearn` models (with the bias column, vs. with `fit_intercept=True`), all three models have similar MSEs! We will explain this when we explore Gradient Descent later in this lab.

In [33]:
from sklearn.metrics import mean_squared_error

print("MSE scipy: \t\t\t" + str(mean_squared_error(model.predict(one_hot_X), tips)))
print("MSE sklearn bias column: \t" + str(mean_squared_error(sklearn_model.predict(one_hot_X), tips)))
print("MSE sklearn intercept model: \t" + str(mean_squared_error(sklearn_model_intercept.predict(one_hot_X_nobias), tips)))

MSE scipy: 			1.867568048328792
MSE sklearn bias column: 	1.867568048328792
MSE sklearn intercept model: 	1.867568048328792
