In [None]:
# imports
# best practice is to keep all the imports in one cell, at the top of your notebook or script

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error


In [None]:
# Bring the prepared csv data in

df = pd.read_csv('cleanedSO.csv')

 Using some of the basic pandas tools, explore this dataset. Think back to descriptive statistics and meta data about the dataframe.

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
df.dtypes

In [None]:
df.isna().sum()

#### Questions to consider:

1. Does anything jump out while looking at the descriptive statistics?
2. What columns can we use without manipulation?
3. How are we going to handle the non-numerics?

### Data Preparation

Remember those non-numeric variables? We need to deal with them. We will do that by creating [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) using the built-in pandas method [`pd.get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [None]:
# pd.get_dummies is smart enough to look at the datatypes and only use the fields 
# that are object fields.
df = pd.get_dummies(df)

In [None]:
df.head()

In [None]:
# check the datatypes!
df.dtypes

Note that after "dummying" we now have all numeric columns

### Variable Selection and TTS

(recall that TTS stands for train-test-split)

Statsmodels uses terms like endogenous and exogenous variables. scikit-learn uses different terminology, more in keeping with the predictive nature.

The dependent variable, or, what we are trying to predict, is always `y`.

The independent variables, or, what we are using to predict `y` is called `X`.

Note the `y` is lower case and the `X` is upper case. This is in following with matrix naming conventions. `X` is a matrix of independent variables and `y` is a vector for the dependent variable.

In [None]:
# Set the target, or the y variable

y = df['Salary']

In [None]:
# Set the independent variable matrix, X

X = df.drop('Salary', axis=1)

In [None]:
# create your train-test splits

X_train, X_test, y_train, y_test = train_test_split(X,y)

### Modeling

We will now instantiate, train, and use a model to make predictions.

scikit-learn as a large number of methods, which we will not exploring in their entirety here. We will focus on two methods:
1. [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit)
2. [`.score()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score)

`.fit()` is the method we will use to train our linear regression algorithm.

`.score()` will return the default scoring method, which, for linear regression is $r^2$.

In [None]:
# instantiate
# this creates a linear regression object that we can train and score

lr = LinearRegression()

We now need to train the model. This is the process of applying the data to linear regression linear equation. Recall the closed form solution for linear regression.

$\vec{b} = (X^TX)^{-1}X^Ty$

Where $\vec{b}$ are the coefficients we are training. This is done via the `.fit()` method.

In [None]:
# Train the model with the correct data set

lr.fit(X_train,y_train)

# Evaluation

We are going to use the scikit-learn metrics $r^2$ and mean squared error to evaluate how the model is performing.

In [None]:
# use .score(). Recall this returns the default scoring method, which, for linear regression is r^2.

train_r2 = lr.score(X_train,y_train)
print(f'training r^2: {train_r2}')

Looking at that numbers what are your take-aways?





Now we'll look at mean squared error. In order to get this metric, we need to create our predictions using our trained model. This is done with the [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict) method.

In [None]:
# Get predictions

y_train_pred = lr.predict(X_train)

In [None]:
# Use y_pred and y with the mean squared error method

train_mse = mean_squared_error(y_train, y_train_pred)
print(f'training MSE: {train_mse}')

That number is ridiculously large. The reason is this method returns the *squared* error. This doesn't really translate well, so we'll take the square of it to get a more usable number. Another advantage is the units of the RMSE are the same as the target. In this instance, RMSE will tell us the average error in dollars, since that's what the target is.

In [None]:
# use a better metric

train_rmse = root_mean_squared_error(y_train, y_train_pred)
train_rmse

Our $r^2$ is .2282. This is a very low number.

Our root mean squared error is 60,410. This can be read as the average salary prediction is off by $60k. This is not good, either.

#### Unseen data

Now that we have trained our model, lets see how it performs on unseen data.

In [None]:
# Score with test data

test_r2 = lr.score(X_test, y_test)
print(f'training r^2: {test_r2}')

In [None]:
# Get the RMSE for the test set

y_test_pred = lr.predict(X_test)
test_rmse = root_mean_squared_error(y_test, y_test_pred)
print(f'testing RMSE: {test_rmse}')

In [None]:
# As  reminder
print(f'training r^2: {train_r2}')
print(f'training r^2: {test_r2}')

print(f'training RMSE: {train_rmse}')
print(f'testing RMSE: {test_rmse}')

So what can we conclude from this?

1. The model isn't good. At all.
2. There is an issue with the data (a minimum salary of .024?)

The good news is that it isn't overfit. We got an equally bad model with both our training and testing data