# Multiple Linear Regression Code Example

The goal with the walk through is to introduce you to the [scikit-learn](https://scikit-learn.org/stable/) environment for *predictive* modeling.

At the end of this, you should be comfortable with basics of scikit-learn model instantiation, training, and predicting.

You have been asked by HR to help them come up with a model for salaries for new position your company is hiring for. Let's follow the steps of the [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) process.

--insert graphic here--
    

### Business Understanding

Fortunately, this is pretty straight forward: how much to pay people?

### Data Understanding

We are going to take a short-cut, and use a pre-prepared data set from the most recent [Stack OverFlow Developers Survey](https://survey.stackoverflow.co/2024/) from 2024. Here is, however the data dictionary:

| **Column**  | **Data Type** | **Description**                        |
|-------------|---------------|----------------------------------------|
| Age         | object        | The age binned into several categories |
| RemoteWork  | object        | What kind of remote work               |
| EdLevel     | object        | Education level                        |
| YearsCode   | int64         | How many years coding                  |
| primaryDB   | object        | Primary Database Used                  |
| primaryLang | object        | Primary language used                  |
| Salary      | float64       | Salary in USD                          |

Looking at this, it is obvious that we are going to need to create dummy variables for most of these columns.

### Data Preparation

Again, this has been done for you, in order to facilitate this exercise. The on additional step we will use is [`pd.get_dummies`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) in order to turn the object columns into numeric variables.

### Modeling

We will be using [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) for our model

### Evaluation

We will be looking at two metrics to determine how effective our model is for predictive purposes.

1. [$r^2$ score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)
2. [mean squared error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)



### Deployment

For our purposes, deployment will be completing this notebook, and successfully predicting a salary.

# Data import and examination and preparation

Ready? Excellent, let's get to the code!

In [None]:
# imports
# best practice is to keep all the imports in once cell, at the top of your notebook or script
# hint: data handling, model,and metrics


In [None]:
# Bring the prepared csv file in 

 Using some of the basic pandas tools, explore this dataset. Think back to descriptive statistics and meta data about the dataframe.

In [None]:
# how big is it?

In [None]:
# descriptive statistics?


In [None]:
# non-numerical features?


In [None]:
#data types? What are you working with?


In [None]:
# Are there any nulls?
df.isna().sum()

#### Questions to consider:

1. Does anything jump out while looking at the descriptive statistics?
2. What columns can we use without manipulation?
3. How are we going to handle the non-numerics?

### Data Preparation

Remember those non-numeric variables? We need to deal with them. We will do that by creating [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) using the built-in pandas method [`pd.get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [None]:
# pd.get_dummies is smart enough to look at the datatypes and only use the fields 
# that are object fields. 


In [None]:
# sanity check your work


In [None]:
# check the datatypes!


Note that after "dummying" we now have all numeric columns

# Modeling

We will now instantiate, train, and use a model to make predictions.

scikit-learn as a large number of methods, which we will not exploring in their entirety here. We will focus on two methods:
1. [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit)
2. [`.score()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score)

`.fit()` is the method we will use to train our linear regression algorithm.

`.score()` will return the default scoring method, which, for linear regression is $r^2$.

In [None]:
# instantiate
# this creates a linear regression object that we can train and score


Statsmodels uses terms like endogenous and exogenous variables. scikit-learn uses different terminology, more in keeping with the predictive nature.

The dependent variable, or, what we are trying to predict, is always `y`.

The independent variables, or, what we are using to predict `y` is called `X`.

Note the `y` is lower case and the `X` is upper case. This is in following with matrix naming conventions. `X` is a matrix of independent variables and `y` is a vector for the dependent variable.

In [None]:
# Set the target, or the y variable


In [None]:
# Set the independent variable matrix, X


We now need to train the model. This is the process of applying the data to linear regression linear equation. Recall the closed form solution for linear regression.

$\vec{b} = (X^TX)^{-1}X^Ty$

Where $\vec{b}$ are the coefficients we are training. This is done via the `.fit()` method.

In [None]:
# Train the model


# Evaluation

We are going to use the scikit-learn metrics $r^2$ and mean squared error to evaluate how the model is performing.

In [None]:
# Recall this returns the default scoring method, which, for linear regression is r^2. 


Looking at that number, what are your take-aways?





Now we'll look at mean squared error. In order to get this metric, we need to create our predictions using our trained model. This is done with the [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict) method.

In [None]:
# Create some predictions


In [None]:
# Use y_pred and y with the mean squared error method
mse = mean_squared_error(y, y_pred)
mse

That number is ridiculously large. The reason is this method returns the *squared* error. This doesn't really translate well, so we'll take the square of it to get a more usable number.

In [None]:
# Create a variable with the root mean squared error


In [None]:
# You can also get this number by setting a flag when calling the function
rmse2 = mean_squared_error(y, y_pred, squared=False)
rmse2

Our $r^2$ is .2285. This is a very low number.

Our root mean squared error is 60,592. This can be read as the average salary prediction is off by $60k. This is not good, either.

So what can we conclude from this?

1. The model isn't good. At all.
2. There is an issue with the data (a minimum salary of .024?)

# Conclusion

You now know how to:
1. instantiate a scikit-learn model
2. train a model
3. score a model with multiple metrics