In [1]:
# imports
# best practice is to keep all the imports in one cell, at the top of your notebook or script

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, root_mean_squared_error


In [2]:
# Bring the prepared csv data in
df = pd.read_csv('cleanedSO.csv')

 Using some of the basic pandas tools, explore this dataset. Think back to descriptive statistics and meta data about the dataframe.

In [3]:
df.shape

(21807, 7)

In [4]:
df.head()

Unnamed: 0,Age,RemoteWork,EdLevel,YearsCode,Salary,primaryDB,primaryLang
0,25-34 years old,"Hybrid (some remote, some in-person)","Professional degree (JD, MD, Ph.D, Ed.D, etc.)",12,31360.0,other,C
1,35-44 years old,Remote,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",15,95200.0,other,JavaScript
2,35-44 years old,Remote,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",27,56000.0,BigQuery,Python
3,25-34 years old,Remote,Some college/university study without earning ...,7,110000.0,other,HTML/CSS
4,35-44 years old,"Hybrid (some remote, some in-person)","Professional degree (JD, MD, Ph.D, Ed.D, etc.)",32,166874.4,BigQuery,C#


In [5]:
df.describe()

Unnamed: 0,YearsCode,Salary
count,21807.0,21807.0
mean,15.685055,90012.179599
std,9.926717,68987.446492
min,1.0,0.024
25%,8.0,42560.0
50%,13.0,75000.0
75%,20.0,120000.0
max,50.0,498000.0


In [6]:
df.describe(include='object')

Unnamed: 0,Age,RemoteWork,EdLevel,primaryDB,primaryLang
count,21807,21807,21807,21807,21807
unique,8,3,8,8,11
top,25-34 years old,Remote,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",other,Bash/Shell (all shells)
freq,9233,10001,9646,12370,7735


In [7]:
df.dtypes

Age             object
RemoteWork      object
EdLevel         object
YearsCode        int64
Salary         float64
primaryDB       object
primaryLang     object
dtype: object

In [8]:
df.isna().sum()

Age            0
RemoteWork     0
EdLevel        0
YearsCode      0
Salary         0
primaryDB      0
primaryLang    0
dtype: int64

#### Questions to consider:

1. Does anything jump out while looking at the descriptive statistics?
2. What columns can we use without manipulation?
3. How are we going to handle the non-numerics?

### Data Preparation

Remember those non-numeric variables? We need to deal with them. We will do that by creating [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) using the built-in pandas method [`pd.get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

In [9]:
# pd.get_dummies is smart enough to look at the datatypes and only use the fields 
# that are object fields.

df = pd.get_dummies(df) 

In [10]:
df.head()

Unnamed: 0,YearsCode,Salary,Age_18-24 years old,Age_25-34 years old,Age_35-44 years old,Age_45-54 years old,Age_55-64 years old,Age_65 years or older,Age_Prefer not to say,Age_Under 18 years old,...,primaryLang_Bash/Shell (all shells),primaryLang_C,primaryLang_C#,primaryLang_C++,primaryLang_Go,primaryLang_HTML/CSS,primaryLang_Java,primaryLang_JavaScript,primaryLang_Python,primaryLang_other
0,12,31360.0,False,True,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
1,15,95200.0,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,27,56000.0,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,7,110000.0,False,True,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,32,166874.4,False,False,True,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False


In [11]:
# check the datatypes!
df.dtypes

YearsCode                                                                                       int64
Salary                                                                                        float64
Age_18-24 years old                                                                              bool
Age_25-34 years old                                                                              bool
Age_35-44 years old                                                                              bool
Age_45-54 years old                                                                              bool
Age_55-64 years old                                                                              bool
Age_65 years or older                                                                            bool
Age_Prefer not to say                                                                            bool
Age_Under 18 years old                                                            

Note that after "dummying" we now have all numeric columns

### Variable Selection and TTS

(recall that TTS stands for train-test-split)

Statsmodels uses terms like endogenous and exogenous variables. scikit-learn uses different terminology, more in keeping with the predictive nature.

The dependent variable, or, what we are trying to predict, is always `y`.

The independent variables, or, what we are using to predict `y` is called `X`.

Note the `y` is lower case and the `X` is upper case. This is in following with matrix naming conventions. `X` is a matrix of independent variables and `y` is a vector for the dependent variable.

In [12]:
# Set the target, or the y variable

y = df['Salary']

In [13]:
# Set the independent variable matrix, X

X = df.drop('Salary', axis=1)

In [14]:
# create your train-test splits

X_train, X_test, y_train, y_test =train_test_split(X,y)
X_train.shape, X_train.shape

((16355, 39), (16355, 39))

### Modeling

We will now instantiate, train, and use a model to make predictions.

scikit-learn as a large number of methods, which we will not exploring in their entirety here. We will focus on two methods:
1. [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit)
2. [`.score()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score)

`.fit()` is the method we will use to train our linear regression algorithm.

`.score()` will return the default scoring method, which, for linear regression is $r^2$.

In [15]:
# instantiate
# this creates a linear regression object that we can train and score
lr = LinearRegression()

We now need to train the model. This is the process of applying the data to linear regression linear equation. Recall the closed form solution for linear regression.

$\vec{b} = (X^TX)^{-1}X^Ty$

Where $\vec{b}$ are the coefficients we are training. This is done via the `.fit()` method.

In [16]:
# Train the model with the .fit() method
lr.fit(X, y)


# Evaluation

We are going to use the scikit-learn metrics $r^2$ and mean squared error to evaluate how the model is performing.

In [17]:
# use .score(). Recall this returns the default scoring method, which, for linear regression is r^2.

lr.score(X, y)

0.22852390181570326

Looking at that numbers what are your take-aways?





Now we'll look at mean squared error. In order to get this metric, we need to create our predictions using our trained model. This is done with the [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict) method.

In [18]:
# Get predictions

y_pred = lr.predict(X)

In [19]:
# Use y_pred and y with the mean squared error method
mse = mean_squared_error(y, y_pred)
mse



np.float64(3671492961.327205)

That number is ridiculously large. The reason is this method returns the *squared* error. This doesn't really translate well, so we'll take the square of it to get a more usable number. Another advantage is the units of the RMSE are the same as the target. In this instance, RMSE will tell us the average error in dollars, since that's what the target is.

In [20]:
# use a better metric

rmse = mse**(1/2)
rmse

np.float64(60592.845793271714)

In [21]:
rmse2 = root_mean_squared_error(y, y_pred)
rmse2

np.float64(60592.845793271714)

Our $r^2$ is .2282. This is a very low number.

Our root mean squared error is 60,410. This can be read as the average salary prediction is off by $60k. This is not good, either.

So what can we conclude from this?

1. The model isn't good. At all.
2. There is an issue with the data (a minimum salary of .024?)

The good news is that it isn't overfit. We got an equally bad model with both our training and testing data

conclusion

You know now how to:
    
1. instantiate a sckit-learn model    
2. train a model
3. score a model with multiple metrics