Run the following cell to import all the code we need to complete the exercise.

In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

from plotnine import *

import pandas as pd

We will use the same data as in your Unit 3 assignment. See the FiveThiryEight article ["Higher Rates Of Hate Crimes Are Tied To Income Inequality"](https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/) for reference.

The data URL is http://bit.ly/2ItxYg3. Use `pd.read_csv` to read in this data and call the new dataframe `df_hate_crimes`.

Use the snippet below 👇.
```python
df_hate_crimes = (
    pd.read_csv('http://bit.ly/2ItxYg3')  # read in the data
    .dropna()                             # remove rows with missing values
)

df_hate_crimes.head()                     # preview the table
```

In [2]:
df_hate_crimes = (
    pd.read_csv('http://bit.ly/2ItxYg3')  # read in the data
    .dropna()                             # remove rows with missing values
)

df_hate_crimes.head()            

Unnamed: 0,state,median_house_inc,share_pop_metro,hs,hate_crimes,trump_support,unemployment,urbanization,income
0,New Mexico,low,0.69,83.0,0.295,low,high,low,46686
1,Maine,low,0.54,90.0,0.616,low,low,low,51710
2,New York,low,0.94,85.0,0.351,low,low,high,54310
3,Illinois,low,0.9,86.0,0.195,low,high,high,54916
4,Delaware,high,0.9,87.0,0.323,low,low,high,57522


Let's build a model of `income` using the `urbanization` and `hs` variables.

First we'll make a preprocessor to dummy encode the `urbanization` column and pass the `hs` through untransformed to our model.

In [3]:
# fix the following code and execute the cell

ct = make_column_transformer(
    ['passthrough', ['hs']],
    [OneHotEncoder(drop=['low']), ['urbanization']]
)

ct

Now let's create our training and test data. We establish our `outcome` variable and will send the rest of the data to our modeling pipeline.

In [4]:
outcome = 'income'

# idomatically "X" stands for training data and "y" for the outcome
X, y = df_hate_crimes.loc[:, df_hate_crimes.columns != outcome], df_hate_crimes[outcome]

Preview the training data `X` to confirm our outcome is no longer in the training data.

In [5]:
X.head()

Unnamed: 0,state,median_house_inc,share_pop_metro,hs,hate_crimes,trump_support,unemployment,urbanization
0,New Mexico,low,0.69,83.0,0.295,low,high,low
1,Maine,low,0.54,90.0,0.616,low,low,low
2,New York,low,0.94,85.0,0.351,low,low,high
3,Illinois,low,0.9,86.0,0.195,low,high,high
4,Delaware,high,0.9,87.0,0.323,low,low,high


In [6]:
y.head()

0    46686
1    51710
2    54310
3    54916
4    57522
Name: income, dtype: int64

Use the `fit_transform` method for your column transformer to see how it transforms your training data. I.e. call `fit_transform` with `X` as the argument.

In [11]:
ct.fit_transform(X)

array([[83.,  0.],
       [90.,  0.],
       [85.,  1.],
       [86.,  1.],
       [87.,  1.],
       [85.,  1.],
       [89.,  1.],
       [90.,  1.],
       [81.,  1.],
       [91.,  0.],
       [89.,  1.],
       [89.,  1.],
       [87.,  1.],
       [87.,  1.],
       [92.,  0.],
       [87.,  1.],
       [89.,  1.],
       [89.,  1.],
       [84.,  0.],
       [85.,  1.],
       [84.,  0.],
       [84.,  1.],
       [84.,  1.],
       [88.,  0.],
       [84.,  1.],
       [88.,  1.],
       [80.,  1.],
       [88.,  1.],
       [91.,  0.],
       [90.,  0.],
       [90.,  1.],
       [91.,  0.],
       [91.,  0.],
       [80.,  0.],
       [83.,  0.],
       [82.,  0.],
       [82.,  1.],
       [82.,  0.],
       [83.,  1.],
       [82.,  0.],
       [86.,  0.],
       [87.,  0.],
       [91.,  0.],
       [88.,  0.],
       [90.,  0.]])

Next we'll build out pipeline/model. Execute the following cell. Does the output make sense? Can you find which columns were passed through your column transformer by clicking through the output pipeline visualiztion? 

In [None]:
pl = make_pipeline(
    ct, # or whatever you called your column transformer
    LinearRegression()
)

pl.fit(X, y)

Make a new dataframe called `df_hate_crimes_w_pred` by calling the following method from the `df_hate_crimes` dataframe: `.assign(pred_income=lambda df_: pl.predict(df_))`.

What is the name of your predictions column?

Use Plotnine and your `df_hate_crimes_w_pred` dataframe to plot your model. Use `geom_point` for your observed values and `geom_line` for your predicted values.

Use the function below to show the regression table for you model.

In [None]:
def get_regression_table(pipeline):
    
    ct = pipeline['columntransformer']
    terms = list(ct.get_feature_names_out()) + ['intercept']
    
    mod = pipeline['linearregression']
    coefs = mod.coef_
    intercept = mod.intercept_
    estimates = list(coefs) + [intercept]
    
    data = {
        "term": terms,
        "estimate": estimates
    }
    
    return pd.DataFrame(data)

😎 **BONUS** inspect the function above. How do you access the "model" from a pipeline? What about the coefficients for your model terms?