# Lec 23 - Step Functions
## CMSE 381 - Fall 2023
## Nov 3, 2023



We're going to try again with the step functions.

In [None]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time


# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm


# 0. Loading in the data

We're going to use the `Wage` data used in the book, so note that many of your plots can be checked by looking at figures in the book.

In [None]:
df = pd.read_csv('../../DataSets/Wage.csv', index_col =0 )
df.head()

In [None]:
df.info()

In [None]:
df.describe()

Here's the plot we used multiple times in class to look at a single variable:  `age` vs `wage`

In [None]:
plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*', label = '< 250')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250], label = '> 250')
plt.legend()

plt.xlabel('Age')
plt.ylabel('Wage')

# 1. Step functions

Now let's try to use step functions to learn a model using `age` to predict `wage`. Like with the polynomial example from last time, all we're going to do is build a data frame or feature matrix that has the step function values in each column, and then pass that matrix to our favorite linear modeling function. 

First, we want to get a dataframe with the cuts. The `right = False` bit says whether I want the bins to include the right endpoint in the interval. This means our bins end up as $[c_i,c_{i+1})$ which follows the notation in the book.

In [None]:
df_cut, bins = pd.cut(df.age, 4, 
                      retbins = True, #<---- Says I want it to return the bins (aka the knots)
                      right = False) 

I will define the entries in the bins to be the $c_i$'s as follows. 


In [None]:
print(bins)

In [None]:
print(r'c_1 = ', bins[0])
print(r'c_2 = ', bins[1])
print(r'c_3 = ', bins[2])
print(r'c_4 = ', bins[3])
print(r'c_5 = ', bins[4])

&#9989; **<font color=red>Do this:</font>**
 For each of the functions $C_0(X)$, $C_1(X)$, $C_2(X)$, $C_3(X)$, $C_4(X)$, $C_5(X)$ (following our notation in class), determine the domains where they have value 1. 

*Your answer here*

- $C_0(X)$:
- $C_1(X)$:
- $C_2(X)$: 
- $C_3(X)$: 
- $C_4(X)$: 
- $C_5(X)$: 

Below is my code that generates the data frame storing $C_i(X)$ for all our entries. 

In [None]:
df_steps_dummies = pd.get_dummies(df_cut)
df_steps_dummies.head()

&#9989; **<font color=red>Q:</font>** Which of the functions $C_i(X)$ for $i=0,\cdots, 5$ have columns represented in this matrix? *Note: it's not all of them*


* Your answer here*

One annoying difference from the book is that because our code saw no data in the intervals $(-\infty, 18]$ or  $[80.062,\infty)$ interval, it doesn't make us a column for either of those. This is totally fine as long as later we don't ask our model to predict anything outside of the range $[18.0, 80.062)$ so for the remainder of the notebook, we'll make sure we don't try to pass it anything outside of those values. 

&#9989; **<font color=red>Do this:</font>** Pass this matrix to a linear regression model and use it to predict `wage`. What is the equation for your learned model? Be specific in terms of the $C_i$ functions you learned earlier.

In [None]:
# Your code here #

Assuming you stored your linear regression model as `linreg`, the following code will plot the learned function. Check that the answers you got in the table above match with what you're seeing in the graph.  

In [None]:
t = np.linspace(20,80,100) #<--- Remember my rule that I can't pass anything outisde
                           #     of [18,80.02)

bin_mapping = np.digitize(t, bins)

# print(bin_mapping)
t_dummies = pd.get_dummies(bin_mapping)
t_dummies.head()

In [None]:
stepPredict = linreg.predict(t_dummies) #<---- If you named your linear regression 
                                        #      something else, you can fix this to match.
            
#--------Uncomment below to draw the scatter plot of the data as well-------#
# plt.scatter(df.age,df.wage,marker = '+')


plt.xlabel('Age')
plt.ylabel('Wage')

plt.plot(t,stepPredict,color='red')


![Stop Icon](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Vienna_Convention_road_sign_B2a.svg/180px-Vienna_Convention_road_sign_B2a.svg.png)

Great, you got to here! Hang out for a bit, there's more lecture before we go on to the next portion. 

# 2.  Classification version of step functions

Now we can try out the classification version of the problem. Let's build the classifier that predicts whether a person of a given age will make more than $250,000. You already made the matrix of step function features, so we just have to hand it to `LogisticRegression` to do its thing.

In [None]:

plt.scatter(df.age[df.wage <=250], df.wage[df.wage<=250],marker = '*')
plt.scatter(df.age[df.wage >250], df.wage[df.wage>250])

plt.xlabel('Age')
plt.ylabel('Wage')

# plt.savefig('WageColoredBy250.png', bbox_inches = 'tight')


In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
y = np.array(df.wage>250) #<--- this makes sure I 
                          #     just have true/false input
clf = LogisticRegression(random_state=48824)
clf.fit(df_steps_dummies,y)

In [None]:
f = clf.predict_proba(t_dummies)

In [None]:
below = df.age[df.wage <=250]
above = df.age[df.wage >250]

# Comment this out to see the function better
plt.scatter(above,np.ones(above.shape[0]),marker = '|', color = 'orange')
plt.scatter(below,np.zeros(below.shape[0]),marker = '|', color = 'blue')



plt.xlabel('Age')
plt.ylabel('P[Wage >= 250]')
plt.plot(t,f[:,1])



-----
### Congratulations, we're done!
Written by Dr. Liz Munch, Michigan State University

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.