## Transformations in Linear Regression

Linear regression assumes that the dependent variables and its predictors are linearly related to each other.  This applies to the terms you specify in the model estimation.  However, you can apply transformations to those variables before estimating the models.  Transformations can be applied to:

- Just the dependent variable
- One or more predictor variables
- Both predictor and dependent variables.

Some common transformations are outlined here:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwibtIr4o8rvAhVYXM0KHTl9CXYQFjADegQIFRAD&url=https%3A%2F%2Fonlinepubs.trb.org%2Fonlinepubs%2Fnchrp%2Fcd-22%2Fmanual%2Fv2appendixb.pdf&usg=AOvVaw2MUNcGvy4XZhF9G6DBqFZq


This lesson draws from David Dranove's excellent explanation of log transformations in regression.  I very much recommend that you read it: 

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=15&cad=rja&uact=8&ved=2ahUKEwjI0fu82NvoAhXRWM0KHUuVDCkQFjAOegQICBAB&url=https%3A%2F%2Fcanvas.northwestern.edu%2Ffiles%2F1812457%2Fdownload%3Fdownload_frd%3D1%26verifier%3DQBFTMd2yRHbWR6yC6mp2s0in7G9N3rRZiRFStMrA&usg=AOvVaw1JP3fLhZJA3IEhBHVFbuRq

### Log transformations

In this lesson, we will focus on log transformations, which are among the most common.  First, a refresher on the properties of logs and expoentials.  (It's been a while, so I tend to keep the cheat sheet around!)

![image](log_exp_graph.jpg)

![image](log_properties.gif)

### What happens when we specify a log-transformed regression model?

Let's work out the math...

### Why take logs?

There are two main reasons to specify a log model instead of a linear model:

1. It fits the data better
2. You have a theoretical to think that the relationship should be multiplicative instead of additive

### Let's try again with the wage data

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import statsmodels.formula.api as smf

# this allows plots to appear directly in the notebook
%matplotlib inline

In [None]:
# get the data
df = pd.read_csv('data/psam_p21.csv')
df.head()

In [None]:
# keep only the people who have worked within the last 12 months
'''
WKL 1
When last worked
b .N/A (less than 16 years old)
1 .Within the past 12 months
2 .1-5 years ago
3 .Over 5 years ago or never worked
'''

df = df[df['WKL']==1]
len(df)

### Look at the data

It is often a good idea to look at the data we want to model, and see how it relates to some variables we expect to be important.  We can do this using seaborn, which we learned about a few weeks ago.  

In [None]:
'''
WAGP 6
Wages or salary income past 12 months
bbbbbb .N/A (less than 15 years old)
000000 .None
000001..999999 .$1 to 999999 (Rounded and top-coded)
'''

sns.distplot(df['WAGP'])

In [None]:
# we may think that wages relate to hours worked, so let's look at that relationship
'''
WKHP 2
Usual hours worked per week past 12 months
bb .N/A (less than 16 years old/did not work during the past 12
.months)
01..98 .1 to 98 usual hours
99 .99 or more usual hours
'''

sns.jointplot(x="WKHP", y="WAGP", data=df, size=8)

### Estimating some models

In [None]:
# linear model
mod = smf.ols(formula='WAGP ~ WKHP', data=df)
res = mod.fit()
print(res.summary())

In [None]:
# log-log model
mod = smf.ols(formula='np.log(1+WAGP) ~ np.log(1+WKHP)', data=df)
res = mod.fit()
print(res.summary())

In [None]:
# log-linear model
mod = smf.ols(formula='np.log(1+WAGP) ~ WKHP', data=df)
res = mod.fit()
print(res.summary())

In [None]:
# What makes sense here?

### Categorical variables

We may also want to include categorical variables.  We can include them by calculating a 'dummy' variable, which is 1 if the value is in a category, and zero otherwise.  

In [None]:
'''
SCHL 2
Educational attainment
bb .N/A (less than 3 years old)
01 .No schooling completed
02 .Nursery school, preschool 03 .Kindergarten
04 .Grade 1
05 .Grade 2
06 .Grade 3
07 .Grade 4
08 .Grade 5
09 .Grade 6
10 .Grade 7
11 .Grade 8
12 .Grade 9
13 .Grade 10
14 .Grade 11 15 .12th grade - no diploma
16 .Regular high school diploma
17 .GED or alternative credential
18 .Some college, but less than 1 year
19 .1 or more years of college credit, no degree
20 .Associate's degree
21 .Bachelor's degree
22 .Master's degree
23 .Professional degree beyond a bachelor's degree
24 .Doctorate degree
'''

# Is someone a college grad
df['college_grad'] = df['SCHL'].apply(lambda x : x>=20)

In [None]:
# note that I can wrap long strings with a \ character
mod = smf.ols(formula="np.log(1+WAGP) \
                       ~ np.log(1+WKHP) \
                       + college_grad", 
              data=df)
res = mod.fit()
print(res.summary())

In [None]:
# What would you recommend for age?  

## Homework

1. Read and understand:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=15&cad=rja&uact=8&ved=2ahUKEwjI0fu82NvoAhXRWM0KHUuVDCkQFjAOegQICBAB&url=https%3A%2F%2Fcanvas.northwestern.edu%2Ffiles%2F1812457%2Fdownload%3Fdownload_frd%3D1%26verifier%3DQBFTMd2yRHbWR6yC6mp2s0in7G9N3rRZiRFStMrA&usg=AOvVaw1JP3fLhZJA3IEhBHVFbuRq

2. Revisit our previous homework modeling wages.  Specify your best log-log and log-linear models.  What would you recommend in this situation?  