# The Multiple Regression Model (Second Example)

### Intro and objectives


### In this lab you will learn:
1. examples of multiple regression models.
2. how to fit multiple regression models in Python.


## What I hope you'll get out of this lab
* The feeling that you'll "know where to start" when you need to fit a multiple regression model.
* Worked Examples of multiple regression models
* How to interpret the results obtained

In [19]:
!pip install wooldridge
import wooldridge as woo
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Example 2. Explaining arrest records

#### CRIME1 contains data on arrests during the year 1986 and other information on 2,725 men born in either 1960 or 1961 in California. Each man in the sample was arrested at least once prior to 1986.
#### The variable narr86 is the number of times the man was arrested during 1986: it is zero for most men in the sample (72.29%), and it varies from 0 to 12. (The percentage of men arrested once during 1986 was 20.51.) The variable pcnv is the proportion (not percentage) of arrests prior to 1986 that led to conviction, avgsen is average sentence length served for prior convictions (zero for most people), ptime86 is months spent in prison in 1986, and qemp86 is the number of quarters during which the man was employed in 1986 (from zero to four).


#### In this case we fit the following multiple linear model:
$ narr86=\beta_0+\beta_1*pncv+\beta_2*avgsen+\beta_3*ptime86++\beta_4*qemp86+u $




### Using the data in CRIME1 where n=2725 individuals

In [20]:
crimeData = woo.dataWoo('crime1')


In [21]:
crimeData.head()

Unnamed: 0,narr86,nfarr86,nparr86,pcnv,avgsen,tottime,ptime86,qemp86,inc86,durat,black,hispan,born60,pcnvsq,pt86sq,inc86sq
0,0,0,0,0.38,17.6,35.200001,12,0.0,0.0,0.0,0,0,1,0.1444,144,0.0
1,2,2,0,0.44,0.0,0.0,0,1.0,0.8,0.0,0,1,0,0.1936,0,0.64
2,1,1,0,0.33,22.799999,22.799999,0,0.0,0.0,11.0,1,0,1,0.1089,0,0.0
3,2,2,1,0.25,0.0,0.0,5,2.0,8.8,0.0,0,1,1,0.0625,25,77.440002
4,1,1,0,0.0,0.0,0.0,0,2.0,8.1,1.0,0,0,0,0.0,0,65.610008


In [22]:
crimeData.describe()

Unnamed: 0,narr86,nfarr86,nparr86,pcnv,avgsen,tottime,ptime86,qemp86,inc86,durat,black,hispan,born60,pcnvsq,pt86sq,inc86sq
count,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0,2725.0
mean,0.404404,0.233394,0.125505,0.357787,0.632294,0.838752,0.387156,2.309028,54.967046,2.251376,0.161101,0.217615,0.362569,0.284131,3.951193,7458.93262
std,0.859077,0.581014,0.482847,0.395192,3.508031,4.607019,1.950051,1.610428,66.627213,4.607063,0.367691,0.4127,0.48083,0.390734,22.08584,16361.238454
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.16
50%,0.0,0.0,0.0,0.25,0.0,0.0,0.0,3.0,29.0,0.0,0.0,0.0,0.0,0.0625,0.0,841.0
75%,1.0,0.0,0.0,0.67,0.0,0.0,0.0,4.0,90.099998,2.0,0.0,0.0,1.0,0.4489,0.0,8118.009766
max,12.0,6.0,8.0,1.0,59.200001,63.400002,12.0,4.0,541.0,25.0,1.0,1.0,1.0,1.0,144.0,292681.0


In [23]:
type(crimeData)

pandas.core.frame.DataFrame

In [24]:
# We impose a simple, linear, model: 
# We specify CeoSalaries as the empirical dataset

reg = smf.ols(formula='narr86 ~ pcnv + ptime86 + qemp86', data=crimeData)

In [25]:
# We fit the model
results = reg.fit()


In [26]:
b = results.params
print(f'b: \n{b}\n')

b: 
Intercept    0.711772
pcnv        -0.149927
ptime86     -0.034420
qemp86      -0.104113
dtype: float64



In [27]:
results.rsquared

0.04132330770123016

## Based on the previous we have fitted the following model:

$ narr86=0.71-0.1499*pncv-0.034*ptime86-0.104*qemp86+u $

$R^2=0.041$


#### where pcnv is a proxy for the likelihood for being convicted of a crime. The variable ptime86 captures the incarcerative effects of crime: if an individual is in prison, he cannot be arrested for a crime outside of prison. Labor market opportunities are crudely captured by qemp86.

## How do we interpret the equation?

#### This equation says that, as a group, the three variables pcnv, ptime86, and qemp86 explain about 4.1% ($R^2=0.041$) of the variation in narr86.

#### Each of the OLS slope coefficients has the anticipated sign. An increase in the proportion of convictions lowers the predicted number of arrests.

#### Similarly, a longer prison term leads to a lower predicted number of arrests. In fact, if ptime86 increases from 0 to 12, predicted arrests for a particular man fall by .034*12=.408. Another quarter in which legal employment is reported lowers predicted arrests by .104, which would be 10.4 arrests among 100 men. 

## Let's try to improve the model.

#### The previous model explained just 4.1% of the variation in narr86. We can try to improve the model by adding aditional explanatory variables, for instance avgsen: the average sentence length served for prior convictions (zero for most people).

In [28]:
reg = smf.ols(formula='narr86 ~ pcnv +avgsen+ ptime86 + qemp86', data=crimeData)

In [29]:
 #We fit the model
results = reg.fit()


In [30]:
b = results.params
print(f'b: \n{b}\n')

b: 
Intercept    0.706756
pcnv        -0.150832
avgsen       0.007443
ptime86     -0.037391
qemp86      -0.103341
dtype: float64



In [31]:
results.rsquared

0.04219385157543343

#### Thus, adding the average sentence variable increases R2 from .0413 to .0422, a practically small effect. The sign of the coefficient on avgsen is also unexpected: it says that a longer average sentence length increases criminal activity.