## Gregory M. Eirich
## Example 
## Lab #5

# Lab 5 Examples

## 1. Run a multiple linear probability model (have at least 2 Xs in the model).  Tell me how you think your independent variables will affect your dependent variable.  Interpret your results.  Were your expectations correct?  Why or why not?

## 2. Run a multiple (binary) logistic model.  (It can be the same as the above LPM or a new model.)  If it is a new model, tell me how you think your independent variables will affect your dependent variable.  Interpret your results in the logit scale.  Were your expectations correct?  Why or why not?

## 3. Get odds ratios from your logit model in Question 2 and interpret some of them.  

## 4. Extra Credit: Get predicted probabilities from your logit model in Question 2 for some constellations of X values and interpret the results.  

## (P.S., I am mostly doing this to show you code that works, but you should put more time/thought into your write-ups and descriptions than I am here.)


In [None]:
Some preliminary set up code:

In [22]:
from __future__ import division
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import os
import matplotlib.pyplot as plt

## Use the 2006 GSS again.

In [23]:
os.chdir('/Users/gregoryeirich/Desktop/Data Analysis/') # change working directory

g = pd.read_csv("GSS.2006.csv.xls")
g.head()

Unnamed: 0,vpsu,vstrat,adults,ballot,dateintv,famgen,form,formwt,gender1,hompop,...,away7,gender14,old14,relate14,relhh14,relhhd14,relsp14,where12,where6,where7
0,1,1957,1,3,316,2,1,1,2,3,...,,,,,,,,,,
1,1,1957,2,2,630,1,2,1,2,2,...,,,,,,,,,,
2,1,1957,2,2,314,2,1,1,2,2,...,,,,,,,,,,
3,1,1957,1,1,313,1,2,1,2,1,...,,,,,,,,,,
4,1,1957,3,1,322,2,2,1,2,3,...,,,,,,,,,,


## 1. Run a multiple linear probability model (have at least 2 Xs in the model).  Tell me how you think your independent variables will affect your dependent variable.  Interpret your results.  Were your expectations correct?  Why or why not?
  

## ## The question is what are some things that predict "How often do you come home from work exhausted?" ranging from always to never.

In [24]:
## get rid of all missings; necessary for predictions later ##

sub = g.dropna(subset = ["xhaustn", "hrs1", "age", "prestg80", "babies", "wrkstat"]) 

## I am going to recode to only look at if someone says they "always exhausted" vs. everything else ##

In [25]:
conditions = [
    (sub['xhaustn'] == 1) ,
    (sub['xhaustn'] > 1)]
choices = [1, 0]
sub['exh'] = np.select(conditions, choices, default=np.nan)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [26]:
pd.crosstab(index=sub["exh"], columns="count")  ## check that the recode worked okay ##

col_0,count
exh,Unnamed: 1_level_1
0.0,785
1.0,130


## We will subset our LPM on only full-time working people ##

In [27]:
lm1 = smf.ols(formula = 'exh ~ hrs1 + age + prestg80 + babies', subset=(sub['wrkstat']==1), data = sub).fit()
print (lm1.summary())

                            OLS Regression Results                            
Dep. Variable:                    exh   R-squared:                       0.031
Model:                            OLS   Adj. R-squared:                  0.026
Method:                 Least Squares   F-statistic:                     6.321
Date:                Mon, 24 Jun 2019   Prob (F-statistic):           5.23e-05
Time:                        05:21:21   Log-Likelihood:                -270.86
No. Observations:                 783   AIC:                             551.7
Df Residuals:                     778   BIC:                             575.0
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0158      0.077      0.205      0.8

## These variables make sense.  The more hours you work, the more likely you are to say that you are exhausted all the time, etc.

## 2. Run a multiple (binary) logistic model.  (It can be the same as the above LPM or a new model.)  If it's a new model, tell me how you think your independent variables will affect your dependent variable.  Interpret your results in the logit scale.  Were your expectations correct?  Why or why not?

In [28]:
logit1 = sm.formula.logit(formula = "exh ~ hrs1 + age + prestg80 + babies", subset=(sub['wrkstat']==1), data = sub).fit()
print (logit1.summary()) 

Optimization terminated successfully.
         Current function value: 0.390574
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                    exh   No. Observations:                  783
Model:                          Logit   Df Residuals:                      778
Method:                           MLE   Df Model:                            4
Date:                Mon, 24 Jun 2019   Pseudo R-squ.:                 0.03762
Time:                        05:21:26   Log-Likelihood:                -305.82
converged:                       True   LL-Null:                       -317.78
                                        LLR p-value:                 8.321e-05
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.7199      0.654     -4.157      0.000      -4.002      -1.438
hrs1           0.0318      0.

## Here are the logit coefficients.  We see here too that the more hours you work, on average, the higher the logit of being exhausted all the time, etc.

## 3. Get odds ratios from your logit model in Question 2 and interpret some of them.  

In [29]:
np.exp(logit1.params)

Intercept    0.065882
hrs1         1.032340
age          1.009581
prestg80     0.975567
babies       1.495597
dtype: float64

## These are the odds-ratios, so you can see that for every hour more than someone works, the odds that they will be exhausted goes up by 3.2%, net of other factors, etc.  

## 4. Extra Credit: Get predicted probabilities from your logit model in Question 2 for some constellations of X values and interpret the results.  

## I am going to define a predicted probabilities function and utilize the parameters from my logit on it.

In [30]:
def logit2prob (logit):
    odds = np.exp(logit)
    prob = odds / (1 + odds) 
    return(prob);


intercept = logit1.params.Intercept
b_hrs1 = logit1.params.hrs1
b_age = logit1.params.age
b_prestg80 = logit1.params.prestg80
b_babies = logit1.params.babies

## I am going to find out the predicted probability of being exhausted for a person who works 40 hours/week, is 30 years old, has a low prestige job of 30 and no young children.

In [31]:
## CHOOSE REPRESENTATIVE VALUES FOR ALL Xs ##
logits_exh = intercept + (40 * b_hrs1) + (30 * b_age) + (30 * b_prestg80) + (0 * b_babies)
logit2prob(logits_exh)

0.12979137976801786

## Someone like that has a 12.9 probability of being exhausted.

## Now, I am going to find out the predicted probability of being exhausted for a person who works 80 hours/week, is 50 years old, has a high prestige job of 80 and 3 young children.

In [32]:
## CHOOSE REPRESENTATIVE VALUES FOR ALL Xs ##
logits_exh = intercept + (80 * b_hrs1) + (50 * b_age) + (80 * b_prestg80) + (3 * b_babies)
logit2prob(logits_exh) 

0.3850395409635665

## Someone like that has a 38.5 probability of being exhausted.

## Now, I am going to find out the predicted probability of being exhausted for a person who has mostly average values on the Xs, except for work hours.  

In [33]:
logits_exh = intercept + (80 * b_hrs1) + (sub.age.mean() * b_age) + (sub.prestg80.mean() * b_prestg80) + (sub.babies.mean() * b_babies)
logit2prob(logits_exh)

0.31123940798690736