# Regression Modeling in Practice
### Week 4: Test a Logistic Regression Model

Although my data set (GapMinder) does not contain any binary data, I wanted to create an example of a logistic regression model. Therefore, I divided internetuserate into two categories with 0 being a “high” internet use rate (defined as a rate above the median rate) and 1 being a “low” internet use rate, and then compared it with income per person.

Load the data and set variables to numeric

In [27]:
'''
Code for Peer-graded Assignments: Test a Logistic Regression Model
Course: Data Management and Visualization
Specialization: Data Analysis and Interpretation
'''
 
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats

data = pd.read_csv('c:/users/greg/desktop/gapminder.csv', low_memory=False)
 
data['internetuserate'] = pd.to_numeric(data['internetuserate'], errors='coerce')
data['incomeperperson'] = pd.to_numeric(data['incomeperperson'], errors='coerce')
data['employrate'] = pd.to_numeric(data['employrate'], errors='coerce')

Convert response variable to binary

In [28]:
binarydata = data.copy()

def internetgrp (row):
    if row['internetuserate'] < data['internetuserate'].median():
        return 0
    else:
        return 1

binarydata['internetuserate'] = binarydata.apply (lambda row: internetgrp (row),axis=1)

Perform logistic regression with income per person

In [29]:
lreg1 = smf.logit(formula = 'internetuserate ~ incomeperperson', data = binarydata).fit()
lreg1.summary()

Optimization terminated successfully.
         Current function value: 0.357431
         Iterations 9


0,1,2,3
Dep. Variable:,internetuserate,No. Observations:,190.0
Model:,Logit,Df Residuals:,188.0
Method:,MLE,Df Model:,1.0
Date:,"Thu, 12 Jan 2017",Pseudo R-squ.:,0.4843
Time:,16:47:59,Log-Likelihood:,-67.912
converged:,True,LL-Null:,-131.69
,,LLR p-value:,1.407e-29

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-2.2530,0.345,-6.525,0.000,-2.930 -1.576
incomeperperson,0.0006,0.000,5.817,0.000,0.000 0.001


Calculate odds ratios

In [30]:
np.exp(lreg1.params)

Intercept          0.105085
incomeperperson    1.000608
dtype: float64

Calculate odd ratios with 95% confidence intervals

In [31]:
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
np.exp(conf)

Unnamed: 0,Lower CI,Upper CI,OR
Intercept,0.053414,0.206742,0.105085
incomeperperson,1.000403,1.000813,1.000608


Perform logistic regression with income per person and employment rate

In [32]:
lreg2 = smf.logit(formula = 'internetuserate ~ incomeperperson + employrate', data = binarydata).fit()
print (lreg2.summary())

Optimization terminated successfully.
         Current function value: 0.345366
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:        internetuserate   No. Observations:                  166
Model:                          Logit   Df Residuals:                      163
Method:                           MLE   Df Model:                            2
Date:                Thu, 12 Jan 2017   Pseudo R-squ.:                  0.5009
Time:                        16:48:09   Log-Likelihood:                -57.331
converged:                       True   LL-Null:                       -114.87
                                        LLR p-value:                 1.026e-25
                      coef    std err          z      P>|z|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept           1.1446      1.426      0.803      0.422        -1.650     3.939
incomeperperso

My first examination was of internet use rate versus income per person. I found that the relationship is highly significant, with a p-value of 0.000. The odds ratio came out to be 1.000608, which indicates that as income per person goes up, so will the internet use rate. However, the odds ratio is very close to 1, so the correlation is not particularly strong. The 95% confidence interval for this odds ratio is 1.000403 to 1.000813, which is a fairly small interval, telling us that our odds ratio is most likely accurate to several decimal places.

Next, I brought employment rate into the analysis as a second explanatory variable. Because it did not change the statistics of income per person much, I can be confident that it is not a confounding variable. With this additional variable, the p-value of income per person stayed the same, at 0.000, and the p-value of employment rate is 0.019. Because this is below our limit of 0.05, both variables are significant. The odds ratio of income per person is now 1.0005, and the odds ratio for employment rate is 0.944. Increasing income per person leads to increasing internet usage, but *decreasing* employment rates will also increase internet use. Too much Facebook, perhaps? The confidence intervals for these odds ratios are similarly as small as the single-variate results.

These results match my previous analysis, that increasing income per person will lead to an increase in the internet use rate, but conversely increasing the employment rate leads to a *decrease* in the internet use rate.