A researcher is interested in how predictor variables, such as 
1)GRE (Graduate Record Exam scores), 
2)GPA (grade point average) and 
3) rank/prestige of the undergraduate institution
effect admission into graduate school.

#The response variable "admission to grad school"  is a binary variable.
The only two choices are admit/don’t admit.
Values are 0 = no admit, 1 = admit

THIS TASK IS CALLED CLASSIFICATION. 
CLASSIFICATION => The target output is on of a limited number of categories.
In this problem we only have two possible targets: no admit and admit



CHECK FOR UNDERSTANDING: Why is Classification different from Regression, a/k/a Linear Regression?

Note: The fact that you solve CLASSIFICATION problems with a technique called LOGISITIC REGRESSION is unfortunate, but a fact of life.

In [3]:
import pandas as pd
import statsmodels.formula.api as smf

In [4]:
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 4 columns):
admit    400 non-null int64
gre      400 non-null int64
gpa      400 non-null float64
rank     400 non-null int64
dtypes: float64(1), int64(3)
memory usage: 12.6 KB


In [9]:
df.describe()

Unnamed: 0,admit,gre,gpa,rank
count,400.0,400.0,400.0,400.0
mean,0.3175,587.7,3.3899,2.485
std,0.466087,115.516536,0.380567,0.94446
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


In [14]:
#admit: is the target we want to model. Categorical. 0 = no admit, 1 = admit
#gre is numerical/continuous
#gpa is numerical/continuous
#rank is categorical: 1,2,3, or 4


# You could do this. But don't. (Rank should be treated as categorical variable!)

In [15]:
#DON'T DO THIS IRL

fitted_model = smf.logit(formula='admit ~ gre + gpa + rank', data=df).fit()
fitted_model.summary()

Optimization terminated successfully.
         Current function value: 0.574302
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,400.0
Model:,Logit,Df Residuals:,396.0
Method:,MLE,Df Model:,3.0
Date:,"Tue, 19 Nov 2019",Pseudo R-squ.:,0.08107
Time:,10:28:45,Log-Likelihood:,-229.72
converged:,True,LL-Null:,-249.99
Covariance Type:,nonrobust,LLR p-value:,8.207e-09

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.4495,1.133,-3.045,0.002,-5.670,-1.229
gre,0.0023,0.001,2.101,0.036,0.000,0.004
gpa,0.7770,0.327,2.373,0.018,0.135,1.419
rank,-0.5600,0.127,-4.405,0.000,-0.809,-0.311


# Do This Instead!
## In reality you should *always* explicitly separate out Categorical factors.
## Notice this this model has more coefficients. (why?)

In [16]:
fitted_model = smf.logit(formula='admit ~ gre + gpa + C(rank)', data=df).fit()
fitted_model.summary()

Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,400.0
Model:,Logit,Df Residuals:,394.0
Method:,MLE,Df Model:,5.0
Date:,"Mon, 18 Nov 2019",Pseudo R-squ.:,0.08292
Time:,17:15:57,Log-Likelihood:,-229.26
converged:,True,LL-Null:,-249.99
Covariance Type:,nonrobust,LLR p-value:,7.578e-08

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.9900,1.140,-3.500,0.000,-6.224,-1.756
C(rank)[T.2],-0.6754,0.316,-2.134,0.033,-1.296,-0.055
C(rank)[T.3],-1.3402,0.345,-3.881,0.000,-2.017,-0.663
C(rank)[T.4],-1.5515,0.418,-3.713,0.000,-2.370,-0.733
gre,0.0023,0.001,2.070,0.038,0.000,0.004
gpa,0.8040,0.332,2.423,0.015,0.154,1.454


In [18]:
#CONFUSION MATRIX. HOW WELL DID YOUR MODEL PREDICT THE REALITY OF YOUR DATA?
fitted_model.pred_table()

array([[253.,  20.],
       [ 98.,  29.]])

In [1]:
#pred_table[i,j] refers to the number of times “i” was observed and the model predicted “j”. 
#Correct predictions are along the diagonal.

In [5]:
df.describe()

Unnamed: 0,admit,gre,gpa,rank
count,400.0,400.0,400.0,400.0
mean,0.3175,587.7,3.3899,2.485
std,0.466087,115.516536,0.380567,0.94446
min,0.0,220.0,2.26,1.0
25%,0.0,520.0,3.13,2.0
50%,0.0,580.0,3.395,2.0
75%,1.0,660.0,3.67,3.0
max,1.0,800.0,4.0,4.0


In [6]:
#Will our model perform better if I make the data more "uniform"?
#what if I try to make the GRE and GPA columns more like a Z score or standard normal? 
#Will that make it easier for the numerical solver in statsmodels to find a better answer?

In [11]:
# Here are the means and standard deviations of the two columns
gre_mean = df['gre'].mean()
gre_std = df['gre'].std()

gpa_mean = df['gpa'].mean()
gpa_std = df['gpa'].std()

print(gre_mean, gre_std)
print(gpa_mean, gpa_std)

587.7 115.51653637223805
3.3899 0.3805667716303841


In [None]:
#let's add two new columns to our dataset to reflect the gpa and gre score on a standardized basis

In [None]:
gre_mean = df['gre'].mean()
gre_std = df['gre'].std()

gpa_mean = df['gpa'].mean()
gpa_std = df['gpa'].std()

In [12]:
df['gre_zscore'] = df['gre'].apply(lambda gre: (gre - gre_mean)/gre_std )
df['gpa_zscore'] = df['gpa'].apply(lambda gpa: (gpa - gpa_mean)/gpa_std )


In [13]:
df.describe() #now we's transformed the numerical data columns to mean zero and variance 1.

Unnamed: 0,admit,gre,gpa,rank,gre_zscore,gpa_zscore
count,400.0,400.0,400.0,400.0,400.0,400.0
mean,0.3175,587.7,3.3899,2.485,-3.907985e-16,2.198242e-16
std,0.466087,115.516536,0.380567,0.94446,1.0,1.0
min,0.0,220.0,2.26,1.0,-3.183094,-2.968993
25%,0.0,520.0,3.13,2.0,-0.5860633,-0.6829288
50%,0.0,580.0,3.395,2.0,-0.06665712,0.01340106
75%,1.0,660.0,3.67,3.0,0.6258844,0.7360075
max,1.0,800.0,4.0,4.0,1.837832,1.603135


In [17]:
#let's re-run the model using gre_zscore, gpa_zscore, and rank
formula = 'admit ~ gre_zscore + gpa_zscore + C(rank)'

fitted_model = smf.logit(formula=formula, data=df).fit()
fitted_model.summary()

Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6


0,1,2,3
Dep. Variable:,admit,No. Observations:,400.0
Model:,Logit,Df Residuals:,394.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 19 Nov 2019",Pseudo R-squ.:,0.08292
Time:,14:25:06,Log-Likelihood:,-229.26
converged:,True,LL-Null:,-249.99
Covariance Type:,nonrobust,LLR p-value:,7.578e-08

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0664,0.266,0.250,0.802,-0.454,0.587
C(rank)[T.2],-0.6754,0.316,-2.134,0.033,-1.296,-0.055
C(rank)[T.3],-1.3402,0.345,-3.881,0.000,-2.017,-0.663
C(rank)[T.4],-1.5515,0.418,-3.713,0.000,-2.370,-0.733
gre_zscore,0.2616,0.126,2.070,0.038,0.014,0.509
gpa_zscore,0.3060,0.126,2.423,0.015,0.058,0.553


In [18]:
dir(fitted_model)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_cache',
 '_data_attr',
 '_get_endog_name',
 '_get_robustcov_results',
 'aic',
 'bic',
 'bse',
 'conf_int',
 'cov_kwds',
 'cov_params',
 'cov_type',
 'df_model',
 'df_resid',
 'f_test',
 'fittedvalues',
 'get_margeff',
 'initialize',
 'k_constant',
 'llf',
 'llnull',
 'llr',
 'llr_pvalue',
 'load',
 'mle_retvals',
 'mle_settings',
 'model',
 'nobs',
 'normalized_cov_params',
 'params',
 'pred_table',
 'predict',
 'prsquared',
 'pvalues',
 'remove_data',
 'resid_dev',
 'resid_generalized',
 'resid_pearson',
 'resid_response',
 'save',
 'scale',
 'set_null_options',
 'summary',
 'summary2',
 't_test',
 't_

In [None]:
## notice that the coefficients are different, but the significance of the coefficients is the same!
## So, for *statsmodels*, in this case, the answer is no, it does not make a difference
## This is not necessarily the case everywhere, for example, using scikit-learn it sometimes DOES make a difference. 
## Double-check when you really need to be sure