In [2]:
#import modules
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm

Each row represents a group of subjects that took a pass/fail test.  The first two columns are the dependant variable- the number of successes and failures for each group.  All of the other columns are the features.  Each group is a combination of these eight features, represented in percentages. These feature columns for each group add up to 100% (or nearly so, there is some rounding error in there, but close to 100%)

In [20]:
#import data
data = pd.read_csv('../capstone/sample_binomial_data.csv',index_col=0)
data.head()

Unnamed: 0,sucesses,failures,A,B,C,D,E,F,G,H
0,339.0,73445.0,0.0,0.48,0.0,0.39,0.12,0.0,0.0,0.0
1,8.0,3340.0,0.0,0.0,0.0,0.98,0.0,0.0,0.0,0.0
2,12.0,1546.0,0.0,0.38,0.13,0.04,0.44,0.0,0.0,0.0
3,10.0,1167.0,0.47,0.0,0.0,0.14,0.0,0.37,0.0,0.0
4,9.0,1163.0,0.0,0.0,0.99,0.0,0.0,0.0,0.0,0.0


In [23]:
#define dependant and independant variables for binomial regression

dep = data[['sucesses','failures']]
features = ['A','B','C','D','E','F','G','H']
indy = data[features]

In [24]:
#run regression and get summary
glm_binom = sm.GLM(dep, indy, family=sm.families.Binomial())
result = glm_binom.fit()
print(result.summary()) 

                    Generalized Linear Model Regression Results                     
Dep. Variable:     ['sucesses', 'failures']   No. Observations:                  176
Model:                                  GLM   Df Residuals:                      168
Model Family:                      Binomial   Df Model:                            7
Link Function:                        logit   Scale:                             1.0
Method:                                IRLS   Log-Likelihood:                -1294.0
Date:                      Sun, 11 Dec 2016   Deviance:                       1889.5
Time:                              21:04:03   Pearson chi2:                 2.36e+03
No. Iterations:                          12                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
A             -6.0339      0.059   -102.033      0.000        -6.150    -5.91

All of my coefficient are negative.  This implies that every possible feature has a negative, doesn't it?  But logically, this can't be so - these features add up to 100% for each group, so at least one of them has to improve the liklihood ... right?  I can't explain to my boss that every possible feature makes success less likely ... I need to be able to distinguish positive and negative.  I thought maybe I needed to have a constant, but then this happens:

In [25]:
#try it using a contant
indy = sm.add_constant(data[features])
glm_binom = sm.GLM(dep, indy, family=sm.families.Binomial())
result = glm_binom.fit()
print(result.summary()) 

                    Generalized Linear Model Regression Results                     
Dep. Variable:     ['sucesses', 'failures']   No. Observations:                  176
Model:                                  GLM   Df Residuals:                      167
Model Family:                      Binomial   Df Model:                            8
Link Function:                        logit   Scale:                             1.0
Method:                                IRLS   Log-Likelihood:                -1291.1
Date:                      Sun, 11 Dec 2016   Deviance:                       1883.6
Time:                              21:04:25   Pearson chi2:                 2.32e+03
No. Iterations:                          12                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          6.1887      2.540      2.436      0.015         1.210    11.16

That goes in the opposite direction of what I was hoping for.  I also tried normalizing the data, but, as you would expect when dealing with numbers that we percentages in the first place, it doesn't have much effect.  Any other ideas?  

In [32]:
#try it with normalized data

def normalize(df,cols):
    df[cols] = df[cols].apply(lambda x: (x - x.min()) / (x.max() - x.min())) #applies the normalization formula
    df[cols] = df[cols].fillna(0) #fills in empty data points with zeros
    return df

normed_data = normalize(data,features)

dep = normed_data[['sucesses','failures']]
features = ['A','B','C','D','E','F','G','H']
indy = normed_data[features]

glm_binom = sm.GLM(dep, indy, family=sm.families.Binomial())
result = glm_binom.fit()
print(result.summary()) 

                    Generalized Linear Model Regression Results                     
Dep. Variable:     ['sucesses', 'failures']   No. Observations:                  176
Model:                                  GLM   Df Residuals:                      168
Model Family:                      Binomial   Df Model:                            7
Link Function:                        logit   Scale:                             1.0
Method:                                IRLS   Log-Likelihood:                -1294.0
Date:                      Sun, 11 Dec 2016   Deviance:                       1889.5
Time:                              22:41:56   Pearson chi2:                 2.36e+03
No. Iterations:                          12                                         
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
A             -5.9736      0.059   -102.033      0.000        -6.088    -5.85