In [1]:
library('aod')
library(ggplot2)
library(caret)

Loading required package: lattice


# Preprocessing

The features are already selected. For mode details, see the preProcessing notebooks. 

Let's open the file:

In [44]:
features <- read.csv('features.csv')

In [45]:
names(features)

## Actionable Important Features 

Only important actionable variables will go into logistic regression model. For more details, see Finding-Importanct-Variables notebook.

In [48]:
categorical_col_names <- c('primary_focus_subject_grouped', 'resource_type_grouped'
                           ,'eligible_double_your_impact_match', 'funding_status',
                           'eligible_almost_home_match', 'semester_posted','Giving_Page','Promo_Code')
numerical_col_names <- c('total_price_including_optional_support', 'students_reached',
                          'previousProposal_Teacher','previousProposal_School')

categorical_features <- features[,categorical_col_names]
numerical_features <- features[,numerical_col_names]

## Scaling  

It doesn't hurt to make sure the categorical variables are seen as factors:

In [49]:
categorical_features <- sapply(categorical_features,function(col) as.factor(col))

The make the coefficients more related:

1. Divide the total price by 100 
2. Divide the number of students reached by 10
3. Divide the number or teacher's and school's previouse proposals by 5

In [50]:
numerical_features$total_price_including_optional_support <- numerical_features$total_price_including_optional_support/100
numerical_features$students_reached <- numerical_features$students_reached/10
numerical_features$previousProposal_Teacher <- numerical_features$previousProposal_Teacher / 5
numerical_features$previousProposal_School <- numerical_features$previousProposal_School / 5

Time to combine the numerical and categorical features:

In [51]:
features <- cbind.data.frame(categorical_features,numerical_features)

In [52]:
head(features)

Unnamed: 0,primary_focus_subject_grouped,resource_type_grouped,eligible_double_your_impact_match,funding_status,eligible_almost_home_match,semester_posted,Giving_Page,Promo_Code,total_price_including_optional_support,students_reached,previousProposal_Teacher,previousProposal_School
1,literacy_math,Books_Supplies,0,completed,0,1st,0,0,6.5126,9.0,0.8,6.6
2,literacy_math,Books_Supplies,0,completed,0,1st,0,0,3.775,6.0,0.8,6.6
3,literacy_math,Technology_other,0,completed,0,1st,0,0,23.4626,12.5,0.4,6.0
4,health,Books_Supplies,0,completed,0,1st,0,0,21.9126,15.0,0.2,5.4
5,health,Technology_other,0,completed,0,1st,0,0,7.0874,20.0,0.8,26.2
6,literacy_math,Technology_other,0,completed,0,1st,1,0,5.575,10.0,0.8,25.6


## Setting The Reference Level 

Something that will my life easier is to set the reference level of categorical variables. The levels of a factor are re-ordered so that the level specified by ref is first and the others are moved down.

The reference level is choosen base on the explanatory analysis. For more details see preprocessing notebooks.

In [53]:
# Music & Art are the most funded subjects
features$primary_focus_subject_grouped <- relevel(features$primary_focus_subject_grouped, ref='music_art')

# Trips & Visitors are the most funded requested resources
features$resource_type_grouped <- relevel(features$resource_type_grouped, ref='Trips_Visitor')

# Interested in the increase chance of success and not the other way
features$funding_status <- relevel(features$funding_status, ref='expired')

## Renaming Columns 

Some column names are funky. Need to make them more readable.

In [54]:
names(features)

In [55]:
colnames(features) <- c('Subject_', 'Resource_', 'Double_Match_', 'funding_status', 'Promo_Matched_',
                        'Almost_Home_Match_', 'Semester_Posted_', 'Giving_Page_', 'Total_Price', 'Students_Reached',
                        'PreviousProposal_Teacher', 'PreviousProposal_School')

## Train/Test 

75% goes to the training dataset and the rest to the test dataset.

In [56]:
set.seed(3456)
trainIndex <- createDataPartition(features$funding_status, p = .75, list = FALSE)

train <- features[ trainIndex,]
test  <- features[-trainIndex,]

dim(train)
dim(test)

Good! Ready for feeding it into the dear GLM!

# Building Model

Logit provides the insight we are looking for.

In [57]:
mylogit <- glm(funding_status ~ Total_Price + Giving_Page_ + Semester_Posted_ + PreviousProposal_Teacher +
               PreviousProposal_School + Students_Reached + Double_Match_ + Almost_Home_Match_ + Promo_Matched_ + 
               Subject_ + Resource_ , data=train,family=binomial(logit))

In [58]:
summary(mylogit)


Call:
glm(formula = funding_status ~ Total_Price + Giving_Page_ + Semester_Posted_ + 
    PreviousProposal_Teacher + PreviousProposal_School + Students_Reached + 
    Double_Match_ + Almost_Home_Match_ + Promo_Matched_ + Subject_ + 
    Resource_, family = binomial(logit), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9985   0.2626   0.3550   0.5168   5.8911  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                3.0308166  0.3003789  10.090  < 2e-16 ***
Total_Price               -0.0387752  0.0034260 -11.318  < 2e-16 ***
Giving_Page_1             -0.0198695  0.0708133  -0.281 0.779025    
Semester_Posted_1          1.9399790  0.0588751  32.951  < 2e-16 ***
PreviousProposal_Teacher   0.0157188  0.0054379   2.891 0.003845 ** 
PreviousProposal_School    0.0055657  0.0014644   3.801 0.000144 ***
Students_Reached          -0.0008564  0.0010607  -0.807 0.419409    
Double_Match_1            -0.1

Ha! The promotions are not that effective. Interesting. Almost home match is marginally important.

Confidence intervals for odds ratios should tell us more.

In [80]:
## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit), Prob = coef(mylogit)/(coef(mylogit) + 1)))

Waiting for profiling to be done...


Unnamed: 0,OR,2.5 %,97.5 %,Prob
(Intercept),20.71414,11.82449,38.59028,2.12105
Total_Price,0.961967,0.9553884,0.9683012,0.9604635
Giving_Page_1,0.9803266,0.8539877,1.127312,0.9799318
Semester_Posted_1,6.958605,6.203888,7.814624,1.934524
PreviousProposal_Teacher,1.015843,1.005195,1.026863,1.015596
PreviousProposal_School,1.005581,1.002725,1.008498,1.00555
Students_Reached,0.9991439,0.99674,1.00141,0.9991432
Double_Match_1,0.8907353,0.7999915,0.9918411,0.877351
Almost_Home_Match_2nd,0.6152051,0.5560645,0.6803577,0.3887702
Promo_Matched_1,0.7851066,0.6447286,0.9618272,0.7267669


# Model Accuracy 

In [60]:
Y_test <- test$funding_status
test$funding_status <- NULL

In [73]:
prediction <- rep('completed',length(Y_test))

prediction_prob <- predict(mylogit,test,type='response')
prediction[prediction_prob < 0.73] <- 'expired'

In [74]:
confusionMatrix(Y_test,prediction)

In confusionMatrix.default(Y_test, prediction): Levels are not in the same order for reference and data. Refactoring data to match.

Confusion Matrix and Statistics

           Reference
Prediction  completed expired
  completed      3784     545
  expired         333     379
                                          
               Accuracy : 0.8258          
                 95% CI : (0.8151, 0.8362)
    No Information Rate : 0.8167          
    P-Value [Acc > NIR] : 0.04813         
                                          
                  Kappa : 0.3614          
 Mcnemar's Test P-Value : 1.072e-12       
                                          
            Sensitivity : 0.9191          
            Specificity : 0.4102          
         Pos Pred Value : 0.8741          
         Neg Pred Value : 0.5323          
             Prevalence : 0.8167          
         Detection Rate : 0.7506          
   Detection Prevalence : 0.8588          
      Balanced Accuracy : 0.6646          
                                          
       'Positive' Class : completed       
                                       