In [1]:
library('aod')
library(ggplot2)
library(caret)

Loading required package: lattice


# Preprocessing

The features are already selected. For mode details, see the preProcessing notebooks. 

Let's open the file:

In [102]:
features <- read.csv('features.csv')

In [104]:
names(features)

## Actionable Important Features 

Only important actionable variables will go into logistic regression model. For more details, see Finding-Importanct-Variables notebook.

In [105]:
categorical_col_names <- c('primary_focus_subject_grouped', 'resource_type_grouped'
                           ,'eligible_double_your_impact_match', 'funding_status',
                           'eligible_almost_home_match', 'semester_posted','via_giving_page')
numerical_col_names <- c('total_price_including_optional_support', 'students_reached',
                          'previousProposal_Teacher','previousProposal_School')

categorical_features <- features[,categorical_col_names]
numerical_features <- features[,numerical_col_names]

## Scaling  

It doesn't hurt to make sure the categorical variables are seen as factors:

In [106]:
categorical_features <- sapply(categorical_features,function(col) as.factor(col))

The make the coefficients more related:

1. Divide the total price by 100 
2. Divide the number of students reached by 10
3. Divide the number or teacher's and school's previouse proposals by 5

In [107]:
numerical_features$total_price_including_optional_support <- numerical_features$total_price_including_optional_support/100
numerical_features$students_reached <- numerical_features$students_reached/10
numerical_features$previousProposal_Teacher <- numerical_features$previousProposal_Teacher / 5
numerical_features$previousProposal_School <- numerical_features$previousProposal_School / 5

Time to combine the numerical and categorical features:

In [108]:
features <- cbind.data.frame(categorical_features,numerical_features)

In [109]:
head(features)

Unnamed: 0,primary_focus_subject_grouped,resource_type_grouped,eligible_double_your_impact_match,funding_status,eligible_almost_home_match,semester_posted,via_giving_page,total_price_including_optional_support,students_reached,previousProposal_Teacher,previousProposal_School
1,other,Books_Supplies,0,completed,0,1st,False,14.375,8,5.0,62.4
2,health,Technology_other,0,completed,0,1st,False,11.2126,9,11.4,62.4
3,literacy_math,Books_Supplies,0,completed,0,1st,False,6.5126,9,0.8,6.6
4,health,Technology_other,0,completed,0,1st,False,34.0876,9,11.4,62.4
5,literacy_math,Books_Supplies,0,completed,0,1st,False,5.1126,2,0.4,14.0
6,literacy_math,Books_Supplies,0,completed,0,1st,False,3.775,6,0.8,6.6


## Setting The Reference Level 

Something that will my life easier is to set the reference level of categorical variables. The levels of a factor are re-ordered so that the level specified by ref is first and the others are moved down.

The reference level is choosen base on the explanatory analysis. For more details see preprocessing notebooks.

In [110]:
# Music & Art are the most funded subjects
features$primary_focus_subject_grouped <- relevel(features$primary_focus_subject_grouped, ref='music_art')

# Trips & Visitors are the most funded requested resources
features$resource_type_grouped <- relevel(features$resource_type_grouped, ref='Trips_Visitor')

# Interested in the increase chance of success and not the other way
features$funding_status <- relevel(features$funding_status, ref='expired')

## Renaming Columns 

Some column names are funky. Need to make them more readable.

In [111]:
names(features)

In [113]:
colnames(features) <- c('Subject_', 'Resource_', 'Double_Match_', 'funding_status',
                        'Almost_Home_Match_', 'Semester_Posted_', 'Giving_Page_', 'Total_Price', 'Students_Reached',
                        'PreviousProposal_Teacher', 'PreviousProposal_School')

## Train/Test 

75% goes to the training dataset and the rest to the test dataset.

In [120]:
set.seed(34)
trainIndex <- createDataPartition(features$funding_status, p = .75, list = FALSE)

train <- features[ trainIndex,]
test  <- features[-trainIndex,]

dim(train)
dim(test)

Good! Ready for feeding it into the dear GLM!

# Building Model

Logit provides the insight we are looking for.

In [121]:
mylogit <- glm(funding_status ~ Total_Price + Giving_Page_ + Semester_Posted_ + PreviousProposal_Teacher +
               PreviousProposal_School + Students_Reached + Double_Match_ + Almost_Home_Match_ + 
               Subject_ + Resource_ , data=train,family=binomial(logit))

In [122]:
summary(mylogit)


Call:
glm(formula = funding_status ~ Total_Price + Giving_Page_ + Semester_Posted_ + 
    PreviousProposal_Teacher + PreviousProposal_School + Students_Reached + 
    Double_Match_ + Almost_Home_Match_ + Subject_ + Resource_, 
    family = binomial(logit), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0254   0.2754   0.3698   0.5142   6.1822  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                2.800641   0.213623  13.110  < 2e-16 ***
Total_Price               -0.041666   0.002442 -17.065  < 2e-16 ***
Giving_Page_True           1.943373   0.035437  54.840  < 2e-16 ***
Semester_Posted_2nd       -0.508396   0.032909 -15.449  < 2e-16 ***
PreviousProposal_Teacher   0.016323   0.003880   4.207 2.59e-05 ***
PreviousProposal_School    0.007484   0.001066   7.020 2.21e-12 ***
Students_Reached          -0.002340   0.001050  -2.227 0.025920 *  
Double_Match_1            -0.001256   0.035291  -0.036 

Ha! The promotions are not that effective. Interesting. I am not surprised that the community related subjects have a similar success probability as music and art. That came out of the exploratory analysis, as well.

Let's look at ANOVA $\chi^2$ results:

In [136]:
anova(mylogit, test='Chisq')

Unnamed: 0,Df,Deviance,Resid. Df,Resid. Dev,Pr(>Chi)
,,,34773,29670.89,
Total_Price,1.0,358.8876,34772,29312.01,4.918122e-80
Giving_Page_,1.0,3629.054,34771,25682.95,0.0
Semester_Posted_,1.0,271.6844,34770,25411.27,4.873094e-61
PreviousProposal_Teacher,1.0,126.8332,34769,25284.43,2.020633e-29
PreviousProposal_School,1.0,43.50509,34768,25240.93,4.228576e-11
Students_Reached,1.0,2.223169,34767,25238.71,0.1359537
Double_Match_,1.0,0.1044614,34766,25238.6,0.7465401
Almost_Home_Match_,1.0,0.8800693,34765,25237.72,0.3481827
Subject_,5.0,78.29696,34760,25159.42,1.905456e-15


Wow, giving page is pretty important. I am surprised that the promotions are not that effective. Maybe because their prevalence is pretty low.

Confidence intervals for odds ratios should tell us more.

In [123]:
## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))

Waiting for profiling to be done...


Unnamed: 0,OR,2.5 %,97.5 %
(Intercept),16.45519,10.98892,25.43683
Total_Price,0.9591905,0.9545584,0.9637369
Giving_Page_True,6.98226,6.515191,7.486154
Semester_Posted_2nd,0.6014598,0.5638432,0.6414833
PreviousProposal_Teacher,1.016457,1.008812,1.024276
PreviousProposal_School,1.007512,1.005422,1.009633
Students_Reached,0.9976631,0.9955312,0.9995501
Double_Match_1,0.9987449,0.9320276,1.070315
Almost_Home_Match_1,0.9102895,0.8023245,1.035354
Subject_community_related,0.9507275,0.8219447,1.099313


In [133]:
prob <- as.data.frame(exp(coef(mylogit))/(exp(coef(mylogit)) + 1))
colnames(prob) <- c('success probability')
prob

Unnamed: 0,success probability
(Intercept),0.9427105
Total_Price,0.4895851
Giving_Page_True,0.8747222
Semester_Posted_2nd,0.3755697
PreviousProposal_Teacher,0.5040806
PreviousProposal_School,0.5018711
Students_Reached,0.4994151
Double_Match_1,0.499686
Almost_Home_Match_1,0.4765191
Subject_community_related,0.4873707


Very cool! See if teachers followed all the advies, the success probability will be pretty high as the intercept is %94. 

In [227]:
ideal.Case <- data.frame('music_art','Trips_Visitor','0','0','1st','True',4,1,1,2)
colnames(ideal.Case) <- c('Subject_', 'Resource_', 'Double_Match_', 'Almost_Home_Match_', 
                       'Semester_Posted_', 'Giving_Page_', 'Total_Price', 'Students_Reached',
                        'PreviousProposal_Teacher', 'PreviousProposal_School')
predict(mylogit,newdata,type='response')

Cool! For the ideal case, the success is almost gauranteed! Ideal case:

1. Post your project on a Giving Page
2. Post in the first quarter
3. Keep the project cost aroun $400-600 (including all the fees)
4. Relate your project to art or music
5. Include a trip or a visitor 

# Model Accuracy 

Confusion Matrix in Caret is amazing and provides all sort of goodness of the model metrics.

In [124]:
Y_test <- test$funding_status
test$funding_status <- NULL

Logit gives a probability for each level. I will use the 0.7 as a conservative model for completed. I picked this number as about 70% of the projects in LA are funded.

In [130]:
prediction <- rep('completed',length(Y_test))

prediction_prob <- predict(mylogit,test,type='response')
prediction[prediction_prob < 0.7] <- 'expired'

### Confusion Matirx 

In [129]:
confusionMatrix(Y_test,prediction)

In confusionMatrix.default(Y_test, prediction): Levels are not in the same order for reference and data. Refactoring data to match.

Confusion Matrix and Statistics

           Reference
Prediction  completed expired
  completed      8634    1192
  expired         838     926
                                          
               Accuracy : 0.8248          
                 95% CI : (0.8178, 0.8317)
    No Information Rate : 0.8173          
    P-Value [Acc > NIR] : 0.01734         
                                          
                  Kappa : 0.3729          
 Mcnemar's Test P-Value : 4.697e-15       
                                          
            Sensitivity : 0.9115          
            Specificity : 0.4372          
         Pos Pred Value : 0.8787          
         Neg Pred Value : 0.5249          
             Prevalence : 0.8173          
         Detection Rate : 0.7450          
   Detection Prevalence : 0.8478          
      Balanced Accuracy : 0.6744          
                                          
       'Positive' Class : completed       
                                       

The p-values looks good! It's measured for a one-sided test to see if the accuracy is better than the "no information rate," which is taken to be the largest class percentage in the data.

It's interesting that sensitivity and specificity are very much comparable to what random forest and gradient boosting provided. Although, the accuracy is a little lower. But logit is doing a descent job.