## Goal

In this project, I am interested in building a logistic regression model using all of its predictors and interpreting some of the coefficients. 
This notebook does not cover data preparation, class imbalance, optimal threshhold, or selecting relevant features for the model, but you can find it in the next notebook. A formal data exploration of the data set will also be explored. 

### Dataset

The data set I selected is the [Spambase Data Set](https://archive.ics.uci.edu/ml/datasets/spambase)
The collection of spam e-mails came from UCI postmaster and individuals who had filed spam. Non-spam e-mails came from filed work and personal e-mails.There is no missing data. 

### Description of Variables 
The data set contains 4601 observations of  57 attributes.
There are 48 columns that deal with word frequency (%), 6 columns that deal with character frequency (%), and 3 columns related to run_length. The word frequency columns are in percentages. The character frequency columns are in percentages.The run length columns are in continous integer.

The last column 'class' denotes whether the e-mail was considered spam (1) or not (0).

### Split the Data into Train and Test Data 

In [28]:
library(dplyr)
library(caret)
library(caTools)
spam_dataset<-read.csv("spambase_csv (1).csv")

ncol(spam_dataset)
#Munge the data
spam_dataset$class<-factor(spam_dataset$class) #convert class into a factor
is.factor(spam_dataset$class)

set.seed(364) 
n <- nrow(spam_dataset) 


test_idx <- sample.int(n, size = round(0.2 * n)) 


train <- spam_dataset[-test_idx, ] 
nrow(train) #number of rows for training set 

test <- spam_dataset[test_idx, ]
nrow(test)   #number of rows for testing set




### Train the Model using the Training Data 

In [103]:
mymodel<-glm(class~., data=train, family='binomial') #binomial is a binary classifier.
summary(mymodel)
Coeff.df<-summary(mymodel)$coeff

Coeff.df<-as.data.frame(Coeff.df)
colnames(Coeff.df)[colnames(Coeff.df)=="Pr(>|z|)"] <- "P.Value"
Coeff.df$B_i <- rownames(Coeff.df)
rownames(Coeff.df) <- NULL
Coeff.df<-Coeff.df[,c(5,1,2,3,4)]
Sig_Coeff.df<-Coeff.df%>% filter(P.Value<0.05)
Sig_Coeff.df


“glm.fit: fitted probabilities numerically 0 or 1 occurred”


Call:
glm(formula = class ~ ., family = "binomial", data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.6544  -0.2020   0.0000   0.1207   4.2060  

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -1.669e+00  1.686e-01  -9.899  < 2e-16 ***
word_freq_make             -3.767e-01  2.416e-01  -1.559 0.119023    
word_freq_address          -1.956e-01  1.012e-01  -1.932 0.053313 .  
word_freq_all               8.400e-02  1.259e-01   0.667 0.504611    
word_freq_3d                1.963e+00  1.462e+00   1.342 0.179512    
word_freq_our               6.383e-01  1.201e-01   5.316 1.06e-07 ***
word_freq_over              1.174e+00  2.824e-01   4.157 3.23e-05 ***
word_freq_remove            2.706e+00  4.100e-01   6.601 4.08e-11 ***
word_freq_internet          7.382e-01  2.173e-01   3.397 0.000682 ***
word_freq_order             4.238e-01  2.933e-01   1.445 0.148460    
word_freq_mail              4.479e

B_i,Estimate,Std. Error,z value,P.Value
(Intercept),-1.6692946596,0.1686343624,-9.8989,4.2087940000000004e-23
word_freq_our,0.638297167,0.1200738085,5.315873,1.061469e-07
word_freq_over,1.1739575458,0.2824372235,4.156526,3.23124e-05
word_freq_remove,2.7064578667,0.4099930783,6.601228,4.077653e-11
word_freq_internet,0.7381906779,0.2173229655,3.396745,0.000681925
word_freq_free,0.9049646819,0.1497646967,6.042577,1.516721e-09
word_freq_business,0.907028877,0.2512932052,3.609444,0.0003068534
word_freq_credit,1.3825637817,0.6418495019,2.154031,0.03123773
word_freq_your,0.2310842778,0.059454578,3.886736,0.0001016009
word_freq_000,1.8201773007,0.4458584652,4.082411,4.457097e-05


### Brief Interpretation of Coefficients (Significant, in this case)

There are 25 variables of the 57 predictors that are statistically associated to class - the outcome. 
The coefficient estimate of the word frequency of our is b = 0.6382971670, which is positive. This means that an increase in the percentage of the word our is associated with increase in the probability of the email being spam.The odds of the email being spam go up by a factor of $e^{0.6382971670}$. This refers to odds not probability. 

The coefficient estimate of the word frequency of project is b = -1.6955674606, which is negative. This means that an increase in the percentage of the word 'project' is associated with a decrease in the probability of the email being spam.The odds of the email being spam go down by a factor of $e^{-1.6955674606}$. 

Both of their p values are significant in determining the class type of the email.

### Run the Training Data through the Model (All Predictors)


In [105]:
# If the probability of Y is > 0.5, then it can be classified as Spam.
#The common practice is to take the probability cutoff as 0.5.

# Recode factors
trainpred <- predict(mymodel, newdata = train, type = "response") 
y_pred_num_train <- ifelse(trainpred > 0.5, 1, 0)  #Converting probabilities to 1 and 0 values. 
y_pred_train <- factor(y_pred_num_train, levels=c(0, 1))


y_act_train <- train$class

tab1<-table(Predicted=y_pred_num_train,Actual=train$class) #confusion matrix. How well is this model performing? 
tab1


#Accuracy
mean(y_pred_train == y_act_train)

#Misclassification
1-sum(diag(tab1)/sum(tab1))

         Actual
Predicted    0    1
        0 2110  166
        1   97 1308

For the training dataset, the classification prediction accuracy of the model is about 93%, which is good. The misclassification error rate is 7%.

### Run the Test Data through the Model (All Predictors)

In [104]:
# If the probability of Y is > 0.5, then it can be classified as Spam.
#The common practice is to take the probability cutoff as 0.5.

# Recode factors
pred <- predict(mymodel, newdata = test, type = "response") 
y_pred_num <- ifelse(pred > 0.5, 1, 0)  #Converting probabilities to 1 and 0 values. 
y_pred <- factor(y_pred_num, levels=c(0, 1))


y_act <- test$class


### Validate the Model -Accuracy, Precision, etc. 

In [29]:
tab2<-table(Predicted=y_pred_num,Actual=test$class) #confusion matrix. How well is this model performing? 
tab2


#Accuracy
mean(y_pred == y_act)

#Misclassification
1-sum(diag(tab2)/sum(tab2))

         Actual
Predicted   0   1
        0 558  32
        1  23 307

For the test dataset, the classification prediction accuracy of the model is about 94%, which is good. The misclassification error rate is 6%.

Sources Used:

https://www.machinelearningplus.com/machine-learning/logistic-regression-tutorial-examples-r/
https://nbviewer.jupyter.org/gist/justmarkham/6d5c061ca5aee67c4316471f8c2ae976
http://www.sthda.com/english/articles/36-classification-methods-essentials/151-logistic-regression-essentials-in-r/
