# Discussion 8 Logistic Regression

#### by Luke Wiebolt

One of my biggest weaknesses is that I have food allergies. To find out what I'm allergic to I have been on a food elimination diet for the past 6 months. This means that I can't eat food that contains dairy, seafood, wheat, soy, nuts, and eggs. This cuts out a large portion of foods, however I manage. One thing I miss is chocolate and candy. I'm able to eat some candy like Sour Patch Kids, and can't eat others like chocolate, from the milk and dairy. Let's say I wanted to build a model to predict the likelihood of a candy being chocolate based on a variety of input variables. Previously, I have guessed if the piece of candy is chocolate or not, so we will benchmark the model against a 50% accuracy.

How would you know if the output of your logistic regression model is valid?

What are the key statistics to determine the validity of your regression model?

### Instantiate Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
data = pd.read_csv('../input/the-ultimate-halloween-candy-power-ranking/candy-data.csv')
data.head(5)

In [None]:
print('Shape of Data', data.shape)

In [None]:
#Let's drop the competitorname
data.dtypes

In [None]:
df = data[['chocolate', 'fruity', 'caramel', 'peanutyalmondy', 'nougat', 'crispedricewafer',
               'hard', 'bar', 'pluribus', 'sugarpercent', 'pricepercent', 'winpercent']]

df.head()

In [None]:
X = df.values[:, 1:12]
Y = df.values[:, 0]

print(X.shape)
print(Y.shape)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split (X, Y, test_size = 0.2, random_state = 42)

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(11,11));
sns.heatmap(df.corr(), ax=ax, annot=True, linewidths=.5, cmap = "YlGnBu");
plt.xlabel('');
plt.ylabel('');
plt.title('Pearson Correlation matrix heatmap');

In [None]:
#We see that our groups are similar
df['chocolate'].value_counts()

#### Logistic Regression Assumptions

1. Assumes there is a linear relationship between any continuous predictors and the logit of the outcome variable, or target variable (Field, 2013). This can be checked with a scatter plot of the target variables and the predictor variables.
2. Independence of Errors - this produces overdispersion (Field, 2013). The distribution of errors is random and not correlated or influenced by the errors in previous observations. The opposite of this would be autocorrelation. Checking for this is done by plotting the residuals sequentially as well as running a Durbin - Watson Test (Eliezer, 2008). 

In [None]:
import statsmodels.api as sm
logit_model=sm.Logit(y_train, x_train)
result=logit_model.fit()
print(result.summary2())

After running the first time I removed the variables Hard, CrispyRiceWafer, and Bar from my model as these were making it so that my model was failing to converge. Looking at the variable x4 (nougat) the coefficient is extremely high and the p-value is nearly 1.0 this is clearly an issue so we will remove this. 

In [None]:
df_slim = data[['chocolate', 'fruity', 'caramel', 'peanutyalmondy',
                'hard','pluribus', 'sugarpercent', 'pricepercent', 'winpercent']]

X_s = df_slim.values[:, 1:9]
Y_s = df_slim.values[:, 0]

x_train_1, x_test_1, y_train_1, y_test_1 = train_test_split (X_s, Y_s, test_size = 0.2, random_state = 42)


In [None]:
import statsmodels.api as sm
logit_model=sm.Logit(y_train_1, x_train_1)
result=logit_model.fit()
print(result.summary2())

In [None]:
df_final = data[['chocolate', 'fruity', 'pluribus', 'winpercent']]

X_f = df_final.values[:, 1:4]
Y_f = df_final.values[:, 0]

x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split (X_f, Y_f, test_size = 0.2, random_state = 42)

In [None]:
import statsmodels.api as sm
logit_model=sm.Logit(y_train_2, x_train_2)
result=logit_model.fit()
print(result.summary2())

## Below is a way we can test to see if our model valid based on the output and various key statistics 

The output above gives us a wide range of statistics and some that are similar when creating a linear regression like the coeficient table, even though the overall equation itself is different. The Log-likelihood is based on summing the probabilities associated with the predicted and actual outomes. This indicates how much unezplated information there is in the model after fitting, the larger the value the more un explained observations. 

Pseudo R-Squared is similar to what is used in linear regression however it is not the same. The approach of goodness of fit does not apply in this case as we are maximizing our likelihood of each predictor to our target variable. The pseudo r squres have been created as a way to measure on a similar scale of 0 to 1. More research should be done on these before using or reporting
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train_2, y_train_2)


We let our logistic regression model predict what it thinks the value of the x_test training set will be (y_pred) and we measure the accuracy. These are the 17 records that we seperated when creating a train and test data set

In [None]:
y_pred = logreg.predict(x_test_2)
print('Accuracy of logistic regression  classifier on test set: {:.2f}'.format(logreg.score(x_test_2, y_test_2)))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print('Confusion Matrix')
print(confusion_matrix)
print('This means we have 8 + 7 = 15 correct predictions')
print('and')
print('This means we have 1 + 1 = 2 incorrect predictions')

#### From our confusion matrix we conclude that:
* True positive: 8(We predicted a positive result and it was positive)
* True negative: 7(We predicted a negative result and it was negative)
* False positive: 1(We predicted a positive result and it was negative)
* False negative: 1(We predicted a negative result and it was positive)
* Accuracy = (TP+TN)/total
* Accuracy = (8+7)/17 ~ 88%
* Error Rate = (FP+FN)/total
* Error rate = (1+1)/17 ~12%

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
print('read more from the documentation on each of these metrics')
print('https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html')

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test_2, logreg.predict(x_test_2))
fpr, tpr, thresholds = roc_curve(y_test_2, logreg.predict_proba(x_test_2)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Our model implies that fruity, pluribus, and winpercent are variables that would be more indicitive in selecting a chocolate or non-chocolate candy. The model shows that if it is fruity it is more likely to not be chocolate, if it is pluribus or if it comes in multiple packs it less likely to be chocolate, and if it has a higher win percent it is more likely to be chocolate.

One of the worries here is that our dataset is too small. If this model were ever to be used further it would need to be tested against a larger and more robust dataset. There are ways this can be expanded on by simply comparing the two models side by side and assessing both of their accuracy. In the future I will touch this up and see what else I can add. Thanks for following along!

### References

Eliezer (2008) Lecture 8 - Residual Analysis - Checking Independence of Errors. Mar 4th, 2008. GSB420 Business Statistics. Retrieved From http://gsb420.blogspot.com/2008/03/lecture-8-residual-analysis-checking_04.html

Field, A (2013) Discovering Statistics using IBM SPSS Statistics. Sage Publications Ltd. 4th Edition.
ISBN 978-1-4462-4917-8

Li, S (2017) Building A Logistic Regression in Python, Step by Step. Towards Data Science. Sep 28th, 2017. Retrieved From https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
 
Walia, A (2019) Logistic Regression in Python. Medium.  Mar 9th, 2019. Retrieved From
https://medium.com/@anishsingh20/logistic-regression-in-python-423c8d32838b
