In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        if 'train' in os.path.join(dirname, filename):
            train = pd.read_csv(os.path.join(dirname, filename))
        elif 'test' in os.path.join(dirname, filename):
            test = pd.read_csv(os.path.join(dirname, filename))
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import pylab
from scipy.stats import shapiro

I'll be doing a series of different classification methods throughout my Santander series (with explanations). For each model, I will attempt to over-explain rather than to under-explain. It is my hope that this can be approachable. 

Below, I'll be adding links to the different classifications as I get to them. For logistic regression, I'm currently thinking that I'll have a very simple model & and few more complex models. I'll link to those here and at the bottom.


# **SCIKIT Classification**

## ***Linear Models***

### **Logistic Regression**

#### 1) Simple model. This one.


2) Adding things like k fold & perhaps playing around with loss models & SGDClassifier



*Ridge Classifier*

1) Sometimes referred to as a Least Squares Support Vector Machines with a linear kernel (according to scikit)



*Perceptron*

 
*Passive Aggressive Classifier*
 
 

 
***Linear & Quadratic Discriminant Analysis***



***Support Vector Machines***



***Nearest Neighbors***


*K neighbors Classification*


*Radius Neighbors*


*Nearest Centroid*


*Neighborhood Components Analysis*


***Gaussian Process Classifier***


***Naive Bayes***


*Gaussian Naive Bayes*


*Multinomial Naive Bayes*


*Bernoulli Naive Bayes*

*Categorical Naive Bayes*

***Decision Tree***

***Ensemeble Methods***

***Probability Calibration***

***Neural Networks***

*Multi layer perceptron*



I'll mostly keep it to one type per. I might do some ensemble and other combinations in separate posts if feasible.

# **Basic Exploratory Analysis**

For my EDA (Exploratory Data Analysis), I need to start of by seeing what is actually going on with my data. At this point, I'm assuming that I have no idea what is in it. Currently, my framework is as follows:

EDA
1. Check the raw data & data dictionary (if applicable)
2. Check for nulls
3. Check to make sure that quantitative and qualitative data are in the correct format
4. Check data distributions & related data transformations
5. If applicable, do some feature tinkering.
6. Picking my model. Assessing assumptions of model & if it fits with data. (If assumptions do not hold, go back to 4 or decide why it is ok to continue).


Training

7. Training the model
8. Submitting my test

I'm always open and tweaking this model, but overall it is general enough to get me through the day. You could easily add things like factor analysis in the EDA if you wanted to. I'll refrain for now.

## Checking Raw Data & Data Dictionary

In this circumstance, there is no data dictionary. The data is anonymized financial data & thus pretty difficult to glean any real world information. Not ideal, but workable as there is so much financial data.

In [None]:
train.head()

In [None]:
test.head()

**Train**

ID_code

Target

200 variables

________________________________________________________

**Test**

ID_code

200 variables

________________________________________________________

The obvious question here is: how do you comfortably work with anonymized data? You have to get comfortable working with *the numbers*. This can be a little intimidating, but it doesn't have to be. In some ways, when you work with just the numbers you are working with less abstraction. What can you do with just numbers?

Two things come to mind:

1) You can always check for nulls

2) You can check data distributions, mean distances, etc. Essentially, you're looking for summary statistics. Heck, you could do some factor analysis or clustering on this stuff and see what comes up. At this point, I'm going to keep it more on the basic summary statistics, but I might come back to some multivariate methods later.

In [None]:
train.isnull().values.any()

In [None]:
test.isnull().values.any()

Since these two isnull() tests end up as false, I don't have to worry about checking where any null data might be. No need to spend the time (computationally or programmatically) before doing an easy check.

Time for some descriptive statistics

In [None]:
train.describe()

In [None]:
test.describe()

This is slightly chaotic to read. Essentially, what I really want from all of this is to see the differences between variables in terms of mean, standard deviation, min and max numbers. From a glance, it seems like standardizing these numbers could be useful here.


I should probably check for normality and scedasticity of the raw data. Normality assumption test first.

In [None]:

def normality_plots(df,rb=0):
    plt.figure()
    fig, ax = plt.subplots(5,5,figsize=(15,14))
    
    for i in range(1,26):
        plt.subplot(5,5,i)
        plt.hist(train['var_'+str(i+rb)])
    
    plt.show()
    #almost works...except I need to create separate plots

In [None]:
normality_plots(train)

In [None]:
normality_plots(train, 27)

In [None]:
normality_plots(train,52)

In [None]:
normality_plots(train, 77)

In [None]:
normality_plots(train, 102)

In [None]:
normality_plots(train, 127)

In [None]:
normality_plots(train, 152)

Generally trending towards CLT. I don't *need* a normal distribution here, so I'm not going to hammer down on this. I'll illustrate a probability plot on one variable just to see how close to normal we are.

In [None]:
stats.probplot(train['var_1'], dist='norm',plot=pylab)
pylab.show()

In [None]:
sns.countplot(train['target'])

Any correlated variables?

In [None]:
train25=train.iloc[:,1:26]
train25

In [None]:
train25=train.iloc[:,1:26]
corr=train25.corr()

sns.heatmap(corr,xticklabels=corr.columns,
        yticklabels=corr.columns)

I'm going to make the not so bold decision to believe that none of these really have much correlation to speak of here.

At this point, I'm going to go forward with my logistic regression model and just see what happens. I'll likely come back and adjust. I saw some of the feature engineering in the most popular EDA for this, but I'm going to avoid that this time. I'll approach it in a later edition once I work through it and see what gems it has.

**The Model - Logistic Regression**

Going to be focusing on the scikit version of logistic regression here. Hoping to explain how things work along the way.

First thing: despite the name, logistic regression is a *classification* algorithm. The reason it has the name regression is due to the algebraic equation nestled snugly in there. There are, of course, variations on this. I'll be focusing on the most typical logistic regression here. Binary Logistic Regression


# Binary Logistic Regression

$\huge p(X) = \frac{e^{\large\beta_{\large0}+\beta_{\large1}X}}{1+e^{\large\beta_{\large0}+\beta_{\large1}X}} $

where

$\large p(X) = $ The probability of being X

$\large e =$ Euler's number

$\large \beta_{0} = $ Intercept Term

$\large \beta_{1}= $ Slope Term

The formula above does some great things for us. First and foremost, it reduces our output to a value between 0 and 1. In this circumstance, that is perfect. We are looking for a yes/no answer, 0 or 1 gives us that binary response

Before we go on, it is worth pointing out, from a mathematical position, where the name logistic regression comes from. Quick, I'll put the linear regression formula.

$\large y = \beta_{0}+\beta_{1}X $

I hope you see it. If you don't, look at the raised values in the numerator and denominator of the binary logistic regression formula. For those that are more visually minded, it may help to see what the plot looks like.

In [None]:

# Sigmoid function
#
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
# Creating sample Z points
#
z = np.arange(-5, 5, 0.1)
 
# Invoking Sigmoid function on all Z points
#
phi_z = sigmoid(z)
 
# Plotting the Sigmoid function
#
plt.plot(z, phi_z)
plt.axvline(0.0, color='k')
plt.xlabel('z')
plt.ylabel('$\phi(z)$')
plt.yticks([0.0, 0.5, 1.0])
ax = plt.gca()
ax.yaxis.grid(True)
plt.tight_layout()
plt.show()

The line, which we call a sigmoid curve,is allowing us to see the growing *probability* of the input z giving us either a 0 or 1 output.

Ok, ok. So now you've seen the logistic regression formula & you've seen why it is misleadingly called regression when it is in fact a classification algorithm. But...what does it actually mean? How to convert the abstract notation into something generally usable? In order to make the above notation a bit more understandable, we can manipulate this formula in order to shows us *odds*.

$\huge \frac{p(X)}{1-p(x)} = e^{\beta_{0}+\beta_{1}X} $


If you know anything about odds, you know that you can now read the formula by saying. "If I have a p(X) = 0.4, then that means that I have an odds ratio of .4/.6" Or an odds of 2:3. It is important to note here what odds is *not*. Odds is not saying that your chance is 2/3. Instead, it is saying that that you will, probabilistically, arrive at answer X 2 times out of 5 and answer Y 3 times out of 5.


Think about it this way. An odds of 1:1 would just be...50% probability, right? If you ever get confused, just remember the odds of 1:1 and then work your interpretation from there.

But, of course, that isn't it with odds. You still have to add the logarithm (this is essentially done to make everything easier and to get rid of the intercept, slope, & x being in the exponential).


Now the question is, how on earth do we use this? Well, it ends up being pretty simple here. Once you have all the more complicated stuff out of the way, then you check on your probability switch. What I mean by that is that you decide what your decision boundary is for moving your outputs to 0 or 1.

For this modeling practice, we won't mess with that, it'll remain at a 50% probability for our decision line. In certain business practices, you may want to shift this in order to be more conservative or more aggressive.

## Assumptions with Logistic Regression

Now the question is, what assumptions does logistic regression make about our model? After all, all models make some sort of assumptions.

Some big differences from other GLMs (generalized linear models) right off the bat.

    1) Residuals do not need to be normally distributed

    2) No homoscedasticity required


Assumptions for **binary** logistic regression

    1) Target variable must be binary

    2) Independent observations

    3) Little multicollinearity

    4) Linearity of independent variables & log odds

    5) Relatively large sample size


For us

    1) Not a problem

    2) Unsure, but it doesn't really seem like this should be an issue. At the very least, I'm not diving into the anonymous data to see if we have matched or repeated data, although it would strike me as odd if we did.
    
    3) This trips people up. Multicollinearity details a linear relationship between multiple predictors. We don't really know about this yet.
    
    4) We're good here.
    
    5) Excellent here
    

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calc_vif(X):

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)


#Big problem here is that this calculation is soooo damn slow

In [None]:
#mu_co=train.iloc[:,2:-1]
#calc_vif(mu_co)

mu_co=train.iloc[0:25,2:10]
calc_vif(mu_co)

Looks like we have some pretty terrible multicollinearity throughout all this. I'm not going to attempt fixing it right now, I just want to see how the model performs as is. However, these seem like pretty bad numbers. I'll likely want to do something with these on my next go around

I'm separating my X from my y values and turning them into a sci-kit approved train-test (or validation in this case) set.

A lot of people can run into problems here (I know I can as well). Scikit gets pretty picky about having everything in the EXACT right format. Pay attention to your columns

In [None]:
X=train.drop(['target','ID_code'], axis=1)
y=train.target

In [None]:
y

Doing an 80/20 split here. Not concerned about having too little data at all.

In [None]:
X

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val=train_test_split(X, y, test_size=0.2,random_state=0)

Printing these out so you can see what is going on in between all of these. In essence:

X_train: Large dataset with 80% of all data & no response

X_val: Smaller dataset with 20% of the data & no response

y_train: Corresponding 80% response values

y_val: Corresponding 20% response values.

The reason that I am labeling these as 'val' instead of 'test' is because I'm technically testing on the unseen data. Labeling it as validation keeps everything a little bit more clear than it otherwise would be. This is pretty common practice, although unfortunately I don't think it is quite common enough.

In [None]:
X_train.head()

In [None]:
len(X_train)

In [None]:
X_val.head()
len(X_val)

In [None]:
y_train

In [None]:
y_val.head()

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logreg = LogisticRegression(solver='sag')

I originally used the liblinear solver here due to an issue I was having with the default lfbgs solver. I think it is fixed, but I didn't want to mess with updating scikit right now in case other things break. Sag should work a little faster. No real need to go into the details of what exactly this does (for me, right now... I could be wrong and it could drastically improve performance later).

In [None]:
#Fit applies the logistic regression model to the X_train input data values and compares the y_train values to the predicted.
logreg.fit(X_train, y_train)

The fit part of scikit is a little bit weird. Essentially, you're training the model on whatever algorithm you have chosen. The scikit model essentially vectorizes—performs the algorithm on each part of—the logistic regression model and chooses your classifications according to the parameters you input.

In [None]:
y_pred=logreg.predict(X_val)

Prediction tells you how well your trained model matches to your validation model here. If there is a wide disparity, it is a good place to tell that you overfit.

I'm going to add a confusion matrix in here & start seeing how well I did with my logistic regression model.

In [None]:
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_val,y_pred)
cnf_matrix

In [None]:
print("Accuracy:",metrics.accuracy_score(y_val, y_pred))
print("Precision:",metrics.precision_score(y_val, y_pred))
print("Recall:",metrics.recall_score(y_val, y_pred))

Accuracy is basic. Exactly what you would think you were looking for. Just the $\huge\frac{TP+TN}{TP+TN+FP+FN}$ Overall, I'm relatively pleased with this.

_____________________________________________

Precision goes down substantially. Precision is a little more difficult for people to grasp. $\huge\frac{TP}{TP+FP}$ or in other words: the percentage of the time the model correctly predicts when something is positive (or classified as a 1 in our case).

______________________________________________


Our recall score...isn't good. It also tends to be the hardest one for people to grasp. Recall tells us how often we correctly predicts a positive when a positive exists.
$\huge\frac{TP}{TP+FN}$


In [None]:
y_pred_proba = logreg.predict_proba(X_val)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_val,  y_pred_proba)
auc = metrics.roc_auc_score(y_val, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

ROC curves give us a tradeoff. In this circumstance, the tradeoff is between sensitivity(the true positive rate) and specificity (1 - false positive rate). The AUC (Area Under Curve) is pretty decent here. Actually better than I expected.

In [None]:
my_test=test.drop(['ID_code'],axis=1)
my_test

Now, I'll finish up by putting everything into my test dataset and then submitting this simplest of logistic regressions. I'll make a more complicated logistic regression model which introduces k-folds and some other things in my following model, which I'll link here when it is available.

In [None]:
my_submission=logreg.predict(my_test)

In [None]:
my_submission

In [None]:
output= pd.DataFrame({'ID_code': test.ID_code,
                       'target': my_submission})
output.to_csv('submission.csv', index=False)

In [None]:
output

Some code & information used from the following sources:

1) https://vitalflux.com/logistic-regression-sigmoid-function-python-code/

2) Datacamp

3) "An Introduction to Statistical Learning in R" by Gareth James, Daniela Witten, et al.

4) "Pattern Recognition & Machine Learning" by Bishop, et al.

5) https://www.analyticsvidhya.com/blog/2020/03/what-is-multicollinearity/