# Logistic regression model

## Fit a logistic regression model and make global and local explanations


Predict whether a student will dropout from their class.

The workflow is the following:

- Identify variables that are good predictors of the target.
- Identify and remove high multi-colinearity among the predictors.
- Fit a logistic model and assess the goodness of fit
- Ensure all features make statistically significant contributions to the outcome
- Interpret the coefficients (global interpretation)
- Evaluate a few observations individually (local interpretation)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

## Load data

To obtain the data, check the folder `prepare-data` in this repo, or section 2 in the course.

In [2]:
df = pd.read_csv('../../student_dropout_logit.csv')

print(df.shape)

df.head()

(4424, 102)


Unnamed: 0,Application order,Daytime/evening attendance\t,Previous qualification (grade),Admission grade,Displaced,Educational special needs,Debtor,Tuition fees up to date,Gender,Scholarship holder,...,Nacionality_Italian,Nacionality_Cape Verdean,Nacionality_Turkish,Nacionality_Moldova (Republic of),Nacionality_Guinean,Nacionality_Colombian,Nacionality_German,Nacionality_Cuban,Nacionality_Russian,Nacionality_English
0,5,1,122.0,127.3,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,160.0,142.5,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,5,1,122.0,124.8,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,2,1,122.0,119.6,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,100.0,141.5,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Split the data

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("dropout", axis=1),
    df["dropout"],
    test_size=0.2,
    random_state=1,
)

X_train.shape, X_test.shape

((3539, 101), (885, 101))

In [4]:
# Fraction of students who drops out.

y_train.mean(), y_test.mean()

(0.3241028539135349, 0.3096045197740113)

In [5]:
# scale the variables

scaler = MinMaxScaler().set_output(transform="pandas")

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Variables

In [6]:
# Lists with numerical and categorical variables
# The categorical variables were one hot encoded
# into k-1 variables

cat_vars = [var for var in X_train.columns if "_" in var]
num_vars = [var for var in X_train.columns if var not in cat_vars]

len(cat_vars), len(num_vars)

(74, 27)

## Chi-square (categorical variables)

Find and remove categorical variables that are not good predictors of dropout. Use the chi-square test.

**Hint:** sklearn's chi-square function is not what you want. Use `scipy.stats`.

## Anova (numerical variables)

Find and remove numerical variables that are not good predictors of dropout. Use Anova or t-test.

**Hint:** The `f_classif` function from sklearn is what you want.

## Correlation

Find and remove variables that are highly correlated to each other. Use 0.7 as threshold. Evaluate only numerical variables, not the categorical.

## Logistic regression

Fit and evaluate a logistic regression model to predict dropout. 

Determine the accuracy of the model and its goodness of fit.

## Goodness of fit

Determine if the model's goodness of fit.

# Asses the coefficients significance.

Use bootstrapping to determine the error of the coefficients.

Identify variables whose coefficients are not significantly different from 0.

## Re-train the logistic regression 

Use only the variables whose coefficients were significantly different from zero.

## Sign and magnitude of coefficients

Plot the coefficients of the logistic regression and draw some conclusions.

## Coefficient magnitude

## Odds ratio

Plot the odds ratio and draw some conclusions.

## Local explanations

Evaluate observations 525 and 3017 from the test set and draw some conclusions.