# LSE Machine Learning: Practical Applications
## Module 5 Unit 1 IDE Activity (Practice)
### In this IDE activity, you have the opportunity to practise the execution of logistic regression on a data set in R.
As you complete this activity, you are required to read the text cells throughout the notebook and then run the code in the cells that follow. Be mindful of the syntax used to execute certain functionalities within R to produce a desired result. In completing this activity, you should gain the necessary practical skills to complete the IDE Activity (Assessment) that follows.

### Step 1: Load the relevant packages

The functionality to estimate logistic regression is in base R and does not have to be loaded separately. However, you need to load the tidyverse package for data manipulation purposes. To interpret the coefficients in the logistic regression, the `margins` package is used.

In [9]:
#!install.packages("margins") 

library(tidyverse)
library(margins)  #to obtain the average marginal effects

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.0     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.4
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



### Step 2: Load the data

To execute logistic regression, load the credit card data set.

In [10]:
credit_data <- read.csv("Credit.csv")

Once the data set is loaded into R, use the `str()` function to analyse the structure of the data frame. Using the `head()` function, also consider the first few rows of the data.

In [11]:
str(credit_data)
head(credit_data)

'data.frame':	10000 obs. of  4 variables:
 $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
 $ balance: num  730 817 1074 529 786 ...
 $ income : num  44362 12106 31767 35704 38463 ...


Unnamed: 0_level_0,default,student,balance,income
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>
1,No,No,729.5265,44361.625
2,No,Yes,817.1804,12106.135
3,No,No,1073.5492,31767.139
4,No,No,529.2506,35704.494
5,No,No,785.6559,38463.496
6,No,Yes,919.5885,7491.559


The first rows indicate that all the columns are relevant. However, the first two variables are categorical in nature and are not saved as factor variables. These variables must be saved as factor variables using the `as.factor()` function. Once this has been done, confirm that the new structure of the data is correct.

In [12]:
credit_data$default <- as.factor(credit_data$default)
credit_data$student <- as.factor(credit_data$student)
str(credit_data)

'data.frame':	10000 obs. of  4 variables:
 $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
 $ balance: num  730 817 1074 529 786 ...
 $ income : num  44362 12106 31767 35704 38463 ...


Once the data is ready, the logistic regression model can be fitted onto the data set. In this example, the ***default*** variable is the variable to be predicted, and all other variables are predictors.

### Step 3: Interpret the output

Use the `glm()` function to estimate the logistic regression model.

In [18]:
logitReg <- glm(default ~., data = credit_data, family = binomial(link = logit))
summary(logitReg)

ERROR: Error in glm.control(iter = 1000): unused argument (iter = 1000)


In this summary, the significance of the variables is shown, indicating that the ***student*** and ***balance*** varaibles are significant. The output shows that these variables are all statistically significant. To determine the relationship between these variables, the `margins()` function may be used. In the output the average marginal effect (AME) is the average change in **probability** with a unit increase in the predictor.

In [17]:
summary(margins(logitReg))

Unnamed: 0_level_0,factor,AME,SE,z,p,lower,upper
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,balance,0.0001232348,4.851866e-06,25.3994633,2.556535e-142,0.0001137253,0.0001327443
2,income,6.516621e-08,1.765025e-07,0.3692085,0.7119723,-2.807722e-07,4.111047e-07
3,studentYes,-0.01326965,0.004659183,-2.8480647,0.004398599,-0.02240148,-0.004137823


### Step 4: Make the prediction

Now that the model has been fitted and interpreted, it can be used to make predictions on a hypothetical individual. Predict the expected probability of a student, with a balance of USD2,000 and an income of USD30,000, to default on their credit card repayments.

In [15]:
newData <- data.frame("Yes", 2000, 30000)
colnames(newData) <- c("student", "balance", "income")
predict(logitReg, newData, type = "response")

The output of this cell shows that the likelihood of this hypothetical individual defaulting on their credit card repayments is 51.2%.