# Generalized Linear Models
## Author: Snigdhayan Mahanta

`Generalized linear models` or `GLMs` are very useful in many data science use cases. A `GLM` enhances the standard `linear regression` model by adding an option to specify a distribution from the `exponential family` that fits the regression error. Along with the family one can specify a `link function` that can often be a source of indecision for a data scientist. For example, the `logistic regression` model uses the `binomial family` with the `logit link function`. However, there are other  `link functions` like `probit` and `cloglog` for the `binomial family`. Instead of worrying about the right choice one can apply all choices and compare the models with the help of `anova`.

I have trained 3 different `binomial regression` models with 3 different `link functions` on the same credit scoring dataset and compared them below.

In [1]:
# Read credit scoring dataset (label is "creditability")
df <- read.csv('CreditScoringData.csv') # the file must be available locally
df <- df[-c(1)]
head(df, 5)

Unnamed: 0_level_0,status_of_existing_checking_account,duration_in_month,credit_history,purpose,credit_amount,savings_account_and_bonds,present_employment_since,installment_rate_in_percentage_of_disposable_income,other_debtors_or_guarantors,property,age_in_years,other_installment_plans,housing,creditability
Unnamed: 0_level_1,<fct>,<int>,<fct>,<fct>,<int>,<fct>,<fct>,<int>,<fct>,<fct>,<int>,<fct>,<fct>,<int>
1,... < 0 DM,6,critical account/ other credits existing (not at this bank),radio/television,1169,unknown/ no savings account,... >= 7 years,4,none,real estate,67,none,own,0
2,0 <= ... < 200 DM,48,existing credits paid back duly till now,radio/television,5951,... < 100 DM,1 <= ... < 4 years,2,none,real estate,22,none,own,1
3,no checking account,12,critical account/ other credits existing (not at this bank),education,2096,... < 100 DM,4 <= ... < 7 years,2,none,real estate,49,none,own,0
4,... < 0 DM,42,existing credits paid back duly till now,furniture/equipment,7882,... < 100 DM,4 <= ... < 7 years,2,guarantor,building society savings agreement/ life insurance,45,none,for free,0
5,... < 0 DM,24,delay in paying off in the past,car (new),4870,... < 100 DM,1 <= ... < 4 years,3,none,unknown / no property,53,none,for free,1


In [2]:
# First model with link function "logit"
model1 <- glm(creditability ~ ., family = binomial(logit), data = df)
model1 <- update(model1, ~ . - property) # remove "property" as predictor

In [3]:
# Chi-squared test
anova(model1, test = "Chisq")

Unnamed: 0_level_0,Df,Deviance,Resid. Df,Resid. Dev,Pr(>Chi)
Unnamed: 0_level_1,<int>,<dbl>,<int>,<dbl>,<dbl>
,,,999,1221.7286,
status_of_existing_checking_account,3.0,131.335922,996,1090.3927,2.787203e-28
duration_in_month,1.0,38.496739,995,1051.8959,5.484526e-10
credit_history,4.0,29.311001,991,1022.5849,6.758712e-06
purpose,9.0,33.508884,982,989.0761,0.0001088679
credit_amount,1.0,1.504216,981,987.5718,0.2200237
savings_account_and_bonds,4.0,19.067515,977,968.5043,0.0007622985
present_employment_since,4.0,12.496157,973,956.0082,0.014019
installment_rate_in_percentage_of_disposable_income,1.0,11.90661,972,944.1016,0.0005593509
other_debtors_or_guarantors,2.0,8.238029,970,935.8635,0.01626053


In [4]:
# Second model with link function "probit"
model2 <- glm(creditability ~ ., family = binomial(probit), data = df)
model2 <- update(model2, ~ . - property) # remove "property" as predictor

In [5]:
# Chi-squared test
anova(model2, test = "Chisq")

Unnamed: 0_level_0,Df,Deviance,Resid. Df,Resid. Dev,Pr(>Chi)
Unnamed: 0_level_1,<int>,<dbl>,<int>,<dbl>,<dbl>
,,,999,1221.7286,
status_of_existing_checking_account,3.0,131.335922,996,1090.3927,2.787203e-28
duration_in_month,1.0,37.995907,995,1052.3968,7.08932e-10
credit_history,4.0,29.926492,991,1022.4703,5.06601e-06
purpose,9.0,34.42479,982,988.0455,7.523205e-05
credit_amount,1.0,1.7542,981,986.2913,0.1853497
savings_account_and_bonds,4.0,17.878973,977,968.4123,0.00130315
present_employment_since,4.0,12.441058,973,955.9713,0.0143558
installment_rate_in_percentage_of_disposable_income,1.0,11.833466,972,944.1378,0.0005817559
other_debtors_or_guarantors,2.0,7.730122,970,936.4077,0.02096165


In [6]:
# Third model with link function "cloglog"
model3 <- glm(creditability ~ ., family = binomial(cloglog), data = df)
model3 <- update(model3, ~ . - property) # remove "property" as predictor

In [7]:
# Chi-squared test
anova(model3, test = "Chisq")

Unnamed: 0_level_0,Df,Deviance,Resid. Df,Resid. Dev,Pr(>Chi)
Unnamed: 0_level_1,<int>,<dbl>,<int>,<dbl>,<dbl>
,,,999,1221.7286,
status_of_existing_checking_account,3.0,131.3359218,996,1090.3927,2.787203e-28
duration_in_month,1.0,39.781518,995,1050.6112,2.840201e-10
credit_history,4.0,26.5164929,991,1024.0947,2.489332e-05
purpose,9.0,29.7730798,982,994.3216,0.0004794703
credit_amount,1.0,0.2733114,981,994.0483,0.601119
savings_account_and_bonds,4.0,22.8328649,977,971.2154,0.0001367424
present_employment_since,4.0,12.4406551,973,958.7748,0.0143583
installment_rate_in_percentage_of_disposable_income,1.0,12.9505109,972,945.8242,0.0003198342
other_debtors_or_guarantors,2.0,7.9087576,970,937.9155,0.01917057


In [8]:
# Compare models
anova(model1, model2, model3)

Unnamed: 0_level_0,Resid. Df,Resid. Dev,Df,Deviance
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
1,965,918.7715,,
2,965,918.4379,0.0,0.3335978
3,965,920.9142,0.0,-2.4763173
