# Generalised linear models

We use the [Bikeshare dataset](https://github.com/pykale/transparentML/blob/main/data/Bikeshare.csv) to illustrate generalised linear models (glm). The response is bikers, the number of hourly users of a bike sharing program in Washington, DC. This response value is neither qualitative nor quantitative: instead, it takes on non-negative integer values, or counts. We will consider predicting `bikers` using the covariates `mnth` (month of the year), `hr` (hour of the day, from 0 to 23), `workingday` (an indicator variable that equals 1 if it is neither a weekend nor a holiday), `temp` (the normalized temperature, in Celsius), and `weathersit` (a qualitative variable that takes on one of four possible values: clear; misty or cloudy; light rain or light snow; or heavy rain or heavy snow.)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

from statsmodels.formula.api import ols, poisson

%matplotlib inline

In [None]:
data_url = "https://github.com/pykale/transparentML/raw/main/data/Bikeshare.csv"

data_df = pd.read_csv(data_url, header=0, index_col=0)
data_df["mnth"] = data_df["mnth"].astype("category")
data_df["hr"] = data_df["hr"].astype("category")
data_df.head(3)

## Linear regression on the `Bikeshare` data

To begin, we consider predicting `bikers` using linear regression.

In [None]:
est = ols("bikers ~ mnth + hr + temp + workingday + weathersit", data_df).fit()
est.summary()

We see, for example, that a progression of weather from clear to cloudy results in, on average, 12.89 fewer bikers per hour; however, if the weather progresses further to light rain or snow, then this further results in 53.60 fewer bikers per hour.

In [None]:
# months = ["Jan", "Feb", "March", "April", "May", "June", "July", "Aug", "Sept", "Oct", "Nov", "Dec"]
months = [
    "Jan",
    "Feb",
    "March",
    "May",
    "June",
    "July",
    "Aug",
    "Sept",
    "Oct",
    "Nov",
    "Dec",
]
coef_mnth = [est.params["mnth[T.%s]" % _m] for _m in months]
coef_hr = [est.params["hr[T.%d]" % _h] for _h in range(1, 24)]

In [None]:
# Create plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(months, coef_mnth, "bo-")
ax1.set_xlabel("Month")
ax2.plot(np.arange(1, 24), coef_hr, "bo-")
ax2.set_xlabel("Hour")

for ax in fig.axes:
    ax.set_ylabel("Coefficient")

plt.show()

The above figures display the coefficients associated with `mnth` and `hr`, respectively. We see that bike usage is highest in the spring and fall, and lowest during the winter months. Furthermore, bike usage is greatest around rush hour (9 AM and 6 PM), and lowest overnight. Thus, at first glance, fitting a linear regression model to the `Bikeshare` dataset seems to provide reasonable and intuitive results.

## Poisson regression on the `Bikeshare` data

In [None]:
est = poisson("bikers ~ mnth + hr + temp + workingday + weathersit", data_df).fit()
est.summary()

In [None]:
coef_mnth = [est.params["mnth[T.%s]" % _m] for _m in months]
coef_hr = [est.params["hr[T.%d]" % _h] for _h in range(1, 24)]

# Create plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.plot(months, coef_mnth, "bo-")
ax1.set_xlabel("Month")
ax2.plot(np.arange(1, 24), coef_hr, "bo-")
ax2.set_xlabel("Hour")

for ax in fig.axes:
    #     ax.legend(["student", "non-student"], loc=2)
    #     ax.set_xlabel("Income")
    ax.set_ylabel("Coefficient")
#     ax.set_ylim(ymax=1550)
plt.show()

## Generalized linear models

Common characteristics of generalized linear models:

1. Each approach uses predictors $ x_1, x_2, \dots, x_D $ to predict a response $ y $. We assume that, conditional on $ x_1, x_2,\dots, x_D $, $ y $ belongs to a certain family of distributions. For linear regression, we typically assume that $ y $ follows a Gaussian or normal distribution. For logistic regression, we assume that $y$ follows a Bernoulli distribution. Finally, for Poisson regression, we assume that $y$ follows a Poisson distribution.
2. Each approach models the mean of $y$ as a function of the predictors. In linear regression, the mean of $y$ takes the form
    $$
    \mathbb{E}(y|x_1, x_2, \dots, x_D) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D.
    $$
    In logistic regression, the mean of $y$ takes the form
    $$
    \mathbb{E}(y|x_1, x_2, \dots, x_D) = \mathbb{P}(y=1|x_1, x_2, \dots, x_D) = \frac{e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D}}{1 + e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D}}.
    $$
    In Poisson regression, the mean of $y$ takes the form
    $$
    \mu = e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_D}.
    $$

The Gaussian, Bernoulli and Poisson distributions are all members of a wider class of distributions, known as the exponential family. Other well known members of this family are the exponential distribution, the Gamma distribution, and the negative binomial distribution. In general, we can perform a regression by modelling the response $y$ as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors via (4.42). Any regression approach that follows this very general recipe is known as a generalized linear model (GLM). Thus, linear regression, logistic regression, and Poisson regression are three examples of GLMs. Other examples not covered here include _Gamma regression_ and _negative binomial regression_.

## Exercises

min 3 max 5

