ESS245 Statsmodels
==================

**Author:** Ulrich G Wortmann



## Recap



-   color schemes: Avoid red and green!

    import numpy as np
    import matplotlib.pyplot as plt
    
    x = np.arange(0, np.pi * 2, 0.1)
    y1 = np.sin(x)  # calculate the sin(x)
    y2 = np.cos(x) ** 2 + 2  # calculate the cos(x)^2 + 2
    
    plt.style.use("uli")
    fig, ax = plt.subplots()
    ax.plot(x, y1, color="C0")
    ax.plot(x, y2, color="C1")
    plt.show()



### Twinx()



    import numpy as np
    import matplotlib.pyplot as plt
    
    x = np.arange(0, np.pi * 2, 0.1)
    y1 = np.sin(x)  # calculate the sin(x)
    y2 = np.cos(x)** 2 + 2 # calculate the cos(x)^2 + 2
    
    plt.style.use("ulitwinx")
    fig, ax = plt.subplots()
    axt = ax.twinx()
    ax.plot(x, y1, color="C0")
    axt.plot(x, y2, color="C1")
    plt.show()



### Controlling Font size etc.



    import numpy as np
    import matplotlib.pyplot as plt
    
    x = np.arange(0, np.pi * 2, 0.1)
    y1 = np.sin(x)  # calculate the sin(x)
    y2 = np.cos(x)** 2 + 2 # calculate the cos(x)^2 + 2
    
    plt.style.use("ulitwinx")
    fig, ax = plt.subplots()
    axt = ax.twinx()
    ax.plot(x, y1, color="C0")
    axt.plot(x, y2, color="C1")
    
    fig.set_size_inches(6, 4)
    # fig.set_size_inches(3, 2)
    # fig.set_size_inches(5, 4)
    
    fig.savefig("font_size_test.pdf")
    plt.show()



## Today



### Linear Regression



![img](./Ringed_white_stork.png)



#### Let's take a look at the data



    import matplotlib.pyplot as plt
    import pathlib as pl
    import pandas as pd
    
    fn: str = "storks_vs_birth_rate.csv"  # file name
    cwd: pl.Path = pl.Path.cwd()  # get the current working directory
    fqfn: pl.Path = pl.Path(f"{cwd}/{fn}")  # fully qualified file name
    
    if not fqfn.exists():  # check if file exist
        raise FileNotFoundError(f"Cannot find file {fqfn}")
    
    df: pd.DataFrame = pd.read_csv(fqfn)  # read csv data
    x = df.iloc[:, 1]  # Storks
    y = df.iloc[:, 0]  # Babies
    xlabel = df.columns[1]
    ylabel = df.columns[0]
    
    plt.style.use("uli")
    fig, ax = plt.subplots()
    ax.scatter(x, y, color="C0")
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    fig.tight_layout()
    fig.savefig("stork1.png")
    plt.show()

-   we see some sort of correlation
-   why not run a linear regression on this?



#### Correlation versus Causation



-   **Correlation:** Correlation is a statistical measure (expressed as a number) that
    describes the size and direction of a relationship between two or more
    variables. E.g.:
    -   Smoking and alcoholism often correlated, but smoking does not cause alcoholism.
    -   They do correlate because they are not independent of each other, rather they both depend on a common third variable (addiction)

-   **Causation:** Causation indicates that one event is the result of the
    occurrence of the other event; i.e. there is a causal relationship between the
    two events. E.g., if you have a contract with an hourly wage, you wage will depend on the hours you work.
    -   This implies that we have an **independent** variable (hours worked), and a **dependent** variable (wage)

After: [https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/correlation-and-causation](https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/correlation-and-causation)



#### Revisiting the storks story



![img](./stork1.png)

-   We know that storks and babies **correlate**
-   We do not know why
-   Hypothesis: Storks deliver babies. We thus predict that the number of babies scales linear with the number of storks (aka **linear model**)
-   Mathematically, we propose that

$$ y_r = y_0 + cx$$ 

    from scipy.stats import linregress
    
    res = linregress(x, y)
    yr = res.intercept + res.slope * x
    
    fig, ax = plt.subplots()
    ax.scatter(x, y, color="C0")
    ax.plot(x, yr, "C1")
    fig.savefig("simple_reg.png")
    plt.show()



#### Retrieving statistical parameters



    eq = f"y = {res.intercept:.2f} + {res.slope:.4f} * x"
    rsq = f"r^2 = {res.rvalue**2:.2f}"
    p = f"p = {res.pvalue:.3f}"
    print(f"{eq}\n{rsq}\n{p}")

-   each stork pair delivers 28 babies
-   this would explain about 38% of the newborn babies/year
-   we are 99.2 % sure that the null hypothesis (data is uncorrelated) is wrong
-   but does this prove our hypothesis?



### Going beyond excel style regressions



We want to:

-   use R-like syntax to define regression models (`Babies ~ Storks`)
-   obtain a visual representation of the confidence we can have in the regression model
-   obtain a visual representation of the confidence we can have in the predictions we do based on the model

![img](./stork_new.png)



#### Defining the model with the statsmodels.formula.api



Re-using the previous data, where `x = Storks` and `y = Babies`

    import statsmodels.formula.api as smf
    
    # x = independent variable (i.e. storks)
    # y = dependent variable (i.e., babies).
    model: smf.ols = smf.ols(formula="y ~ x", data=df)
    
    results: model.fit = model.fit() # run the model
    
    display(results.summary())



#### Extracting statistical parameters



    """ Retrieve parameters from the model results object. Note
    that the dictionary key 'x' must be equal to the
    name of the independent variable used in the model
    definition 
    """
    
    slope: float = results.params["x"]  # the slope
    y_0: float = results.params["Intercept"]  # the y-intercept
    r_square: float = results.rsquared  # rsquare
    p_value: float = results.pvalues["x"]  # the p-value



### Learning Outcomes



-   practice testing for normality
-   practice ordinary least square model definitions
-   learn how to transform data when necessary
-   continue practicing plot-generation

