# Python Libraries

For this tutorial, we are going to explore the python libraries that include functionality that corresponds with the material discussed in the course.

The primary package we will be using is:

* **Statsmodels:** a library that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, exploring data, and constructing models.  

*__ATTN__: If you are not familiar with the following packages:*  

* **Numpy** is a library for working with arrays of data.  

* **Pandas** is a library for data management, manipulation, and analysis.  

* **Matplotlib** is a library for making visualizations.  

* **Seaborn** is a higher-level interface to Matplotlib that can be used to simplify many visualization tasks.  

We recommend you check out the first and second courses of the Statistics with Python specialization, **Understanding and Visualizing Data** and **Inferential Statistical Analysis with Python**.

*__Important__: While this notebooks provides insight into the basics of these libraries,  it is recommended that you dig into the documentation available online.*

## StatsModels

The StatsModels library is extremely extensive and includes functionality ranging from statistical methods to advanced topics such as regression, time-series analysis, and multivariate statistics.

We will mainly be looking at the stats, OLS, and GLM sub-libraries.  However, we will begin by reviewing some functionality that has been referenced in earlier course of the Statistics with Python specialization.

In [2]:
import statsmodels.api as sm
import numpy as np

### Stats

#### Descriptive Statistics

In [3]:
# Draw random variables from a normal distribution with numpy
normalRandomVariables = np.random.normal(0,1, 1000)

# Create object that has descriptive statistics as variables
x = sm.stats.DescrStatsW(normalRandomVariables)

print(x)

<statsmodels.stats.weightstats.DescrStatsW object at 0x0000013F1697BEB0>


As you can see from the above output, we have created an object with type: "statsmodels.stats.weightstats.DescrStatsW".  

This object stores various descriptive statistics such as mean, standard deviation, variance, ect. that we can access.

In [4]:
# Mean
print(x.mean)

# Standard deviation
print(x.std)

# Variance
print(x.var)

-0.020815098602874685
0.9899975259515813
0.9800951013902519


The output above shows the mean, standard deviation, and variance of the 1000 random variables we drew from the distribution we generated above.

There are other interesting things you can do with this object, such as generating confidence intervals and hypothesis testing.

#### Confidence Intervals

`proportion_confint` used to calculate the confidence interval for a proportion or a binary outcome
```python
# Syntax:
lower, upper = proportion_confint(count, nobs, alpha=0.05, method='normal')
```

`zconfint_mean()` is used to calculate the confidence interval for the mean of a continuous variable
```python
# Syntax:
lower, upper = zconfint_mean(x, std_mean, alpha=0.05)
```

In [5]:
# Generate confidence interval for a population proportion

tstar = 1.96

# Observer population proportion
p = .85

# Size of population
n = 659

# Construct confidence interval
sm.stats.proportion_confint(n * p, n)

(0.8227378265796143, 0.8772621734203857)

The above output includes the lower and upper bounds of a 95% confidence interval of population proportion.

In [6]:
import pandas as pd

# Import data that will be used to construct confidence interval of population mean
df = pd.read_csv("https://raw.githubusercontent.com/UMstatspy/UMStatsPy/master/Course_1/Cartwheeldata.csv")

# Generate confidence interval for a population mean
sm.stats.DescrStatsW(df["CWDistance"]).zconfint_mean()

(76.57715593233026, 88.38284406766975)

The output above shows the lower and upper bounds of a 95% confidence interval of population mean.

These functions should be familiar, if not, we recommend you take course 2 of our specialization.
#### Hypothesis Testing
`proportions_ztest` is used to test hypotheses and calculate p-values for proportions or proportions difference between two groups
```python
# Syntax:
stat, pvalue = proportions_ztest(count, nobs, value=None, alternative='two-sided', prop_var=False)
```

`ztest` is used to test hypotheses and calculate p-values for means or mean differences between two groups

```python
stat, pvalue = ztest(x1, x2=None, value=0, alternative='two-sided', usevar='pooled', ddof=1)
```

In [7]:
# One population proportion hypothesis testing

# Population size
n = 1018

# Null hypothesis population proportion
pnull = .52

# Observe population proportion
phat = .56

# Calculate test statistic and p-value
sm.stats.proportions_ztest(phat * n, n, pnull)

(2.571067795759113, 0.010138547731721065)

In [8]:
# Using the dataframe imported above, perform a hypothesis test for population mean
sm.stats.ztest(df["CWDistance"], value = 80, alternative = "larger")

(0.8234523266982029, 0.20512540845395266)

The outputs above are the test statistics and p-values from the respective hypothesis tests.

If you'd like to review these functions on your own, the stats sub-library documentation can be found at the following url: https://www.statsmodels.org/stable/stats.html

This concludes the review portion of this notebook,  now we are going to introduce the OLS and GLM sub-libraries and the functions you will be seeing throughout this course.

# OLS (Ordinary Least Squares), GLM (Generalized Linear Models), GEE (Generalize Estimated Equations), MIXEDLM (Multilevel Models)

| Method  | Definition                      | Use case                                            | Strengths                                            | Weaknesses                                             | Syntax (*`import statsmodels as sm`*)                              |
|---------|---------------------------------|-----------------------------------------------------|------------------------------------------------------|--------------------------------------------------------|---------------------------------------------------|
| OLS     | Ordinary Least Squares          | Fits a linear model to normally distributed data     | Simple, easy to understand and implement             | Not as flexible as GLM, not as robust to outliers      | `sm.OLS.from_formula(formula, data=df).fit()`     |
| GLM     | Generalized Linear Model        | Fits a linear model to data with wider distributions | Flexible, robust to outliers, wider range of distributions | More complex than OLS, not ideal for normal data       | `sm.GLM.from_formula(formula, data=df).fit()`     |
| GEE     | Generalized Estimating Equations | Fits a linear model to clustered data               | Accounts for clustering, improves prediction accuracy | More complex than OLS and GLM                          | `sm.GEE.from_formula(formula, groups=group_col, data=df).fit()` |
| MIXEDLM | Mixed-effects Linear Model      | Fits a linear model to data with fixed and random effects | Accounts for fixed and random effects, improves prediction accuracy | More complex than OLS, GLM, GEE                   | `sm.MixedLM.from_formula(formula, data=df).fit()` |

The OLS, GLM, GEE, and MIXEDLM sub-libraries are the primary libraries in statsmodels that we will be utilizing in this course to create various models.

Below, we will give a brief description of each model and a skeleton of the functions you will see going forward in the course.  This is simply for you to get familiar with these concepts and to prepare you for the coming weeks.  If their application at this time seems a bit ambigious have no fear as they will be discussed in detail throughout this course!

For each of the following models, we follow our similar structure which means we will be following our structure of Dependent and Independent Variables, with a few caveats that will be expressed below.

#### Ordinary Least Squares

Ordinary Least Squares is a method for estimating the unknown parameters in a linear regression model.  This is the function we will use when our target variable is continuous. 

(Bonus) OLS is a linear regression method used to estimate the relationship between a dependent variable and one or more independent variables. It assumes that the relationship is linear and that the errors are normally distributed and have constant variance. OLS minimizes the sum of squared differences between the observed and predicted values to obtain the best-fitting regression line. OLS is commonly used in various fields for predictive modeling and causal analysis.

In [9]:
da = pd.read_csv("nhanes_2015_2016.csv")

# Drop unused columns, drop rows with any missing values.
vars = ["BPXSY1", "RIDAGEYR", "RIAGENDR", "RIDRETH1", "DMDEDUC2", "BMXBMI",
        "SMQ020", "SDMVSTRA", "SDMVPSU"]
da = da[vars].dropna()

da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

model = sm.OLS.from_formula("BPXSY1 ~ RIDAGEYR + RIAGENDRx", data=da)
res = model.fit()
res.summary()

0,1,2,3
Dep. Variable:,BPXSY1,R-squared:,0.215
Model:,OLS,Adj. R-squared:,0.214
Method:,Least Squares,F-statistic:,697.4
Date:,"Sun, 23 Jul 2023",Prob (F-statistic):,1.8699999999999998e-268
Time:,23:09:07,Log-Likelihood:,-21505.0
No. Observations:,5102,AIC:,43020.0
Df Residuals:,5099,BIC:,43040.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,100.6305,0.712,141.257,0.000,99.234,102.027
RIAGENDRx[T.Male],3.2322,0.459,7.040,0.000,2.332,4.132
RIDAGEYR,0.4739,0.013,36.518,0.000,0.448,0.499

0,1,2,3
Omnibus:,706.732,Durbin-Watson:,2.036
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1582.73
Skew:,0.818,Prob(JB):,0.0
Kurtosis:,5.184,Cond. No.,168.0


The above code is creating a multiple linear regression where the target variable is BPXSY1 and the two predictor variables are RIDAGEYR and RIAGENDRx.

Note that the target variable, BPXSY1, is a continous variable that represents blood pressure.

Bonus: `from sklearn.linear_model import LinearRegression`

In [10]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Read the data and drop unused columns and rows with missing values
da = pd.read_csv("nhanes_2015_2016.csv")
vars = ["BPXSY1", "RIDAGEYR", "RIAGENDR", "RIDRETH1", "DMDEDUC2", "BMXBMI", "SMQ020", "SDMVSTRA", "SDMVPSU"]
da = da[vars].dropna()

# Prepare the data for modeling
X = da[["RIDAGEYR", "RIAGENDR"]]  # Independent variables
y = da["BPXSY1"]                  # Dependent variable

# Fit the linear regression model
model = LinearRegression()
model.fit(X, y)

# Print the coefficients and intercept
print("Intercept:", model.intercept_)
print("Coefficients:", model.coef_)

Intercept: 107.09494236237379
Coefficients: [ 0.47389682 -3.23224502]


#### Generalized Linear Models

While generalized linear models are a broad topic, **in this course we will be using this suite of functions to carry out logistic regression.**  Logistic regression is used when our target variable is a binary outcome, or a classification of two groups, which can be denoted as group 0 and group 1.

(Bonus)
GLM is an extension of linear regression that allows for modeling relationships between dependent and independent variables when the response variable follows a distribution other than the normal distribution. GLM includes different types of regression models, such as 
- logistic regression for binary outcomes, 
- Poisson regression for count data, and 
- gamma regression for skewed continuous data. 

GLM incorporates a link function that relates the linear predictor to the response variable.

In [10]:
da["smq"] = da.SMQ020.replace({2: 0, 7: np.nan, 9: np.nan})
model = sm.GLM.from_formula("smq ~ RIAGENDRx", family=sm.families.Binomial(), data=da)
res = model.fit()
res.summary()

0,1,2,3
Dep. Variable:,smq,No. Observations:,5094.0
Model:,GLM,Df Residuals:,5092.0
Model Family:,Binomial,Df Model:,1.0
Link Function:,Logit,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-3350.6
Date:,"Sun, 23 Jul 2023",Deviance:,6701.2
Time:,23:10:53,Pearson chi2:,5090.0
No. Iterations:,4,Pseudo R-squ. (CS):,0.04557
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.7547,0.042,-18.071,0.000,-0.837,-0.673
RIAGENDRx[T.Male],0.8851,0.058,15.227,0.000,0.771,0.999


Above is a example of creating a logistic model where the target value is SMQ020x, which in this case is whether or not this person is a smoker or not.  The predictor is RIAGENDRx, which is gender.

Bonus `from sklearn.linear_model import LogisticRegression`

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Prepare the data
X = da[['RIAGENDR']]
y = da['SMQ020']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print the intercept and coefficients
intercept = model.intercept_[0]
coefficients = model.coef_[0]

print(f"Intercept: {intercept}")
print(f"Coefficient(s): {coefficients}")

# Print the accuracy score
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.4f}")

Intercept: 3.6567624630909252
Coefficient(s): [-0.33794606]
Accuracy: 0.6347


#### Generalized Estimated Equations

Generalized Estimating Equations estimate generalized linear models for panel, cluster or repeated measures data when the observations are possibly correlated within a cluster but uncorrelated across clusters.  These are used primarily when there is uncertainty regarding correlation between outcomes. "Generalized Estimating Equations" (GEE) fit marginal linear models, and estimate intraclass correlation.

(Bonus): GEE is a method used for analyzing correlated or clustered data. It is an extension of GLM that accounts for within-cluster correlation by estimating population-averaged effects. GEE is commonly used in longitudinal and clustered data analysis where observations within the same cluster are expected to be more correlated than observations from different clusters. GEE provides robust parameter estimates and standard errors, allowing for valid inference even when the correlation structure is misspecified.

In [11]:
da["group"] = 10*da.SDMVSTRA + da.SDMVPSU
model = sm.GEE.from_formula("BPXSY1 ~ 1", groups="group", cov_struct=sm.cov_struct.Exchangeable(), data=da)
res = model.fit()
res.cov_struct.summary()

'The correlation between two observations in the same cluster is 0.030'

Here we are creating a marginal linear model of BPXSY1 to determine the estimated ICC value, which would indicate whether or not there are correlated clusters of BPXSY1.

#### Multilevel Models

Similarly to GEEs, we use multilevel models when there is potential for outcomes to be grouped together which is not uncommon when using various sampling methods to collect data.

(Bonus) MIXEDLM (Mixed Effects Linear Model): MIXEDLM, also known as mixed-effects regression or hierarchical linear regression, is a method used to analyze data with both fixed effects (population-level effects) and random effects (subject-specific effects). It is used when there is nested or clustered data with repeated measures or when there are hierarchical structures in the data. MIXEDLM estimates the fixed and random effects simultaneously and can handle unbalanced and missing data. It is commonly used in social sciences, biology, and other fields where data have complex structures.

In [27]:
for v in ["BPXSY1", "RIDAGEYR", "BMXBMI", "smq", "SDMVSTRA"]:
    model = sm.GEE.from_formula(v + " ~ 1", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)
    result = model.fit()
    print(v, result.cov_struct.summary())

BPXSY1 The correlation between two observations in the same cluster is 0.030
RIDAGEYR The correlation between two observations in the same cluster is 0.035
BMXBMI The correlation between two observations in the same cluster is 0.039
smq The correlation between two observations in the same cluster is 0.026
SDMVSTRA The correlation between two observations in the same cluster is 0.959


What;s nice about the statsmodels library is that all the models follow the similar structure and syntax.  


Documentation and examples of these models can be found at the following links:

* OLS: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

* GLM: https://www.statsmodels.org/stable/glm.html

* GEE: https://www.statsmodels.org/stable/gee.html

* MIXEDLM: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

Feel free to read up on these sub-libraries and their use cases.  In week 2 you will see examples of OLS and GLM, where in week 3, we will be implementing GEE and MIXEDLM.