# Section 2 - Controlling for confounding factors

## Example 2.2
**Application 2.2**: Determining if protein size explains the previously noted (**Example 1.2**) association of entanglements with disease in humans

* Larger proteins may be more prone to misfolding and thus causing a disease regardless of their entanglement status
* In this example, you will use the code below to carry out a logistic regression analysis of the relationship between disease and entanglement while treating protein size as a confounding factor

### Step 0 - Load libraries

In [None]:
import pandas as pd
import statsmodels.api as sm
import numpy as np

### Step 1 - Load and explore the data
* We will reuse the same information from **Example 1.2** but load a new version that includes information about protein length

In [None]:
# "data5" is a pandas DataFrame object
data_path = "/home/jovyan/data-store/data/iplant/home/shared/NCEMS/BPS-training-2025/"
data5     = pd.read_csv(data_path + "entanglement-disease-association-length.csv")

# print summary information
print ("Create a quick summary of the DataFrame:\n")
data5.info()

print ("\nPrint the first 10 rows of the DataFrame:\n")
data5.head(10)

* This dataset uses `Yes` and `No` rather than binary `1` and `0` - we will need to recode the columns `Entanglement` and `disease-linked` to be binary integers

### Step 2 - Prepare for analysis

In [None]:
# create two new columns with values recoded from Yes and No strings to binary 1 and 0 integers
recode_map                         = {"Yes": 1, "No": 0}
data5['disease-linked-binary']     = data5['disease-linked'].map(recode_map)
data5['entanglement-binary']       = data5['entanglement'].map(recode_map)

# add column of 1's corresponding to the intercept
data5['intercept'] = 1

# print a summary of the updated DataFrame
print ("\nHere is the updated DataFrame:\n")
data5.head(10)

* With these three new columns of `disease-linked-binary`, `entanglement-binary`, and `intercept` we are ready to the run the analysis

### Step 3 - Run the analysis

In [None]:
# make two X datasets, one including the confounder and one excluding it

# X1 includes only the feature
X1 = data5[['intercept', 'entanglement-binary']]

# X2 includes both the feature and the confounder
X2 = data5[['intercept', 'entanglement-binary', 'Length']]

# define the dependent variable (i.e., the outcome)
y = data5['disease-linked-binary']

# create two LogisticRegression() objects, fit the models, get coefficients, and compute odds ratios

# model1 will not include the confounder
model1  = sm.Logit(y, X1)
result1 = model1.fit(disp = 0)

# print a summary of result1
print ("\nResults when confounding factor IS NOT included:\n")

# get a summary of the results
odds_ratios = pd.DataFrame({"Coefficient": result1.params,
                            "OR"         : np.exp(result1.params),  
                            "Lower CI"   : np.exp(result1.conf_int()[0]),  
                            "Upper CI"   : np.exp(result1.conf_int()[1]),
                            "p-value"    : result1.pvalues}).drop(index="intercept", errors="ignore")

# print the results
print (odds_ratios.round(3), "\n")

# model1 will not include the confounder
model2  = sm.Logit(y, X2)
result2 = model2.fit(disp = 0)

# print a summary of result1
print ("\nResults when confounding factor IS included:\n")

# get a summary of the results
odds_ratios = pd.DataFrame({"Coefficient": result2.params,
                            "OR"         : np.exp(result2.params),  
                            "Lower CI"   : np.exp(result2.conf_int()[0]),  
                            "Upper CI"   : np.exp(result2.conf_int()[1]),
                            "p-value"    : result2.pvalues}).drop(index="intercept", errors="ignore")

# print the odds ratio
print (odds_ratios.round(3), "\n")

### Step 4 - Interpreting the results

* Use the quiz question at the QR code/link below to test your understanding

![](../images/section-2-example-2.png)

[Quiz Link](https://forms.gle/BYjF5YYfvvbL4NTSA)