**Initial Regression Analysis**

This is an initial look at specifications for our analysis of maternal mortality and political indicators.

Open questions/issues:
- We may need to get data from 2016 regarding state leadership
- All 50 states are not represented in the maternal mortality data
- Since maternal mortality is measured at the county level (giving us more observations), but policy is usually managed on a state level, is GOP trifecta (measuring Senate seats plus governor) the cleanest way to do this?
- The mortality rate was extremely small, so I included models that standardized it and ones that didn't.  I was having issues understanding the standardized column as well, since the values should be between -1 and 1 but some are not.

In [109]:
#setup
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [111]:
#importing the file for maternal mortality rate (MMR) data
mmr_df = pd.read_csv("data/maternal_mortality_complete_dataset.csv")
mmr_df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,measure_id,measure_name,location_id,location_name,fips,race_id,race_name,sex_id,...,obgyn_provider_rate_100k,prenatal_care_first_trimester_pct,obesity_pre_pregnancy_pct,diabetes_pre_pregnancy_pct,hypertension_pre_pregnancy_pct,State Abbreviation,State,guttmacher_index,state,gop_trifecta
0,0,3648030,1,Deaths,614,Autauga County (Alabama),1001.0,1,Total,2,...,3.5,71.8,32.1,1.1,3.0,AL,Alabama,most_restrictive,Alabama,1
1,1,3648036,1,Deaths,637,Baldwin County (Alabama),1003.0,1,Total,2,...,19.4,78.3,28.0,1.2,3.3,AL,Alabama,most_restrictive,Alabama,1
2,2,3648042,1,Deaths,624,Barbour County (Alabama),1005.0,1,Total,2,...,0.0,67.4,41.2,1.1,4.6,AL,Alabama,most_restrictive,Alabama,1
3,3,3648048,1,Deaths,603,Bibb County (Alabama),1007.0,1,Total,2,...,0.0,66.4,37.5,1.1,3.1,AL,Alabama,most_restrictive,Alabama,1
4,4,3648054,1,Deaths,588,Blount County (Alabama),1009.0,1,Total,2,...,3.4,72.3,32.8,1.2,2.9,AL,Alabama,most_restrictive,Alabama,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3005,3005,4044432,1,Deaths,3712,Sweetwater County (Wyoming),56037.0,1,Total,2,...,14.5,75.2,28.8,1.3,1.1,WY,Wyoming,restrictive,Wyoming,1
3006,3006,4044438,1,Deaths,3697,Teton County (Wyoming),56039.0,1,Total,2,...,70.7,79.3,8.7,0.7,1.0,WY,Wyoming,restrictive,Wyoming,1
3007,3007,4044444,1,Deaths,3714,Uinta County (Wyoming),56041.0,1,Total,2,...,10.0,80.1,25.1,1.0,1.0,WY,Wyoming,restrictive,Wyoming,1
3008,3008,4044450,1,Deaths,3700,Washakie County (Wyoming),56043.0,1,Total,2,...,0.0,76.0,26.0,1.1,1.2,WY,Wyoming,restrictive,Wyoming,1


Question: should we be looking at the lagged data for government control? Two ways of looking at this...

a) we can say that the leadership is a reflection of politics and policy in the state, or 

b) we can use control in 2016 as an indicator of future policy and leadership in the state and the 2017-2019 data is a result of the policies preceeding the data collection.

In [113]:
#Noticed Alaska missing, so wanted to check if all 50 states are represented.  Looks like two are missing.
mmr_df["state"].nunique()

48

In [115]:
#Looks like Alaska and Louisiana are missing.
mmr_df["state"].unique()

array(['Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',
       'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Maine',
       'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',
       'New_Hampshire', 'New_Jersey', 'New_Mexico', 'New_York',
       'North_Carolina', 'North_Dakota', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Rhode_Island', 'South_Carolina', 'South_Dakota',
       'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West_Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

I've broken these down into two theories, and then further into 2 or 3 specifications based on those theories.

**First Theory**: A GOP trifecta has a positive relationship with maternal mortality.
- Specification #1: The effect of a GOP trifecta on MMR.
- Specification #2: The effect of GOP trifecta on MMR, standardizing MMR.
- Specification #3: The effect of a GOP trifecta on MMR, controlling for health indicators.
note: we can also add a specification for low-income if we want to

**Second Theory**: The highest level of abortion restriction (Most restrictive, Very restrictive, Restrictive, Some restrictions/protections, Protective, Very protective, and Most protective) corresponds with the highest level of maternal mortality in a state.
- Specification #1:  The effect of higher levels of abortion restriction (only looking at whether a state was in the "most restrictive" category) vs states that were not in that category on MMR
- Specification #2: The effect of all levels of abortion restriction on MMR
- Specification #3: The effect of all levels of abortion restriction on MMR, controlling for health indicators.

**Specification # 1:** The effect of a GOP trifecta on maternal mortality.

In [117]:
#making a clean copy without NAs in these columns.  This dropped about 30 observations.
mmr_df = mmr_df.dropna(subset=['val', 'gop_trifecta'])

In [119]:
mmr_df

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,measure_id,measure_name,location_id,location_name,fips,race_id,race_name,sex_id,...,obgyn_provider_rate_100k,prenatal_care_first_trimester_pct,obesity_pre_pregnancy_pct,diabetes_pre_pregnancy_pct,hypertension_pre_pregnancy_pct,State Abbreviation,State,guttmacher_index,state,gop_trifecta
0,0,3648030,1,Deaths,614,Autauga County (Alabama),1001.0,1,Total,2,...,3.5,71.8,32.1,1.1,3.0,AL,Alabama,most_restrictive,Alabama,1
1,1,3648036,1,Deaths,637,Baldwin County (Alabama),1003.0,1,Total,2,...,19.4,78.3,28.0,1.2,3.3,AL,Alabama,most_restrictive,Alabama,1
2,2,3648042,1,Deaths,624,Barbour County (Alabama),1005.0,1,Total,2,...,0.0,67.4,41.2,1.1,4.6,AL,Alabama,most_restrictive,Alabama,1
3,3,3648048,1,Deaths,603,Bibb County (Alabama),1007.0,1,Total,2,...,0.0,66.4,37.5,1.1,3.1,AL,Alabama,most_restrictive,Alabama,1
4,4,3648054,1,Deaths,588,Blount County (Alabama),1009.0,1,Total,2,...,3.4,72.3,32.8,1.2,2.9,AL,Alabama,most_restrictive,Alabama,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3005,3005,4044432,1,Deaths,3712,Sweetwater County (Wyoming),56037.0,1,Total,2,...,14.5,75.2,28.8,1.3,1.1,WY,Wyoming,restrictive,Wyoming,1
3006,3006,4044438,1,Deaths,3697,Teton County (Wyoming),56039.0,1,Total,2,...,70.7,79.3,8.7,0.7,1.0,WY,Wyoming,restrictive,Wyoming,1
3007,3007,4044444,1,Deaths,3714,Uinta County (Wyoming),56041.0,1,Total,2,...,10.0,80.1,25.1,1.0,1.0,WY,Wyoming,restrictive,Wyoming,1
3008,3008,4044450,1,Deaths,3700,Washakie County (Wyoming),56043.0,1,Total,2,...,0.0,76.0,26.0,1.1,1.2,WY,Wyoming,restrictive,Wyoming,1


In [121]:
#define y (MMR rate)
y = mmr_df["val"]

#define X (GOP trifecta)
X = mmr_df["gop_trifecta"]

#adding an intercept
X = sm.add_constant(X)

mmr_vs_trifecta_1 = sm.OLS(y,X).fit()

print(mmr_vs_trifecta_1.summary())

                            OLS Regression Results                            
Dep. Variable:                    val   R-squared:                       0.134
Model:                            OLS   Adj. R-squared:                  0.134
Method:                 Least Squares   F-statistic:                     462.3
Date:                Mon, 02 Dec 2024   Prob (F-statistic):           1.89e-95
Time:                        16:26:42   Log-Likelihood:                 29457.
No. Observations:                2979   AIC:                        -5.891e+04
Df Residuals:                    2977   BIC:                        -5.890e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         2.953e-05   3.65e-07     80.973   

**Comments:**
- These results are hard to interpret since the values are so small- it may be smart to think about standardizing the "val" column (which represents the mortality rate value)
- A gop_trifecta seems to have a postive relationship with maternal mortality.
- The results are statistically significant for the intercept and the gop_trifecta coefficients

In [123]:
#define y (MMR rate)
y = mmr_df["val"]

#define X (GOP trifecta)
X = mmr_df["gop_trifecta"]

#adding an intercept
X = sm.add_constant(X)

mmr_vs_trifecta_1 = sm.OLS(y,X).fit()

print(mmr_vs_trifecta_1.summary())

                            OLS Regression Results                            
Dep. Variable:                    val   R-squared:                       0.134
Model:                            OLS   Adj. R-squared:                  0.134
Method:                 Least Squares   F-statistic:                     462.3
Date:                Mon, 02 Dec 2024   Prob (F-statistic):           1.89e-95
Time:                        16:26:48   Log-Likelihood:                 29457.
No. Observations:                2979   AIC:                        -5.891e+04
Df Residuals:                    2977   BIC:                        -5.890e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         2.953e-05   3.65e-07     80.973   

**Specificaition #2:** The effect of a GOP trifecta on MMR, standardizing MMR.

In [127]:
#calculating mean of MMR
mmr_mean = mmr_df['val'].mean()
#calculating standard deviation of MMR
mmr_std = mmr_df['val'].std()

#creating a new column for standardized MMR
mmr_df['standardized_mmr'] = (mmr_df['val'] - mmr_mean) / mmr_std

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mmr_df['standardized_mmr'] = (mmr_df['val'] - mmr_mean) / mmr_std


In [129]:
#define y (MMR rate standardized)
y = mmr_df["standardized_mmr"]

#define X (GOP trifecta)
X = mmr_df["gop_trifecta"]

#adding an intercept
X = sm.add_constant(X)

mmr_vs_trifecta_2 = sm.OLS(y,X).fit()

print(mmr_vs_trifecta_2.summary())

                            OLS Regression Results                            
Dep. Variable:       standardized_mmr   R-squared:                       0.134
Model:                            OLS   Adj. R-squared:                  0.134
Method:                 Least Squares   F-statistic:                     462.3
Date:                Mon, 02 Dec 2024   Prob (F-statistic):           1.89e-95
Time:                        16:27:29   Log-Likelihood:                -4011.5
No. Observations:                2979   AIC:                             8027.
Df Residuals:                    2977   BIC:                             8039.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.4669      0.028    -16.913   

**Comments:**
- This specification is MUCH easier to read. It tells us that the gop_trifecta increased maternal mortality by a factor of 0.7547.  Living in a Democratic state is represented by the constant of -0.4669, which indicates a decrease in maternal mortality.
- All results were statistically significant.

In [133]:
#was getting NA values in the regression, so needed to filter out the ones I didn't want
na_counts = mmr_df.isnull().sum()
#filter to show only columns with NAs
columns_with_nas = na_counts[na_counts > 0].index.tolist()
print("Columns with NAs:", columns_with_nas)

Columns with NAs: ['county_fips', 'low_income_pct', 'obgyn_provider_rate_100k', 'prenatal_care_first_trimester_pct', 'obesity_pre_pregnancy_pct', 'diabetes_pre_pregnancy_pct', 'hypertension_pre_pregnancy_pct']


In [137]:
#dropping columns with NAs
mmr_df = mmr_df.dropna(subset=['obgyn_provider_rate_100k', 'prenatal_care_first_trimester_pct', 'obesity_pre_pregnancy_pct', 'diabetes_pre_pregnancy_pct', 'hypertension_pre_pregnancy_pct'])

In [139]:
#define y (MMR rate standardized)
y = mmr_df["standardized_mmr"]

#define X (GOP trifecta and health indicator covariates)
X = mmr_df[['gop_trifecta', 'obgyn_provider_rate_100k', 
            'prenatal_care_first_trimester_pct', 'obesity_pre_pregnancy_pct',
            'diabetes_pre_pregnancy_pct', 'hypertension_pre_pregnancy_pct']]

#adding an intercept
X = sm.add_constant(X)

mmr_vs_trifecta_3 = sm.OLS(y,X).fit()

print(mmr_vs_trifecta_3.summary())

                            OLS Regression Results                            
Dep. Variable:       standardized_mmr   R-squared:                       0.484
Model:                            OLS   Adj. R-squared:                  0.483
Method:                 Least Squares   F-statistic:                     451.0
Date:                Mon, 02 Dec 2024   Prob (F-statistic):               0.00
Time:                        16:32:09   Log-Likelihood:                -3154.4
No. Observations:                2891   AIC:                             6323.
Df Residuals:                    2884   BIC:                             6365.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
const 

**Comments:**
- GOP Trifecta's effect is slightly smaller taking into account health indicators that may better explain the mortality rate, however, our data still shows that the coefficient for having one is substantively significant, increasing the rate of maternal mortality by 0.47 standard deviations from the mean.
- Few other health indicators appear to be as predictive of maternal mortality, with hypertension before pregnancy as a distant second at 0.226.
- All except the intercept's coefficient is statistically significant- the reason why is unclear at this point. IT is possibly safer to say we know the true effect of a republican trifecta rather than the effect of not having one.
- There is potential multicollinearity in our model as mentioned in the notes below the regression.