# GPT-4.1 programming code responses test for comorbidities dataset

### Prompt 3

In [1]:
#Old code provided by LLM:
#Reweighting:
#sample_weight = total_samples / (n_groups * group_count[group_label])

#new code
import pandas as pd

# 1. Load your data
df = pd.read_csv("data_new.csv")

# 2. Pick your grouping.  For example, by race:
#   —or— by the race×gender interaction:
df['group'] = df['race'] + "_" + df['dem_female'].astype(str)

# 3. Compute counts
total_samples = len(df)
group_counts  = df['group'].value_counts()        # Series: index=group_label, |value=count
n_groups      = group_counts.size                 # how many distinct groups

# 4. Build the weight map
#    total_samples / (n_groups * count_for_each_group)
weight_map = {
    grp: total_samples / (n_groups * cnt)
    for grp, cnt in group_counts.items()
}

# 5. Assign a sample_weight column
df['sample_weight'] = df['group'].map(weight_map)

# 6. Quick sanity check: each group’s total weight should be ≈ total_samples/n_groups
group_weight_sums = df.groupby('group')['sample_weight'].sum()
print(group_weight_sums)

group
black_0    12196.0
black_1    12196.0
white_0    12196.0
white_1    12196.0
Name: sample_weight, dtype: float64


## Output
All four groups summing to 12 196 means that out of 48 784 total samples split across four groups, each group’s weights add up to exactly 48 784 ÷ 4. 

In [2]:
#old code provided by LLM
# from imblearn.over_sampling import SMOTENC
#X_resampled, y_resampled = SMOTENC(categorical_features=[group_index], random_state=0).fit_resample(X, y)

#new code:
import pandas as pd
from imblearn.over_sampling import SMOTENC

# 1. Load and select features/target
df = pd.read_csv("data_new.csv")
X = df[['risk_score_t', 'race', 'dem_female']]   # risk_score_t is numeric; race & dem_female are categorical
y = df['program_enrolled_t']

# 2. Encode categorical columns as integer codes (SMOTENC expects integer‐coded cats, not one‐hot)
categorical_cols = ['race', 'dem_female']
X_enc = X.copy()
for col in categorical_cols:
    X_enc[col] = X_enc[col].astype('category').cat.codes

# 3. Figure out the indices of those categorical columns
cat_indices = [X_enc.columns.get_loc(col) for col in categorical_cols]

# 4. Instantiate SMOTENC with those indices
smote_nc = SMOTENC(categorical_features=cat_indices, random_state=42)

# 5. Fit & resample
X_resampled, y_resampled = smote_nc.fit_resample(X_enc, y)

print("Original class counts:", y.value_counts().to_dict())
print("Resampled class counts:", pd.Series(y_resampled).value_counts().to_dict())

Original class counts: {0: 48332, 1: 452}
Resampled class counts: {0: 48332, 1: 48332}


## Output
The SMOTENC procedure took the original set of 48 332 negatives (class 0) and just 452 positives (class 1) and synthetically generated new minority samples until both classes had 48 332 examples. In other words, it has perfectly balanced the dataset by oversampling the under-represented positive cases while respecting the categorical nature of “race” and “dem_female.”

### comment 
this was the only code that was given
end