# Hypertension-Associated Metabolites in Bread-Spikers

## Research Question
Using multi-omics profiling data, identify specific hypertension-associated metabolites that show statistically significant positive correlation with systolic blood pressure specifically within the 'Bread-spiker' phenotypic group after adjusting for age and BMI.

## Analysis Strategy
1. **Identify Bread-Spikers**: Subjects with above-median glucose peak response to bread consumption
2. **Metabolite Processing**: Average metabolite abundances per subject across replicates
3. **Statistical Analysis**: Multiple linear regression for each metabolite:
   - Outcome: Systolic blood pressure
   - Predictors: Metabolite abundance, Age, BMI
   - Filter: p < 0.05 and positive coefficient


In [17]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

print("Loading data...")
cgm = pd.read_csv('data_cgm.csv')
meta = pd.read_csv('data_meta.csv')
metab = pd.read_csv('data_metabolomics.csv', sep='\t', on_bad_lines='skip')

print(f"CGM data shape: {cgm.shape}")
print(f"Metadata shape: {meta.shape}")
print(f"Metabolomics data shape: {metab.shape}")


Loading data...
CGM data shape: (23520, 7)
Metadata shape: (74, 19)
Metabolomics data shape: (974, 172)


## Step 1: Identify Bread-Spikers

Calculate glucose peak response for each subject after bread consumption (without mitigators), then identify subjects with above-median peak responses.


In [18]:
# Get bread consumption data (without mitigators)
bread = cgm[(cgm['foods'] == 'Bread') & (cgm['mitigator'].isna())]
groups = bread.groupby(['subject', 'rep'])

# Calculate peak glucose delta for each subject-rep combination
subject_deltas = {}
for name, group in groups:
    subject, rep = name
    # Baseline: mean glucose before bread consumption (negative minutes)
    negative = group[group['mins_since_start'] < 0]
    if len(negative) == 0:
        continue
    baseline = np.mean(negative['glucose'])
    
    # Post-consumption: glucose after bread consumption
    post = group[group['mins_since_start'] >= 0]
    if len(post) == 0:
        continue
    
    # Calculate maximum delta from baseline
    deltas = post['glucose'] - baseline
    max_delta = deltas.max()
    
    if subject not in subject_deltas:
        subject_deltas[subject] = []
    subject_deltas[subject].append(max_delta)

# Calculate average peak for each subject
subject_avg_peak = {}
for sub, deltas in subject_deltas.items():
    subject_avg_peak[sub] = np.mean(deltas)

# Identify spikers (above median peak response)
peaks = list(subject_avg_peak.values())
median = np.median(peaks)
spikers = [sub for sub, pk in subject_avg_peak.items() if pk > median]

print(f"Total subjects with bread data: {len(subject_avg_peak)}")
print(f"Median peak glucose delta: {median:.2f} mg/dL")
print(f"Number of bread-spikers (above median): {len(spikers)}")
print(f"\nBread-spiker subjects: {sorted(spikers)}")


Total subjects with bread data: 37
Median peak glucose delta: 56.46 mg/dL
Number of bread-spikers (above median): 18

Bread-spiker subjects: ['XB100', 'XB101', 'XB107', 'XB111', 'XB16', 'XB19', 'XB20', 'XB21', 'XB33', 'XB42', 'XB48', 'XB59', 'XB62', 'XB68', 'XB69', 'XB70', 'XB79', 'XB94']


## Step 2: Prepare Metadata for Bread-Spikers


In [19]:
# Set index to subject ID for easier lookup
meta = meta.set_index('id')
spiker_meta = meta.loc[spikers]

print(f"Metadata for {len(spiker_meta)} bread-spikers:")
print(f"  Systolic BP: {spiker_meta['Systolic bp'].describe()}")
print(f"  Age: {spiker_meta['age'].describe()}")
print(f"  BMI: {spiker_meta['BMI'].describe()}")


Metadata for 18 bread-spikers:
  Systolic BP: count     18.000000
mean     118.488147
std       15.186099
min       91.000000
25%      110.577750
50%      123.250000
75%      125.895625
max      146.000000
Name: Systolic bp, dtype: float64
  Age: count    18.000000
mean     57.622222
std      13.194944
min      25.200000
25%      55.250000
50%      59.150000
75%      66.475000
max      79.000000
Name: age, dtype: float64
  BMI: count    18.000000
mean     24.825059
std       4.552596
min      17.200000
25%      21.885520
50%      24.095000
75%      27.962500
max      34.438333
Name: BMI, dtype: float64


## Step 3: Process Metabolomics Data

Average metabolite abundances per subject across all replicates.


In [20]:
# Identify abundance columns (format: XB{subject}_{rep})
abund_cols = [col for col in metab.columns if col.startswith('XB') and '_' in col]

# Group columns by subject
subject_cols = defaultdict(list)
for col in abund_cols:
    sub = col.split('_')[0]
    subject_cols[sub].append(col)

subject_list = list(subject_cols.keys())

# Create averaged metabolite abundances per subject
metab_subject = pd.DataFrame(index=metab.index, columns=subject_list)

for i in metab.index:
    for sub in subject_list:
        cols = subject_cols[sub]
        values = metab.loc[i, cols]
        valid = values.dropna()
        if not valid.empty:
            metab_subject.at[i, sub] = valid.mean()

print(f"Processed {len(metab_subject)} metabolites for {len(subject_list)} subjects")
print(f"Metabolites with valid names: {metab['newname'].notna().sum()}")


Processed 974 metabolites for 38 subjects
Metabolites with valid names: 974


## Step 4: Statistical Analysis

For each metabolite, perform multiple linear regression:
- **Dependent variable**: Systolic blood pressure
- **Independent variables**: Metabolite abundance, Age, BMI
- **Criteria**: p-value < 0.05 AND positive coefficient


In [21]:
results = []

for i in metab.index:
    name = metab.loc[i, 'newname']
    # Skip metabolites without valid names
    if pd.isna(name) or name == '-' or name == 'NA':
        continue
    
    try:
        # Prepare data frame for regression
        data = pd.DataFrame({
            'systolic': spiker_meta['Systolic bp'],
            'age': spiker_meta['age'],
            'bmi': spiker_meta['BMI'],
            'met': np.nan
        }, index=spikers)
        
        # Add metabolite abundance for each spiker
        for s in spikers:
            if s in subject_list:
                data.at[s, 'met'] = metab_subject.at[i, s]
        
        # Convert to numeric, handling errors
        data['met'] = pd.to_numeric(data['met'], errors='coerce')
        data['systolic'] = pd.to_numeric(data['systolic'], errors='coerce')
        data['age'] = pd.to_numeric(data['age'], errors='coerce')
        data['bmi'] = pd.to_numeric(data['bmi'], errors='coerce')
        
        # Remove rows with missing data
        data = data.dropna()
        
        # Need at least 5 observations for meaningful regression
        if len(data) < 5:
            continue
        
        # Multiple linear regression
        X = data[['met', 'age', 'bmi']]
        try:
            X = sm.add_constant(X)  # Add intercept
            y = data['systolic']
            model = sm.OLS(y, X).fit()
            
            # Extract coefficient and p-value for metabolite
            coef = model.params['met']
            p = model.pvalues['met']
            
            # Filter: significant (p < 0.05) and positive correlation
            if p < 0.05 and coef > 0:
                results.append({
                    'metabolite': name,
                    'coefficient': coef,
                    'p_value': p,
                    'n_samples': len(data),
                    'r_squared': model.rsquared
                })
        except Exception:
            continue
    except KeyError:
        continue

print(f"\nAnalysis complete. Found {len(results)} metabolites with significant positive correlation.")



Analysis complete. Found 41 metabolites with significant positive correlation.


## Results: Hypertension-Associated Metabolites in Bread-Spikers

Metabolites showing statistically significant (p < 0.05) positive correlation with systolic blood pressure, adjusted for age and BMI.


In [22]:
if len(results) > 0:
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('p_value')
    
    #print(f"\n{'='*80}")
    print(f"FOUND {len(results_df)} METABOLITES WITH SIGNIFICANT POSITIVE CORRELATION")
    #print(f"{'='*80}\n")
    
    for idx, row in results_df.iterrows():
        print(f"Metabolite: {row['metabolite']}")
        print(f"  Coefficient: {row['coefficient']:.4f} (mmHg per unit increase)")
        print(f"  P-value: {row['p_value']:.4e}")
        print(f"  R-squared: {row['r_squared']:.3f}")
        #print(f"  Sample size: {row['n_samples']}")
        print()
    
    # Save to CSV for further analysis
    results_df.to_csv('hypertension_metabolites_results.csv', index=False)
    print(f"\nResults saved to 'hypertension_metabolites_results.csv'")
    
    # Summary statistics
    print(f"\nSummary Statistics:")
    print(f"  Mean coefficient: {results_df['coefficient'].mean():.4f}")
    print(f"  Median p-value: {results_df['p_value'].median():.4e}")
    print(f"  Mean R-squared: {results_df['r_squared'].mean():.3f}")
else:
    print("\nNo metabolites found with significant positive correlation (p < 0.05).")


FOUND 41 METABOLITES WITH SIGNIFICANT POSITIVE CORRELATION
Metabolite: cis-4-Hydroxyproline
  Coefficient: 0.0053 (mmHg per unit increase)
  P-value: 3.0948e-04
  R-squared: 0.813

Metabolite: Proline
  Coefficient: 0.0012 (mmHg per unit increase)
  P-value: 1.1773e-03
  R-squared: 0.758

Metabolite: Proline; CE10; ONIBWKKTOPOVIA-BYPYZUCNSA-N
  Coefficient: 0.0000 (mmHg per unit increase)
  P-value: 1.5002e-03
  R-squared: 0.747

Metabolite: Kynurenic acid
  Coefficient: 0.0033 (mmHg per unit increase)
  P-value: 2.4807e-03
  R-squared: 0.721

Metabolite: Deoxycholic acid 3-glucuronide
  Coefficient: 0.0002 (mmHg per unit increase)
  P-value: 2.6845e-03
  R-squared: 0.717

Metabolite: N-Acetylgalactosaminitol
  Coefficient: 0.0075 (mmHg per unit increase)
  P-value: 4.0528e-03
  R-squared: 0.694

Metabolite: Creatinine
  Coefficient: 0.0007 (mmHg per unit increase)
  P-value: 4.4042e-03
  R-squared: 0.689

Metabolite: Creatinine
  Coefficient: 0.0000 (mmHg per unit increase)
  P-value:

In [23]:
# Display results in a formatted table
if len(results) > 0:
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values('p_value')
    display(results_df)


Unnamed: 0,metabolite,coefficient,p_value,n_samples,r_squared
25,cis-4-Hydroxyproline,0.005251,0.000309,14,0.813441
14,Proline,0.001234,0.001177,14,0.75836
2,Proline; CE10; ONIBWKKTOPOVIA-BYPYZUCNSA-N,5e-06,0.0015,14,0.746836
4,Kynurenic acid,0.003291,0.002481,14,0.721254
40,Deoxycholic acid 3-glucuronide,0.000169,0.002684,14,0.717026
26,N-Acetylgalactosaminitol,0.007538,0.004053,14,0.693986
37,Creatinine,0.000686,0.004404,14,0.689131
16,Creatinine,3.3e-05,0.00667,14,0.663832
18,Chenodeoxycholic acid 24-acyl-.beta.-D-glucuro...,0.000908,0.007965,14,0.652467
9,Asp-Arg,0.000334,0.008265,14,0.650054
