# PCOSense: An Explainable & Patient-Centric PCOS Risk Intelligence System

**Author:** Raksha Mishra  
**Project:** CycleSense / PCOSense  
**Domain:** Women’s Health · Machine Learning · Tableau Analytics  

---

## Why This Study Exists

I was diagnosed with PCOS at the age of **15**.

What followed were years of acne flare-ups, hirsutism, irregular menstrual cycles,
weight fluctuations, and eventually thyroid imbalance. Treatment plans changed,
medications rotated, and explanations often remained incomplete.

Most clinical tools reduce PCOS to a *binary diagnosis*.

This project challenges that.

**PCOSense** is built to:
- Quantify PCOS risk on a **spectrum**
- Combine **clinical data + lived symptoms**
- Provide **explainable insights**, not black-box predictions
- Act as a **decision-support system**, not a diagnosis

> This notebook documents the full research, modeling, and interpretability pipeline.


## ⚠️ Ethical Disclaimer

This notebook and its models are created **solely for educational, analytical, and research purposes**.

-  NOT a medical diagnostic tool  
-  NOT a replacement for professional healthcare  
-  Designed to **support understanding**, not replace clinicians  

All insights should be interpreted responsibly.


In [None]:
# Core libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# ML & preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    classification_report,
    roc_auc_score,
    confusion_matrix,
    roc_curve
)

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier

# Imbalance handling
from imblearn.combine import SMOTETomek

# Visualization settings
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)


## Dataset Description

The dataset used in this study contains:
- Hormonal markers (LH, FSH, AMH, TSH)
- Physical indicators (BMI, Waist-Hip Ratio)
- Ultrasound follicle measurements
- Symptom-based indicators (acne, hair growth, cycle irregularity)
- Binary PCOS label

This dataset reflects **real clinical screening conditions**.


In [None]:
# Update this path based on your local folder
DATA_PATH = "PCOS_data.csv"  # or .xlsx

if DATA_PATH.endswith(".xlsx"):
    df = pd.read_excel(DATA_PATH, sheet_name="Full_new")
else:
    df = pd.read_csv(DATA_PATH)

print("Dataset shape:", df.shape)
df.head()


## Data Cleaning Strategy

Clinical datasets often contain:
- Missing values
- Mixed data types
- Identifier columns irrelevant to modeling

To preserve medical meaning:
- Median imputation is used
- Identifiers are removed
- Numeric coercion is applied safely


In [None]:
# Normalize column names
df.columns = df.columns.str.strip()

# Drop identifiers
drop_cols = ['Sl. No', 'Patient File No.', 'Unnamed: 44']
df.drop(columns=[c for c in drop_cols if c in df.columns],
        inplace=True, errors="ignore")

# Convert to numeric
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# Impute missing values
df.fillna(df.median(numeric_only=True), inplace=True)

df.info()


## Exploratory Data Analysis

Before modeling, it is crucial to:
- Understand feature distributions
- Identify imbalance
- Observe correlations with PCOS


In [None]:
sns.countplot(x="PCOS (Y/N)", data=df)
plt.title("PCOS Class Distribution")
plt.show()


In [None]:
sns.boxplot(x="PCOS (Y/N)", y="LH(mIU/mL)", data=df)
plt.title("LH Levels vs PCOS")
plt.show()

sns.boxplot(x="PCOS (Y/N)", y="FSH(mIU/mL)", data=df)
plt.title("FSH Levels vs PCOS")
plt.show()


## Feature Engineering (Medical Domain Logic)

Instead of generic transformations, features are engineered based on:
- Endocrinology literature
- PCOS diagnostic patterns
- Clinical relevance


In [None]:
df_fe = df.copy()

# Hormonal ratios
df_fe["LH_FSH_Ratio"] = df_fe["LH(mIU/mL)"] / (df_fe["FSH(mIU/mL)"] + 1e-6)

# Thyroid-Ovarian interaction
df_fe["Thyroid_Ovarian_Index"] = (
    df_fe["TSH (mIU/L)"] * df_fe["AMH(ng/mL)"]
)

# Symptom burden
symptoms = [
    'Weight gain(Y/N)', 'hair growth(Y/N)',
    'Skin darkening (Y/N)', 'Hair loss(Y/N)', 'Pimples(Y/N)'
]
df_fe["Symptom_Burden_Score"] = df_fe[symptoms].sum(axis=1)

# Follicle aggregation
df_fe["Total_Follicle_Count"] = (
    df_fe["Follicle No. (L)"] + df_fe["Follicle No. (R)"]
)

df_fe.head()


## Modeling Philosophy

PCOS is **not linear**.

Hence, a **stacked ensemble** is used:
- Random Forest → captures non-linear interactions
- Gradient Boosting → improves weak predictions
- Logistic Regression → explainable meta-learner


In [None]:
TARGET = "PCOS (Y/N)"
X = df_fe.drop(columns=[TARGET])
y = df_fe[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

smote = SMOTETomek(random_state=42)
X_train_bal, y_train_bal = smote.fit_resample(X_train, y_train)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_bal)
X_test_scaled = scaler.transform(X_test)


In [None]:
rf = RandomForestClassifier(
    n_estimators=300, max_depth=7, class_weight="balanced", random_state=42
)

gb = GradientBoostingClassifier(
    n_estimators=250, learning_rate=0.05, random_state=42
)

stacked_model = StackingClassifier(
    estimators=[("rf", rf), ("gb", gb)],
    final_estimator=LogisticRegression(max_iter=3000),
    passthrough=True
)

stacked_model.fit(X_train_scaled, y_train_bal)


## Model Performance & Interpretability


In [None]:
y_prob = stacked_model.predict_proba(X_test_scaled)[:, 1]
y_pred = stacked_model.predict(X_test_scaled)

print("ROC-AUC:", roc_auc_score(y_test, y_prob))
print(classification_report(y_test, y_pred))

fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, label="PCOSense Model")
plt.plot([0,1], [0,1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()


## Key Takeaways

- PCOS risk exists on a **continuum**
- Hormonal ratios outperform single markers
- Symptom burden significantly influences risk
- Explainable ensembles improve trust

---

## Why This Matters

This notebook is not just about prediction.

It is about **visibility**, **validation**, and **understanding** for millions
of individuals whose symptoms are often minimized.

PCOSense is my attempt to bridge data science with lived reality.
