# Data Cleaning & Standardization for Association Rule Mining (ARM)

This notebook prepares the raw dataset for association rule mining by **standardizing symptom text** and exporting:

- a **cleaned dataset** (`outputs/dataset_cleaned.csv`)
- **symptom-only transactions** (`outputs/transactions_symptoms.csv`)
- **symptom + disease transactions** (`outputs/transactions_symptoms_plus_disease.csv`)

The goal is to ensure that the same symptom is not counted as multiple different items due to formatting differences (underscores, stray spaces, inconsistent casing).

In [1]:
import pandas as pd
import numpy as np
import re
from pathlib import Path

# ---------- Paths ----------
CANDIDATES = [
    Path("outputs/dataset_cleaned.csv"),  # if already cleaned
    Path("dataset.csv"),
    Path("./dataset.csv"),
    Path("/mnt/data/dataset.csv"),
]

RAW_PATH = next((p for p in CANDIDATES if p.exists() and p.name == "dataset.csv"), None)
if RAW_PATH is None:
    # If dataset.csv is not present, still allow the notebook to run if only cleaned data is needed
    RAW_PATH = next((p for p in CANDIDATES if p.exists() and p.suffix == ".csv"), None)

if RAW_PATH is None:
    raise FileNotFoundError("Could not find dataset.csv. Put it beside this notebook or in /mnt/data/.")

OUT_DIR = Path("outputs")
OUT_DIR.mkdir(parents=True, exist_ok=True)

OUT_CLEAN_DATASET = OUT_DIR / "dataset_cleaned.csv"
OUT_TRANSACTIONS_SYM = OUT_DIR / "transactions_symptoms.csv"
OUT_TRANSACTIONS_SYM_DISEASE = OUT_DIR / "transactions_symptoms_plus_disease.csv"

df = pd.read_csv(RAW_PATH)
symptom_cols = [c for c in df.columns if c.lower().startswith("symptom")]

print("Loaded:", RAW_PATH)
print("Rows:", df.shape[0], "| Diseases:", df["Disease"].nunique(), "| Symptom columns:", len(symptom_cols))
df.head()

Loaded: /mnt/data/dataset.csv
Rows: 4920 | Diseases: 41 | Symptom columns: 17


Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,,,,,,,,,,,,


## Cleaning rules (what changes)

Each symptom cell is standardized to a consistent item label:

- trims leading/trailing spaces
- converts underscores to spaces
- collapses multiple spaces into one
- converts to lowercase

This prevents issues like `dischromic _patches` vs `dischromic_patches` being treated as different items.

In [2]:
def clean_symptom_token(x):
    if pd.isna(x):
        return np.nan
    x = str(x).strip()
    x = x.replace("_", " ")
    x = re.sub(r"\s+", " ", x).strip()
    return x.lower()

df_clean = df.copy()

# Clean symptoms
for c in symptom_cols:
    df_clean[c] = df_clean[c].apply(clean_symptom_token)

# Clean disease labels (keep original casing, but strip spaces)
df_clean["Disease"] = df_clean["Disease"].astype(str).str.strip()

# Quick before/after snapshot (sample)
sample = df.loc[:12, ["Disease"] + symptom_cols[:5]].copy()
sample_after = df_clean.loc[:12, ["Disease"] + symptom_cols[:5]].copy()

display(pd.concat(
    {"Before": sample, "After": sample_after},
    axis=1
).head(8))

df_clean.head()

Unnamed: 0_level_0,Before,Before,Before,Before,Before,Before,After,After,After,After,After,After
Unnamed: 0_level_1,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5
0,Fungal infection,itching,skin_rash,nodal_skin_eruptions,dischromic _patches,,Fungal infection,itching,skin rash,nodal skin eruptions,dischromic patches,
1,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,Fungal infection,skin rash,nodal skin eruptions,dischromic patches,,
2,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,Fungal infection,itching,nodal skin eruptions,dischromic patches,,
3,Fungal infection,itching,skin_rash,dischromic _patches,,,Fungal infection,itching,skin rash,dischromic patches,,
4,Fungal infection,itching,skin_rash,nodal_skin_eruptions,,,Fungal infection,itching,skin rash,nodal skin eruptions,,
5,Fungal infection,skin_rash,nodal_skin_eruptions,dischromic _patches,,,Fungal infection,skin rash,nodal skin eruptions,dischromic patches,,
6,Fungal infection,itching,nodal_skin_eruptions,dischromic _patches,,,Fungal infection,itching,nodal skin eruptions,dischromic patches,,
7,Fungal infection,itching,skin_rash,dischromic _patches,,,Fungal infection,itching,skin rash,dischromic patches,,


Unnamed: 0,Disease,Symptom_1,Symptom_2,Symptom_3,Symptom_4,Symptom_5,Symptom_6,Symptom_7,Symptom_8,Symptom_9,Symptom_10,Symptom_11,Symptom_12,Symptom_13,Symptom_14,Symptom_15,Symptom_16,Symptom_17
0,Fungal infection,itching,skin rash,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,
1,Fungal infection,skin rash,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,,
2,Fungal infection,itching,nodal skin eruptions,dischromic patches,,,,,,,,,,,,,,
3,Fungal infection,itching,skin rash,dischromic patches,,,,,,,,,,,,,,
4,Fungal infection,itching,skin rash,nodal skin eruptions,,,,,,,,,,,,,,


## Quality checks

These checks confirm that the cleaned dataset is stable and ARM-ready:

- number of unique symptoms after cleaning
- distribution of symptoms per transaction
- disease distribution (to understand class balance)


In [3]:
# Unique symptoms
sym_set = set()
for c in symptom_cols:
    sym_set.update([v for v in df_clean[c].dropna().unique().tolist() if isinstance(v, str) and v.strip()])
print("Unique symptoms (after cleaning):", len(sym_set))

# Symptoms per row (transaction size)
sizes = []
for _, row in df_clean[symptom_cols].iterrows():
    items = [x for x in row.values.tolist() if isinstance(x, str) and x.strip()]
    sizes.append(len(set(items)))

sizes_s = pd.Series(sizes, name="symptoms_per_row")
display(sizes_s.describe())

# Disease counts
disease_counts = df_clean["Disease"].value_counts()
print("Disease classes:", disease_counts.shape[0])
print("Min rows per disease:", int(disease_counts.min()), "| Max rows per disease:", int(disease_counts.max()))

disease_counts.head(10)

Unique symptoms (after cleaning): 131


count    4920.000000
mean        7.448780
std         3.592166
min         3.000000
25%         5.000000
50%         6.000000
75%        10.000000
max        17.000000
Name: symptoms_per_row, dtype: float64

Disease classes: 41
Min rows per disease: 120 | Max rows per disease: 120


Disease
Fungal infection                120
Hepatitis C                     120
Hepatitis E                     120
Alcoholic hepatitis             120
Tuberculosis                    120
Common Cold                     120
Pneumonia                       120
Dimorphic hemmorhoids(piles)    120
Heart attack                    120
Varicose veins                  120
Name: count, dtype: int64

## Export outputs

We export the cleaned dataset and two transaction files:

1) **Symptom-only transactions** (one basket per row)  
2) **Symptom + disease transactions** (adds an item like `disease::Diabetes`)

The second format is useful when mining rules of the form **Symptoms â†’ Disease**.


In [4]:
# Save cleaned dataset
df_clean.to_csv(OUT_CLEAN_DATASET, index=False)

# Build transactions
transactions_sym = []
transactions_sym_disease = []

for i, row in df_clean.iterrows():
    items = [x for x in row[symptom_cols].values.tolist() if isinstance(x, str) and x.strip()]
    items = sorted(set(items))
    if len(items) >= 2:
        transactions_sym.append(items)
    # always include disease-tagged transaction for symptom->disease mining
    disease_item = f"disease::{row['Disease']}"
    transactions_sym_disease.append(items + [disease_item])

# Export as a single comma-separated string per transaction
df_sym = pd.DataFrame({
    "transaction_id": range(1, len(transactions_sym) + 1),
    "items": [", ".join(t) for t in transactions_sym]
})
df_sym.to_csv(OUT_TRANSACTIONS_SYM, index=False)

df_sym_dis = pd.DataFrame({
    "transaction_id": range(1, len(transactions_sym_disease) + 1),
    "items": [", ".join(t) for t in transactions_sym_disease]
})
df_sym_dis.to_csv(OUT_TRANSACTIONS_SYM_DISEASE, index=False)

print("Saved cleaned dataset:", OUT_CLEAN_DATASET.resolve())
print("Saved symptom transactions:", OUT_TRANSACTIONS_SYM.resolve())
print("Saved symptom+disease transactions:", OUT_TRANSACTIONS_SYM_DISEASE.resolve())

Saved cleaned dataset: /home/oai/outputs/dataset_cleaned.csv
Saved symptom transactions: /home/oai/outputs/transactions_symptoms.csv
Saved symptom+disease transactions: /home/oai/outputs/transactions_symptoms_plus_disease.csv


## Summary

- The dataset has been standardized so each symptom maps to a single consistent item label.
- Exports are saved in the `outputs/` folder for direct use in the ARM notebook.
