# LLMs prompts

School of Epidemiology

Futuri Center, Białystok, October 22, 2025 

Wojciech Łaguna
 

## <span style="color: red;">Warning!</span>

The following materials are for data review purposes only. The results of the prompts **should not be used in scientific work under any circumstances**. The prompts should only be used for preliminary data review. Final statistics and data analysis should be performed by a professional statistician. 

## Prerequisites

1. [Anaconda](https://www.anaconda.com/download/success) (preferred) or stanalone [Jupyter Notebook](https://jupyter.org/install) installed

2. Jupyter Lab selected (right-up corner of the image)
  
3. Access to **one of the following LLMs** (free or paid version). In some you need to be registered:
   - [Claude](https://claude.ai/login?returnTo=%2F%3F)
   - [ChatGPT](https://chatgpt.com/)
   - [Gemini](https://gemini.google.com/)
   - [GROK](https://grok.com/)

## Data overview

**[Heart Disease Cleveland](https://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland/data)**

The dataset is the Cleveland Heart Disease dataset taken from the UCI repository. The dataset consists of 303 individuals’ data.


[Licence:CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)

1. **Age**: Patients Age in years 
2. **Sex**: Gender (Male : 1; Female : 0) 
3. **cp**: Type of chest pain experienced by patient. This term categorized into 4 category.
    - 0 typical angina,
    - 1 atypical angina,
    - 2 non-anginal pain,
    - 3 asymptomatic 
4. **trestbps**: patient's level of blood pressure at resting mode in mm/HG 
5. **chol**: Serum cholesterol in mg/dl 
6. **fbs**: Blood sugar levels on fasting > 120 mg/dl represents as 1 in case of true and 0 as false 
7. **restecg**: Result of electrocardiogram while at rest are represented in 3 distinct values
   - 0 : Normal
   - 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
   - 2: showing probable or definite left ventricular hypertrophyby Estes' criteria 
8. **thalach**: Maximum heart rate achieved 
9. **exang**: Angina induced by exercise 0 depicting NO 1 depicting Yes
10. **oldpeak**: Exercise induced ST-depression in relative with the state of rest 
11. **slope**: ST segment measured in terms of slope during peak exercise
    - 0: up sloping;
    - 1: flat;
    - 2: down sloping
14. **ca**: The number of major vessels (0–3)
15. **thal**: A blood disorder called thalassemia
    - 0: NULL
    - 1: normal blood flow
    - 2: fixed defect (no blood flow in some part of the heart)
    - 3: reversible defect (a blood flow is observed but it is not normal
16. ***target***: It is the target variable which we have to predict. 1 means patient is suffering from heart disease and 0 means patient is normal.

## Data loading

**PROMPT: I need to load a dataset from given address to pandas './Heart_disease_cleveland_new.csv'. Plase assign it to "df" variable**. 

In [None]:
import pandas as pd

# Load the dataset from the local file path into the DataFrame 'df'
df = pd.read_csv('./Heart_disease_cleveland_new.csv')

# You can display the first few rows to verify the load (optional)
df.head()

**PROMPT: Can I see table in a nicer outline?**

**PROMPT: What types of data do I have in my dataset?**

In [None]:
df.dtypes

## Data cleaning

**PROMPT: I know that "chol" should be a number, not "object." How can I check if this and other columns displaying the "object" type contain a random string?**

In [None]:
import numpy as np

# 1. Select all columns that are currently of 'object' dtype
object_cols = df.select_dtypes(include=['object']).columns

print("--- Inspection of 'Object' Columns for Non-Numeric Data ---")

if len(object_cols) == 0:
    print("No 'object' columns found in the DataFrame.")
else:
    # 2. Loop through each 'object' column and check for non-numeric values
    for col in object_cols:
        # Attempt to convert the column to a numeric type.
        # errors='coerce' turns any non-convertible string (like '?' or 'abc') into NaN.
        numeric_col = pd.to_numeric(df[col], errors='coerce')

        # Count the number of values that turned into NaN during conversion
        non_numeric_count = numeric_col.isnull().sum()

        if non_numeric_count > 0:
            print(f"\n🚨 Column '{col}' contains {non_numeric_count} non-numeric values.")

            # Find the original unique string values that caused the conversion to fail (NaN)
            # We filter the original column where the newly created numeric column is NaN
            non_numeric_values = df[col][numeric_col.isna()].unique()

            # The set should exclude 'np.nan' if it was already present as a string 'nan'
            cleaned_values = [
                val for val in non_numeric_values if str(val).lower() != 'nan' and pd.notna(val)
            ]

            if cleaned_values:
                 print(f"   Unique non-numeric strings found: {cleaned_values}")
            else:
                 print("   The non-numeric values appear to be empty strings or purely NaN values.")

        else:
            print(f"\n✅ Column '{col}' contains only numeric data, but is currently stored as 'object'.")
            print("   You can safely convert this column to a numeric type (float or integer).")

**PROMPT: How can I replace "unknown" by an empty value?**

In [None]:
import numpy as np

# Replace the string "unknown" with the standard pandas missing value NaN
df.replace("unknown", np.nan, inplace=True)

# You can verify the change by checking the number of missing values (optional)
print(df.isnull().sum())

In [None]:
df.dtypes

**PROMPT: I replaced, but "chol" is still an "object" in my dataframe.** 

In [None]:
import numpy as np

# Replace the common placeholder '?' with NaN
df.replace("?", np.nan, inplace=True)

# You should also re-run the inspection logic to find any *other* remaining non-numeric strings
print("\n--- Remaining Unique Values in 'chol' ---")
# Use the same logic as before to check for non-convertible strings
numeric_chol = pd.to_numeric(df['chol'], errors='coerce')
non_numeric_count = numeric_chol.isnull().sum()

if non_numeric_count > 0:
    print(f"🚨 'chol' still contains {non_numeric_count} non-numeric values (including NaNs).")
    non_numeric_values = df['chol'][numeric_chol.isna()].unique()
    
    # Filter out actual np.nan values to see the problematic strings
    problematic_strings = [
        val for val in non_numeric_values if pd.notna(val) and str(val).lower() != 'nan'
    ]

    if problematic_strings:
         print(f"   Remaining unique problematic strings: {problematic_strings}")
    else:
         print("   No other specific strings found; conversion should proceed with existing NaNs.")
else:
    print("✅ 'chol' now appears to contain only numeric values or standard NaNs.")

In [None]:
import pandas as pd
import numpy as np

# Convert 'chol' column to a numeric type (float64)
# errors='coerce' ensures that any remaining non-numeric strings are turned into np.nan
df['chol'] = pd.to_numeric(df['chol'], errors='coerce')

# Verify the data type
print(f"The new data type for 'chol' is: {df['chol'].dtype}")

# Optional: Check how many non-numeric values were converted to NaN
# This tells you how many problematic entries were cleaned up by the conversion
print(f"Number of missing values (NaNs) in 'chol' after conversion: {df['chol'].isnull().sum()}")

**PROPMPT: How to discover outliers in my dataset?**

In [None]:
# 1. Calculate Q1, Q3, and IQR
Q1 = df['chol'].quantile(0.25)
Q3 = df['chol'].quantile(0.75)
IQR = Q3 - Q1

# 2. Define the outlier fences
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)

# 3. Identify the outliers
outliers = df[(df['chol'] < lower_bound) | (df['chol'] > upper_bound)]

print(f"Number of outliers in 'chol': {len(outliers)}")
print("Outlier details (first 5):")
print(outliers[['chol']].head())

**PROMPT: I see that the maximum values ​​for "chol" and "thalach" are outliers, as are the minimum values ​​for "target." How do I replace them with an empty value?**

In [None]:
# Select all numeric columns
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

In [None]:
import numpy as np
import pandas as pd

# Dictionary to store the count of outliers for each column
outlier_counts = {}

print("--- Outlier Discovery using 1.5 * IQR Rule ---")

# Loop through every numeric column
for col in numeric_cols:
    # Drop NaNs before calculating statistics to avoid errors
    data = df[col].dropna()

    if data.empty:
        print(f"Column '{col}' is empty or all NaN, skipping.")
        continue

    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    # Define the outlier fences (1.5 * IQR rule)
    lower_bound = Q1 - (1.5 * IQR)
    upper_bound = Q3 + (1.5 * IQR)

    # Count the number of outliers
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    count = len(outliers)

    outlier_counts[col] = count

    if count > 0:
        print(f"🚨 Column '{col}': {count} outliers found.")
        # Optional: Print the min/max of the outliers to see their values
        print(f"   Min Outlier: {outliers.min():.2f}, Max Outlier: {outliers.max():.2f}")
    else:
        print(f"✅ Column '{col}': No outliers found.")

# Display a summary of all findings
print("\n--- Summary of Outliers per Column ---")
summary_df = pd.Series(outlier_counts, name='Outlier Count')
print(summary_df)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- Assuming the DataFrame 'df' has been successfully loaded and cleaned ---

# 1. Select all numeric columns
numeric_cols = df.select_dtypes(include=np.number).columns
# Filter out columns that are all NaN after cleaning
final_cols_to_plot = [col for col in numeric_cols if not df[col].dropna().empty]

if final_cols_to_plot:
    num_cols = len(final_cols_to_plot)

    # Determine grid size for subplots
    n_rows = (num_cols + 2) // 3  # Roughly 3 columns per row
    n_cols = 3

    # Create figure and subplots
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
    axes = axes.flatten() # Flatten the 2D array of axes for easy iteration

    for i, col in enumerate(final_cols_to_plot):
        # Use dropna() to ignore NaN values for the boxplot calculation
        axes[i].boxplot(df[col].dropna())
        axes[i].set_title(f'{col}', fontsize=12)
        # Remove x-axis tick labels for cleaner look on multiple plots
        axes[i].tick_params(axis='x', labelbottom=False)

    # Hide any unused subplots
    for j in range(len(final_cols_to_plot), len(axes)):
        fig.delaxes(axes[j])

    plt.suptitle('Box Plots for Outlier Visualization Across All Numeric Columns', fontsize=16, y=1.02)
    plt.tight_layout(rect=[0, 0.03, 1, 0.98])
    plt.show() # In a real environment, you'd use plt.show() or plt.savefig()

else:
    print("No suitable numeric columns found for plotting.")

**PROMPT: How many missing values are in my dataset?**

In [None]:
# 1. Check for null/missing values across the entire DataFrame
missing_values = df.isnull().sum()

print("--- Missing Values Count Per Column ---")
print(missing_values)

# 2. To get the total count across the entire dataset:
total_missing = missing_values.sum()
print(f"\nTotal missing values in the entire dataset: {total_missing}")

**PROMPT: How to remove all records that contain at least one empty value?**

In [None]:
# Create a new DataFrame containing only rows with NO missing values
df_cleaned = df.dropna(how='any')

# Optional: Print the original and new number of rows to see how many were removed
print(f"Original number of records: {len(df)}")
print(f"Number of records after removal of missing values: {len(df_cleaned)}")

# To apply the change directly to the original DataFrame 'df', use inplace=True
df.dropna(how='any', inplace=True)

In [None]:
df.info()

**PROMPT: How to check if there any duplicates?**

In [None]:
# .duplicated() returns a boolean Series (True for duplicates)
# .sum() counts the number of 'True' values (which are 1)
total_duplicates = df.duplicated().sum()

print(f"Total number of duplicate rows found in the dataset: {total_duplicates}")

## Basic statistics

**PROMPT: How to obtain column names containing continuous and categorical values?**

In [None]:
# Identify columns by basic dtype
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
object_cols = df.select_dtypes(include=['object']).columns

In [None]:
# Define the maximum number of unique values a variable can have to be considered categorical
# 20 is a common rule of thumb, but can be adjusted.
UNIQUE_VALUE_THRESHOLD = 20

# Initialize the final lists
continuous_vars = []
categorical_vars = list(object_cols) # All remaining object columns are assumed categorical

# Loop through all numeric columns to apply the heuristic
for col in numeric_cols:
    # Use dropna() since missing values don't count towards the unique value count
    if df[col].nunique() < UNIQUE_VALUE_THRESHOLD:
        categorical_vars.append(col)
    else:
        continuous_vars.append(col)

print("--- Final Classification of Variables ---")
print(f"Continuous Variables ({len(continuous_vars)}): \n{continuous_vars}")
print("-" * 30)
print(f"Categorical Variables ({len(categorical_vars)}): \n{categorical_vars}")

**PROMPT: How to display histograms for continuous data?**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# --- 1. Identify Continuous Variables (Based on Cleaning Steps) ---
# Assuming 'df' is the cleaned DataFrame where NaNs were dropped and types are correct.
UNIQUE_VALUE_THRESHOLD = 20
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

continuous_vars = []
for col in numeric_cols:
    if df[col].nunique() >= UNIQUE_VALUE_THRESHOLD:
        continuous_vars.append(col)

print(f"Continuous Variables identified for plotting: {continuous_vars}")

# --- 2. Generate Histograms ---
if continuous_vars:
    num_vars = len(continuous_vars)
    # Set up a grid of subplots (e.g., 2 histograms per row)
    n_rows = (num_vars + 1) // 2
    n_cols = 2

    # Create the figure and axes
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, 5 * n_rows))
    axes = axes.flatten() # Flattens the axes array for easy iteration

    for i, col in enumerate(continuous_vars):
        # Plot the histogram using 15 bins and a slight alpha for better viewing
        axes[i].hist(df[col], bins=15, edgecolor='black', alpha=0.7)
        axes[i].set_title(f'Distribution of {col}', fontsize=14)
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Frequency')

    # Hide any unused subplots
    for j in range(len(continuous_vars), len(axes)):
        fig.delaxes(axes[j])

    plt.suptitle('Histograms of Continuous Variables', fontsize=18, y=1.02)
    plt.tight_layout(rect=[0, 0.03, 1, 0.98])
    plt.savefig('continuous_histograms.png')
    # plt.show() # Use plt.show() to display the plot interactively

**PROMPT: How to get basic statistic for continuous data?**

In [None]:
import pandas as pd
import numpy as np

# --- 1. Identify Continuous Variables (Based on Cleaning Steps) ---
# Assuming 'df' is the cleaned DataFrame where NaNs were dropped and types are correct.
UNIQUE_VALUE_THRESHOLD = 20
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

continuous_vars = []
for col in numeric_cols:
    if df[col].nunique() >= UNIQUE_VALUE_THRESHOLD:
        continuous_vars.append(col)

# --- 2. Calculate Basic Statistics ---
if continuous_vars:
    # Use .describe() on the subset of continuous columns
    continuous_data_stats = df[continuous_vars].describe()

    # Print the resulting table
    print("\n--- Basic Statistics for Continuous Variables ---")
    print(continuous_data_stats)
else:
    print("No continuous variables found for statistical analysis.")

**PROMPT: How to visualize columns with categorical values?**

In [None]:
# Assuming 'df' is the cleaned DataFrame
UNIQUE_VALUE_THRESHOLD = 20
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

categorical_vars = []
for col in numeric_cols:
    if df[col].nunique() < UNIQUE_VALUE_THRESHOLD:
        categorical_vars.append(col)

# In your dataset, this list likely includes columns like 'sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', and 'target'.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'categorical_vars' is the list generated above

if categorical_vars:
    num_vars = len(categorical_vars)
    # Set up a grid of subplots (e.g., 3 plots per row)
    n_rows = (num_vars + 1) // 3
    n_cols = 3

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 5 * n_rows))
    axes = axes.flatten()

    for i, col in enumerate(categorical_vars):
        # Use seaborn's countplot for a clean bar visualization
        sns.countplot(x=col, data=df, ax=axes[i], palette='viridis')
        axes[i].set_title(f'Frequency of {col}', fontsize=14)
        axes[i].set_xlabel(col)
        axes[i].set_ylabel('Count')

    # Hide any unused subplots
    for j in range(len(categorical_vars), len(axes)):
        fig.delaxes(axes[j])

    plt.suptitle('Count Plots of Categorical Variables', fontsize=18, y=1.02)
    plt.tight_layout(rect=[0, 0.03, 1, 0.98])
    plt.show() # Display the plots

**PROMPT: How to get basic statistics for categorical variables?**

In [None]:
# Assuming 'categorical_vars' is the list of column names identified earlier
if categorical_vars:
    # Use include='all' to get descriptive stats for categorical columns
    categorical_stats = df[categorical_vars].describe(include='all')

    print("--- Descriptive Statistics for Categorical Variables ---")
    print(categorical_stats)
else:
    print("No categorical variables found for statistical analysis.")

In [None]:
print("\n--- Value Counts (Frequency Distribution) for Each Categorical Variable ---")

for col in categorical_vars:
    print(f"\n--- Column: {col} ---")

    # Get the count of each unique value
    counts = df[col].value_counts()
    print(counts)

    # Optional: Get the percentage distribution
    percentages = df[col].value_counts(normalize=True) * 100
    print("\nPercentage Distribution:")
    print(percentages.round(1))

**PROMPT: Can I get a table showing the numbers and percentages for each categorical variable?**

**PROMPT: Can I see correlation matrix for my data?**

**PROMPT: I do not see "sex" in the list (Male, Female), how to include them?**

## Statistics on groups

**PROMPT: Can you show basic statistics per "sex"?**

In [None]:
import pandas as pd
import numpy as np

# Assuming 'df' is the cleaned DataFrame from previous steps

# 1. Identify Continuous Variables (Heuristic)
UNIQUE_VALUE_THRESHOLD = 20
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

continuous_vars = []
for col in numeric_cols:
    if df[col].nunique() >= UNIQUE_VALUE_THRESHOLD:
        continuous_vars.append(col)

# 2. Perform Grouped Descriptive Statistics
if 'sex' in df.columns and continuous_vars:
    # Select only the continuous columns and the grouping column ('sex')
    grouped_stats = df.groupby('sex')[continuous_vars].describe()

    print("\n--- Basic Statistics Grouped by 'sex' ---")
    # Display the resulting multi-level indexed table
    print(grouped_stats)
else:
    print("The 'sex' column or continuous variables are not available in the DataFrame.")

**PROMPT: Can you show how many Men have "chol" level above the median for whole dataset?**

In [None]:
import pandas as pd
import numpy as np

# --- 1. Load and Clean the DataFrame (Full Preprocessing) ---
try:
    df = pd.read_csv('./Heart_disease_cleveland_new.csv')
except FileNotFoundError:
    print("Error: The file './Heart_disease_cleveland_new.csv' was not found. Please ensure it is uploaded.")
    exit() # Exit if the file can't be loaded

# a. Replace missing value placeholders ("unknown" and "?") with NaN
df.replace(["unknown", "?"], np.nan, inplace=True)

# b. Convert columns to numeric where possible (like 'chol')
for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = pd.to_numeric(df[col], errors='coerce')

# c. Remove all records containing at least one empty value (dropna)
df.dropna(how='any', inplace=True)

# --- 2. Calculate the Overall Median for 'chol' ---
median_chol = df['chol'].median()

# --- 3. Filter and Count ---
# Standard coding for 'sex' in this dataset is typically 1.0 for Men
men_sex_code = 1.0

# 3.1 Filter for Men (sex == 1.0)
men_df = df[df['sex'] == men_sex_code]

# 3.2 Filter for those with 'chol' > overall median
men_high_chol = men_df[men_df['chol'] > median_chol]

# 3.3 Get the final count
count_men_high_chol = len(men_high_chol)

# --- 4. Display Results ---
print("--- Cholesterol Analysis ---")
print(f"Overall Median Cholesterol (chol) for the cleaned dataset: {median_chol:.2f}")
print(f"Number of Men (sex={men_sex_code}) with cholesterol above the overall median: {count_men_high_chol} 👨")

**PROMPT: Show me records with Man who has "chol" levels above 200**

**PROMPT: How to save selected Men to a csv file?**

**PROMPT: Can you show a plot of "chol" vs "age" regarding "target"?**

## Statistical tests

**PROMPT: Can you show me how to check if there is statistical significance between men and women for each variable?**

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, chi2_contingency

# --- 1. Load and Clean the DataFrame (Full Preprocessing) ---
# Assuming 'df' is the cleaned DataFrame where NaNs were dropped and types are correct.
# (The code block above includes the necessary loading and cleaning steps.)

# --- 2. Identify Variables (Continuous vs. Categorical) ---
UNIQUE_VALUE_THRESHOLD = 20
numeric_cols = [col for col in df.select_dtypes(include=['int64', 'float64']).columns if col != 'sex']

continuous_vars = []
categorical_vars = []

for col in numeric_cols:
    if df[col].nunique() >= UNIQUE_VALUE_THRESHOLD:
        continuous_vars.append(col)
    else:
        categorical_vars.append(col)

# --- 3. Perform Statistical Tests ---
results = {}
alpha = 0.05
group_men = df[df['sex'] == 1.0] # Assuming 1.0 is Men
group_women = df[df['sex'] == 0.0] # Assuming 0.0 is Women

# A. T-test for Continuous Variables
for col in continuous_vars:
    # Compares the mean of the variable in the Men group vs. the Women group
    stat, p_value = ttest_ind(group_men[col], group_women[col], equal_var=True)

    significance = "Statistically Significant" if p_value < alpha else "Not Significant"
    results[col] = {
        'Test': 'T-test (Means)',
        'p-value': p_value,
        'Significance': significance
    }

# B. Chi-Square Test of Independence for Categorical Variables
for col in categorical_vars:
    # --- ADD THIS CHECK ---
    # Chi-Square requires at least 2 unique values for the column AND the 'sex' column
    if df[col].nunique() < 2:
        print(f"⚠️ Skipping Chi-Square test for '{col}': Only {df[col].nunique()} unique value(s) remain.")
        results[col] = {
            'Test': 'Chi-Square (Skipped)',
            'p-value': np.nan,
            'Significance': 'Not Enough Variation'
        }
        continue # Skip to the next column
    # -----------------------
    
    # Create the contingency table
    contingency_table = pd.crosstab(df['sex'], df[col])
    
    # Check for empty table (secondary check, mostly handled by the above nunique check)
    if contingency_table.size == 0:
        continue # Should not happen if nunique >= 2

    # Perform the chi-square test
    stat, p_value, dof, expected = chi2_contingency(contingency_table)

    significance = "Statistically Significant" if p_value < alpha else "Not Significant"
    results[col] = {
        'Test': 'Chi-Square (Association)',
        'p-value': p_value,
        'Significance': significance
    }

**PROMPT: How to create two tables containing general statistics: total, men, women. Number and percentage for categorical variables, median and IQR range for continuous variables? Calculate the “p-value” in each table. All floating point numbers should be rounded to three decimal places.**

**PROMPT: How to make a "p" correction? Justify your choice of method.**

## Discovering risk factors

**PROMPT: How to discover risk factors for my "target"?**

**PROMPT: How to discover feature importance regarding "target"?**

## BONUS. Advanced prompt (don't expect it will produce expected outcome). 

**PROMPT**:

I have a cleaned and prepared Heart Disease Cleveland dataset ("df") (303 patients, 14 clinical variables). Now I need to conduct exploratory data analysis to formulate evidence-based research hypotheses for epidemiological investigation.

Dataset context:
- Cleaned dataset ready for analysis
- Target: heart disease presence (binary)
- 14 clinical predictors including demographics, symptoms, and test results
- Balanced analysis needed for hypothesis generation

Please create Python code that performs comprehensive exploratory analysis to generate research hypotheses:

1. **Univariate Analysis by Heart Disease Status:**
   - **Demographics:**
     * Age distribution: histograms, box plots, statistical tests (t-test/Mann-Whitney)
     * Sex distribution: cross-tabulation, chi-square test
     * Calculate mean age difference between diseased/healthy
     * Assess age-sex interaction patterns
   
   - **Clinical Symptoms:**
     * Chest pain types (cp) distribution by heart disease status
     * Exercise-induced angina (exang) prevalence comparison
     * Statistical significance testing for symptom-disease associations
   
   - **Clinical Measurements:**
     * Blood pressure (trestbps): distribution analysis, clinical cut-points
     * Cholesterol (chol): distribution analysis, clinical categories
     * Maximum heart rate (thalach): age-adjusted analysis
     * Fasting blood sugar (fbs) prevalence comparison
   
   - **Diagnostic Tests:**
     * Resting ECG (restecg) abnormalities by disease status
     * ST depression (oldpeak) patterns and clinical significance
     * ST segment slope (slope) distribution analysis
     * Major vessels (ca) involvement patterns
     * Thalassemia (thal) defect types by disease status

2. **Bivariate Analysis and Correlation:**
   - Create comprehensive correlation matrix for numerical variables
   - Cross-tabulation tables for categorical variables
   - Point-biserial correlations for categorical-numerical pairs
   - Identify strongest associations with heart disease
   - Clinical interpretation of correlation patterns

3. **Advanced Visualization for Hypothesis Generation:**
   - **Age-Sex stratified analysis:**
     * Heart disease prevalence by age groups and sex
     * Age-sex interaction heatmaps
     * Stacked bar charts showing disease patterns
   
   - **Risk Factor Combination Analysis:**
     * Multiple chest pain symptoms (cp + exang)
     * Blood pressure and cholesterol combined effects
     * Age with clinical measurements interactions
   
   - **Diagnostic Test Patterns:**
     * ECG abnormalities with ST depression correlation
     * Vessel involvement (ca) with other cardiac markers
     * Thalassemia types with disease severity indicators

4. **Statistical Evidence Summary for Hypothesis Formulation:**
   - Calculate effect sizes (Cohen's d, Cramer's V, odds ratios)
   - Identify variables with strongest associations (p<0.05, effect size >0.3)
   - Rank variables by predictive potential
   - Document clinical plausibility of observed associations

5. **Generate Formal Research Hypotheses:**
   Based on the exploratory analysis, formulate:
   
   **Primary Hypothesis:**
   - Main research question with strongest evidence
   - Specify direction and magnitude of expected effect
   
   **Secondary Hypotheses (3-4):**
   - Additional hypotheses supported by exploratory findings
   - Include interaction hypotheses if patterns suggest them
   
   **Each hypothesis should specify:**
   - Null hypothesis (H₀)
   - Alternative hypothesis (H₁)
   - Expected effect direction and magnitude
   - Clinical/biological rationale
   - Statistical test to be used for verification
   - Required assumptions for the test

6. **Create Evidence Summary Tables:**
   - Summary table of all associations found (variable, effect size, p-value, clinical interpretation)
   - Risk factor ranking by strength of association
   - Suggested analysis plan for hypothesis testing
   - Power analysis estimates for main hypotheses

Please ensure all analyses are clinically grounded and provide epidemiologically meaningful interpretations. Include detailed visualizations that support hypothesis formation and statistical evidence for each proposed hypothesis.