# End-to-End Data Analysis in Python
## Healthcare Stroke Dataset

This notebook walks through a complete data analysis pipeline:
1. Load healthcare data from a CSV file
2. Inspect & clean the data
3. Perform exploratory data analysis (EDA)
4. Create visualizations
5. Save the cleaned data

Dataset: **Healthcare Stroke Prediction Dataset** (~5k rows, ~10+ columns).

## 1. Setup: Import Libraries

We import the core Python libraries used for data analysis:
- `numpy` for numerical operations
- `pandas` for data loading and manipulation
- `matplotlib.pyplot` and `seaborn` for visualizations

We also set some display options so DataFrames and plots are easier to read in Colab.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: f"{x:,.2f}")

sns.set(style='whitegrid')

### Exercises – Section 1
1. Add another pandas display option to set the display width (hint: `display.width`).
2. Change the seaborn style from `'whitegrid'` to `'darkgrid'` and re-run a plot later.

## 2. Load the Healthcare Stroke Dataset

In Colab, we typically upload a CSV file from our local machine and then read it with `pandas.read_csv()`.
We assume the file is named **`healthcare-dataset-stroke-data.csv`**.

In [None]:
from google.colab import files

print('Please upload healthcare-dataset-stroke-data.csv')
uploaded = files.upload()

file_name = 'healthcare-dataset-stroke-data.csv'  # ensure this matches the uploaded file name
df = pd.read_csv(file_name)
df.head()

### Exercises – Section 2
1. Show the first 10 rows instead of 5 using `df.head(10)`.
2. Use `df.sample(5)` to view 5 random rows. Why might this be useful?
3. Intentionally change `file_name` to something incorrect and run the cell. What error do you see?

## 3. First Look at the Data

We now inspect the basic structure of the dataset:
- Number of rows and columns (`df.shape`)
- Column names (`df.columns`)
- Data types and non-null counts (`df.info()`)
- Summary statistics for numeric columns (`df.describe()`).

In [None]:
print('Shape (rows, columns):', df.shape)

print('\nColumn names:')
print(df.columns)

print('\nData types and non-null counts:')
df.info()

print('\nSummary statistics (numeric columns):')
df.describe()

### Exercises – Section 3
1. How many rows and columns does the dataset have?
2. From `df.info()`, list which columns are numeric and which are `object` (string-like).
3. Run `df.describe(include='all')` in a new cell. What extra information do you get?

## 4. Inspect Raw Data & Categorical Columns

Before cleaning, it's helpful to look at actual values to spot issues such as strange strings or inconsistent categories.

In [None]:
# First 10 rows
df.head(10)

In [None]:
# Random sample of 5 rows
df.sample(5, random_state=42)

In [None]:
# Summary statistics for all columns (numeric + non-numeric)
df.describe(include='all')

### Exercises – Section 4
1. Use `df.tail(8)` to view the last 8 rows. Do they look similar to the first rows?
2. Check unique values of `gender` using `df['gender'].unique()`.
3. Check the frequency of `work_type` using `df['work_type'].value_counts()`.

## 5. Missing Values & Unique Values

We want to know where data is missing and how many unique values each column has.

In [None]:
print('Missing values per column:')
print(df.isna().sum())

print('\nPercentage of missing values per column:')
print((df.isna().mean() * 100).round(2))

print('\nNumber of unique values per column:')
print(df.nunique())

### Exercises – Section 5
1. Which column has the highest percentage of missing values?
2. Approximately what percentage of `bmi` values are missing?
3. How many unique categories does `smoking_status` have?

## 6. Basic Data Cleaning

We will:
1. Clean column names (lowercase, replace spaces with underscores)
2. Fix special missing values in `bmi` (e.g., 'N/A')
3. Convert `bmi` and `avg_glucose_level` to numeric types.

In [None]:
# 1. Clean column names
df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(' ', '_')
      .str.replace('-', '_')
)
df.columns

In [None]:
# Inspect unique values in bmi
df['bmi'].unique()[:20]

In [None]:
# Replace 'N/A' with actual NaN and convert to numeric
df['bmi'] = df['bmi'].replace('N/A', np.nan)
df['bmi'] = pd.to_numeric(df['bmi'], errors='coerce')
df['avg_glucose_level'] = pd.to_numeric(df['avg_glucose_level'], errors='coerce')
df[['bmi', 'avg_glucose_level']].info()

### Exercises – Section 6
1. Add another transformation to the column-name cleaning chain if needed.
2. Ensure `age` is numeric using `pd.to_numeric(df['age'], errors='coerce')`.
3. Create a new column `age_decade = df['age'] // 10`.

## 7. Handling Missing Values

We will impute missing numeric values with the **median**, which is robust to outliers.

In [None]:
# Check missing values again
df.isna().sum()

In [None]:
# Median imputation for bmi and avg_glucose_level
bmi_median = df['bmi'].median()
agl_median = df['avg_glucose_level'].median()
df['bmi'] = df['bmi'].fillna(bmi_median)
df['avg_glucose_level'] = df['avg_glucose_level'].fillna(agl_median)
df.isna().sum()

### Exercises – Section 7
1. Compare the number of missing `bmi` values before and after imputation.
2. Compute the mean and median of `bmi`. Why might we prefer the median?
3. For a categorical column like `smoking_status`, what imputation strategy would you use?

## 8. Converting Columns to Categorical

For columns that represent categories (e.g., gender, work type, hypertension), we convert them to the `category` dtype.

In [None]:
categorical_cols = [
    'gender', 'ever_married', 'work_type', 'residence_type',
    'smoking_status', 'hypertension', 'heart_disease', 'stroke'
]

for col in categorical_cols:
    df[col] = df[col].astype('category')

df[categorical_cols].dtypes

### Exercises – Section 8
1. View the categories for `gender` using `df['gender'].cat.categories`.
2. Check how imbalanced `stroke` is using `df['stroke'].value_counts(normalize=True)`.
3. Convert your `age_decade` column (from Section 6) to categorical.

## 9. Descriptive Statistics – Numeric Columns

In [None]:
numeric_df = df.select_dtypes(include=['int64', 'float64'])
numeric_df.describe()

In [None]:
print('Age summary:')
print(df['age'].describe())

print('\nAverage Glucose Level summary:')
print(df['avg_glucose_level'].describe())

print('\nBMI summary:')
print(df['bmi'].describe())

### Exercises – Section 9
1. Which numeric column has the largest standard deviation?
2. What are the minimum and maximum ages in the dataset?
3. Compute the IQR (Q3 - Q1) for `bmi`.

## 10. Descriptive Statistics – Categorical Columns

In [None]:
def show_category_distribution(col_name):
    counts = df[col_name].value_counts()
    percents = df[col_name].value_counts(normalize=True) * 100
    display(pd.DataFrame({'count': counts, 'percent': percents.round(2)}))

for col in ['gender', 'ever_married', 'work_type', 'residence_type', 'smoking_status', 'stroke']:
    print(f'--- {col} ---')
    show_category_distribution(col)
    print()

### Exercises – Section 10
1. Which `work_type` appears most frequently?
2. What percentage of people fall into each `smoking_status` category?
3. Add `hypertension` to the loop and inspect its distribution.

## 11. Univariate Visualizations – Numeric

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 4))

sns.histplot(df['age'], bins=30, kde=True, ax=axes[0])
axes[0].set_title('Age Distribution')

sns.histplot(df['avg_glucose_level'], bins=30, kde=True, ax=axes[1])
axes[1].set_title('Average Glucose Level Distribution')

sns.histplot(df['bmi'], bins=30, kde=True, ax=axes[2])
axes[2].set_title('BMI Distribution')

plt.tight_layout()
plt.show()

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 4))

sns.boxplot(y=df['age'], ax=axes[0])
axes[0].set_title('Age Boxplot')

sns.boxplot(y=df['avg_glucose_level'], ax=axes[1])
axes[1].set_title('Avg Glucose Level Boxplot')

sns.boxplot(y=df['bmi'], ax=axes[2])
axes[2].set_title('BMI Boxplot')

plt.tight_layout()
plt.show()

### Exercises – Section 11
1. Which numeric variable appears most skewed?
2. Which variable shows the most extreme outliers in the boxplots?
3. Create a single histogram of `age` with `bins=15` and add a title and axis labels.

## 12. Univariate Visualizations – Categorical

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(x='gender', data=df)
plt.title('Gender Distribution')
plt.show()

plt.figure(figsize=(8, 4))
sns.countplot(x='work_type', data=df)
plt.title('Work Type Distribution')
plt.xticks(rotation=30)
plt.show()

plt.figure(figsize=(8, 4))
sns.countplot(x='smoking_status', data=df)
plt.title('Smoking Status Distribution')
plt.xticks(rotation=30)
plt.show()

plt.figure(figsize=(4, 4))
sns.countplot(x='stroke', data=df)
plt.title('Stroke vs No Stroke')
plt.show()

### Exercises – Section 12
1. Which `work_type` category has the smallest count?
2. What do you notice about the class balance of `stroke` (0 vs 1)?
3. Create a countplot for `residence_type`.

## 13. Bivariate Analysis – Stroke vs Risk Factors

In [None]:
# Mean age, glucose, BMI by stroke status
grouped_stroke = df.groupby('stroke')[['age', 'avg_glucose_level', 'bmi']].mean().round(2)
grouped_stroke

In [None]:
# Age distribution by stroke status
plt.figure(figsize=(8, 4))
sns.kdeplot(data=df, x='age', hue='stroke', shade=True)
plt.title('Age Distribution by Stroke Status')
plt.show()

In [None]:
# BMI by stroke status
plt.figure(figsize=(6, 4))
sns.boxplot(x='stroke', y='bmi', data=df)
plt.title('BMI by Stroke Status')
plt.show()

In [None]:
# Hypertension vs stroke
ct_hyper = pd.crosstab(df['hypertension'], df['stroke'], normalize='index') * 100
ct_hyper.round(2)

In [None]:
ct_hyper.plot(kind='bar', stacked=True, figsize=(6, 4))
plt.title('Stroke Rate by Hypertension Status (%)')
plt.xlabel('Hypertension (0 = No, 1 = Yes)')
plt.ylabel('Percentage')
plt.legend(title='Stroke', labels=['No Stroke (0)', 'Stroke (1)'])
plt.show()

### Exercises – Section 13
1. How do the mean `age` and `avg_glucose_level` differ between stroke and no-stroke groups?
2. In the KDE plot, which group appears older on average?
3. Create a similar crosstab and stacked bar chart for `heart_disease` vs `stroke`.

## 14. Correlation Matrix – Numeric Features

In [None]:
corr_matrix = df.select_dtypes(include=['int64', 'float64']).corr()
corr_matrix

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Numeric Features')
plt.show()

### Exercises – Section 14
1. Which pair of numeric variables has the strongest positive correlation?
2. Which variable is most correlated with `age`?
3. Does any variable show surprisingly low correlation with `stroke`? Why might that be?

## 15. Save Cleaned Data

Finally, we save our cleaned dataset to a new CSV file.

In [None]:
output_file = 'healthcare-stroke-data-cleaned.csv'
df.to_csv(output_file, index=False)
print(f'Cleaned data saved to: {output_file}')

In [None]:
# Download the cleaned file (Colab only)
files.download(output_file)

### Final Exercises – Putting It All Together
1. Create a new column `high_glucose_flag` which is 1 if `avg_glucose_level > 140`, else 0.
2. Use a crosstab to compare stroke rates for `high_glucose_flag = 0` vs `1`.
3. Create a bar chart showing the average `age` by `smoking_status`.
4. In a markdown cell, write a short EDA summary of key patterns you observed.