# DS637 â€” Homework 5 (msleep)
**Student:** Umair Ali

Tasks:
1. Load `msleep.csv` into a DataFrame.
2. Split into **good** (0 NaN), **bad** (exactly 1 NaN), **ugly** (2+ NaN) per row.
3. Fill NaN in **bad** using column mean (numeric) or mode (categorical).
4. On **good**, convert `order` to dummies with prefix `order_`.
5. On **good**, cut `bodywt` into 10 bins and show counts.
6. On **good**, cap `bodywt` at max 100.
7. On filled **bad**, cut `bodywt` into 10 bins and show counts.

> Put `msleep.csv` in the same folder as this notebook, or update `data_path`.

## 1) Load data

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

# If msleep.csv is in the same folder as this notebook:
data_path = Path('msleep.csv')

df = pd.read_csv(data_path)
print('Shape:', df.shape)
display(df.head())

## 2) Split into good / bad / ugly based on NaN count per row

In [None]:
nan_per_row = df.isna().sum(axis=1)

good = df.loc[nan_per_row == 0].copy()
bad  = df.loc[nan_per_row == 1].copy()
ugly = df.loc[nan_per_row >= 2].copy()

print('good rows (0 NaN):', len(good))
print('bad rows  (1 NaN):', len(bad))
print('ugly rows (2+ NaN):', len(ugly))

display(good.head())
display(bad.head())
display(ugly.head())

## 3) Fill NaN in the *bad* dataframe (mean for numeric, mode for categorical)

In [None]:
fill_values = {}
for col in df.columns:
    if pd.api.types.is_numeric_dtype(df[col]):
        fill_values[col] = df[col].mean(skipna=True)
    else:
        modes = df[col].mode(dropna=True)
        fill_values[col] = modes.iloc[0] if len(modes) else None

filled_bad = bad.copy().fillna(value=fill_values)

print('NaNs remaining in filled_bad:', int(filled_bad.isna().sum().sum()))
display(filled_bad.head())

## 4) On *good*, convert column `order` into dummies with prefix `order_`

In [None]:
good_dummies = good.copy()

order_dummies = pd.get_dummies(good_dummies['order'], prefix='order')
good_dummies = pd.concat([good_dummies.drop(columns=['order']), order_dummies], axis=1)

print('Original good shape:', good.shape)
print('After dummies shape:', good_dummies.shape)
display(good_dummies.head())

## 5) On *good*, cut `bodywt` into 10 bins and return counts

In [None]:
good_bodywt = pd.to_numeric(good['bodywt'], errors='coerce')

bins_good = pd.cut(good_bodywt, bins=10)
counts_good = bins_good.value_counts().sort_index()

display(counts_good)

## 6) On *good*, cap `bodywt` to 100 max

In [None]:
good_capped = good.copy()
good_capped['bodywt'] = pd.to_numeric(good_capped['bodywt'], errors='coerce').clip(upper=100)

print('Max bodywt before cap:', float(pd.to_numeric(good['bodywt'], errors='coerce').max()))
print('Max bodywt after  cap:', float(good_capped['bodywt'].max()))
display(good_capped[['bodywt']].describe())

## 7) On filled *bad*, cut `bodywt` into 10 bins and return counts

In [None]:
filled_bad_bodywt = pd.to_numeric(filled_bad['bodywt'], errors='coerce')

bins_bad = pd.cut(filled_bad_bodywt, bins=10)
counts_bad = bins_bad.value_counts().sort_index()

display(counts_bad)

## Summary

In [None]:
summary = pd.DataFrame({
    'dataset': ['good', 'bad', 'ugly', 'filled_bad'],
    'rows': [len(good), len(bad), len(ugly), len(filled_bad)],
    'total_NaNs': [int(good.isna().sum().sum()), int(bad.isna().sum().sum()), int(ugly.isna().sum().sum()), int(filled_bad.isna().sum().sum())]
})
display(summary)