# Stress Level Dataset — Data Cleaning (Task 1)

This notebook cleans the raw dataset for the Skillytixs Data Analytics Internship Task 1.
It handles duplicates, standardizes text, fixes data types, and saves a cleaned CSV.


## Steps
1. Load the dataset
2. Explore: shape, null counts, duplicates
3. Handle missing values (if any)
4. Remove duplicates
5. Standardize text columns
6. Fix data types (numeric/datetime where possible)
7. Rename columns to snake_case
8. Save cleaned dataset


In [None]:
# Setup
import pandas as pd
import numpy as np

raw_path = "StressLevelDataset.csv"          # put the raw CSV next to this notebook
clean_path = "StressLevelDataset_Cleaned.csv"


In [None]:
# Load dataset
df = pd.read_csv(raw_path)
print("Initial shape:", df.shape)
display(df.head())

In [None]:
# Explore nulls and duplicates
print("\nNull values per column:\n", df.isnull().sum())
print("\nDuplicate rows:", df.duplicated().sum())

In [None]:
# Handle missing values (example strategy)
# If any nulls exist, forward-fill then back-fill as a simple demo.
# Adjust strategy as needed for your dataset.
if df.isnull().sum().sum() > 0:
    df = df.ffill().bfill()
print("Nulls after filling:", df.isnull().sum().sum())

In [None]:
# Remove duplicates
before_dups = df.duplicated().sum()
df = df.drop_duplicates()
print(f"Duplicates removed: {before_dups}")

In [None]:
# Standardize text columns
obj_cols = df.select_dtypes(include=['object']).columns
for c in obj_cols:
    df[c] = df[c].astype(str).str.strip().str.lower()
print("Standardized text columns:", list(obj_cols))

In [None]:
# Fix data types (try numeric, then datetime)
for c in df.columns:
    # skip object columns already standardized; attempt numeric first
    try:
        df[c] = pd.to_numeric(df[c])
        continue
    except Exception:
        pass
    # attempt datetime if not numeric
    try:
        df[c] = pd.to_datetime(df[c], errors='raise')
    except Exception:
        pass

print("Dtypes:\n", df.dtypes)

In [None]:
# Rename columns to snake_case
df.columns = (df.columns
              .str.strip()
              .str.lower()
              .str.replace(' ', '_')
              .str.replace('-', '_')
              .str.replace('/', '_'))
print("Columns renamed:", list(df.columns))

In [None]:
# Save cleaned dataset
df.to_csv(clean_path, index=False)
print("Saved:", clean_path)
print("Final shape:", df.shape)