# Setup: Generate Sample Dataset

This cell creates the required folder structure (`data/raw/` and `data/processed/`) relative to the notebook, and generates the sample CSV dataset with missing values. 
This ensures the dataset is ready for cleaning functions and saves it to `data/raw/sample_data.csv`.

In [1]:
import os
import pandas as pd
import numpy as np

# --- Define Project Structure ---
# Get the current working directory (should be the project's root folder)
project_root = os.getcwd()
print(f"Project Root Directory: {project_root}")

# Define folder paths using os.path.join for cross-platform compatibility
raw_dir = os.path.join(project_root, 'data', 'raw')
processed_dir = os.path.join(project_root, 'data', 'processed')

# --- Create Folders if They Don't Exist ---
print(f"Ensuring directory exists: {raw_dir}")
os.makedirs(raw_dir, exist_ok=True)

print(f"Ensuring directory exists: {processed_dir}")
os.makedirs(processed_dir, exist_ok=True)

# --- Generate Sample Dataset ---
# Define the sample data
data = {
    'age': [34, 45, 29, 50, 38, np.nan, 41],
    'income': [55000, np.nan, 42000, 58000, np.nan, np.nan, 49000],
    'score': [0.82, 0.91, np.nan, 0.76, 0.88, 0.65, 0.79],
    'zipcode': ['90210', '10001', '60614', '94103', '73301', '12345', '94105'],
    'city': ['Beverly', 'New York', 'Chicago', 'SF', 'Austin', 'Unknown', 'San Francisco'],
    'extra_data': [np.nan, 42, np.nan, np.nan, np.nan, 5, np.nan]
}

# Create DataFrame
df = pd.DataFrame(data)

# Define the full path for the output CSV file
csv_path = os.path.join(raw_dir, 'sample_data.csv')

# Save to CSV in raw data folder
if not os.path.exists(csv_path):
    df.to_csv(csv_path, index=False)
    print(f'Sample dataset created and saved to {csv_path}')
else:
    print(f'File already exists at {csv_path}. Skipping CSV creation to avoid overwrite.')

Project Root Directory: /Users/souhil/bootcamp_souhil_khiat/homework/homework6
Ensuring directory exists: /Users/souhil/bootcamp_souhil_khiat/homework/homework6/data/raw
Ensuring directory exists: /Users/souhil/bootcamp_souhil_khiat/homework/homework6/data/processed
File already exists at /Users/souhil/bootcamp_souhil_khiat/homework/homework6/data/raw/sample_data.csv. Skipping CSV creation to avoid overwrite.


# Homework Starter — Stage 6: Data Preprocessing
Use this notebook to apply your cleaning functions and save processed data.

In [2]:
import pandas as pd
from src import cleaning

## Load Raw Dataset

In [4]:
df = pd.read_csv('/Users/souhil/bootcamp_souhil_khiat/homework/homework6/data/raw/sample_data.csv')
df.head()

Unnamed: 0,age,income,score,zipcode,city,extra_data
0,34.0,55000.0,0.82,90210,Beverly,
1,45.0,,0.91,10001,New York,42.0
2,29.0,42000.0,,60614,Chicago,
3,50.0,58000.0,0.76,94103,SF,
4,38.0,,0.88,73301,Austin,


## Apply Cleaning Functions

In [5]:
# First, let's look at the raw data's state
print("Original Data Info:")
df.info()
print("\nMissing values before cleaning:")
print(df.isna().sum())
print("\nOriginal DataFrame head:")
print(df.head())

# --- Define our cleaning strategy ---
# 1. Fill missing values for core numeric features. Median is robust to outliers.
numeric_cols_to_fill = ['age', 'income', 'score']
df_cleaned = cleaning.fill_missing_median(df, columns=numeric_cols_to_fill)

# 2. Drop rows that are mostly empty. The 'extra_data' column makes some rows very sparse.
# We'll set a threshold of 0.7, meaning we keep rows with at least 70% non-missing data.
df_cleaned = cleaning.drop_missing(df_cleaned, threshold=0.7)

# 3. Normalize numeric columns to a common scale (0-1) for potential modeling.
numeric_cols_to_normalize = ['age', 'income', 'score']
df_cleaned = cleaning.normalize_data(df_cleaned, columns=numeric_cols_to_normalize)


# --- Validate the cleaned data ---
print("\n" + "="*50)
print("Cleaned Data Info:")
df_cleaned.info()
print("\nMissing values after cleaning:")
print(df_cleaned.isna().sum())
print("\nCleaned DataFrame head:")
print(df_cleaned.head())

Original Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         6 non-null      float64
 1   income      4 non-null      float64
 2   score       6 non-null      float64
 3   zipcode     7 non-null      int64  
 4   city        7 non-null      object 
 5   extra_data  2 non-null      float64
dtypes: float64(4), int64(1), object(1)
memory usage: 468.0+ bytes

Missing values before cleaning:
age           1
income        3
score         1
zipcode       0
city          0
extra_data    5
dtype: int64

Original DataFrame head:
    age   income  score  zipcode      city  extra_data
0  34.0  55000.0   0.82    90210   Beverly         NaN
1  45.0      NaN   0.91    10001  New York        42.0
2  29.0  42000.0    NaN    60614   Chicago         NaN
3  50.0  58000.0   0.76    94103        SF         NaN
4  38.0      NaN   0.88    73301    Austin

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_copy[col].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_copy[col].fillna(median_val, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values a

## Save Cleaned Dataset

In [6]:
# Define the output path
output_path = 'data/processed/sample_data_cleaned.csv'

# Save the cleaned dataframe
df_cleaned.to_csv(output_path, index=False)

print(f"Cleaned data successfully saved to {output_path}")

Cleaned data successfully saved to data/processed/sample_data_cleaned.csv
