# Clothes Size Predictor 🧥

### ➤ Import Libraries and Stuff

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import logging
logging.basicConfig(level=logging.INFO, format='[%(levelname)s] %(message)s')
logger = logging.getLogger(__name__)

# Get the current working directory
current_dir = os.getcwd()

# Navigate to the project root
project_root = os.path.abspath(os.path.join(current_dir, '..'))
logger.info(f"✅ Libraries uploaded")

[INFO] ✅ Libraries uploaded


⎙ Import the helpers from /src

In [2]:
try:
    sys.path.append(os.path.join(project_root))
    from src.pipeline.data import DataProcessor
    from src.pipeline.cleaning import DataCleaner
    logger.info(f"✅ Libraries uploaded")
except Exception as e:
    logger.info(f"❌ Error loading libraries: {e}")

[INFO] ✅ Libraries uploaded


⏏ Import the Dataset

In [3]:
# Load the cleaned dataset
file_path = os.path.abspath(os.path.join(project_root, 'data', 'raw', 'clothing_info.csv'))
processor = DataProcessor(file_path)
df = processor.run_pipeline()

✅ Data uploaded correctly. 119734 rows and 4 columns.

📊 General information about the dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119734 entries, 0 to 119733
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   weight  119734 non-null  int64  
 1   age     119477 non-null  float64
 2   height  119404 non-null  float64
 3   size    119734 non-null  object 
dtypes: float64(2), int64(1), object(1)
memory usage: 3.7+ MB
None

📈 Descriptive statistics:

               weight            age         height    size
count   119734.000000  119477.000000  119404.000000  119734
unique            NaN            NaN            NaN       7
top               NaN            NaN            NaN       M
freq              NaN            NaN            NaN   29712
mean        61.756811      34.027311     165.805794     NaN
std          9.944863       8.149447       6.737651     NaN
min         22.000000       0.000000     137.16000

## ➤ Data Cleaning

In [4]:
# --- Checking the nulls and deleting it
clothes_deploy = df.copy() # --- We save a copy here

In [5]:
clothes_deploy.shape

(119734, 4)

In [6]:
cleaner = DataCleaner(clothes_deploy)
clothes_deploy = cleaner.run_pipeline()


🚀 Starting Data Cleaning Pipeline...


🧹 Handling missing values...
⚠️ Columns with missing values detected:
 age       257
height    330
dtype: int64
➡️ Before dropna: (119734, 4)
✅ After dropna:  (119153, 4)

📑 Checking for duplicates...
⚠️ 92182 duplicate rows detected. Removing them...
✅ Remaining rows after drop: 26971

🔍 Checking data types consistency...
✅ Data type check complete.

📈 Checking for outliers (Z-score method)...
⚠️ Possible outliers detected:
 {'weight': 456, 'age': 131, 'height': 35}

✨ Data cleaning completed. Ready for feature engineering.


### --- Checking outliers

In [7]:
cleaner.handle_outliers()


📈 Checking for outliers (Z-score method)...
⚠️ Possible outliers detected:
 {'weight': 456, 'age': 131, 'height': 35}


In [8]:
# --- We remove it
cleaner.handle_outliers(action="remove")


📈 Checking for outliers (Z-score method)...
⚠️ Possible outliers detected:
 {'weight': 456, 'age': 130, 'height': 34}
🧹 Outlier rows removed.


In [9]:
cleaner.handle_outliers()


📈 Checking for outliers (Z-score method)...
⚠️ Possible outliers detected:
 {'weight': 265, 'age': 38}


In [10]:
clothes_deploy.shape

(26971, 4)

In [11]:
clothes_deploy = cleaner.run_pipeline()


🚀 Starting Data Cleaning Pipeline...


🧹 Handling missing values...
✅ No missing values found.

📑 Checking for duplicates...
✅ No duplicate rows found.

🔍 Checking data types consistency...
✅ Data type check complete.

📈 Checking for outliers (Z-score method)...
⚠️ Possible outliers detected:
 {'weight': 265, 'age': 38}

✨ Data cleaning completed. Ready for feature engineering.


## --- Export the new CSV

In [12]:
clothes_deploy.to_csv(f'{project_root}/data/processed/clothes_processed.csv', index=False)
logger.info("✅ File saved as 'clothes_processed.csv'.")

[INFO] ✅ File saved as 'clothes_processed.csv'.
