## Outline for the Non-Image Model (Text + Metadata Only)

### Phase 1 — Load and Inspect Data
- Load the cleaned dataset from Google Drive.

### Phase 2 — Metadata Feature Engineering
- Encode categorical variables (brand, color, category).
- Decide between one-hot encoding or target encoding for high-cardinality fields.
- Standardize numeric variables as needed.
- Consolidate rare categories into an “other” class to improve model stability.

### Phase 3 — Text Feature Engineering
- Clean and normalize product titles and descriptions.
- Build both TF-IDF features and BERT-based embeddings.
- Optionally apply dimensionality reduction (e.g., SVD) to compressed text representations.

### Phase 4 — Non-Image Baseline Models
- Train baseline regressors using metadata and text features separately and in combination.
- Models include linear regression variants, random forests, gradient boosting, and shallow neural networks.
- Evaluate performance using MAE, RMSE, and R².



## Phase 1 - Data loading and checking

In [None]:
#phase 1
import pandas as pd
import numpy as np

#phase 2
!pip install category_encoders
from category_encoders import TargetEncoder

#phase 3
!pip install -q sentence-transformers
import re
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer

#phase 4
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler




In [None]:
df = pd.read_csv("asos_final_clean.csv")
df.head()

Unnamed: 0,price,sku,description,images,brand,item_category,main_material,fit_type,color_simple
0,49.99,126704571,Coats & Jackets by New Look Low-key layering N...,https://images.asos-media.com/products/new-loo...,New Look,Coats & Jackets,Polyester,Regular,Beige
1,59.99,123650194,Coats & Jackets by Stradivarius Jacket upgrade...,https://images.asos-media.com/products/stradiv...,Stradivarius,Coats & Jackets,Polyester,Regular,Grey
2,45.0,125806824,Coats & Jackets by JDY Low-key layering Notch ...,https://images.asos-media.com/products/jdy-ove...,JDY,Coats & Jackets,Polyester,Oversized,Beige
3,84.95,121963507,Coats & Jackets by Nike Running Hit that new P...,https://images.asos-media.com/products/nike-ru...,Nike Running,Coats & Jackets,Polyester,Regular,Black
4,75.0,123053365,Coats & Jackets by ASOS Tall Throw-on appeal N...,https://images.asos-media.com/products/asos-de...,ASOS Tall,Coats & Jackets,Cotton,Regular,Beige


In [None]:
len(df)

29971

##Phase 2: Inspect Distinct Values per Column


In [None]:

# Show number of unique values for every column
unique_counts = df.nunique().sort_values(ascending=False)
print("Unique value counts per column:")
display(unique_counts)


Unique value counts per column:


Unnamed: 0,0
sku,29971
images,29971
description,28475
brand,4949
price,684
item_category,327
main_material,56
color_simple,14
fit_type,5


While checking for our dataset, we've noticed that there were unique count of product description did not match the entire row count of our dataframe. for confirmation, we ran the below code to find which descriptions came up more than once.

In [None]:
df['description'].value_counts()[df['description'].value_counts() > 1]

Unnamed: 0_level_0,count
description,Unnamed: 1_level_1
Tops by Topshop Welcome to the next phase of Topshop Plain design Crew neck Short sleeves,16
Tops by Topshop Welcome to the next phase of Topshop High neck Long sleeves,12
Tops by The North Face Exclusive to ASOS Branded design Crew neck Short sleeves,11
Tops by Topshop Welcome to the next phase of Topshop Plain design Crew neck Drop shoulders,10
Trousers & Leggings by Stradivarius Exclusive to ASOS High rise Belt loops Functional pockets,10
...,...
Tops by Converse Your new go-to Crew neck Logo embroidery to chest Regular fit Unisex style,2
T-shirt by COLLUSION Exclusive to ASOS Plain design Crew neck Short sleeves Cropped length,2
Tops by ASOS DESIGN Basket-worthy find High neck Sleeveless style,2
T-shirt by ASOS DESIGN Crew neck Short sleeves Cropped length Slim fit Designed to fit cup sizes DD-G,2


In [None]:
# Find descriptions that appear more than once
duplicate_desc_values = df['description'].value_counts()
duplicate_desc_values = duplicate_desc_values[duplicate_desc_values > 1].index.tolist()

# Remove ALL rows where the description is one of the duplicated ones
df2 = df[~df['description'].isin(duplicate_desc_values)].copy()

print("Original rows:", len(df))
print("Rows removed due to duplicate descriptions:", len(df) - len(df2))
print("Remaining rows:", len(df2))


Original rows: 29971
Rows removed due to duplicate descriptions: 2661
Remaining rows: 27310


Proceeding to metadata encoding for categorical values

In [None]:
categorical_cols = ['brand', 'item_category', 'main_material', 'fit_type', 'color_simple']

print("Unique value counts per categorical column:")
for col in categorical_cols:
    print(f"{col}: {df2[col].nunique()}")

Unique value counts per categorical column:
brand: 4811
item_category: 323
main_material: 56
fit_type: 5
color_simple: 14


In [None]:
# STEP 1: Target Encode the 'brand' column (high cardinality) -
te = TargetEncoder(cols=['brand'])
df2['brand_te'] = te.fit_transform(df2['brand'], df2['price'])0

# STEP 2: One-hot encode remaining low-cardinality columns
low_cardinality_cols = ['item_category', 'main_material', 'fit_type', 'color_simple']

df2_encoded = pd.get_dummies(df2, columns=low_cardinality_cols, drop_first=False)

# STEP 3: Build the metadata feature matrix

exclude_cols = ['price', 'sku', 'images', 'description', 'brand']
metadata_cols = [col for col in df2_encoded.columns if col not in exclude_cols]

X_meta = df2_encoded[metadata_cols]
y = df2_encoded['price']


X_meta.head()

Unnamed: 0,brand_te,item_category_Accessories,item_category_Beach cover-up,item_category_Beach dress,item_category_Beach shirt,item_category_Bikini Bottoms,item_category_Bikini Tops,item_category_Bikini briefs,item_category_Bikini top,item_category_Blazer,...,color_simple_Green,color_simple_Grey,color_simple_Multi/Pattern,color_simple_Orange,color_simple_Other,color_simple_Pink,color_simple_Purple,color_simple_Red,color_simple_White,color_simple_Yellow
0,25.055133,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,29.472625,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
2,25.461724,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,50.276268,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,37.086971,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Phase 3 - Text Analytics (BERT/TD-IDF Prep) / Test, train split

In [None]:
#Step 1  - text cleaning
df3 = df2
df3['description_clean'] = (
    df3['description']
        .str.lower()
        .str.strip()
        .str.replace(r'\s+', ' ', regex=True)
        .fillna('') # Fill NaN values with empty strings
)

#Step 2 -  Test/train aplit
X_text = df3['description_clean']
y = df3['price']

X_train_text, X_test_text, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42
)


#step 3 - Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1,2),
    min_df=3
)

X_train_tfidf = tfidf.fit_transform(X_train_text)
X_test_tfidf = tfidf.transform(X_test_text)

X_train_tfidf.shape, X_test_tfidf.shape

print("TF-IDF shapes:")
print(X_train_tfidf.shape, X_test_tfidf.shape)

#step 3b - BERT embedding

bert_model = SentenceTransformer('all-MiniLM-L6-v2')

X_train_bert = bert_model.encode(
    X_train_text.tolist(),
    batch_size=32,
    show_progress_bar=True
)

X_test_bert = bert_model.encode(
    X_test_text.tolist(),
    batch_size=32,
    show_progress_bar=True
)

print("BERT shapes:")
print(X_train_bert.shape, X_test_bert.shape)


TF-IDF shapes:
(21848, 10000) (5462, 10000)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/683 [00:00<?, ?it/s]

Batches:   0%|          | 0/171 [00:00<?, ?it/s]

BERT shapes:
(21848, 384) (5462, 384)


## Phase 4 - Non-Image Baseline Models

In [None]:

# Align metadata features with the same train/test split as text

# y_train, y_test came from train_test_split(X_text, y, ...)
idx_train = y_train.index
idx_test = y_test.index

X_meta_train = X_meta.loc[idx_train]
X_meta_test = X_meta.loc[idx_test]

print("Metadata shapes:", X_meta_train.shape, X_meta_test.shape)
print("TF-IDF shapes:", X_train_tfidf.shape, X_test_tfidf.shape)
print("BERT shapes:", X_train_bert.shape, X_test_bert.shape)

# Helper: evaluation function

def evaluate_model(name, model, X_tr, y_tr, X_te, y_te):
    model.fit(X_tr, y_tr)
    y_pred = model.predict(X_te)

    mae = mean_absolute_error(y_te, y_pred)
    # Calculate RMSE by taking the square root of mean_squared_error
    rmse = np.sqrt(mean_squared_error(y_te, y_pred))
    r2 = r2_score(y_te, y_pred)

    print(f"\n=== {name} ===")
    print(f"MAE:  {mae:.3f}")
    print(f"RMSE: {rmse:.3f}")
    print(f"R²:   {r2:.3f}")

    return {
        "model": name,
        "MAE": mae,
        "RMSE": rmse,
        "R2": r2
    }

results = []

# 1) Metadata-only model (Random Forest)

rf_meta = RandomForestRegressor(
    n_estimators=200,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

results.append(
    evaluate_model(
        "Metadata-only (RandomForest)",
        rf_meta,
        X_meta_train, y_train,
        X_meta_test, y_test
    )
)

# 2) TF-IDF text-only model (Ridge Regression)

ridge_tfidf = Ridge(alpha=1.0, random_state=42)

results.append(
    evaluate_model(
        "Text-only TF-IDF (Ridge)",
        ridge_tfidf,
        X_train_tfidf, y_train,
        X_test_tfidf, y_test
    )
)

# 3) BERT text-only model (MLP)

mlp_bert = MLPRegressor(
    hidden_layer_sizes=(128, 64),
    activation='relu',
    solver='adam',
    max_iter=50,
    random_state=42
)

results.append(
    evaluate_model(
        "Text-only BERT (MLP)",
        mlp_bert,
        X_train_bert, y_train.to_numpy(),
        X_test_bert, y_test.to_numpy()
    )
)

# 4) Metadata + BERT combined model (MLP)

# Combine metadata and BERT embeddings (dense -> use np.hstack)
X_train_meta_bert = np.hstack([X_meta_train.values, X_train_bert])
X_test_meta_bert = np.hstack([X_meta_test.values, X_test_bert])

mlp_meta_bert = MLPRegressor(
    hidden_layer_sizes=(256, 128),
    activation='relu',
    solver='adam',
    max_iter=50,
    random_state=42
)

results.append(
    evaluate_model(
        "Metadata + BERT (MLP)",
        mlp_meta_bert,
        X_train_meta_bert, y_train.to_numpy(),
        X_test_meta_bert, y_test.to_numpy()
    )
)

# Summary table of all 4 models

results_df = pd.DataFrame(results)
results_df

Metadata shapes: (21848, 399) (5462, 399)
TF-IDF shapes: (21848, 10000) (5462, 10000)
BERT shapes: (21848, 384) (5462, 384)

=== Metadata-only (RandomForest) ===
MAE:  10.870
RMSE: 18.917
R²:   0.694

=== Text-only TF-IDF (Ridge) ===
MAE:  11.733
RMSE: 19.350
R²:   0.680





=== Text-only BERT (MLP) ===
MAE:  12.864
RMSE: 20.511
R²:   0.641

=== Metadata + BERT (MLP) ===
MAE:  11.142
RMSE: 17.553
R²:   0.737




Unnamed: 0,model,MAE,RMSE,R2
0,Metadata-only (RandomForest),10.870389,18.917149,0.694499
1,Text-only TF-IDF (Ridge),11.732561,19.350352,0.680346
2,Text-only BERT (MLP),12.864487,20.511146,0.640845
3,Metadata + BERT (MLP),11.142339,17.553421,0.736958


### Metadata + TD IDF + BERT (Not so good result - do not run)

In [None]:

import numpy as np
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
X_train_tfidf = tfidf.fit_transform(X_train_text)

# Convert TF-IDF sparse → dense
X_train_tfidf_dense = X_train_tfidf.toarray()
X_test_tfidf_dense = X_test_tfidf.toarray()

# Combine all 3 feature sets
X_train_full = np.hstack([
    X_meta_train.values,
    X_train_tfidf_dense,
    X_train_bert
])

X_test_full = np.hstack([
    X_meta_test.values,
    X_test_tfidf_dense,
    X_test_bert
])

mlp_full = MLPRegressor(
    hidden_layer_sizes=(512, 256),
    activation='relu',
    solver='adam',
    max_iter=50,
    random_state=42
)

mlp_full.fit(X_train_full, y_train)
full_pred = mlp_full.predict(X_test_full)

mae_full = mean_absolute_error(y_test, full_pred)
rmse_full = np.sqrt(mean_squared_error(y_test, full_pred))
r2_full = r2_score(y_test, full_pred)

print("\n=== Metadata + TF-IDF + BERT (MLP) ===")
print(f"MAE:  {mae_full:.3f}")
print(f"RMSE: {rmse_full:.3f}")
print(f"R²:   {r2_full:.3f}")


## SGD(Not so good result - do not run)


In [None]:

# Scale TF-IDF because SGD is sensitive
scaler_tfidf = StandardScaler(with_mean=False)  # with_mean=False for sparse matrices

X_train_tfidf_scaled = scaler_tfidf.fit_transform(X_train_tfidf)
X_test_tfidf_scaled = scaler_tfidf.transform(X_test_tfidf)

sgd_tfidf = SGDRegressor(
    loss='huber',
    penalty='l2',
    alpha=0.001,
    learning_rate='optimal',
    max_iter=2000,
    tol=1e-3,
    random_state=42
)

sgd_tfidf.fit(X_train_tfidf_scaled, y_train)
preds = sgd_tfidf.predict(X_test_tfidf_scaled)

print("MAE:", mean_absolute_error(y_test, preds))
print("RMSE:", np.sqrt(mean_squared_error(y_test, preds)))
print("R2:", r2_score(y_test, preds))


MAE: 12.178548679692259
RMSE: 21.102369604207503
R2: 0.6198419526748591


In [None]:


# Scale combined features
scaler_meta_bert = StandardScaler()
X_train_meta_bert_scaled = scaler_meta_bert.fit_transform(X_train_meta_bert)
X_test_meta_bert_scaled = scaler_meta_bert.transform(X_test_meta_bert)

sgd_meta_bert = SGDRegressor(
    loss='huber',
    penalty='l2',
    alpha=0.0005,
    learning_rate='optimal',
    max_iter=2000,
    tol=1e-3,
    random_state=42
)

sgd_meta_bert.fit(X_train_meta_bert_scaled, y_train)
preds2 = sgd_meta_bert.predict(X_test_meta_bert_scaled)

print("MAE:", mean_absolute_error(y_test, preds2))
print("RMSE:", np.sqrt(mean_squared_error(y_test, preds2)))
print("R2:", r2_score(y_test, preds2))

MAE: 12.504559503839102
RMSE: 21.602667254425707
R2: 0.601602605007953


## Extended Phase 4 — Bagging Regressor on Metadata + BERT

This section introduces a more robust non-image model by applying a Bagging Regressor to the combined feature set of metadata and BERT embeddings. Bagging improves model stability by training multiple base estimators on resampled subsets of the data and aggregating their predictions. This is especially effective for datasets where pricing varies significantly within categories and brands, as it reduces variance and enhances generalization. We use a Random Forest as the base estimator to leverage its capacity for handling nonlinear relationships in both metadata and dense semantic features from BERT.

### Steps Performed

- **Combine Metadata and BERT Features:**  
  Metadata encodings and BERT embeddings are horizontally stacked to form a unified feature matrix.
  
- **Initialize Bagging Regressor:**  
  A Bagging Regressor with 10 Random Forest base models (each with 100 trees) is used to reduce prediction variance and improve robustness.

- **Train and Evaluate:**  
  The model is trained on the combined feature set and evaluated using MAE, RMSE, and R² to ensure consistency with earlier baselines.

- **Result Summary:**  
  A DataFrame is provided to present final evaluation metrics in a format consistent with other model results for comparison.

This model typically outperforms simpler metadata-only or text-only approaches, offering a strong non-image baseline before integrating visual features in later phases of the project.


In [None]:

#Bagging Regressor on Metadata + BERT


import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


# Construct the combined feature matrix (Metadata + BERT)


X_train_meta_bert = np.hstack([X_meta_train.values, X_train_bert])
X_test_meta_bert = np.hstack([X_meta_test.values, X_test_bert])

print("Combined train shape:", X_train_meta_bert.shape)
print("Combined test shape:", X_test_meta_bert.shape)


# Define the Bagging Regressor Model


bag_model = BaggingRegressor(
    estimator=RandomForestRegressor(
        n_estimators=100,
        random_state=42
    ),
    n_estimators=10,
    random_state=42,
    n_jobs=-1
)


# Train & Evaluate


bag_model.fit(X_train_meta_bert, y_train)
y_pred_bag = bag_model.predict(X_test_meta_bert)

mae_bag = mean_absolute_error(y_test, y_pred_bag)
rmse_bag = np.sqrt(mean_squared_error(y_test, y_pred_bag))
r2_bag = r2_score(y_test, y_pred_bag)

print("\n=== Bagging Regressor (Metadata + BERT) ===")
print(f"MAE:  {mae_bag:.3f}")
print(f"RMSE: {rmse_bag:.3f}")
print(f"R²:   {r2_bag:.3f}")


# Display as a DataFrame for consistency


bag_results = pd.DataFrame([{
    "model": "Bagging (Metadata + BERT)",
    "MAE": mae_bag,
    "RMSE": rmse_bag,
    "R2": r2_bag
}])

bag_results


Combined train shape: (21848, 783)
Combined test shape: (5462, 783)

=== Bagging Regressor (Metadata + BERT) ===
MAE:  11.135
RMSE: 18.941
R²:   0.694


Unnamed: 0,model,MAE,RMSE,R2
0,Bagging (Metadata + BERT),11.134635,18.940749,0.693736
