<a href="https://colab.research.google.com/github/sdbrgo/Glossi/blob/main/glossi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Glossi, My Hair Care Buddy**




1. Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

2. Import Dataset

In [3]:
data = "https://raw.githubusercontent.com/sdbrgo/Glossi/main/glossi-dataset-v1.csv" #insert name/link of the dataset
df = pd.read_csv(data)
#df.head()

3. Preprocess Data

*   **Encode** nominal values.
*   Assign X and y features

In [53]:
le = LabelEncoder()
#encode the nominal features
df['shampoo_used_today'] = le.fit_transform(df['shampoo_used_today'])
df['conditioner_used_today'] = le.fit_transform(df['conditioner_used_today'])

# Create placeholder columns for the next 3 days' dryness and heaviness metrics
target_cols = []
for day in range(1,4): #values for next 3 days
  for metric in ['min_dryness','max_dryness','min_heaviness','max_heaviness']:
    new_col_name = f"{metric}_day{day}"
    df[new_col_name] = 0.0 # Initialize with a placeholder value (e.g., 0.0)
    target_cols.append(new_col_name)
    for day in range(1, 4):
        df[f'{metric}_day{day}'] = df[metric].shift(-day)

#drop the rows with missing values
df = df.dropna(subset=[f'{m}_day{d}' for m in ['min_dryness','max_dryness','min_heaviness','max_heaviness'] for d in range(1,4)])

#assign X and y features
X = df[['shampoo_used_today','conditioner_used_today','leave_in_amt','sweat','humidity','bath_intensity','wind_exposure']]
y = df[target_cols]

Unnamed: 0,date,min_dryness,max_dryness,min_heaviness,max_heaviness,shampoo_used_today,conditioner_used_today,leave_in_amt,sweat,humidity,...,min_heaviness_day1,max_heaviness_day1,min_dryness_day2,max_dryness_day2,min_heaviness_day2,max_heaviness_day2,min_dryness_day3,max_dryness_day3,min_heaviness_day3,max_heaviness_day3
98,2025-10-19,3.0,6.0,1.0,2.0,0,0,0.0,4.0,80,...,4.0,5.0,3.0,4.0,4.5,5.2,2.0,3.5,2.5,4.0
99,2025-10-20,2.0,3.0,4.0,5.0,0,0,5.0,3.0,68,...,4.5,5.2,2.0,3.5,2.5,4.0,1.5,3.0,2.0,3.0
100,2025-10-21,3.0,4.0,4.5,5.2,0,0,0.0,3.0,74,...,2.5,4.0,1.5,3.0,2.0,3.0,2.0,3.5,2.3,3.3
101,2025-10-22,2.0,3.5,2.5,4.0,1,1,5.0,5.0,70,...,2.0,3.0,2.0,3.5,2.3,3.3,2.3,3.0,2.7,3.0
102,2025-10-23,1.5,3.0,2.0,3.0,0,0,0.0,6.0,68,...,2.3,3.3,2.3,3.0,2.7,3.0,2.3,3.0,2.8,3.1


4. Train Model

* Use 70-15-15 Train-Validation-Test split   
* Use linear regression



In [48]:
#first split: train (70%) and temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

#second split: validation (15%) and test (15%) from the 30%
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

#print values
print("Training set size:", len(X_train))
print("Validation set size:", len(X_val))
print("Test set size:", len(X_test))

#train model
lr = LinearRegression()
lr.fit(X_train, y_train)

Training set size: 74
Validation set size: 16
Test set size: 16


4.1 Predict on Validation Set

*   Fine-tune if necessary.



In [49]:
#predict on validation set
y_val_pred = lr.predict(X_val)

#evaluate performance
from sklearn.metrics import mean_squared_error, r2_score

r2_val = r2_score(y_val, y_val_pred)
rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))

print(f"Validation R²: {r2_val:.3f}")
print(f"Validation RMSE: {rmse_val:.3f}")

Validation R²: -0.130
Validation RMSE: 1.050


In [50]:
#since the metric results are unsatisfactory,
#let's try FEATURE SCALING with linear regression
from sklearn.preprocessing import StandardScaler

#scale everything
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

#retrain model using scaled data
lr.fit(X_train_scaled, y_train)
y_val_pred = lr.predict(X_val_scaled)

r2_val = r2_score(y_val, y_val_pred)
rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))

print(f"Validation R²: {r2_val:.3f}")
print(f"Validation RMSE: {rmse_val:.3f}")

Validation R²: -0.130
Validation RMSE: 1.050


In [51]:
#let's try RANDOM FOREST without feature scaling
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

#predict on validation set
y_val_pred = rf.predict(X_val)

r2_val = r2_score(y_val, y_val_pred)
rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))

print(f"Validation R²: {r2_val:.3f}")
print(f"Validation RMSE: {rmse_val:.3f}")

Validation R²: -0.004
Validation RMSE: 1.017


In [52]:
#let's try RANDOM FOREST and FEATURE SCALING

#retrain model using random forest & scaled data
rf.fit(X_train_scaled, y_train)
y_val_pred = rf.predict(X_val_scaled)

r2_val = r2_score(y_val, y_val_pred)
rmse_val = np.sqrt(mean_squared_error(y_val, y_val_pred))

print(f"Validation R²: {r2_val:.3f}")
print(f"Validation RMSE: {rmse_val:.3f}")

Validation R²: -0.025
Validation RMSE: 1.026


**4.2 Results and Observation**

---

*   The Linear Regression, Random Forest, and Feature Scaling failed to respect the nature of the data. A model that is more time-aware is necessary. Thus, models like **XGBoost** and others will be explored.   



**5. Redo initial Setup**
1.   Import dataset as `new_df`
2.   Encode nominal values
3. Create **lag features**

In [59]:
#reimport dataset
new_df = pd.read_csv(data)

#encode the nominal features
new_df['shampoo_used_today'] = le.fit_transform(new_df['shampoo_used_today'])
new_df['conditioner_used_today'] = le.fit_transform(new_df['conditioner_used_today'])

#new_df.head()

In [63]:
#make LAG FEATURES for XGBoost
for lag in range(1, 4):
    for metric in ['min_dryness', 'max_dryness', 'min_heaviness', 'max_heaviness']:
        new_df[f'{metric}_lag{lag}'] = new_df[metric].shift(lag)

#drop rows with null ('NaN') values
new_df = new_df.dropna().reset_index(drop=True)

**6. Train XGBoost Regressor**

In [64]:
!pip install xgboost



In [67]:
from xgboost import XGBRegressor

#features and targets
X = new_df[['shampoo_used_today','conditioner_used_today','leave_in_amt','sweat','humidity','bath_intensity','wind_exposure',
        'min_dryness_lag1','max_dryness_lag1','min_heaviness_lag1','max_heaviness_lag1',
        'min_dryness_lag2','max_dryness_lag2','min_heaviness_lag2','max_heaviness_lag2',
        'min_dryness_lag3','max_dryness_lag3','min_heaviness_lag3','max_heaviness_lag3']]
y = new_df[['min_dryness','max_dryness','min_heaviness','max_heaviness']]

#first split: train (70%) and temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, shuffle=False)

#second split: validation (15%) and test (15%) from the 30%
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, shuffle=False)

#train
xgb = XGBRegressor(n_estimators=200, learning_rate=0.05, max_depth=4, random_state=42)
xgb.fit(X_train, y_train)

**7. Validate**

In [68]:
# Validate
y_val_pred = xgb.predict(X_val)

r2 = r2_score(y_val, y_val_pred)
rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
print(f"Validation R²: {r2:.3f}, RMSE: {rmse:.3f}")

Validation R²: -0.058, RMSE: 0.925
