# Aims & Objectives

Main Aim to take this Dataset is to determine various Factors that effects the price of Backpacks such as Brand and the Material , Waterproof, Compartments, Laptop Compartments 

# Key Takeaways

The dataset includes categorical and numerical features, requiring different preprocessing steps.
Handling missing values in categorical variables will be a priority.
Weight Capacity and Price have relatively wide distributions, suggesting possible feature scaling or transformation.
Feature engineering on categorical variables (like Brand and Material) may improve model performance.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import root_mean_squared_error

In [2]:
train = pd.read_csv('train.csv')
train_extra = pd.read_csv('training_extra.csv')
test_df = pd.read_csv('test.csv')

## Training Data Columns

In [3]:
train

Unnamed: 0,id,Brand,Material,Size,Compartments,Laptop Compartment,Waterproof,Style,Color,Weight Capacity (kg),Price
0,0,Jansport,Leather,Medium,7.0,Yes,No,Tote,Black,11.611723,112.15875
1,1,Jansport,Canvas,Small,10.0,Yes,Yes,Messenger,Green,27.078537,68.88056
2,2,Under Armour,Leather,Small,2.0,Yes,No,Messenger,Red,16.643760,39.17320
3,3,Nike,Nylon,Small,8.0,Yes,No,Messenger,Green,12.937220,80.60793
4,4,Adidas,Canvas,Medium,1.0,Yes,Yes,Messenger,Green,17.749338,86.02312
...,...,...,...,...,...,...,...,...,...,...,...
299995,299995,Adidas,Leather,Small,9.0,No,No,Tote,Blue,12.730812,129.99749
299996,299996,Jansport,Leather,Large,6.0,No,Yes,Tote,Blue,26.633182,19.85819
299997,299997,Puma,Canvas,Large,9.0,Yes,Yes,Backpack,Pink,11.898250,111.41364
299998,299998,Adidas,Nylon,Small,1.0,No,Yes,Tote,Pink,6.175738,115.89080


# Training Extra Data Columns

In [4]:
train_extra

Unnamed: 0,id,Brand,Material,Size,Compartments,Laptop Compartment,Waterproof,Style,Color,Weight Capacity (kg),Price
0,500000,Under Armour,Canvas,Small,10.0,Yes,Yes,Tote,Blue,23.882052,114.11068
1,500001,Puma,Polyester,Small,4.0,No,Yes,Backpack,Green,11.869095,129.74972
2,500002,Jansport,Polyester,Small,8.0,Yes,Yes,Tote,Red,8.092302,21.37370
3,500003,Nike,Nylon,Large,7.0,No,No,Messenger,Pink,7.719581,48.09209
4,500004,Nike,Leather,Large,9.0,No,Yes,Tote,Green,22.741826,77.32461
...,...,...,...,...,...,...,...,...,...,...,...
3694313,4194313,Nike,Canvas,,3.0,Yes,Yes,Messenger,Blue,28.098120,104.74460
3694314,4194314,Puma,Leather,Small,10.0,Yes,Yes,Tote,Blue,17.379531,122.39043
3694315,4194315,Jansport,Canvas,Large,10.0,No,No,Backpack,Red,17.037708,148.18470
3694316,4194316,Puma,Canvas,,2.0,No,No,Backpack,Gray,28.783339,22.32269


In [5]:
train_df = pd.concat([train , train_extra] , axis=0)

# Testing Data Columns

In [6]:
train_df

Unnamed: 0,id,Brand,Material,Size,Compartments,Laptop Compartment,Waterproof,Style,Color,Weight Capacity (kg),Price
0,0,Jansport,Leather,Medium,7.0,Yes,No,Tote,Black,11.611723,112.15875
1,1,Jansport,Canvas,Small,10.0,Yes,Yes,Messenger,Green,27.078537,68.88056
2,2,Under Armour,Leather,Small,2.0,Yes,No,Messenger,Red,16.643760,39.17320
3,3,Nike,Nylon,Small,8.0,Yes,No,Messenger,Green,12.937220,80.60793
4,4,Adidas,Canvas,Medium,1.0,Yes,Yes,Messenger,Green,17.749338,86.02312
...,...,...,...,...,...,...,...,...,...,...,...
3694313,4194313,Nike,Canvas,,3.0,Yes,Yes,Messenger,Blue,28.098120,104.74460
3694314,4194314,Puma,Leather,Small,10.0,Yes,Yes,Tote,Blue,17.379531,122.39043
3694315,4194315,Jansport,Canvas,Large,10.0,No,No,Backpack,Red,17.037708,148.18470
3694316,4194316,Puma,Canvas,,2.0,No,No,Backpack,Gray,28.783339,22.32269


In [7]:
train_id = train_df.drop(columns = 'id' , inplace=True)

# Summary Statistics

In [8]:
train_df.describe()

Unnamed: 0,Compartments,Weight Capacity (kg),Price
count,3994318.0,3992510.0,3994318.0
mean,5.43474,18.01042,81.36217
std,2.893043,6.973969,38.93868
min,1.0,5.0,15.0
25%,3.0,12.06896,47.47002
50%,5.0,18.05436,80.98495
75%,8.0,23.98751,114.855
max,10.0,30.0,150.0


In [9]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3994318 entries, 0 to 3694317
Data columns (total 10 columns):
 #   Column                Dtype  
---  ------                -----  
 0   Brand                 object 
 1   Material              object 
 2   Size                  object 
 3   Compartments          float64
 4   Laptop Compartment    object 
 5   Waterproof            object 
 6   Style                 object 
 7   Color                 object 
 8   Weight Capacity (kg)  float64
 9   Price                 float64
dtypes: float64(3), object(7)
memory usage: 335.2+ MB


# Encoding Categorical Data 

Encoding Categorical Data refers to the process of converting categorical variables (e.g., text) into a numerical format so that they can be used in statistical models and machine learning algorithms.

Methods for Encoding Categorical Data
1. Label Encoding
Each unique value in a column is replaced by a number.
📌 Use case: Suitable for ordinal variables (e.g., Size: "Small" < "Medium" < "Large").

2. One-Hot Encoding
Creates separate columns for each category with values 0 or 1.
📌 Use case: Good for nominal variables (e.g., Brand, Color).

3. Ordinal Encoding
Similar to Label Encoding but with values assigned in a specific order.
📌 Use case: Suitable for variables with a natural order (e.g., Size).

4. Binary Encoding
First applies Label Encoding and then converts to binary format.
📌 Use case: Good when there are many categories (e.g., Brand).

5. Target Encoding
Each category is replaced with the mean of the target variable (e.g., average price for each brand).
📌 Use case: Typically used in regression problems.

In [10]:
object_columns = [column for column in train_df.columns if train_df[column].dtype == "object"]

for column in object_columns:
    print(f"Column name: {column} \n Unique vlaues: {train_df[column].unique()}")

Column name: Brand 
 Unique vlaues: ['Jansport' 'Under Armour' 'Nike' 'Adidas' 'Puma' nan]
Column name: Material 
 Unique vlaues: ['Leather' 'Canvas' 'Nylon' nan 'Polyester']
Column name: Size 
 Unique vlaues: ['Medium' 'Small' 'Large' nan]
Column name: Laptop Compartment 
 Unique vlaues: ['Yes' 'No' nan]
Column name: Waterproof 
 Unique vlaues: ['No' 'Yes' nan]
Column name: Style 
 Unique vlaues: ['Tote' 'Messenger' nan 'Backpack']
Column name: Color 
 Unique vlaues: ['Black' 'Green' 'Red' 'Blue' 'Gray' 'Pink' nan]


# Data Cleaning (Missing values)

The data_cleaning(df) function cleans and preprocesses a dataset by:

- Filling missing values (most_frequent for categorical, median for numerical).
- Creating a Weight_Class column based on Weight Capacity (kg).
- Converting data types (float64 for weight, object for class).
- Encoding the Size column numerically.


The Weight_Class column categorizes items based on their weight capacity (Weight Capacity (kg)). This is useful because:

- It simplifies classification of products/services based on their load capacity.
- It can be used as an additional feature in machine learning models.
- It enhances data interpretation by assigning meaningful categories instead of raw numerical values.

In [11]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder

In [12]:
def data_cleaning(df):
    categorical_col = df.select_dtypes(include=['object']).columns.tolist()
    
    imp_most = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
    imp_mean = SimpleImputer(missing_values=np.nan, strategy="median")

    df[categorical_col] = imp_most.fit_transform(df[categorical_col])
    df[['Weight Capacity (kg)']] = imp_mean.fit_transform(df[['Weight Capacity (kg)']])
    
    conditions = [
        (df["Weight Capacity (kg)"] <= 5),
        (df["Weight Capacity (kg)"]  > 5) & (df["Weight Capacity (kg)"] <= 15),
        (df["Weight Capacity (kg)"]  > 15) & (df["Weight Capacity (kg)"] <= 20),
        (df["Weight Capacity (kg)"]  > 20) & (df["Weight Capacity (kg)"] <= 25),
        (df["Weight Capacity (kg)"] > 25)
    ]    
    
    choices = ['Light', 'Middle', 'Light_heavy', 'Middle_heavy','Heavy']
    df['Weight_Class'] = np.select(conditions, choices, default='')
    
    df["Weight Capacity (kg)"] = df["Weight Capacity (kg)"].astype("float64")
    df['Weight_Class'] = df['Weight_Class'].astype("object")
    
    oe = OrdinalEncoder(categories='auto')
    df[['Size']] = oe.fit_transform(df[['Size']])

    return df

In [13]:
train_df = data_cleaning(train_df)
test_df = data_cleaning(test_df)

In [14]:
train_df

Unnamed: 0,Brand,Material,Size,Compartments,Laptop Compartment,Waterproof,Style,Color,Weight Capacity (kg),Price,Weight_Class
0,Jansport,Leather,1.0,7.0,Yes,No,Tote,Black,11.611723,112.15875,Middle
1,Jansport,Canvas,2.0,10.0,Yes,Yes,Messenger,Green,27.078537,68.88056,Heavy
2,Under Armour,Leather,2.0,2.0,Yes,No,Messenger,Red,16.643760,39.17320,Light_heavy
3,Nike,Nylon,2.0,8.0,Yes,No,Messenger,Green,12.937220,80.60793,Middle
4,Adidas,Canvas,1.0,1.0,Yes,Yes,Messenger,Green,17.749338,86.02312,Light_heavy
...,...,...,...,...,...,...,...,...,...,...,...
3694313,Nike,Canvas,1.0,3.0,Yes,Yes,Messenger,Blue,28.098120,104.74460,Heavy
3694314,Puma,Leather,2.0,10.0,Yes,Yes,Tote,Blue,17.379531,122.39043,Light_heavy
3694315,Jansport,Canvas,0.0,10.0,No,No,Backpack,Red,17.037708,148.18470,Light_heavy
3694316,Puma,Canvas,1.0,2.0,No,No,Backpack,Gray,28.783339,22.32269,Heavy


In [15]:
train_df.isna().sum()

Brand                   0
Material                0
Size                    0
Compartments            0
Laptop Compartment      0
Waterproof              0
Style                   0
Color                   0
Weight Capacity (kg)    0
Price                   0
Weight_Class            0
dtype: int64

In [16]:
test_df.isna().sum()

id                      0
Brand                   0
Material                0
Size                    0
Compartments            0
Laptop Compartment      0
Waterproof              0
Style                   0
Color                   0
Weight Capacity (kg)    0
Weight_Class            0
dtype: int64

In [18]:
categorical_col = train_df.select_dtypes(include=['object']).columns.tolist()
categorical_col

['Brand',
 'Material',
 'Laptop Compartment',
 'Waterproof',
 'Style',
 'Color',
 'Weight_Class']

In [19]:
for col in train_df[categorical_col]:
    print(f"Column name: {col} Values: {train_df[col].value_counts()}")

Column name: Brand Values: Brand
Under Armour    927793
Adidas          797000
Nike            764407
Puma            755778
Jansport        749340
Name: count, dtype: int64
Column name: Material Values: Material
Polyester    1171844
Leather       976186
Nylon         942656
Canvas        903632
Name: count, dtype: int64
Column name: Laptop Compartment Values: Laptop Compartment
Yes    2071470
No     1922848
Name: count, dtype: int64
Column name: Waterproof Values: Waterproof
Yes    2063529
No     1930789
Name: count, dtype: int64
Column name: Style Values: Style
Messenger    1433857
Tote         1297942
Backpack     1262519
Name: count, dtype: int64
Column name: Color Values: Color
Pink     821874
Gray     666110
Blue     638485
Red      630215
Black    620610
Green    617024
Name: count, dtype: int64
Column name: Weight_Class Values: Weight_Class
Middle          1420536
Light_heavy      853801
Middle_heavy     840557
Heavy            821337
Light             58087
Name: count, dtype:

# Standard scaler at weight

In [20]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [21]:
train_df['Weight Capacity (kg)'].describe()

count    3.994318e+06
mean     1.801044e+01
std      6.972391e+00
min      5.000000e+00
25%      1.207231e+01
50%      1.805436e+01
75%      2.398552e+01
max      3.000000e+01
Name: Weight Capacity (kg), dtype: float64

In [22]:
'''
scaler = StandardScaler()
train_df[['Weight Capacity (kg)']] = scaler.fit_transform(train_df[['Weight Capacity (kg)']])
test_df[['Weight Capacity (kg)']] = scaler.fit_transform(test_df[['Weight Capacity (kg)']])
'''

"\nscaler = StandardScaler()\ntrain_df[['Weight Capacity (kg)']] = scaler.fit_transform(train_df[['Weight Capacity (kg)']])\ntest_df[['Weight Capacity (kg)']] = scaler.fit_transform(test_df[['Weight Capacity (kg)']])\n"

In [23]:
'''
min_max_scaler = MinMaxScaler()
train_df[['Weight Capacity (kg)']] = min_max_scaler.fit_transform(train_df[['Weight Capacity (kg)']])
test_df[['Weight Capacity (kg)']] = min_max_scaler.fit_transform(test_df[['Weight Capacity (kg)']])
'''

"\nmin_max_scaler = MinMaxScaler()\ntrain_df[['Weight Capacity (kg)']] = min_max_scaler.fit_transform(train_df[['Weight Capacity (kg)']])\ntest_df[['Weight Capacity (kg)']] = min_max_scaler.fit_transform(test_df[['Weight Capacity (kg)']])\n"

# Train test split

In [24]:
from sklearn.model_selection import train_test_split

In [25]:
y = train_df['Price']
X = train_df.drop(columns='Price')

In [26]:
X_train, X_test , y_train , y_test = train_test_split(X, y ,test_size=0.2 , shuffle=True)

In [27]:
X_train.shape,  X_test.shape

((3195454, 10), (798864, 10))

# Catboost

## Gridseach

In [28]:
from catboost import CatBoostRegressor, Pool

In [29]:
model_grid = CatBoostRegressor(loss_function="RMSE",
                               eval_metric = "RMSE",
                               cat_features=categorical_col, 
                               task_type="GPU",
                               devices='0',
                               early_stopping_rounds=50,
                               verbose=False)

In [30]:
params = {'learning_rate': [0.05],
          'depth': [6],
          'l2_leaf_reg': [3],
          'iterations': [1000],
          'bagging_temperature':[1]
         }

In [31]:
clf_search = model_grid.grid_search(params, 
                                    X=X_train,
                                    y = y_train , 
                                    plot=True , 
                                    verbose=False,
                                    cv=2)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

bestTest = 38.85967031
bestIteration = 890
Training on fold [0/2]
bestTest = 38.90840002
bestIteration = 783
Training on fold [1/2]
bestTest = 38.90053371
bestIteration = 997


In [32]:
# Extract best parameters from grid search
best_params = clf_search['params']
print("Best Parameters Found: ", best_params)

Best Parameters Found:  {'bagging_temperature': 1, 'depth': 6, 'learning_rate': 0.05, 'l2_leaf_reg': 3, 'iterations': 1000}


## k-Fold Cross-Validation

- We can compare diffrent machine learning algorithm with it. It is used to compare and select a model for a given predictive modeling problem. If you have a machine learning model and some data, you want to tell if your model can fit. You can split your data into training and test set. Train your model with the training set and evaluate the result with test set. But you evaluated the model only once and you are not sure your good result is by luck or not. You want to evaluate the model multiple times so you can be more confident about the model design.

- The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into.

1. Dealing with the problem of splitting data into train and test

- Split data into k groups.
- Than run K separate learning experiment
- pick testing set
- train
- test on testing set
- Summarize the skill of the model using the sample of model evaluation scores

In [33]:
from sklearn.model_selection import KFold

best_model = CatBoostRegressor(
    **best_params,
    loss_function="RMSE",
    eval_metric="RMSE",
    cat_features=categorical_col, 
    task_type="GPU",
    devices='0',
    verbose=200,               # Set to 200 for logging
    early_stopping_rounds=200  # Make consistent
)

cv = KFold(5, shuffle=True, random_state=0)

scores = []
test_preds = []

X_test_pool = Pool(data = test_df, cat_features=categorical_col)

for train_idx, val_idx in cv.split(X, y): 
    
    X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
    y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]
    
    X_train_pool = Pool(X_train_fold, y_train_fold, cat_features=categorical_col)
    X_valid_pool = Pool(X_val_fold, y_val_fold, cat_features=categorical_col)
    
    # Fit model using best parameters
    best_model.fit(X=X_train_pool, eval_set=X_valid_pool, verbose=200, early_stopping_rounds=200)
    
    # Predict and calculate RMSE for validation
    val_pred = best_model.predict(X_valid_pool)
    score = root_mean_squared_error(y_val_fold, val_pred)
    scores.append(score)

    # Predict for test set
    test_pred = best_model.predict(X_test_pool)
    
    if 'postprocess' in globals():
        test_pred = postprocess(y_val_fold, val_pred, test_pred)
    
    test_preds.append(test_pred)

# Print results
print(f'Cross-validated RMSE score: {np.mean(scores):.3f} +/- {np.std(scores):.3f}')
print(f'Max RMSE score: {np.max(scores):.3f}')
print(f'Min RMSE score: {np.min(scores):.3f}')

0:	learn: 38.9373722	test: 38.9331028	best: 38.9331028 (0)	total: 280ms	remaining: 4m 39s
200:	learn: 38.8994854	test: 38.8991582	best: 38.8991582 (200)	total: 22s	remaining: 1m 27s
400:	learn: 38.8937575	test: 38.8971871	best: 38.8971851 (396)	total: 43.8s	remaining: 1m 5s
600:	learn: 38.8889702	test: 38.8963550	best: 38.8963550 (600)	total: 1m 5s	remaining: 43.7s
800:	learn: 38.8845243	test: 38.8961779	best: 38.8961326 (708)	total: 1m 28s	remaining: 22.1s
999:	learn: 38.8804963	test: 38.8960976	best: 38.8960379 (931)	total: 1m 52s	remaining: 0us
bestTest = 38.89603785
bestIteration = 931
Shrink model to first 932 iterations.
0:	learn: 38.9289561	test: 38.9668309	best: 38.9668309 (0)	total: 205ms	remaining: 3m 25s
200:	learn: 38.8908510	test: 38.9330884	best: 38.9330637 (197)	total: 22.5s	remaining: 1m 29s
400:	learn: 38.8844625	test: 38.9316706	best: 38.9316603 (395)	total: 44.3s	remaining: 1m 6s
600:	learn: 38.8796391	test: 38.9313290	best: 38.9313146 (595)	total: 1m 6s	remaining: 4

# Submission

In [34]:
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission['Price'] = np.mean(test_preds, axis=0)
sample_submission.to_csv('submission.csv', index=False)
sample_submission.head(10)

Unnamed: 0,id,Price
0,300000,81.245771
1,300001,81.884345
2,300002,82.537746
3,300003,81.150639
4,300004,77.320145
5,300005,81.579709
6,300006,82.367649
7,300007,84.421213
8,300008,83.269573
9,300009,79.292728


In [35]:
'''
cat_features = categorical_col
test_pool = Pool(data=test_df, cat_features=cat_features)

id_test = test_df['id']

test_pred = model_grid.predict(test_pool)
submission_df = pd.DataFrame({
    'id' : id_test,
    'Price': test_pred
})

submission_df.to_csv("submission.csv", index=False)
submission_df.head()
'''

'\ncat_features = categorical_col\ntest_pool = Pool(data=test_df, cat_features=cat_features)\n\nid_test = test_df[\'id\']\n\ntest_pred = model_grid.predict(test_pool)\nsubmission_df = pd.DataFrame({\n    \'id\' : id_test,\n    \'Price\': test_pred\n})\n\nsubmission_df.to_csv("submission.csv", index=False)\nsubmission_df.head()\n'