# eCommerce Price Prediction

## Problem Statement
E-commerce platforms have been in existence for more than 2 decades now. The popularity and its preference as a common choice for buying and selling essential products have grown rapidly and exponentially over the past few years. E-commerce has impacted the lifestyle of common people to a huge extent. Many such platforms are competing over each other for dominance by providing consumer goods at a competitive price. In this hackathon, we challenge data science enthusiasts to predict the price of commodities on an e-commerce platform.

Given are **7 distinguishing factors** that can influence the price of a product on an e-commerce platform. Your objective as a data scientist is to build a machine learning model that can accurately predict the price of a product based on the given factors.

## Input Data

The unzipped folder will have the following files.

Train.csv –  2452 observations

Test.csv –  1051 observations

Target Variable: **Selling_Price**


## Import Libraries

In [None]:
# Basic libraries

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

# Plot related libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Linear Regression Model
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import TransformedTargetRegressor
from sklearn.utils import shuffle

## Load the dataset

The eCommerce price prediction problem has set of data in Train and Test file as comma-separated file. 


In [None]:
TRAIN_FILE = "/kaggle/input/E-Commerce_Participants_Data/Train.csv"
TEST_FILE = "/kaggle/input/E-Commerce_Participants_Data/Test.csv"

# Using pandas read_csv method to import data
train_ecomm_df = pd.read_csv(TRAIN_FILE, header=0)
test_ecomm_df = pd.read_csv(TEST_FILE, header=0)

Let us check the info of the given dataset.

**Training Set**

In [None]:
train_ecomm_df.info()
print("=="*30)
train_ecomm_df.head()

**Test Set**

In [None]:
test_ecomm_df.info()
print("=="*30)
test_ecomm_df.head()

## Exploratory Data Analysis

In [None]:
train_ecomm_df.describe().T

**Check for null data**

The training set seems to have no null data. 


In [None]:
train_ecomm_df.columns

In [None]:
sns.set(style='whitegrid', palette='muted')
fig, ax = plt.subplots(1,2, figsize=(12,6))

sns.distplot(train_ecomm_df['Selling_Price'], kde=True, ax=ax[0])
sns.scatterplot(x='Item_Rating', y='Selling_Price', data=train_ecomm_df, marker='o', color='r', ax=ax[1])

plt.tight_layout()
plt.show()

In [None]:
# Transform the target variable
y_target = np.log1p(train_ecomm_df['Selling_Price'])

In [None]:
fig, axes = plt.subplots(1,2,figsize=(10,5))
sns.distplot(train_ecomm_df['Selling_Price'], kde=True, ax=axes[0])
sns.distplot(y_target, kde=True, ax=axes[1])
axes[0].set_title("Skewed Y-Values")
axes[1].set_title("Normalized Y-Values")
plt.show()

## Prepare data for model building

The dataset contains date and few categorical columns. We need to encode the categorical columns to number before building a model. 

In [None]:
# Merge train and test data
tempset = pd.concat([train_ecomm_df, test_ecomm_df], keys=[0,1])

# Impute the 'unknown' values with Mode
tempset['Subcategory_1'] = tempset['Subcategory_2'].replace('unknown', np.nan).bfill().ffill()
tempset['Subcategory_2'] = tempset['Subcategory_2'].replace('unknown', np.nan).bfill().ffill()

tempset['Subcategory_1'] = tempset['Subcategory_1'].fillna(tempset['Subcategory_1'].mode()[0])
tempset['Subcategory_2'] = tempset['Subcategory_2'].fillna(tempset['Subcategory_2'].mode()[0])

In [None]:
tempset.drop(['Date', 'Product'], axis=1, inplace=True)

In [None]:
# Getting the categorical columns
cat_data = tempset.select_dtypes(include=['object'])

# One-hot encoding
X_encode = pd.get_dummies(tempset, columns=cat_data.columns)

# Getting back the Tran and Test data
X_train, X_enc_test = X_encode.xs(0), X_encode.xs(1)

## Define X and Y 

In [None]:
# Prepare X and y for fitting the model
y = X_train['Selling_Price'].values
X = X_train.drop('Selling_Price', axis=1).values

X_test = X_enc_test.drop('Selling_Price', axis=1)

## Building Linear Regression Model

### Using TransformedTargetRegressor model

This model allows us to use cross-validation and regularizer functions such as Ridge and Lasso

## Ridge CV implementation

In [None]:
ridge_cv = RidgeCV(normalize=True,cv=10,gcv_mode='svd',scoring='neg_mean_squared_error')

#Initializing Linear Regression algorithm with Ridge regularizer(K-fold with 10 folds)
ridge_reg = TransformedTargetRegressor(regressor= ridge_cv,
                                      func=np.log1p,
                                      inverse_func=np.expm1)

In [None]:
ridge_reg.fit(X, y)

# Predict the test data
predictions = ridge_reg.predict(X_test)

In [None]:
final_df = pd.DataFrame({'Selling_Price': predictions})

final_df['Selling_Price'] = final_df.apply(lambda x: round(x, 2))
final_df = pd.concat([test_ecomm_df, final_df['Selling_Price']], axis=1)

In [None]:
final_df.head(20)

## Learnings

* The data is a mix of categorical, ordinal, numeric and date values
* The **Y-Target** attribute **Selling Price** has got a skewed data when we visualize its distribution
* We need to apply the transformation method to make it normal. Here, **np.log1p** method is used. [Click to know more about the method](https://numpy.org/doc/1.18/reference/generated/numpy.log1p.html#numpy.log1p)
* It is always good to start with linear model rather than ensembles or neural network. 
* The indention was to get exposure to real time data not the leaderboard (pun indented) 
* First tried with LinearRegressor model with RidgeCV
* During the iteration, applied the data with [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html) of 300 estimators but the result was not converging towards 0.5, hence switched to [Log transformer](https://numpy.org/doc/1.18/reference/generated/numpy.log1p.html#numpy.log1p).


## Conclusion

The final submission score is as follows

|Best Public Score | Final Score |
|------------------|-------------|
|0.67659	|**0.65363**	|

These scores stood **38th** position. The challenge was quite tough, solely because of the data.

Although the feature scaling and engineering parts were not done extensively here, the **Linear Regressor** with RidgeCV seemed to have done pretty good job. 
