<h1><font color = 'blue'> Mercari Price Suggestion Challenge</font></h1>

<h2> 1. Business Problem</h2>

<h3> 1.1 Problem Description:</h3>
<p>
It is hard to interpret a product's price as small details can mean big differences in pricing.
</p>
<p>
Product pricing gets even harder at scale, considering just how many products are sold online. Clothing has strong seasonal pricing trends and is heavily influenced by brand names, while electronics have fluctuating prices based on product specs.
</p>
<p>
Mercari, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace.
</p>
<p>
So, our task is to implement an algorithm which could automatically predict the prices of the products.
</p>

__ Reference :__

https://www.kaggle.com/c/mercari-price-suggestion-challenge/overview



<h3>1.2 Problem Statement</h3>
<p>
We are given user-inputted text descriptions of the products, including details like product category name, brand name, and item condition. Objective is to build an algorithm that automatically suggests the right product prices. 
</p>

<h3>1.3 Business Objective and constraint</h3>

__Objectives__:
1. Predict the the right product prices based on product category, brand name, etc.
2. Minimize the RMSLE.

__Constraints__:
1. Some form of interpretability.


<h1> 2. Machine Learning Problem </h1>

<h2>2.1 Data </h2>

<h3> 2.1.1 Data Overview </h3>

<p> Get the data from : https://www.kaggle.com/c/mercari-price-suggestion-challenge/data </p>
<p> Data files : 
<ul> 
<li> train.tsv: It has 1,482,535 rows and 8 columns. </li>
<li> test.tsv: It has 693,359 rows and 7 columns ('price' is excluded). </li>
</ul>
<br>
The files consist of a list of product listings. These files are tab-delimited.
<br>
<pre>
1. train_id or test_id : the id of the listing

2. name - the title of the listing. Note that we have cleaned the data to remove text that look like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]

3. item_condition_id - the condition of the items provided by the seller

4. category_name - category of the listing

5. brand_name

6. price - the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in test.tsv since that is what you will predict.
shipping - 1 if shipping fee is paid by seller and 0 by buyer

7. item_description - the full description of the item. Note that we have cleaned the data to remove text that look like prices (e.g. $20) to avoid leakage. These removed prices are represented as [rm]
</pre>

<h2>2.2 Mapping the real world problem to a Machine Learning Problem </h2>

<h3> 2.2.1 Type of Machine Learning Problem </h3>



The problem that is to be solved is to predict the valid price for the products sold online. Thus, this is a __Regression Problem__.

<h3> 2.2.2 Performance metric </h3>

Root Mean Squared Logarithmic Error: 
https://www.kaggle.com/c/mercari-price-suggestion-challenge/overview/evaluation

<h1> 3. Exploratory Data Analysis </h1>

<h3>Importing required libraries</h3>

In [None]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('nbagg')
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
from sklearn import preprocessing
from tqdm import tqdm
import seaborn as sns
sns.set_style('whitegrid')
import os
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup
import re
import scipy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dropout, Dense, concatenate, GRU, Embedding, Flatten, Activation
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
import math
import joblib

<h2> 3.1 Reading Data</h2> 

In [None]:
print("Reading Data")

train = pd.read_csv('../input/mercari-price-suggestion-challenge/train.tsv', sep='\t')
test = pd.read_csv('../input/mercari-price-suggestion-challenge/test_stg2.tsv', sep='\t')

print("Shape of train data: ",train.shape)
print("Shape of test data: ",test.shape)

In [None]:
y_train = np.log1p(train["price"])

In [None]:
NUM_BRANDS = 2500
NAME_MIN_DF = 10
MAX_FEAT_DESCP = 50000

<h2>3.2 Data Cleaning</h2>

<h3> 3.2.1 Check for Duplicates</h3>


In [None]:
print('No of duplicates in train: {}'.format(sum(train.duplicated())))
print('No of duplicates in test : {}'.format(sum(test.duplicated())))

<h3> 3.2.2 Checking for NaN/null values</h3>

In [None]:
train.isnull().any()

In [None]:
print('We have {} NaN/Null values in train'.format(train.isnull().values.sum()))
print('We have {} NaN/Null values in test'.format(test.isnull().values.sum()))

In [None]:
train["category_name"] = train["category_name"].fillna("Other").astype("category")
train["brand_name"] = train["brand_name"].fillna("unknown")

test["category_name"] = test["category_name"].fillna("Other").astype("category")
test["brand_name"] = test["brand_name"].fillna("unknown")

top_brands = train["brand_name"].value_counts().index[:NUM_BRANDS]
train.loc[~train["brand_name"].isin(top_brands), "brand_name"] = "Other"
test.loc[~test["brand_name"].isin(top_brands), "brand_name"] = "Other"

train["item_description"] = train["item_description"].fillna("None")
train["brand_name"] = train["brand_name"].astype("category")

test["item_description"] = test["item_description"].fillna("None")
test["brand_name"] = test["brand_name"].astype("category")

<h2> 3.3 Univariate Data Analysis </h2>

<h3> 3.3.1 Feature : item_condition_id </h3>

In [None]:
train_cond_id = Counter(list(train['item_condition_id']))
test_cond_id = Counter(list(test['item_condition_id']))

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4))

ax1.bar(train_cond_id.keys(), train_cond_id.values(), width=0.2, align='edge', label='Train')
ax1.set_xticks([1,2,3,4,5])
ax1.set_xlabel('item_condition_id')
ax1.legend()

ax2.bar(test_cond_id.keys(), test_cond_id.values(), width=-0.2, align='edge', label='Test')
ax2.set_xticks([1,2,3,4,5])
ax2.set_xlabel('item_condition_id')
ax2.legend()

fig.show()

#### Observation:
Data of item_condition_id is equally distributed across train and test data

<h3> 3.3.2 Feature : category_name </h3>

In [None]:
print(train['category_name'].describe())
category_nam = Counter(list(train['category_name']))

In [None]:
print("Top 15 category in train data: ")
category_nam.most_common(15)

In [None]:
print(test['category_name'].describe())
category_nam = Counter(list(test['category_name']))

In [None]:
print("Top 15 category in test data: ")
category_nam.most_common(15)

<h3> 3.3.3 Feature : brand_name </h3>

In [None]:
print(train['brand_name'].describe())
brand_nam = Counter(list(train['brand_name']))


In [None]:
print("Top 15 brands in train data: ")
brand_nam.most_common(15)

In [None]:
print(test['brand_name'].describe())
brand_nam = Counter(list(test['brand_name']))


In [None]:
print("Top 15 brands in test data: ")
brand_nam.most_common(15)

<h3> 3.3.4 Feature : price </h3>

In [None]:
train.price.describe()

In [None]:
fig, ax = plt.subplots( figsize = (10, 5))
ax.hist(train.price, bins = 100, color = "blue")
ax.set_title("\n \n  Histogram ", fontsize = 15)
ax.set_xlabel(" Price", fontsize = 10)
plt.title("Distribution of Price")
plt.show()

<h3> 3.3.5 Feature : shipping </h3>

In [None]:
train_ship = Counter(list(train['shipping']))
test_ship = Counter(list(test['shipping']))

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4))

ax1.bar(train_ship.keys(), train_ship.values(), width=0.1, align='edge', label='Train')
ax1.set_xticks([0,1])
ax1.set_xlabel('shipping')
ax1.legend()

ax2.bar(test_ship.keys(), test_ship.values(), width=-0.1, align='edge', label='Test')
ax2.set_xticks([1,0])
ax2.set_xlabel('shipping')
ax2.legend()

fig.show()

<h2>3.4 Data Preprocessing</h2>

> <h3> 3.4.1 Preprocessing Textual Features</h3>

>> <h4> 3.4.1.2 Name feature</h4>

In [None]:
print("Encodings")
count_nm = CountVectorizer(min_df=NAME_MIN_DF)
train_name_vec = count_nm.fit_transform(train["name"])
test_name_vec = count_nm.transform(test["name"])
print("Shape of train Name feature: ",train_name_vec.shape)
print("Shape of test Name feature: ",test_name_vec.shape)

>> <h4> 3.4.1.2 Description feature</h4>

In [None]:
print("Descp encoders")
count_desc = TfidfVectorizer(max_features = MAX_FEAT_DESCP, 
                              ngram_range = (1,3),
                              stop_words = "english")
train_desc_vec = count_desc.fit_transform(train["item_description"])
test_desc_vec = count_desc.transform(test["item_description"])
print("Shape of train Name feature: ",train_desc_vec.shape)
print("Shape of test Name feature: ",test_desc_vec.shape)

> <h3> 3.4.2 Preprocessing Categorical Features</h3>

>> <h4> category_name</h4>

In [None]:
print("Category Encoders")
unique_categories = pd.Series("/".join(train["category_name"].unique().astype("str")).split("/")).unique()
count_category = CountVectorizer()
encoder_cat_train = count_category.fit_transform(train["category_name"])
encoder_cat_test= count_category.transform(test["category_name"])

In [None]:
print(encoder_cat_train.shape)
print(encoder_cat_test.shape)

>> <h4> brand_name</h4>

In [None]:
from sklearn.preprocessing import LabelBinarizer

print("Brand encoders")
vect_brand = LabelBinarizer(sparse_output=True)

encoder_brnd_train = vect_brand.fit_transform(train["brand_name"])
encoder_brnd_test= vect_brand.transform(test["brand_name"])

In [None]:
print(encoder_brnd_train.shape)
print(encoder_brnd_test.shape)

In [None]:
X_train = scipy.sparse.hstack((
                         train_desc_vec,
                         encoder_brnd_train,
                         encoder_cat_train,
                         train_name_vec,
                         np.array(train['item_condition_id']).reshape(-1,1),
                         np.array(train['shipping']).reshape(-1,1)
                        )).tocsr()
print(X_train.shape)

In [None]:
X_test = scipy.sparse.hstack((
                         test_desc_vec,
                         encoder_brnd_test,
                         encoder_cat_test,
                         test_name_vec,
                         np.array(test['item_condition_id']).reshape(-1,1),
                         np.array(test['shipping']).reshape(-1,1)
)).tocsr()
print(X_test.shape)

<h1>4. Modelling</h1>

> <h3>4.1 XGB Regressor</h3>

In [None]:
from xgboost import XGBRegressor
# from sklearn.model_selection import GridSearchCV

# params = { 
#           'gamma':[i/10.0 for i in range(3,8,2)],  
#           'max_depth': [4,8,16]}

# xgb = XGBRegressor() 

# grid = GridSearchCV(estimator=xgb, param_grid=params, n_jobs=-1, cv=2, verbose=3)
# grid.fit(X_train, y_train)
# print("Best estimator : ", grid.best_estimator_)
# print("Best Score : ", grid.best_score_)

In [None]:
xgb = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bytree=1, gamma=0.7, learning_rate=0.1, max_delta_step=0,
             max_depth=16, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=-1, random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, 
             seed=None, silent=True, subsample=1)

print("Fitting Model 1")
xgb.fit(X_train, y_train)

In [None]:
y_pred = xgb.predict(X_test)    

### Observation:
* This XGBRegressor model gave the LB Score of 0.52026 

> <h3>4.2 Ridge Regressor</h3>

In [None]:
from sklearn.linear_model import RidgeCV

model = RidgeCV(fit_intercept=True, alphas=[5.0], normalize=False, cv = 2, scoring='neg_mean_squared_error')


print("Fitting Model")
model.fit(X_train, y_train)

In [None]:
preds = model.predict(X_test)

### Observation:
* This RidgeCV model gave the LB Score of 0.46610

> <h2> 4.3 RNN Model</h2>

In [None]:
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from keras.preprocessing.text import Tokenizer

In [None]:
full_df = pd.concat([train, test])

In [None]:
print("Processing categorical data...")
le = LabelEncoder()

le.fit(full_df.category_name)
full_df.category_name = le.transform(full_df.category_name)

le.fit(full_df.brand_name)
full_df.brand_name = le.transform(full_df.brand_name)

del le

In [None]:
print("Transforming text data to sequences...")
raw_text = np.hstack([full_df.item_description.str.lower(), full_df.name.str.lower()])

print("   Fitting tokenizer...")
tok_raw = Tokenizer()
tok_raw.fit_on_texts(raw_text)

print("   Transforming text to sequences...")
full_df['seq_item_description'] = tok_raw.texts_to_sequences(full_df.item_description.str.lower())
full_df['seq_name'] = tok_raw.texts_to_sequences(full_df.name.str.lower())

del tok_raw

In [None]:
# Define constants to use when define RNN model
MAX_NAME_SEQ = 10
MAX_ITEM_DESC_SEQ = 75
MAX_TEXT = np.max([
    np.max(full_df.seq_name.max()),
    np.max(full_df.seq_item_description.max()),
]) + 4
MAX_CATEGORY = np.max(full_df.category_name.max()) + 1
MAX_BRAND = np.max(full_df.brand_name.max()) + 1
MAX_CONDITION = np.max(full_df.item_condition_id.max()) + 1

In [None]:
def get_keras_data(df):
    X = {
        'name': pad_sequences(df.seq_name, maxlen=MAX_NAME_SEQ),
        'item_desc': pad_sequences(df.seq_item_description, maxlen=MAX_ITEM_DESC_SEQ),
        'brand_name': np.array(df.brand_name),
        'category_name': np.array(df.category_name),
        'item_condition': np.array(df.item_condition_id),
        'num_vars': np.array(df[["shipping"]]),
    }
    return X
# Calculate number of train/dev/test examples.
n_trains = train.shape[0]
n_tests = test.shape[0]

train = full_df[:n_trains]
test = full_df[n_trains:]

X_train = get_keras_data(train)
X_test = get_keras_data(test)

In [None]:
from keras import optimizers
def new_rnn_model(lr=0.001, decay=0.0):    
    # Inputs
    name = Input(shape=[X_train["name"].shape[1]], name="name")
    item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
    brand_name = Input(shape=[1], name="brand_name")
    category_name = Input(shape=[1], name="category_name")
    item_condition = Input(shape=[1], name="item_condition")
    num_vars = Input(shape=[X_train["num_vars"].shape[1]], name="num_vars")

    # Embeddings layers
    emb_name = Embedding(MAX_TEXT, 20)(name)
    emb_item_desc = Embedding(MAX_TEXT, 60)(item_desc)
    emb_brand_name = Embedding(MAX_BRAND, 10)(brand_name)
    emb_category_name = Embedding(MAX_CATEGORY, 10)(category_name)

    # rnn layers
    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_name)

    # main layers
    main_l = concatenate([
        Flatten() (emb_brand_name),
        Flatten() (emb_category_name),
        item_condition,
        rnn_layer1,
        rnn_layer2,
        num_vars,
    ])

    main_l = Dense(256)(main_l)
    main_l = Activation('elu')(main_l)

    main_l = Dense(128)(main_l)
    main_l = Activation('elu')(main_l)

    main_l = Dense(64)(main_l)
    main_l = Activation('elu')(main_l)

    # the output layer.
    output = Dense(1, activation="linear") (main_l)

    model = Model([name, item_desc, brand_name , category_name, item_condition, num_vars], output)

    optimizer = optimizers.Adam(lr=lr, decay=decay)
    model.compile(loss="mse", optimizer=optimizer)

    return model

model = new_rnn_model()
model.summary()
del model

In [None]:
# Set hyper parameters for the model.
BATCH_SIZE = 1024
epochs = 2

# Calculate learning rate decay.
exp_decay = lambda init, fin, steps: (init/fin)**(1/(steps-1)) - 1
steps = int(n_trains / BATCH_SIZE) * epochs
lr_init, lr_fin = 0.007, 0.0005
lr_decay = exp_decay(lr_init, lr_fin, steps)

rnn_model = new_rnn_model(lr=lr_init, decay=lr_decay)

print("Fitting RNN model to training examples...")
rnn_model.fit(
        X_train, y_train, epochs=epochs, batch_size=BATCH_SIZE, verbose=2)

In [None]:
preds = rnn_model.predict(X_test, batch_size=BATCH_SIZE)

In [None]:
test["price"] = np.expm1(preds)
test["test_id"] = pd.to_numeric(test["test_id"], downcast='integer')
test[["test_id", "price"]].to_csv("submission.csv", index = False)

In [None]:
res = pd.read_csv("submission.csv")
res.head()

In [None]:
type(res["test_id"][0])

<h1>5. Conclusion</h1>

In [None]:
from prettytable import PrettyTable
table = PrettyTable()
table.title = "Comparison of Models"
table.field_names = [ "Model"," RMLSE "]
table.add_row(["XGBRegressor","0.52"])
table.add_row(["RidgeCV Regressor","0.46"])
table.add_row(["RNN Model","0.43"])
print(table)

<h2> <font color='blue'> Procedure to solve the problem </font></h2>

<h2> <font color='grey'>1.  Business Problem: </font></h2>
It covers the basic details which should be known before solving the case study.<br>
<p>
**1.1. Problem Description:** describes the background details of the Mercari shopping app which is must to know to get the insights.<br>
**1.2. Problem Statemtent:** describes the problem which we are intended to solve.<br>
**1.3. Business Objectives and Constraints:** describes the objectives which we have to keep in mind while solving the problem. We need to give proper attention towards the constraints stated under this.
</p>

<h2> <font color='grey'>2. Machine Learning problem:</font></h2>
Looking into the problem as a Machine learning problem.
<p>
**2.1 Data Overview:** Understanding the data and the data fields.<br>
**2.2 Mapping the real-world problem to a Machine Learning Problem:** <br>
_2.2.1 Type of Machine Learning Problem:_ Understand the type of problem i.e. classification (binary classification, Multi-class classification, Multi-label classification), regression, etc. This is a  Regression Problem.<br>
_2.2.2 Performance Metric:_ Percieve the appropriate metric for this problem.
</p>


<h2> <font color='grey'>3. Exploratory Data Analysis:</font></h2>
<p>
**3.1 Reading the Data **<br>
**3.2 Data Cleaning**<br>
*3.2.1 Checking for duplicates:* removing any duplicates if present<br>
*3.2.2 Checking for NaN/null values*<br>
</p>
<p>
**3.3 Univariate Data Analysis **<br>
*3.3.1 Feature : item_condition_id  *<br>
*3.3.2 Feature : category_name* <br>
*3.3.3 Feature : brand_name  * <br>
*3.3.4 Feature : price * <br>
*3.3.5 Feature : shipping* 
</p>
<p>
**3.4 Preprocessing** <br>
*3.4.1 Preprocessing Textual Features *<br>
*3.4.2 Preprocessing Categorical Features* <br>
*3.4.3 Numerical Features* <br>
</p>

<h2> <font color='grey'> 4. Machine Learning Models</font></h2>
<br>
*4.1. Applying GridSearchCV for XGBRegressor*<br>
*4.2. Applying RidgeCV Regressor*<br>
*4.2. Applying RNN Model*<br>

<h2> <font color='grey'> 5. Conclusion</font></h2>
* **RNN Model** has RMLSE of 0.43176 and thus, it is a best model.<br>

