### Introduction

It can be hard to know how much something’s really worth. Small details can mean big differences in pricing. It can be hard to know how much something’s really worth. Small details can mean big differences in pricing. 

Here we are using the dataset proided by the Mercari, Japan’s biggest community-powered shopping app, to build a model which can offer pricing suggestions to sellers or even to the buyers to know if they are purchasing the product a correct price.

Here, we will be using user-inputted text descriptions of the products, including details like product category name, brand name, and item condition.

### Data

The files consist of a list of product listings. These files are tab-delimited.

* train_id or test_id - the id of the listing
* name - the title of the listing. 
* item_condition_id - the condition of the items provided by the seller
* category_name - category of the listing
* brand_name
* price - the price that the item was sold for. This is the target variable that you will predict. The unit is USD. This column doesn't exist in test.tsv since that is what you will predict.
* shipping - 1 if shipping fee is paid by seller and 0 by buyer
* item_description - the full description of the item.

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import os
import math

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Dropout, Dense, BatchNormalization, Activation, concatenate, GRU, Embedding, Flatten, BatchNormalization
from keras.models import Model
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras import backend as K

from sklearn.model_selection import train_test_split

### File Installation

Installing file to extract the data from train test .tsv.7z file

In [None]:
!apt-get install p7zip
!p7zip -d -f -k /kaggle/input/mercari-price-suggestion-challenge/train.tsv.7z
!p7zip -d -f -k /kaggle/input/mercari-price-suggestion-challenge/test.tsv.7z
!p7zip -d -f -k /kaggle/input/mercari-price-suggestion-challenge/sample_submission.csv.7z

### Unzipping dataset

Unzipping he test andsample submission dataset

In [None]:
!unzip /kaggle/input/mercari-price-suggestion-challenge/sample_submission_stg2.csv.zip
!unzip /kaggle/input/mercari-price-suggestion-challenge/test_stg2.tsv.zip

### Reading the Train and Test file

In [None]:
train_data = pd.read_csv('train.tsv', sep='\t')
train_copy = train_data.copy()
test_data = pd.read_csv('test_stg2.tsv', sep='\t')
test_copy = test_data.copy()
print(train_data.shape)
print(test_data.shape)
print(train_data.columns)
print(test_data.columns)

In [None]:
train_data.head()

In [None]:
test_data.head()

### Exploratory data analysis

In [None]:
train_data.isna().sum()

In [None]:
test_data.isna().sum()

In [None]:
train_data = train_data.dropna()
test_data = test_data.dropna()

In [None]:
# #HANDLE MISSING VALUES
# print("Handling missing values...")
# def handle_missing(dataset):
#     dataset.category_name.fillna(value="missing", inplace=True)
#     dataset.brand_name.fillna(value="missing", inplace=True)
#     dataset.item_description.fillna(value="missing", inplace=True)
#     return (dataset)

# train_data = handle_missing(train_data)
# test_data = handle_missing(test_data)
# print(train_data.shape)
# print(test_data.shape)

In [None]:
train_data = train_data.drop(['train_id'],axis=1)
test_data = test_data.drop(['test_id'],axis=1)

In [None]:
train_data.category_name.value_counts()[train_data.category_name.value_counts() > 10000]

In [None]:
train_data.brand_name.value_counts()[train_data.brand_name.value_counts() > 10000]

In [None]:
def split_cat(text):
    try: return text.split("/")
    except: return ("No Label", "No Label", "No Label")

In [None]:
train_data['general_cat'], train_data['subcat_1'], train_data['subcat_2'] = \
zip(*train_data['category_name'].apply(lambda x: split_cat(x)))
train_data.head()

In [None]:
test_data['general_cat'], test_data['subcat_1'], test_data['subcat_2'] = \
zip(*test_data['category_name'].apply(lambda x: split_cat(x)))
test_data.head()

In [None]:
train_data = train_data.drop(['category_name'],axis=1)
test_data = test_data.drop(['category_name'],axis=1)

In [None]:
print("There are %d unique first sub-categories." % train_data['subcat_1'].nunique())

In [None]:
print("There are %d unique first sub-categories." % train_data['subcat_2'].nunique())

### EDA Through Visualization

In [None]:
plt.figure(figsize=(16,5))
sns.countplot(x = 'general_cat', data = train_data, order = train_data['general_cat'].value_counts().index, palette = 'Spectral_r')
plt.tight_layout()

In [None]:
plt.figure(figsize=(16,5))
sns.countplot(x = 'subcat_1', data = train_data, order = train_data['subcat_1'].value_counts().iloc[:10].index, palette = 'mako')
plt.tight_layout()

In [None]:
plt.figure(figsize=(16,5))
sns.countplot(x = 'subcat_2', data = train_data, order = train_data['subcat_2'].value_counts().iloc[:10].index, palette = 'rocket')
plt.tight_layout()

In [None]:
plt.figure(figsize=(20,5))
df = train_data.groupby('brand_name')['price'].mean().reset_index()
brands = list(train_data.brand_name.value_counts()[train_data.brand_name.value_counts() > 1000].index)
data = {'brand_name':  [df[df['brand_name']==i].values[0][0] for i in brands],
        'price': [df[df['brand_name']==i].values[0][1] for i in brands]}
df = pd.DataFrame(data).sort_values('price',ascending=False)[:10]
sns.barplot(data=df, x='brand_name', y='price', palette = 'icefire')

In [None]:
plt.figure(figsize=(20,5))
df = train_data.groupby('brand_name')['price'].mean().reset_index()
sns.barplot(data=df, x='brand_name', y='price', order = train_data['brand_name'].value_counts().iloc[:10].index, palette = 'Spectral')

In [None]:
plt.figure(figsize=(20,5))
len_desc = [len(i.split()) for i in train_data['item_description']]
sns.lineplot(data=train_data, x=len_desc, y='price')

In [None]:
plt.figure(figsize=(20,5))
df = train_data.groupby('general_cat')['price'].mean().reset_index()
df = df.sort_values('price')
sns.lineplot(data=df, x='general_cat', y='price')

* Women, Beauty and kids are the most selling general category items.
* Athletic appareal, Shoes, Makeup makes up the highest selling items.
* Louis Vuitton, Gucci, Air Jordan, Tiffany & Co. and Tory Burch are the top 5 expensive brand on an average with more than 1000 products sold. Which makes sense as they are the most branded, luxurious and famous fashion companies.
* We checked if the lemgth of the description impacts the price of the product and weuldn't find any relation between them and which a good thing.
* From the genalroduct category, we found that the electronics items are the most expensive on.

### PROCESS CATEGORICAL DATA

In [None]:
print("Handling categorical variables...")
le = LabelEncoder()

le.fit(np.hstack([train_data.general_cat, test_data.general_cat]))
train_data.general_cat = le.transform(train_data.general_cat)
test_data.general_cat = le.transform(test_data.general_cat)

le.fit(np.hstack([train_data.brand_name, test_data.brand_name]))
train_data.brand_name = le.transform(train_data.brand_name)
test_data.brand_name = le.transform(test_data.brand_name)

le.fit(np.hstack([train_data.subcat_1, test_data.subcat_1]))
train_data.subcat_1 = le.transform(train_data.subcat_1)
test_data.subcat_1 = le.transform(test_data.subcat_1)

le.fit(np.hstack([train_data.subcat_2, test_data.subcat_2]))
train_data.subcat_2 = le.transform(train_data.subcat_2)
test_data.subcat_2 = le.transform(test_data.subcat_2)

In [None]:
train_data.head(5)

### Processing the text

In [None]:
raw_text = np.hstack([train_data.item_description.str.lower(), train_data.name.str.lower()])

print("   Fitting tokenizer...")
tokenizer = Tokenizer()
tokenizer.fit_on_texts(raw_text)
print("   Transforming text to seq...")

train_data["seq_item_description"] = tokenizer.texts_to_sequences(train_data.item_description.str.lower())
test_data["seq_item_description"] = tokenizer.texts_to_sequences(test_data.item_description.str.lower())
train_data["seq_name"] = tokenizer.texts_to_sequences(train_data.name.str.lower())
test_data["seq_name"] = tokenizer.texts_to_sequences(test_data.name.str.lower())
train_data.head(3)

### Sequences variable analysis

In [None]:
max_name_seq = np.max([np.max(train_data.seq_name.apply(lambda x: len(x))), np.max(test_data.seq_name.apply(lambda x: len(x)))])
max_item_description_seq = np.max([np.max(train_data.seq_item_description.apply(lambda x: len(x)))
                                   , np.max(test_data.seq_item_description.apply(lambda x: len(x)))])
print("max name seq "+str(max_name_seq))
print("max item desc seq "+str(max_item_description_seq))

In [None]:
train_data.seq_name.apply(lambda x: len(x)).hist()

In [None]:
train_data.seq_item_description.apply(lambda x: len(x)).hist()

* Based on the histograms we will select the next length

In [None]:
#EMBEDDINGS MAX VALUE
MAX_NAME_SEQ = 10
MAX_ITEM_DESC_SEQ = 75
MAX_TEXT = np.max([np.max(train_data.seq_name.max()), np.max(test_data.seq_name.max())
                  , np.max(train_data.seq_item_description.max()), np.max(test_data.seq_item_description.max())])+3
MAX_GENERAL_CAT = np.max([train_data.general_cat.max(),test_data.general_cat.max()])+1
MAX_SUBCAT_1 = np.max([train_data.subcat_1.max(),test_data.subcat_1.max()])+1
MAX_SUBCAT_2 = np.max([train_data.subcat_2.max(),test_data.subcat_2.max()])+1
MAX_BRAND = np.max([train_data.brand_name.max(), test_data.brand_name.max()])+1
MAX_CONDITION = np.max([train_data.item_condition_id.max(), test_data.item_condition_id.max()])+1

### Normalizing target value

In [None]:
#SCALE target variable
train_data["target"] = np.log(train_data.price+1)
target_scaler = MinMaxScaler(feature_range=(-1, 1))
train_data["target"] = target_scaler.fit_transform(np.array(train_data.target).reshape(-1,1))
pd.DataFrame(train_data.target).hist()

* It is a good practice to normalize the target vairable between -1 and 1, if we are specially dealing with regression data with equential model such as LSTM, GRU or RNN 

### Splitting of the data

In [None]:
#splitting the data
train, valid = train_test_split(train_data, random_state=123, test_size=0.01)
print(train.shape)
print(valid.shape)

### Keras data defination and padding of the text sequences 

In [None]:
def get_keras_data(dataset):
    X = {
        'name': pad_sequences(dataset.seq_name, maxlen=MAX_NAME_SEQ),
        'item_desc': pad_sequences(dataset.seq_item_description, maxlen=MAX_ITEM_DESC_SEQ),
        'brand_name': np.array(dataset.brand_name),
        'general_cat': np.array(dataset.general_cat),
        'subcat_1': np.array(dataset.subcat_1),
        'subcat_2': np.array(dataset.subcat_2),
        'item_condition': np.array(dataset.item_condition_id),
        'shipping': np.array(dataset[["shipping"]])
    }
    return X

X_train = get_keras_data(train)
X_valid = get_keras_data(valid)
X_test = get_keras_data(test_data)

### Keras model Defination

In [None]:
def get_callbacks(filepath, patience=2):
    es = EarlyStopping('val_loss', patience=patience, mode="min")
    msave = ModelCheckpoint(filepath, save_best_only=True)
    return [es, msave]

def rmsle_cust(y_true, y_pred):
    first_log = K.log(K.clip(y_pred, K.epsilon(), None) + 1.)
    second_log = K.log(K.clip(y_true, K.epsilon(), None) + 1.)
    return K.sqrt(K.mean(K.square(first_log - second_log), axis=-1))

def get_model():
    #params
    dr_r = 0.1
    
    #Inputs
    name = Input(shape=[X_train["name"].shape[1]], name="name")
    item_desc = Input(shape=[X_train["item_desc"].shape[1]], name="item_desc")
    brand_name = Input(shape=[1], name="brand_name")
    general_cat = Input(shape=[1], name = 'general_cat')
    subcat_1 = Input(shape=[1], name = 'subcat_1')
    subcat_2 = Input(shape=[1], name = 'subcat_2')
    item_condition = Input(shape=[1], name="item_condition")
    shipping = Input(shape=[X_train["shipping"].shape[1]], name="shipping")
    
    #Embeddings layers
    emb_name = Embedding(MAX_TEXT, 50)(name)
    emb_item_desc = Embedding(MAX_TEXT, 50)(item_desc)
    emb_brand_name = Embedding(MAX_BRAND, 10)(brand_name)
    emb_general_cat = Embedding(MAX_GENERAL_CAT, 10)(general_cat)
    emb_subcat_1 = Embedding(MAX_SUBCAT_1, 10)(subcat_1)
    emb_subcat_2 = Embedding(MAX_SUBCAT_2, 10)(subcat_2)
    emb_item_condition = Embedding(MAX_CONDITION, 5)(item_condition)
    
    #rnn layer
    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_name)
    
    #main layer
    main_l = concatenate([Flatten() (emb_brand_name), 
                          Flatten() (emb_general_cat),
                          Flatten() (emb_subcat_1),
                          Flatten() (emb_subcat_2),
                          Flatten() (emb_item_condition), 
                          rnn_layer1, 
                          rnn_layer2, 
                          shipping])
    main_l = Dropout(dr_r) (Dense(128) (main_l))
    main_l = Dropout(dr_r) (Dense(64) (main_l))
    
    #output
    output = Dense(1, activation="linear") (main_l)
    
    #model
    model = Model([name, item_desc, brand_name
                   , general_cat, subcat_1, subcat_2, item_condition, shipping], output)
    model.compile(loss="mse", optimizer="adam", metrics=["mae", rmsle_cust])
    
    return model

    
model = get_model()
model.summary()

### Fitting the model

In [None]:
BATCH_SIZE = 20000
epochs = 10

model = get_model()
model.fit(X_train, train.target, epochs=epochs, batch_size=BATCH_SIZE
          , validation_data=(X_valid, valid.target), verbose = 1)

### Root Mean Squared Logarithmic Error

In [None]:
def rmsle(y, y_pred):
    assert len(y) == len(y_pred)
    to_sum = [(math.log(y_pred[i] + 1) - math.log(y[i] + 1)) ** 2.0 for i,pred in enumerate(y_pred)]
    return (sum(to_sum) * (1.0/len(y))) ** 0.5
#Source: https://www.kaggle.com/marknagelberg/rmsle-function

### Evaluating the model on the validation dataset 
How is it doing?

In [None]:
val_preds = model.predict(X_valid)
val_preds = target_scaler.inverse_transform(val_preds)
val_preds = np.exp(val_preds)+1

#mean_absolute_error, mean_squared_log_error
y_true = np.array(valid.price.values)
y_pred = val_preds[:,0]
v_rmsle = rmsle(y_true, y_pred)
print(" RMSLE error on dev test: "+str(v_rmsle))

In [None]:
for i in range(10):
    print('')
    print('')
    if int((int(np.array(valid.price)[i])-int(val_preds[i]))/int(np.array(valid.price)[i])*100) >= -5 and (int(np.array(valid.price)[i])-int(val_preds[i]))/int(np.array(valid.price)[i])*100 <= 30:
        print('\033[92m Fair priced')
    elif int((int(np.array(valid.price)[i])-int(val_preds[i]))/int(np.array(valid.price)[i])*100) < -5:
        print('\033[93m Over priced')
    else:
        print('\033[91m Under priced')

    print('Product name: ',np.array(valid.name)[i])
    print('Prodct actual price: ', int(np.array(valid.price)[i]))
    print('Product predicted price: ',int(val_preds[i]))
    print('Change in price percentage:',int((int(np.array(valid.price)[i])-int(val_preds[i]))/int(np.array(valid.price)[i])*100))
    print('*****************************************************************************')