# 1.Buisness Problem

#### **What is Mercari ?**

> Mercari is a marketplace platform where you can buy and sell almost anything provided the item can be shipped. It is currently operating in Japan and the United States. It is similar to the Quickr or OLX in India.




#### **Problem Statement**

> Predicting the price of the product can be a tough challenge especially when the product is seasonal. The same product with different brands can range in the different price range making it difficult for the sellers to set a fair price to make profit and selling the product below the market price would be a loss.


#### **Business Objective**



>* Mercari wants us to come up with a model that can suggest the best market price of that product online such that there will be a fair price predicted for the product leading to better experience for the customer and seller<br>
>* Model should not take too long to predict the price.



#### **Data Overview**

> The files consist of a list of product which are seperated by tab.

    * train_id or test_id - the id of the listing

    * name - the title of the listing. Note that we have cleaned the data to remove text that look like prices (e.g. \$20) to avoid leakage. 
      These removed prices are represented as [rm] 

    * item_condition_id - the condition of the items provided by the seller, range from 1-5. 1 being 'New' and 5 being 'Poor'.

    * category_name - category of the listing

    * brand_name - brand of the product. 

    * price - the price that the item was sold for. This is the target variable that you will predict. The unit is USD. 

    * shipping - 1 if shipping fee is paid by seller and 0 by buyer

    * item_description - the full description of the item. Note that we have cleaned the data to remove text that look like prices
     (e.g. \$20) to avoid leakage. These removed prices are represented as [rm]

#### **Type of Machine Learning problem**

> The price is a target variable which is a continuous variable, therefore it is a Regression problem.




#### **Performance Metric -**

> The error metric that we use for the regression problem is the RMSE (root mean square error), MSE(mean abosolute error) and MSE(mean square error). In this case study we are going to use the RMSLE which is Root Mean Squared Logarithmic Error because of the following reasons -<br>
>    * Robust to outliers.
>    * It scale invariant which means changing the scale by a certain amount does not change the final result.
>    * Biased Penalty - penalises those error which are underestimated
rather than overestimated which may increase the profit for company and seller as well.

**NOTE :** **The above comparisions are with respect to the RMSE only.**
<br><br>
The RMSLE is calculated as -
<img src="https://miro.medium.com/max/875/0*AUzyQ1rc6mpQVYfn">

reference - https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a

<br><br><br>

## Importing data

In [None]:
# import modules
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import pickle
import datetime
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import seaborn as sns
import math
import nltk
from collections import Counter
import os
import shutil
from tqdm.notebook import tqdm
tqdm.pandas()
import re
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from tqdm import tqdm
tqdm.pandas()
import pickle

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip - ‘glove.6B.zip’ saved [862182613/862182613]

--2021-07-21 15:12:38--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-07-21 15:12:39--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-07-21 15:12:40--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip.1’

glove

In [None]:
!unzip glove.6B.zip

In [None]:
# importing the data in colab using CurlWget

! wget --header="Host: storage.googleapis.com" --header="User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" --header="Referer: https://www.kaggle.com/" "https://storage.googleapis.com/kagglesdsdata/competitions/7559/44327/train.tsv.7z?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1626025725&Signature=WJlWIDJL0ILb0i0AmBY%2FBkDpfsXWgPLoODrqS%2B%2F%2FGnlmR0B%2F8d1vuuxjjfExL2utyqmH3h96Icw4iG5Xm5vXRGYSHz%2B2PEh0ZWZOkEhracCRzpa5Ymv4RKxJ3beC5rrnCU4Ug24%2FHKjR75H2tFRohRo1fDzGs1MR2VFUS%2BlFnc2tH6TlZVgGtExd0ft%2BhQbc2ywXFlpUsRF7NmPOYzsraoYK8zPTvwQG5w7n7TC5bwEL8ZhTgV9K321%2FND1TRcZdgdtjCA8wTo2qgDvBcoz28i800dO9lskNkTeCfxEghVo1wy6r0z5tn3kxFH8jyW57QbJlBsrJkeoHDjs7aFwHPg%3D%3D&response-content-disposition=attachment%3B+filename%3Dtrain.tsv.7z" -c -O 'train.tsv.7z'

--2021-07-08 17:50:26--  https://storage.googleapis.com/kagglesdsdata/competitions/7559/44327/train.tsv.7z?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1626025725&Signature=WJlWIDJL0ILb0i0AmBY%2FBkDpfsXWgPLoODrqS%2B%2F%2FGnlmR0B%2F8d1vuuxjjfExL2utyqmH3h96Icw4iG5Xm5vXRGYSHz%2B2PEh0ZWZOkEhracCRzpa5Ymv4RKxJ3beC5rrnCU4Ug24%2FHKjR75H2tFRohRo1fDzGs1MR2VFUS%2BlFnc2tH6TlZVgGtExd0ft%2BhQbc2ywXFlpUsRF7NmPOYzsraoYK8zPTvwQG5w7n7TC5bwEL8ZhTgV9K321%2FND1TRcZdgdtjCA8wTo2qgDvBcoz28i800dO9lskNkTeCfxEghVo1wy6r0z5tn3kxFH8jyW57QbJlBsrJkeoHDjs7aFwHPg%3D%3D&response-content-disposition=attachment%3B+filename%3Dtrain.tsv.7z
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.196.112, 216.58.200.176, 216.58.200.208, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.196.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 77912192 (74M) [application/x-7z-compressed]
Saving to: ‘train.tsv.7z’


2021-07-08 17:50:27 (302 MB/s

In [None]:
# reference - https://stackoverflow.com/questions/49955814/unzip-a-7z-file-in-google-collab

!7z e train.tsv.7z


7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: train.tsv.7z

Extracting  train.tsv

Everything is Ok

Size:       337809843
Compressed: 77912192


<br><br>

# Data preprocessing

In [None]:
# importing the data
# here the file is tab-seperated, therefore using '\t'
data = pd.read_csv('train.tsv', sep='\t')
data.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


In [None]:
print(data.shape)

(1482535, 8)


###### Observation - 
The data contains near about 1.5 million rows and 8 columns
<br>

In [None]:
# reference - https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python

def decontracted(phrase):

    '''  
        this function helps in expanding the given phrases.

        input: phrase/ word
        returns: expanded string 
    '''

    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [None]:
def category_split(string):
    '''
        function to split the category column into three subcategories

        input: category string
        returns: three subcategory strings 
    '''
    try:
        # split the string with '/'
        t = string.split('/')
        return t[0], t[1], t[2]
    except:
        return 'unk_cat', 'unk_cat', 'unk_cat'


In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/rahul_rbbisht1050/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
stopwords_ = stopwords.words('english')

def preprocess(text,col):
    """
        Function to clean the strings containing special characters and converts them to lowercase characters.

        input: string
        output: string which contains number and lower character.
    """

    try:
        # convert the string to lowercase
        text = text.lower()
        # decontraction - expanding the words like : i'll -> i will, he'd -> he would
        text = decontracted(text)
        # replace & and - character with _ . 
        text = re.sub('[&-]', '_', text)    #  Example : t-shirt -> t_shirt, horse&sweater -> horse_sweater
        # replace special characters except _
        text = re.sub('[^0-9a-z_]',' ',text)
        text = re.sub('\s_\s', ' ', text)   #  replace strings like  ' _ ' with ' ' (string with a space)
        text = re.sub('\s+', ' ', text).strip()  # replace more than one_space_character to single_space_character
        if col != 'name':
            # removing the stopwords
            text = ' '.join(i for i in text.split(' ') if not i in stopwords_)
        else:
            text = ' '.join(i for i in text.split(' '))
    except:
        text = np.nan
    return text

In [None]:
# reference - https://albertauyeung.github.io/2018/06/03/generating-ngrams.html

def generate_ngrams(s, n):
    
    '''
        function to return the ngrams
        input: s: sentence, n- ngrams
        output: ngrams 
    '''
    
    # Break sentence in the token, remove empty tokens
    tokens = [token for token in s.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    # print(list(ngrams))
    return [" ".join(ngram) for ngram in ngrams]

In [None]:
def fill_missing_brands(df):
    '''
        function to fill the missing brands
        input: dataframe row
        output: if present: brand_name otherwise 'unk_brand'
    '''
    name, brand_name, item_description = df[0], df[1], df[2]
    name = str(name) + ' ' + str(item_description)
    ngram_ = [4,3,2,1]
    if brand_name != 'unk_brand':
        return brand_name
    else:
        try:
            brand_names = []
            for i in ngram_:
                for grams in generate_ngrams(name, i):
                    brand = ' '.join(grams)
                    if brand in all_unique_brands:
                        brand_names.append(brand)
            if len(brand_names) > 0:
                return brand_names[0]
            else:
                return 'unk_brand'
        except :
            return 'unk_brand'


In [None]:
start_time = datetime.datetime.now()

print('products whoose price are greater than 0')
data = data[data['price'] > 0].reset_index(drop=True)
print('products filtering done!')
print('-' * 80)
print()


print('filling nan category_name values with "unk_cat/unk_subcat1/unk_subcat2"...')
data.category_name.fillna('unk_cat/unk_subcat1/unk_subcat2',inplace=True)
print('filling nan category_name complete!')
print('-' * 80)
print()


print('converting brand_name to lower case...')
data['brand_name'] = data['brand_name'].str.lower()
print('brand_name converted to lower case!')
print('-' * 80)
print()

print('creating subcategory columns...')
data['main_category'], data['sub_category1'], data['sub_category2'] = zip(*data.category_name.apply(lambda x: category_split(x)))
print('subcategory columns creation completed!')
print('-' * 80)
print()

print('filling nan of item_description...')
data['item_description'].fillna('unk_desc',inplace=True)
print('item_description fillna complete !')
print('-' * 80)
print()

print('replacing "No descripiton yet" with "unk_desc"...')
data['item_description'] = data['item_description'].str.replace('No description yet', 'unk_desc')
print('item_description replacement of "No description yet" with "unk_desc" complete!')

print()
print('time taken to execute the cell : ', datetime.datetime.now()- start_time)

products whoose price are greater than 0
products filtering done!
--------------------------------------------------------------------------------

filling nan category_name values with "unk_cat/unk_subcat1/unk_subcat2"...
filling nan category_name complete!
--------------------------------------------------------------------------------

converting brand_name to lower case...
brand_name converted to lower case!
--------------------------------------------------------------------------------

creating subcategory columns...
subcategory columns creation completed!
--------------------------------------------------------------------------------

filling nan of item_description...
item_description fillna complete !
--------------------------------------------------------------------------------

replacing "No descripiton yet" with "unk_desc"...
item_description replacement of "No description yet" with "unk_desc" complete!

time taken to execute the cell :  0:00:06.766591


In [None]:
print('preprocessing name...')
data['name'] = data['name'].progress_apply(lambda x: preprocess(x, 'name'))
print('preprocessing of name complete!')
print()


  0%|                                                                    | 2668/1482535 [00:00<00:55, 26677.74it/s]

preprocessing name...


100%|█████████████████████████████████████████████████████████████████| 1482535/1482535 [00:37<00:00, 39576.01it/s]


preprocessing of name complete!



In [None]:
print('preprocessing item_description...')
data['item_description'] = data['item_description'].progress_apply(lambda x: preprocess(x, 'item_description'))
print('preprocessing of item_description complete!')


preprocessing item_description...


100%|██████████████████████████████████████████████████████████████████| 1482535/1482535 [02:50<00:00, 8681.51it/s]

preprocessing of item_description complete!





In [None]:
print('filling nan brand values with "unk_brand"...')
data.brand_name.fillna('unk_brand',inplace=True)
print('\nfilled nan brand_name!')
print('-' * 80)
print()

no_brand_name_before = data[data.brand_name == 'unk_brand'].shape[0]

filling nan brand values with "unk_brand"...

filled nan brand_name!
--------------------------------------------------------------------------------



In [None]:
start_time = datetime.datetime.now()

print('filling missing brand_name with help of "name" feature...')
all_unique_brands = data.brand_name.unique()
pickle.dump(all_unique_brands, open('all_unique_brands', "wb"))
all_unique_brands
data['brand_name'] = data[['name','brand_name','item_description']].progress_apply(fill_missing_brands, axis=1)
print('\nfill missing brand name complete!')
print('-' * 80)
print()

print('time taken to execute the cell : ', datetime.datetime.now()- start_time)

  0%|                                                                                  | 0/1481661 [00:00<?, ?it/s]

filling missing brand_name with help of "name" feature...


 52%|██████████████████████████████████▌                               | 775019/1481661 [1:19:09<53:00, 222.21it/s]

In [None]:
print('number of unk_brand filled- {}'.format(no_brand_name_before - data[data.brand_name == 'unk_brand'].shape[0]))

In [None]:
data_copy = data.copy()

In [None]:
# after preprocessing of name some of the values are processed and after removing stopwords and special character(can also be 
# in japnese language)
# are made null
print('replacing the Nan values with "unk_name"')
data_copy['name'] = data_copy.name.str.replace('','unk_name')
print('filling of nan values complete')

In [None]:
# after preprocessing of item description some of the values are processed and after removing stopwords and special character
# they are null values
print('replacing the Nan values with "unk_desc"')
data_copy['item_description'] = data_copy['item_description'].str.replace('', 'unk_desc')
print('filling of nan values complete')

In [None]:
# dataframe to save into disk

data_copy.to_csv('mercari/mercari_dataframe_feature_engineering.csv', header=True, index=False)

In [None]:
data = pd.read_csv('/content/drive/MyDrive/mercari/mercari_dataframe.csv',)

# Feature Engineering

In [None]:
data_ = pd.read_csv('mercari/mercari_dataframe_feature_engineering.csv')

In [None]:
data = data_.copy()

In [None]:
print('\ncomputing word count of name feature...')
data['len_name'] = data['name'].apply(lambda x: len(str(x).split(' ')))
print('\nname_feature word count computation done!')
print('-' * 80)
print()

print('\ncomputing word count of item description...')
data['len_item_description'] = data['item_description'].apply(lambda x: len(str(x).split(' ')))
print('\nitem description word count computation done!')
print('-' * 80)
print()

print('\ncombining name with item descripiton with word count 10 ...')
data['name_desc'] = data['name'] + ' ' + data['item_description'].apply(lambda x: ' '.join(str(x).split(' ')[:10]))
print('\ncombining feature name and item_description done!')
print('-' * 80)
print()

print('\ncombining name, brand_name, subcategories together...')
data['name_brand_cat'] = 'name ' + data['name'] + ' ' + 'brand ' + data['brand_name'] + ' ' + 'main category ' + data['main_category'] + \
                        ' ' + 'sub category ' + data['sub_category1'] + ' ' + 'sub category ' + data['sub_category2']
print('\ncombining feature name, brand_name, subcategories together done!')
print('-' * 80)
print()

print('\nassigning the branded_products with value 1 and unknown_branded products with 0...')
data['brand_value'] = data['brand_name'].apply(lambda x: 1 if x != 'unk_brand' else 0)
print('\nassigning the branded products with 1 and non branded products with 0 done!')


computing word count of name feature...

name_feature word count computation done!
--------------------------------------------------------------------------------


computing word count of item description...

item description word count computation done!
--------------------------------------------------------------------------------


combining name with item descripiton with word count 10 ...

combining feature name and item_description done!
--------------------------------------------------------------------------------


combining name, brand_name, subcategories together...

combining feature name, brand_name, subcategories together done!
--------------------------------------------------------------------------------


assigning the branded_products with value 1 and unknown_branded products with 0...

assigning the branded products with 1 and non branded products with 0 done!


In [None]:
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/rahul_rbbisht1050/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [None]:
def sentiment_analysis(sentence, sentiment):
    ss = sid.polarity_scores(sentence)
    senti_ = ss[sentiment]

    return senti_

In [None]:
print('\ndoing sentiment_analysis for positive sentences...')
data['pos'] = data.item_description.astype(str).progress_apply(lambda x: sentiment_analysis(x, 'pos'))
print('\nsentiment analysis of positive sentences done!')
print('-' * 80)
print()

print('\ndoing sentiment_analysis for negative sentences...')
data['neg'] = data.item_description.astype(str).progress_apply(lambda x: sentiment_analysis(x, 'neg'))
print('\nsentiment analysis of negative sentences done!')
print('-' * 80)
print()

print('\ndoing sentiment_analysis for neutral sentences...')
data['neu'] = data.item_description.astype(str).progress_apply(lambda x: sentiment_analysis(x, 'neu'))
print('\nsentiment analysis of neutral sentences done!')


doing sentiment_analysis for positive sentences...


100%|██████████████████████████████████████████████████████████████████| 1481661/1481661 [06:32<00:00, 3771.29it/s]



sentiment analysis of positive sentences done!
--------------------------------------------------------------------------------


doing sentiment_analysis for negative sentences...


100%|██████████████████████████████████████████████████████████████████| 1481661/1481661 [06:33<00:00, 3767.80it/s]



sentiment analysis of negative sentences done!
--------------------------------------------------------------------------------


doing sentiment_analysis for neutral sentences...


100%|██████████████████████████████████████████████████████████████████| 1481661/1481661 [06:31<00:00, 3785.79it/s]


sentiment analysis of neutral sentences done!





In [None]:
data.shape

(1481661, 19)

In [None]:
# dataframe to save into disk

data.to_csv('mercari/train1.csv',header=True,index=False)

In [None]:
words_dict = dict()
f = open('glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    words_dict[word] = coefs
f.close()


In [None]:
def sentence_word2vec(sentence):
    vector = np.zeros(100)
    for word in sentence.split():
        if word in words_dict:
            vector += words_dict[word]
    
    return vector

# Data Vectorization

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
from sklearn.preprocessing import MinMaxScaler
import csv
from scipy import sparse
from sklearn.impute import SimpleImputer

In [None]:
data = pd.read_csv('mercari/train1.csv')

In [None]:
data.head(2)

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,main_category,sub_category1,sub_category2,len_name,len_item_description,name_desc,name_brand_cat,brand_value,pos,neg,neu
0,0,mlb cincinnati reds t shirt size xl,3,Men/Tops/T-shirts,unk_brand,10.0,1,unk_desc,Men,Tops,T-shirts,7,1,mlb cincinnati reds t shirt size xl unk_desc,name_mlb cincinnati reds t shirt size xl brand...,0,0.0,0.0,1.0
1,1,razer blackwidow chroma keyboard,3,Electronics/Computers & Tablets/Components & P...,razer,52.0,0,keyboard great condition works like came box p...,Electronics,Computers & Tablets,Components & Parts,4,18,razer blackwidow chroma keyboard keyboard grea...,name_razer blackwidow chroma keyboard brandraz...,1,0.419,0.0,0.581


In [None]:
data.shape

(1481661, 19)

In [None]:
data.drop(columns=['train_id', 'name', 'item_description'],inplace=True)

In [None]:
X = data
y = data.price

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
brand_dict_price = X_train.groupby('brand_name').median('price')['price'].to_dict()

In [None]:
price_25_percentile = np.percentile(data.price, 25)
price_90_percentile = np.percentile(data.price, 90)

In [None]:
cheap_brand = set()
affordable_brand = set()
expensive_brand = set()

def categorise_brand_price(brand_name):
    
    '''
        function to categorise the brands as 'cheap', 'affordable', expensive.
        input: brand_name
        output: returns one category
    '''
    
    if brand_dict_price[brand_name] > 0 and brand_dict_price[brand_name] <= price_25_percentile:
        cheap_brand.add(brand_name)
        brand = 'cheap'
    elif brand_dict_price[brand_name] > price_25_percentile and brand_dict_price[brand_name] <= price_90_percentile:
        affordable_brand.add(brand_name)
        brand = 'affordable'
    if brand_dict_price[brand_name] > price_90_percentile :
        expensive_brand.add(brand_name)   
        brand = 'expensive'
    
    return brand

In [None]:
# saving the file to disk

pickle.dump(cheap_brand, open('cheap_brand_set', "wb"))
pickle.dump(affordable_brand, open('affordable_brand_set', "wb"))
pickle.dump(expensive_brand, open('expensive_brand_set', "wb"))

In [None]:
def fill_brand_category(brand_name):
    
    '''
        function to fill the brand_category which returns the 'cheap', 'affordable' and 'expensive'.
    '''
    
    try:
        if brand_name in cheap_brand:
            return 'cheap'
        elif brand_name in affordable_brand:
            return 'affordable'
        elif brand_name in expensive_brand:
            return 'expensive'
        else:
            return 'affordable'
    except:
        return 'affordable'

In [None]:
# categorise data in to cheap, affordable, expensive brands 
X_train['categorise_brand'] = X_train['brand_name'].apply(lambda x: categorise_brand_price(x))

In [None]:
X_test['categorise_brand'] = X_test['brand_name'].apply(lambda x: fill_brand_category(x))

In [None]:
def ordinal_encoder(column_name, file_name):
    
    '''
        function to give categories a unique token 
    '''
    
    # converting the categorical values to integer
    ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value= -1)
    train_ = ordinal_encoder.fit_transform(X_train[column_name].astype(str).values.reshape(-1,1)).reshape(-1,1)
    test_  = ordinal_encoder.transform(X_test[column_name].astype(str).values.reshape(-1,1)).reshape(-1,1)
    
    # for imputation of most_frequent values
    imputer = SimpleImputer(missing_values=-1, strategy='most_frequent')
    train_impute = imputer.fit_transform(train_)
    test_impute  = imputer.transform(test_)
    
    
    pickle.dump(ordinal_encoder, open('encoder/' + file_name + '_ordinal_encoder.pkl', "wb"))
    pickle.dump(imputer,         open('imputer/' + file_name + '_imputer.pkl','wb'))

    return train_impute, test_impute

In [None]:
# giving each of the feature a unique label

train_brand_name, test_brand_name = ordinal_encoder('brand_name', 'brand_name')
train_category_brand, test_category_brand = ordinal_encoder('categorise_brand', 'categorise_brand')
train_category, test_category = ordinal_encoder('category_name', 'category_name')
train_main_category, test_main_category = ordinal_encoder('main_category', 'main_category')
train_sub_category1, test_sub_category1 = ordinal_encoder('sub_category1', 'sub_category1')
train_sub_category2, test_sub_category2 = ordinal_encoder('sub_category2', 'sub_category2')

In [None]:
# additional:- https://datascience.stackexchange.com/questions/44009/scaling-label-encoded-values-for-linear-algorithms
brand_scaler = MinMaxScaler()
train_brand_name = brand_scaler.fit_transform(train_brand_name)
test_brand_name =  brand_scaler.transform(test_brand_name)
pickle.dump(brand_scaler, open('scaler/brand_scaler.pkl', "wb"))

In [None]:
category_brand_scaler = MinMaxScaler()
train_category_brand = category_brand_scaler.fit_transform(train_category_brand)
test_category_brand  = category_brand_scaler.transform(test_category_brand)
pickle.dump(category_brand_scaler, open('scaler/category_brand_scaler.pkl', "wb"))

In [None]:
category_scaler = MinMaxScaler()
train_category = category_scaler.fit_transform(train_category)
test_category  = category_scaler.transform(test_category)
pickle.dump(category_scaler, open('scaler/category_scaler.pkl', "wb"))

In [None]:
main_category_scaler = MinMaxScaler()
train_main_category = main_category_scaler.fit_transform(train_main_category)
test_main_category  = main_category_scaler.transform(test_main_category)
pickle.dump(main_category_scaler, open('scaler/main_category_scaler.pkl', "wb"))

In [None]:
main_sub_category1_scaler = MinMaxScaler()
train_sub_category1 = main_sub_category1_scaler.fit_transform(train_sub_category1)
test_sub_category1  = main_sub_category1_scaler.transform(test_sub_category1)
pickle.dump(main_sub_category1_scaler, open('scaler/main_sub_category1_scaler.pkl', "wb"))

In [None]:
main_sub_category2_scaler = MinMaxScaler()
train_sub_category2 = main_sub_category2_scaler.fit_transform(train_sub_category2)
test_sub_category2  = main_sub_category2_scaler.transform(test_sub_category2)
pickle.dump(main_sub_category2_scaler, open('scaler/main_sub_category2_scaler.pkl', "wb"))

In [None]:
train_brand_name, test_brand_name = sparse.csr_matrix(train_brand_name), sparse.csr_matrix(test_brand_name)
train_category_brand, test_category_brand = sparse.csr_matrix(train_category_brand), sparse.csr_matrix(test_category_brand)
train_category, test_category = sparse.csr_matrix(train_category), sparse.csr_matrix(test_category)
train_main_category, test_main_category = sparse.csr_matrix(train_main_category), sparse.csr_matrix(test_main_category)
train_sub_category1, test_sub_category1 = sparse.csr_matrix(train_sub_category1), sparse.csr_matrix(test_sub_category1)
train_sub_category2, test_sub_category2 = sparse.csr_matrix(train_sub_category2), sparse.csr_matrix(test_sub_category2)

In [None]:
X_train.drop(columns=['brand_name','category_name', 'main_category', 'sub_category1', 'sub_category2'], inplace=True)
X_test.drop(columns=['brand_name','category_name', 'main_category', 'sub_category1', 'sub_category2'], inplace=True)

In [None]:
# scaling the item_condition_id feature 
X_train['item_condition_id'] = X_train['item_condition_id'] / 5.
X_test['item_condition_id'] = X_test['item_condition_id'] / 5.

In [None]:
name_scaler = MinMaxScaler()
X_train['len_name'] = name_scaler.fit_transform(X_train['len_name'].values.reshape(-1,1))
X_test['len_name'] = name_scaler.transform(X_test['len_name'].values.reshape(-1,1))
pickle.dump(name_scaler, open('scaler/len_name_scaler.pkl', "wb"))

In [None]:
item_desc_scaler = MinMaxScaler()
X_train['len_item_description'] = item_desc_scaler.fit_transform(X_train['len_item_description'].values.reshape(-1,1))
X_test['len_item_description']  = item_desc_scaler.transform(X_test['len_item_description'].values.reshape(-1,1))
pickle.dump(item_desc_scaler, open('scaler/len_item_description_scaler.pkl', "wb"))

In [None]:
print('converting the concatenation of train_name_description column to its respective vector form...')
train_name_desc_vector = X_train.name_desc.astype(str).progress_apply(lambda x: sentence_word2vec(x))
print('conversion of train_name_desc to vector completed!')
print('-' * 80)
print()

print('converting the concatenation of test_name_description column to its respective vector form...')
test_name_desc_vector = X_test.name_desc.astype(str).progress_apply(lambda x: sentence_word2vec(x))
print('conversion of test_name_desc to vector completed!')
print('-' * 80)
print()

print('converting the concatenation of name, brand, sub_categories column to its respective vector form...')
train_name_brand_cat_vector = X_train.name_brand_cat.astype(str).progress_apply(lambda x: sentence_word2vec(x))
print('conversion of train_name_brand_cat to vector completed!')
print('-' * 80)
print()

print('converting the concatenation of name, brand, sub_categories column to its respective vector form...')
test_name_brand_cat_vector = X_test.name_brand_cat.astype(str).progress_apply(lambda x: sentence_word2vec(x))
print('conversion of test_name_brand_cat to vector completed!')
print('-' * 80)
print()

converting the concatenation of train_name_description column to its respective vector form...


100%|█████████████████████████████████████████████████████████████████| 1333494/1333494 [00:40<00:00, 33120.65it/s]
  2%|█▌                                                                   | 3300/148167 [00:00<00:04, 32997.20it/s]

conversion of train_name_desc to vector completed!
--------------------------------------------------------------------------------

converting the concatenation of test_name_description column to its respective vector form...


100%|███████████████████████████████████████████████████████████████████| 148167/148167 [00:04<00:00, 34053.67it/s]


conversion of test_name_desc to vector completed!
--------------------------------------------------------------------------------

converting the concatenation of name, brand, sub_categories column to its respective vector form...


100%|█████████████████████████████████████████████████████████████████| 1333494/1333494 [00:27<00:00, 48826.90it/s]
  3%|██▏                                                                  | 4813/148167 [00:00<00:02, 48129.82it/s]

conversion of train_name_brand_cat to vector completed!
--------------------------------------------------------------------------------

converting the concatenation of name, brand, sub_categories column to its respective vector form...


100%|███████████████████████████████████████████████████████████████████| 148167/148167 [00:02<00:00, 50628.37it/s]

conversion of test_name_brand_cat to vector completed!
--------------------------------------------------------------------------------






In [None]:
X_train.drop(columns=['name_desc', 'name_brand_cat', 'price'], inplace=True)
X_test.drop(columns=['name_desc', 'name_brand_cat', 'price'], inplace=True)

In [None]:
train_name_desc_vector = sparse.csc_matrix(train_name_desc_vector.values.tolist())
test_name_desc_vector = sparse.csc_matrix(test_name_desc_vector.values.tolist())

train_name_brand_cat_vector = sparse.csc_matrix(train_name_brand_cat_vector.values.tolist())
test_name_brand_cat_vector = sparse.csc_matrix(test_name_brand_cat_vector.values.tolist())

In [None]:
# stacking the features horizontally
X_train_hstack = sparse.hstack((X_train['item_condition_id'].values.reshape(-1,1),X_train['shipping'].values.reshape(-1,1),\
                                train_brand_name, train_category_brand, train_category, train_main_category,\
                                train_sub_category1, train_sub_category2, X_train['len_name'].values.reshape(-1,1),\
                                X_train['len_item_description'].values.reshape(-1,1), X_train['brand_value'].values.reshape(-1,1),\
                                X_train['pos'].values.reshape(-1,1),\
                                X_train['neg'].values.reshape(-1,1), X_train['neu'].values.reshape(-1,1), train_name_desc_vector, \
                                train_name_brand_cat_vector)).tocsr()

In [None]:
X_test_hstack  = sparse.hstack((X_test['item_condition_id'].values.reshape(-1,1),X_test['shipping'].values.reshape(-1,1),\
                                test_brand_name, test_category_brand, test_category, test_main_category,\
                                test_sub_category1, test_sub_category2, X_test['len_name'].values.reshape(-1,1),\
                                X_test['len_item_description'].values.reshape(-1,1), X_test['brand_value'].values.reshape(-1,1),\
                                 X_test['pos'].values.reshape(-1,1),\
                                X_test['neg'].values.reshape(-1,1), X_test['neu'].values.reshape(-1,1), test_name_desc_vector, \
                                test_name_brand_cat_vector)).tocsr()

In [None]:
sparse.save_npz('mercari/X_train_hstack.npz', X_train_hstack)
sparse.save_npz('mercari/X_test_hstack.npz', X_test_hstack)

In [None]:
y_train = sparse.csc_matrix(y_train).reshape(-1,1)
y_test = sparse.csc_matrix(y_test).reshape(-1,1)

In [None]:
sparse.save_npz('mercari/y_train.npz', y_train)
sparse.save_npz('mercari/y_test.npz', y_test)