# [Mercari Price Suggestion Challenge](https://www.kaggle.com/c/mercari-price-suggestion-challenge)
Can you automatically suggest product prices to online sellers?

# Import packages
- pandas
- numpy
- TfidfVectorizer

In [1]:
import pandas as pd #data processing
import numpy as np #linear algebra
from sklearn.feature_extraction.text import TfidfVectorizer #to calculate Tf-idf

# Import data
load train and test data, split by tab because of the tsv format.

In [2]:
%%time
train_df = pd.read_csv("data/train.tsv", delimiter="\t", low_memory= True)
test_df = pd.read_csv("data/test.tsv", delimiter="\t", low_memory= True)

CPU times: user 9.53 ms, sys: 4.12 ms, total: 13.6 ms
Wall time: 13 ms


In [3]:
train_df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


In [4]:
train_df.columns

Index(['train_id', 'name', 'item_condition_id', 'category_name', 'brand_name',
       'price', 'shipping', 'item_description'],
      dtype='object')

# data preprocessing

### converting column types

In [5]:
print(train_df.dtypes)
print("------------")
print(test_df.dtypes)

train_id               int64
name                  object
item_condition_id      int64
category_name         object
brand_name            object
price                float64
shipping               int64
item_description      object
dtype: object
------------
test_id               int64
name                 object
item_condition_id     int64
category_name        object
brand_name           object
shipping              int64
item_description     object
dtype: object


converting column data types to minimize memory

In [6]:
def col_type_conversion(data):
    data["item_condition_id"] = data["item_condition_id"].astype("int32")
    data["shipping"] = data["shipping"].astype("int8")

col_type_conversion(train_df)
col_type_conversion(test_df)

In [7]:
print(train_df.dtypes)
print("------------")
print(test_df.dtypes)

train_id               int64
name                  object
item_condition_id      int32
category_name         object
brand_name            object
price                float64
shipping                int8
item_description      object
dtype: object
------------
test_id               int64
name                 object
item_condition_id     int32
category_name        object
brand_name           object
shipping               int8
item_description     object
dtype: object


do a little exploring on discriptive statistics

In [8]:
print("train_df shape: {}\ntest_df shape: {}".format(train_df.shape, test_df.shape))

train_df shape: (500, 8)
test_df shape: (500, 7)


In [9]:
pd.set_option("float_format", "{:f}".format)
train_df.describe()

Unnamed: 0,train_id,item_condition_id,price,shipping
count,500.0,500.0,500.0,500.0
mean,249.5,1.908,28.632,0.44
std,144.481833,0.925878,47.412112,0.496884
min,0.0,1.0,3.0,0.0
25%,124.75,1.0,10.0,0.0
50%,249.5,2.0,16.0,0.0
75%,374.25,3.0,31.0,1.0
max,499.0,5.0,650.0,1.0


checking individual values

In [10]:
train_df.apply(lambda x: x.nunique())

train_id             500
name                 499
item_condition_id      5
category_name        190
brand_name           151
price                 83
shipping               2
item_description     470
dtype: int64

### checking missing values

In [11]:
print(train_df.isnull().sum()[train_df.isnull().sum() != 0])
print("------------")
print(test_df.isnull().sum()[test_df.isnull().sum() != 0])

category_name      3
brand_name       205
dtype: int64
------------
category_name      1
brand_name       206
dtype: int64


found missing values in category_name, brand_name, and item_discription
- Fill products with no brand name with 'NoBrand'
- Fill products with no category name with 'No/No/No'
- Fill products with no item descriptions with 'No description yet' (same as the first data)

In [12]:
test_df["item_description"].isnull().sum() != 0

False

In [13]:
def input_missing_values(data):
    if data["brand_name"].isnull().sum() != 0:
        data["brand_name"] = data["brand_name"].fillna("NoBrand")
    if data["category_name"].isnull().sum() != 0:
        data["category_name"] = data["category_name"].fillna("No/No/No")
    if data["item_description"].isnull().sum() != 0:
        data["item_description"] = data["item_description"].fillna("No description yet")
        
input_missing_values(train_df)
input_missing_values(test_df)

In [14]:
print(train_df.isnull().sum()[train_df.isnull().sum() != 0])
print("------------")
print(test_df.isnull().sum()[test_df.isnull().sum() != 0])

Series([], dtype: int64)
------------
Series([], dtype: int64)


# feature engineering

extracting data from category_name and item_discription

split category_name into:
- general_category
- subcategory_1
- subcategory_2

In [15]:
def split_category_name(data):
    split_category_name = data["category_name"].str.split("/", n = 2, expand = True)
    data["general_category"] = split_category_name[0]
    data["subcategory_1"] = split_category_name[1]
    data["subcategory_2"] = split_category_name[2]
    return data[["general_category", "subcategory_1", "subcategory_2"]].head()

split_category_name(train_df)
split_category_name(test_df)

Unnamed: 0,general_category,subcategory_1,subcategory_2
0,Women,Jewelry,Rings
1,Other,Office supplies,Shipping Supplies
2,Vintage & Collectibles,Bags and Purses,Handbag
3,Women,Sweaters,Cardigan
4,Other,Books,Religion & Spirituality


dealing with item_description:
<br>[Extensive Text Data Feature Engineering](https://www.kaggle.com/shivamb/extensive-text-data-feature-engineering)
- character length
- word count
- word density
- puncutation count

In [16]:
def text_data_fe(data):
    data["char_count"] = data["item_description"].apply(len)
    data["word_count"] = data["item_description"].apply(lambda x: len(x.split()))
    data["word_density"] = data["char_count"] / (data["word_count"]+1)
    return data[["char_count", "word_count", "word_density"]].head()

text_data_fe(train_df)
text_data_fe(test_df)

Unnamed: 0,char_count,word_count,word_density
0,6,2,2.0
1,251,38,6.435897
2,55,11,4.583333
3,67,10,6.090909
4,167,29,5.566667


[CountVectorizer, TfidfVectorizer, Predict Comments](https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments)
<br>[Using TfidfVectorizer output to create columns in a pandas df](https://www.reddit.com/r/learnpython/comments/7aduzh/using_tfidfvectorizer_output_to_create_columns_in/)
<br>[Check if multiple strings exist in another string](https://stackoverflow.com/questions/3389574/check-if-multiple-strings-exist-in-another-string)
- top tfidf word
- top tfidf value

In [17]:
vectorizer = TfidfVectorizer(min_df=3, max_features=2500, dtype=np.float32, 
                             strip_accents="unicode", analyzer="word", ngram_range=(1, 3), 
                             stop_words={"english", "rm", "co"})

In [18]:
%%time
description_text = list(train_df["item_description"].values)
tfidf_matrix = vectorizer.fit_transform(description_text)

CPU times: user 63.7 ms, sys: 4.79 ms, total: 68.5 ms
Wall time: 67.6 ms


In [19]:
tfidf_matrix

<500x1110 sparse matrix of type '<class 'numpy.float32'>'
	with 9363 stored elements in Compressed Sparse Row format>

In [20]:
tfidf_matrix.shape

(500, 1110)

the learned corpus vocabulary

In [21]:
vectorizer.vocabulary_

{'no': 621,
 'description': 244,
 'yet': 1094,
 'no description': 623,
 'description yet': 245,
 'no description yet': 624,
 'this': 971,
 'is': 460,
 'in': 433,
 'great': 384,
 'condition': 214,
 'and': 47,
 'works': 1084,
 'like': 511,
 'it': 468,
 'out': 686,
 'of': 645,
 'the': 948,
 'box': 140,
 'all': 37,
 'are': 68,
 'work': 1083,
 'perfectly': 715,
 'via': 1030,
 'on': 657,
 'your': 1105,
 'is in': 462,
 'in great': 439,
 'great condition': 385,
 'condition and': 215,
 'out of': 690,
 'of the': 649,
 'the box': 950,
 'all of': 39,
 'on your': 665,
 'in great condition': 440,
 'adorable': 31,
 'top': 996,
 'with': 1069,
 'lace': 492,
 'key': 484,
 'back': 84,
 'pink': 729,
 'also': 40,
 'have': 403,
 'available': 82,
 'white': 1057,
 'top with': 998,
 'in the': 446,
 'the back': 949,
 'also have': 42,
 'in the back': 447,
 'new': 603,
 'tags': 929,
 'leather': 497,
 'retail': 790,
 'for': 327,
 'each': 273,
 'stand': 902,
 'about': 22,
 'high': 415,
 'they': 967,
 'being': 113,


create a dictionary mapping the tokens to the tfidf values

In [22]:
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf_df = pd.DataFrame(columns=["description_text_tfidf"]).from_dict(dict(tfidf), orient='index')
tfidf_df.columns = ["description_text_tfidf"]

In [23]:
tfidf_df.shape

(1110, 1)

In [24]:
tfidf_df.to_dict()

{'description_text_tfidf': {'10': 4.326234340667725,
  '100': 4.272167205810547,
  '100 authentic': 5.13716459274292,
  '11': 5.13716459274292,
  '12': 4.818710803985596,
  '13': 5.8303117752075195,
  '15': 5.607168197631836,
  '16': 5.607168197631836,
  '20': 5.607168197631836,
  '2017': 5.8303117752075195,
  '21': 5.424846649169922,
  '22': 5.8303117752075195,
  '24': 5.424846649169922,
  '25': 5.270695686340332,
  '2x': 5.8303117752075195,
  '32': 5.607168197631836,
  '34': 5.8303117752075195,
  '3rd': 5.607168197631836,
  '5c': 5.424846649169922,
  '5s': 5.8303117752075195,
  '6s': 5.607168197631836,
  '6s plus': 5.607168197631836,
  'about': 4.508555889129639,
  'about the': 5.8303117752075195,
  'above': 5.8303117752075195,
  'across': 5.424846649169922,
  'actual': 5.8303117752075195,
  'add': 5.8303117752075195,
  'addition': 5.8303117752075195,
  'addition to': 5.8303117752075195,
  'adjustable': 5.424846649169922,
  'adorable': 5.8303117752075195,
  'after': 4.508555889129639

words or phrase with top 100 tf-idf score

In [25]:
top_n_tfidf = tfidf_df.sort_values(by=["description_text_tfidf"], ascending=False).head(100)
top_n_tfidf

Unnamed: 0,description_text_tfidf
slime,5.830312
wearing,5.830312
grey,5.830312
great shape,5.830312
please read,5.830312
...,...
for apple,5.830312
for any,5.830312
for an,5.830312
for all,5.830312


In [26]:
top_n_tfidf.index

Index(['slime', 'wearing', 'grey', 'great shape', 'please read', 'understand',
       'good used condition', 'good used', 'unique', 'polka dot', 'getting',
       'gently used', 'fun', 'update', 'ft', 'store', 'from my closet',
       'storage', 'friendly home', 'friendly', 'stones', 'used no', 'free pet',
       'used to', 'four', 'pro', 'profile', 'for spring', 'for sale',
       'guaranteed', 'had', 'hands', 'help', 'in days', 'in china', 'tried',
       'trim', 'ignored', 'if you want', 'superman', 'if you are',
       'perfect for spring', 'perfectly', 'photo', 'ultra', 'picture is',
       'happy', 'suede', 'head', 'he', 'have been', 'pink logo', 'plastic',
       'has pockets', 'under armour', 'hardware', 'hard to find', 'hard to',
       'stripes', 'for more', 'provide', 'for free shipping', 'during',
       'rarely', 'extremely', 're', 'every', 'read my', 'essential', 'elastic',
       'ready', 'easily', 'easier', 'star', 'during the', 'dunn', 'eyebrows',
       'reflect', 'wa

In [27]:
%%time
def text_data_tfidf_fe(data):
    top_tfidf_word = []
    top_tfidf_value = []
    for i in range(len(list(data["item_description"].values))):
        match = next((word for word in top_n_tfidf.index if word in data["item_description"][i]), False)
        if match != False:
            chosen_word = match
            chosen_tfidf = float(top_n_tfidf.loc[match])
            #print(chosen_word, chosen_tfidf)
            top_tfidf_word.insert(i, chosen_word)
            top_tfidf_value.insert(i, chosen_tfidf)
            #break
        else:
            chosen_word = "None"
            chosen_tfidf = 0.0
            top_tfidf_word.insert(i, chosen_word)
            top_tfidf_value.insert(i, chosen_tfidf)
            continue
    data["top_tfidf_word"] = top_tfidf_word
    data["top_tfidf_value"] = top_tfidf_value
    return data[["top_tfidf_word", "top_tfidf_value"]].head()

text_data_tfidf_fe(train_df)
text_data_tfidf_fe(test_df)

CPU times: user 770 ms, sys: 6.42 ms, total: 776 ms
Wall time: 780 ms


Unnamed: 0,top_tfidf_word,top_tfidf_value
0,,0.0
1,ft,5.830312
2,,0.0
3,he,5.830312
4,ft,5.830312


view the data again

In [28]:
train_df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,general_category,subcategory_1,subcategory_2,char_count,word_count,word_density,top_tfidf_word,top_tfidf_value
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,NoBrand,10.0,1,No description yet,Men,Tops,T-shirts,18,3,4.5,,0.0
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,Electronics,Computers & Tablets,Components & Parts,188,36,5.081081,perfectly,5.830312
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,Women,Tops & Blouses,Blouse,124,29,4.133333,he,5.830312
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,NoBrand,35.0,1,New with tags. Leather horses. Retail for [rm]...,Home,Home Décor,Home Décor Accents,173,32,5.242424,storage,5.830312
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,NoBrand,44.0,0,Complete with certificate of authenticity,Women,Jewelry,Necklaces,41,5,6.833333,he,5.830312


In [29]:
test_df.head()

Unnamed: 0,test_id,name,item_condition_id,category_name,brand_name,shipping,item_description,general_category,subcategory_1,subcategory_2,char_count,word_count,word_density,top_tfidf_word,top_tfidf_value
0,0,"Breast cancer ""I fight like a girl"" ring",1,Women/Jewelry/Rings,NoBrand,1,Size 7,Women,Jewelry,Rings,6,2,2.0,,0.0
1,1,"25 pcs NEW 7.5""x12"" Kraft Bubble Mailers",1,Other/Office supplies/Shipping Supplies,NoBrand,1,"25 pcs NEW 7.5""x12"" Kraft Bubble Mailers Lined...",Other,Office supplies,Shipping Supplies,251,38,6.435897,ft,5.830312
2,2,Coach bag,1,Vintage & Collectibles/Bags and Purses/Handbag,Coach,1,Brand new coach bag. Bought for [rm] at a Coac...,Vintage & Collectibles,Bags and Purses,Handbag,55,11,4.583333,,0.0
3,3,Floral Kimono,2,Women/Sweaters/Cardigan,NoBrand,0,-floral kimono -never worn -lightweight and pe...,Women,Sweaters,Cardigan,67,10,6.090909,he,5.830312
4,4,Life after Death,3,Other/Books/Religion & Spirituality,NoBrand,1,Rediscovering life after the loss of a loved o...,Other,Books,Religion & Spirituality,167,29,5.566667,ft,5.830312


checking the shape again

In [30]:
print('Train shape: {}\nTest shape: {}'.format(train_df.shape, test_df.shape))

Train shape: (500, 16)
Test shape: (500, 15)


# Output Data
output data as csv

In [31]:
train_df.to_csv("train_df.csv", index=False)
test_df.to_csv("test_df.csv", index=False)