# [Mercari Price Suggestion Challenge](https://www.kaggle.com/c/mercari-price-suggestion-challenge)
Can you automatically suggest product prices to online sellers?

# Import packages
- pandas
- numpy
- TfidfVectorizer

In [48]:
import pandas as pd #data processing
import numpy as np #linear algebra
from sklearn.feature_extraction.text import TfidfVectorizer #to calculate Tf-idf

# Import data
load train and test data, split by tab because of the tsv format.

In [49]:
%%time
train_df = pd.read_csv("data/train.tsv", delimiter="\t", low_memory= True)
test_df = pd.read_csv("data/test.tsv", delimiter="\t", low_memory= True)

CPU times: user 7.44 ms, sys: 4.67 ms, total: 12.1 ms
Wall time: 12.1 ms


In [50]:
train_df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


In [51]:
train_df.columns

Index(['train_id', 'name', 'item_condition_id', 'category_name', 'brand_name',
       'price', 'shipping', 'item_description'],
      dtype='object')

# data preprocessing

### converting column types

In [52]:
print(train_df.dtypes)
print("------------")
print(test_df.dtypes)

train_id               int64
name                  object
item_condition_id      int64
category_name         object
brand_name            object
price                float64
shipping               int64
item_description      object
dtype: object
------------
test_id               int64
name                 object
item_condition_id     int64
category_name        object
brand_name           object
shipping              int64
item_description     object
dtype: object


converting column data types to minimize memory

In [53]:
def col_type_conversion(data):
    data["item_condition_id"] = data["item_condition_id"].astype("int32")
    data["shipping"] = data["shipping"].astype("int8")

col_type_conversion(train_df)
col_type_conversion(test_df)

In [54]:
print(train_df.dtypes)
print("------------")
print(test_df.dtypes)

train_id               int64
name                  object
item_condition_id      int32
category_name         object
brand_name            object
price                float64
shipping                int8
item_description      object
dtype: object
------------
test_id               int64
name                 object
item_condition_id     int32
category_name        object
brand_name           object
shipping               int8
item_description     object
dtype: object


do a little exploring on discriptive statistics

In [55]:
print("train_df shape: {}\ntest_df shape: {}".format(train_df.shape, test_df.shape))

train_df shape: (10, 8)
test_df shape: (10, 7)


In [56]:
pd.set_option("float_format", "{:f}".format)
train_df.describe()

Unnamed: 0,train_id,item_condition_id,price,shipping
count,10.0,10.0,10.0,10.0
mean,4.5,2.4,30.7,0.4
std,3.02765,0.966092,22.798879,0.516398
min,0.0,1.0,6.0,0.0
25%,2.25,1.5,10.0,0.0
50%,4.5,3.0,27.0,0.0
75%,6.75,3.0,50.0,1.0
max,9.0,3.0,64.0,1.0


checking individual values

In [57]:
train_df.apply(lambda x: x.nunique())

train_id             10
name                 10
item_condition_id     2
category_name         9
brand_name            5
price                 9
shipping              2
item_description     10
dtype: int64

### checking missing values

In [58]:
print(train_df.isnull().sum()[train_df.isnull().sum() != 0])
print("------------")
print(test_df.isnull().sum()[test_df.isnull().sum() != 0])

brand_name    5
dtype: int64
------------
brand_name    7
dtype: int64


found missing values in category_name, brand_name, and item_discription
- Fill products with no brand name with 'NoBrand'
- Fill products with no category name with 'No/No/No'
- Fill products with no item descriptions with 'No description yet' (same as the first data)

In [59]:
test_df["item_description"].isnull().sum() != 0

False

In [60]:
def input_missing_values(data):
    if data["brand_name"].isnull().sum() != 0:
        data["brand_name"] = data["brand_name"].fillna("NoBrand")
    if data["category_name"].isnull().sum() != 0:
        data["category_name"] = data["category_name"].fillna("No/No/No")
    if data["item_description"].isnull().sum() != 0:
        data["item_description"] = data["item_description"].fillna("No description yet")
        
input_missing_values(train_df)
input_missing_values(test_df)

In [61]:
print(train_df.isnull().sum()[train_df.isnull().sum() != 0])
print("------------")
print(test_df.isnull().sum()[test_df.isnull().sum() != 0])

Series([], dtype: int64)
------------
Series([], dtype: int64)


# feature engineering

extracting data from category_name and item_discription

split category_name into:
- general_category
- subcategory_1
- subcategory_2

In [62]:
def split_category_name(data):
    split_category_name = data["category_name"].str.split("/", n = 2, expand = True)
    data["general_category"] = split_category_name[0]
    data["subcategory_1"] = split_category_name[1]
    data["subcategory_2"] = split_category_name[2]
    return data[["general_category", "subcategory_1", "subcategory_2"]].head()

split_category_name(train_df)
split_category_name(test_df)

Unnamed: 0,general_category,subcategory_1,subcategory_2
0,Women,Jewelry,Rings
1,Other,Office supplies,Shipping Supplies
2,Vintage & Collectibles,Bags and Purses,Handbag
3,Women,Sweaters,Cardigan
4,Other,Books,Religion & Spirituality


dealing with item_description:
<br>[Extensive Text Data Feature Engineering](https://www.kaggle.com/shivamb/extensive-text-data-feature-engineering)
- character length
- word count
- word density

In [63]:
train_df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,general_category,subcategory_1,subcategory_2
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,NoBrand,10.0,1,No description yet,Men,Tops,T-shirts
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,Electronics,Computers & Tablets,Components & Parts
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,Women,Tops & Blouses,Blouse
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,NoBrand,35.0,1,New with tags. Leather horses. Retail for [rm]...,Home,Home Décor,Home Décor Accents
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,NoBrand,44.0,0,Complete with certificate of authenticity,Women,Jewelry,Necklaces


In [64]:
def text_data_fe(data, col):
    data[(col+"_char_count")] = data[col].apply(len)
    data[(col+"_word_count")] = data[col].apply(lambda x: len(x.split()))
    data[(col+"_word_density")] = data[(col+"_char_count")] / (data[(col+"_word_count")]+1)
    return data[[(col+"_char_count"), (col+"_word_count"), (col+"_word_density")]].head()

text_data_fe(train_df, "item_description")
text_data_fe(test_df, "item_description")

Unnamed: 0,item_description_char_count,item_description_word_count,item_description_word_density
0,6,2,2.0
1,251,38,6.435897
2,55,11,4.583333
3,67,10,6.090909
4,167,29,5.566667


In [65]:
text_data_fe(train_df, "name")
text_data_fe(test_df, "name")

Unnamed: 0,name_char_count,name_word_count,name_word_density
0,40,8,4.444444
1,40,7,5.0
2,9,2,3.0
3,13,2,4.333333
4,16,3,4.0


[CountVectorizer, TfidfVectorizer, Predict Comments](https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments)
<br>[Using TfidfVectorizer output to create columns in a pandas df](https://www.reddit.com/r/learnpython/comments/7aduzh/using_tfidfvectorizer_output_to_create_columns_in/)
<br>[Check if multiple strings exist in another string](https://stackoverflow.com/questions/3389574/check-if-multiple-strings-exist-in-another-string)
- top tfidf word
- top tfidf value

In [66]:
vectorizer = TfidfVectorizer(min_df=3, max_features=2500, dtype=np.float32, 
                             strip_accents="unicode", analyzer="word", ngram_range=(1, 3), 
                             stop_words={"english", "rm", "co"})

In [67]:
%%time
description_text = list(train_df["item_description"].values)
tfidf_matrix = vectorizer.fit_transform(description_text)

CPU times: user 3.15 ms, sys: 545 µs, total: 3.69 ms
Wall time: 3.24 ms


In [68]:
tfidf_matrix

<10x9 sparse matrix of type '<class 'numpy.float32'>'
	with 34 stored elements in Compressed Sparse Row format>

In [69]:
tfidf_matrix.shape

(10, 9)

the learned corpus vocabulary

In [70]:
vectorizer.vocabulary_

{'in': 2,
 'and': 0,
 'of': 3,
 'the': 7,
 'are': 1,
 'with': 8,
 'size': 4,
 'small': 6,
 'size small': 5}

create a dictionary mapping the tokens to the tfidf values

In [71]:
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf_df = pd.DataFrame(columns=["description_text_tfidf"]).from_dict(dict(tfidf), orient='index')
tfidf_df.columns = ["description_text_tfidf"]

In [72]:
tfidf_df.shape

(9, 1)

In [73]:
tfidf_df.to_dict()

{'description_text_tfidf': {'and': 1.6061358451843262,
  'are': 2.011600971221924,
  'in': 1.7884573936462402,
  'of': 1.6061358451843262,
  'size': 2.011600971221924,
  'size small': 2.011600971221924,
  'small': 2.011600971221924,
  'the': 1.7884573936462402,
  'with': 1.7884573936462402}}

words or phrase with top 100 tf-idf score

In [74]:
top_n = int(round(len(train_df)*0.1))
top_n_tfidf = tfidf_df.sort_values(by=["description_text_tfidf"], ascending=False).head(top_n)
top_n_tfidf

Unnamed: 0,description_text_tfidf
are,2.011601


In [75]:
top_n_tfidf.index

Index(['are'], dtype='object')

In [76]:
%%time
def text_data_tfidf_fe(data, col):
    top_tfidf_word = []
    top_tfidf_value = []
    for i in range(len(list(data[col].values))):
        match = next((word for word in top_n_tfidf.index if word in data[col][i]), False)
        if match != False:
            chosen_word = match
            chosen_tfidf = float(top_n_tfidf.loc[match])
            #print(chosen_word, chosen_tfidf)
            top_tfidf_word.insert(i, chosen_word)
            top_tfidf_value.insert(i, chosen_tfidf)
            #break
        else:
            chosen_word = "None"
            chosen_tfidf = 0.0
            top_tfidf_word.insert(i, chosen_word)
            top_tfidf_value.insert(i, chosen_tfidf)
            continue
    data[(col+"_top_tfidf_word")] = top_tfidf_word
    data[(col+"_top_tfidf_value")] = top_tfidf_value
    return data[[(col+"_top_tfidf_word"), (col+"_top_tfidf_value")]].head()

text_data_tfidf_fe(train_df, "item_description")
text_data_tfidf_fe(test_df, "item_description")

CPU times: user 6.7 ms, sys: 701 µs, total: 7.4 ms
Wall time: 7.41 ms


Unnamed: 0,item_description_top_tfidf_word,item_description_top_tfidf_value
0,,0.0
1,,0.0
2,,0.0
3,,0.0
4,,0.0


In [77]:
text_data_tfidf_fe(train_df, "name")
text_data_tfidf_fe(test_df, "name")

Unnamed: 0,name_top_tfidf_word,name_top_tfidf_value
0,,0.0
1,,0.0
2,,0.0
3,,0.0
4,,0.0


view the data again

In [78]:
train_df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,general_category,subcategory_1,...,item_description_char_count,item_description_word_count,item_description_word_density,name_char_count,name_word_count,name_word_density,item_description_top_tfidf_word,item_description_top_tfidf_value,name_top_tfidf_word,name_top_tfidf_value
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,NoBrand,10.0,1,No description yet,Men,Tops,...,18,3,4.5,35,7,4.375,,0.0,,0.0
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,Electronics,Computers & Tablets,...,188,36,5.081081,32,4,6.4,are,2.011601,,0.0
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,Women,Tops & Blouses,...,124,29,4.133333,14,2,4.666667,,0.0,,0.0
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,NoBrand,35.0,1,New with tags. Leather horses. Retail for [rm]...,Home,Home Décor,...,173,32,5.242424,21,3,5.25,are,2.011601,,0.0
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,NoBrand,44.0,0,Complete with certificate of authenticity,Women,Jewelry,...,41,5,6.833333,20,4,4.0,,0.0,,0.0


In [79]:
test_df.head()

Unnamed: 0,test_id,name,item_condition_id,category_name,brand_name,shipping,item_description,general_category,subcategory_1,subcategory_2,item_description_char_count,item_description_word_count,item_description_word_density,name_char_count,name_word_count,name_word_density,item_description_top_tfidf_word,item_description_top_tfidf_value,name_top_tfidf_word,name_top_tfidf_value
0,0,"Breast cancer ""I fight like a girl"" ring",1,Women/Jewelry/Rings,NoBrand,1,Size 7,Women,Jewelry,Rings,6,2,2.0,40,8,4.444444,,0.0,,0.0
1,1,"25 pcs NEW 7.5""x12"" Kraft Bubble Mailers",1,Other/Office supplies/Shipping Supplies,NoBrand,1,"25 pcs NEW 7.5""x12"" Kraft Bubble Mailers Lined...",Other,Office supplies,Shipping Supplies,251,38,6.435897,40,7,5.0,,0.0,,0.0
2,2,Coach bag,1,Vintage & Collectibles/Bags and Purses/Handbag,Coach,1,Brand new coach bag. Bought for [rm] at a Coac...,Vintage & Collectibles,Bags and Purses,Handbag,55,11,4.583333,9,2,3.0,,0.0,,0.0
3,3,Floral Kimono,2,Women/Sweaters/Cardigan,NoBrand,0,-floral kimono -never worn -lightweight and pe...,Women,Sweaters,Cardigan,67,10,6.090909,13,2,4.333333,,0.0,,0.0
4,4,Life after Death,3,Other/Books/Religion & Spirituality,NoBrand,1,Rediscovering life after the loss of a loved o...,Other,Books,Religion & Spirituality,167,29,5.566667,16,3,4.0,,0.0,,0.0


checking the shape again

In [80]:
print('Train shape: {}\nTest shape: {}'.format(train_df.shape, test_df.shape))

Train shape: (10, 21)
Test shape: (10, 20)


# Output Data
output data as csv

In [81]:
train_df.to_csv("train_df.csv", index=False)
test_df.to_csv("test_df.csv", index=False)