# [Mercari Price Suggestion Challenge](https://www.kaggle.com/c/mercari-price-suggestion-challenge)
Can you automatically suggest product prices to online sellers?

# Import packages
- pandas
- numpy
- TfidfVectorizer

In [1]:
import pandas as pd #data processing
import numpy as np #linear algebra
from sklearn.feature_extraction.text import TfidfVectorizer #to calculate Tf-idf

# Import data
load train and test data, split by tab because of the tsv format.

In [2]:
%%time
train_df = pd.read_csv("data/train.tsv", delimiter="\t", low_memory= True)
test_df = pd.read_csv("data/test.tsv", delimiter="\t", low_memory= True)

CPU times: user 6.65 ms, sys: 3.86 ms, total: 10.5 ms
Wall time: 10.2 ms


In [3]:
train_df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,,10.0,1,No description yet
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,,35.0,1,New with tags. Leather horses. Retail for [rm]...
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,,44.0,0,Complete with certificate of authenticity


In [4]:
train_df.columns

Index(['train_id', 'name', 'item_condition_id', 'category_name', 'brand_name',
       'price', 'shipping', 'item_description'],
      dtype='object')

# data preprocessing

### converting column types

In [5]:
print(train_df.dtypes)
print("------------")
print(test_df.dtypes)

train_id               int64
name                  object
item_condition_id      int64
category_name         object
brand_name            object
price                float64
shipping               int64
item_description      object
dtype: object
------------
test_id               int64
name                 object
item_condition_id     int64
category_name        object
brand_name           object
shipping              int64
item_description     object
dtype: object


converting column data types to minimize memory

In [6]:
train_df["item_condition_id"] = train_df["item_condition_id"].astype("int32")
train_df["shipping"] = train_df["shipping"].astype("int8")

test_df["item_condition_id"] = test_df["item_condition_id"].astype("int32")
test_df["shipping"] = test_df["shipping"].astype("int8")

In [7]:
print(train_df.dtypes)
print("------------")
print(test_df.dtypes)

train_id               int64
name                  object
item_condition_id      int32
category_name         object
brand_name            object
price                float64
shipping                int8
item_description      object
dtype: object
------------
test_id               int64
name                 object
item_condition_id     int32
category_name        object
brand_name           object
shipping               int8
item_description     object
dtype: object


do a little exploring on discriptive statistics

In [8]:
print("train_df shape: {}\ntest_df shape: {}".format(train_df.shape, test_df.shape))

train_df shape: (10, 8)
test_df shape: (10, 7)


In [9]:
pd.set_option("float_format", "{:f}".format)
train_df.describe()

Unnamed: 0,train_id,item_condition_id,price,shipping
count,10.0,10.0,10.0,10.0
mean,4.5,2.4,30.7,0.4
std,3.02765,0.966092,22.798879,0.516398
min,0.0,1.0,6.0,0.0
25%,2.25,1.5,10.0,0.0
50%,4.5,3.0,27.0,0.0
75%,6.75,3.0,50.0,1.0
max,9.0,3.0,64.0,1.0


checking individual values

In [10]:
train_df.apply(lambda x: x.nunique())

train_id             10
name                 10
item_condition_id     2
category_name         9
brand_name            5
price                 9
shipping              2
item_description     10
dtype: int64

### checking missing values

In [11]:
print(train_df.isnull().sum()[train_df.isnull().sum() != 0])
print("------------")
print(test_df.isnull().sum()[test_df.isnull().sum() != 0])

brand_name    5
dtype: int64
------------
brand_name    7
dtype: int64


found missing values in category_name, brand_name, and item_discription
- Fill products with no brand name with 'NoBrand'
- Fill products with no category name with 'No/No/No'
- Fill products with no item descriptions with 'No description yet' (same as the first data)

In [12]:
train_df["brand_name"] = train_df["brand_name"].fillna("NoBrand")
test_df["brand_name"] = test_df["brand_name"].fillna("NoBrand")

train_df["category_name"] = train_df["category_name"].fillna("No/No/No")
test_df["category_name"] = test_df["category_name"].fillna("No/No/No")

train_df["item_description"] = train_df["item_description"].fillna("No description yet")

In [13]:
print(train_df.isnull().sum()[train_df.isnull().sum() != 0])
print("------------")
print(test_df.isnull().sum()[test_df.isnull().sum() != 0])

Series([], dtype: int64)
------------
Series([], dtype: int64)


# feature engineering

extracting data from category_name and item_discription

split category_name into:
- general_category
- subcategory_1
- subcategory_2

In [14]:
split_category_name = train_df["category_name"].str.split("/", n = 2, expand = True)
split_category_name

Unnamed: 0,0,1,2
0,Men,Tops,T-shirts
1,Electronics,Computers & Tablets,Components & Parts
2,Women,Tops & Blouses,Blouse
3,Home,Home Décor,Home Décor Accents
4,Women,Jewelry,Necklaces
5,Women,Other,Other
6,Women,Swimwear,Two-Piece
7,Sports & Outdoors,Apparel,Girls
8,Sports & Outdoors,Apparel,Girls
9,Vintage & Collectibles,Collectibles,Doll


In [15]:
train_df["general_category"] = split_category_name[0]
train_df["subcategory_1"] = split_category_name[1]
train_df["subcategory_2"] = split_category_name[2]

In [16]:
split_category_name_2 = test_df["category_name"].str.split("/", n = 2, expand = True)
split_category_name_2

Unnamed: 0,0,1,2
0,Women,Jewelry,Rings
1,Other,Office supplies,Shipping Supplies
2,Vintage & Collectibles,Bags and Purses,Handbag
3,Women,Sweaters,Cardigan
4,Other,Books,Religion & Spirituality
5,Electronics,Cell Phones & Accessories,"Cases, Covers & Skins"
6,Women,Jewelry,Necklaces
7,Women,Women's Accessories,Watches
8,Beauty,Fragrance,Women
9,Beauty,Tools & Accessories,Makeup Brushes & Tools


In [17]:
test_df["general_category"] = split_category_name_2[0]
test_df["subcategory_1"] = split_category_name_2[1]
test_df["subcategory_2"] = split_category_name_2[2]

dealing with item_description:
<br>[Extensive Text Data Feature Engineering](https://www.kaggle.com/shivamb/extensive-text-data-feature-engineering)
- character length
- word count
- word density
- puncutation count

In [18]:
train_df["char_count"] = train_df["item_description"].apply(len)
train_df["word_count"] = train_df["item_description"].apply(lambda x: len(x.split()))
train_df["word_density"] = train_df["char_count"] / (train_df["word_count"]+1)

test_df["char_count"] = test_df["item_description"].apply(len)
test_df["word_count"] = test_df["item_description"].apply(lambda x: len(x.split()))
test_df["word_density"] = test_df["char_count"] / (test_df["word_count"]+1)

[CountVectorizer, TfidfVectorizer, Predict Comments](https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments)
<br>[Using TfidfVectorizer output to create columns in a pandas df](https://www.reddit.com/r/learnpython/comments/7aduzh/using_tfidfvectorizer_output_to_create_columns_in/)
<br>[Check if multiple strings exist in another string](https://stackoverflow.com/questions/3389574/check-if-multiple-strings-exist-in-another-string)
- top tfidf word
- top tfidf value

In [19]:
vectorizer = TfidfVectorizer(min_df=3, max_features=2500, dtype=np.float32, 
                             strip_accents="unicode", analyzer="word", ngram_range=(1, 3), 
                             stop_words={"english", "rm", "co"})

In [20]:
%%time
description_text = list(train_df["item_description"].values)
tfidf_matrix = vectorizer.fit_transform(description_text)

CPU times: user 4.52 ms, sys: 1.09 ms, total: 5.6 ms
Wall time: 5.3 ms


In [21]:
tfidf_matrix

<10x9 sparse matrix of type '<class 'numpy.float32'>'
	with 34 stored elements in Compressed Sparse Row format>

In [22]:
tfidf_matrix.shape

(10, 9)

the learned corpus vocabulary

In [23]:
vectorizer.vocabulary_

{'in': 2,
 'and': 0,
 'of': 3,
 'the': 7,
 'are': 1,
 'with': 8,
 'size': 4,
 'small': 6,
 'size small': 5}

create a dictionary mapping the tokens to the tfidf values

In [24]:
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf_df = pd.DataFrame(columns=["description_text_tfidf"]).from_dict(dict(tfidf), orient='index')
tfidf_df.columns = ["description_text_tfidf"]

In [25]:
tfidf_df.shape

(9, 1)

In [26]:
tfidf_df.to_dict()

{'description_text_tfidf': {'and': 1.6061358451843262,
  'are': 2.011600971221924,
  'in': 1.7884573936462402,
  'of': 1.6061358451843262,
  'size': 2.011600971221924,
  'size small': 2.011600971221924,
  'small': 2.011600971221924,
  'the': 1.7884573936462402,
  'with': 1.7884573936462402}}

words or phrase with top 100 tf-idf score

In [27]:
top_n_tfidf = tfidf_df.sort_values(by=["description_text_tfidf"], ascending=False).head(100)
top_n_tfidf

Unnamed: 0,description_text_tfidf
are,2.011601
size,2.011601
size small,2.011601
small,2.011601
in,1.788457
the,1.788457
with,1.788457
and,1.606136
of,1.606136


In [28]:
top_n_tfidf.index

Index(['are', 'size', 'size small', 'small', 'in', 'the', 'with', 'and', 'of'], dtype='object')

In [29]:
%%time
top_tfidf_word = []
top_tfidf_value = []
for i in range(len(list(train_df["item_description"].values))):
    match = next((word for word in top_n_tfidf.index if word in train_df["item_description"][i]), False)
    if match != False:
        chosen_word = match
        chosen_tfidf = float(top_n_tfidf.loc[match])
        #print(chosen_word, chosen_tfidf)
        top_tfidf_word.insert(i, chosen_word)
        top_tfidf_value.insert(i, chosen_tfidf)
        #break
    else:
        chosen_word = "None"
        chosen_tfidf = 0.0
        top_tfidf_word.insert(i, chosen_word)
        top_tfidf_value.insert(i, chosen_tfidf)
        continue

CPU times: user 2.69 ms, sys: 63 µs, total: 2.76 ms
Wall time: 2.79 ms


In [30]:
%%time
top_tfidf_word_2 = []
top_tfidf_value_2 = []
for i in range(len(list(test_df["item_description"].values))):
    match = next((word for word in top_n_tfidf.index if word in test_df["item_description"][i]), False)
    if match != False:
        chosen_word = match
        chosen_tfidf = float(top_n_tfidf.loc[match])
        #print(chosen_word, chosen_tfidf)
        top_tfidf_word_2.insert(i, chosen_word)
        top_tfidf_value_2.insert(i, chosen_tfidf)
        #break
    else:
        chosen_word = "None"
        chosen_tfidf = 0.0
        top_tfidf_word_2.insert(i, chosen_word)
        top_tfidf_value_2.insert(i, chosen_tfidf)
        continue

CPU times: user 2.21 ms, sys: 41 µs, total: 2.26 ms
Wall time: 2.27 ms


In [31]:
train_df["top_tfidf_word"] = top_tfidf_word
train_df["top_tfidf_value"] = top_tfidf_value

test_df["top_tfidf_word"] = top_tfidf_word_2
test_df["top_tfidf_value"] = top_tfidf_value_2

view the data again

In [32]:
train_df.head()

Unnamed: 0,train_id,name,item_condition_id,category_name,brand_name,price,shipping,item_description,general_category,subcategory_1,subcategory_2,char_count,word_count,word_density,top_tfidf_word,top_tfidf_value
0,0,MLB Cincinnati Reds T Shirt Size XL,3,Men/Tops/T-shirts,NoBrand,10.0,1,No description yet,Men,Tops,T-shirts,18,3,4.5,,0.0
1,1,Razer BlackWidow Chroma Keyboard,3,Electronics/Computers & Tablets/Components & P...,Razer,52.0,0,This keyboard is in great condition and works ...,Electronics,Computers & Tablets,Components & Parts,188,36,5.081081,are,2.011601
2,2,AVA-VIV Blouse,1,Women/Tops & Blouses/Blouse,Target,10.0,1,Adorable top with a hint of lace and a key hol...,Women,Tops & Blouses,Blouse,124,29,4.133333,in,1.788457
3,3,Leather Horse Statues,1,Home/Home Décor/Home Décor Accents,NoBrand,35.0,1,New with tags. Leather horses. Retail for [rm]...,Home,Home Décor,Home Décor Accents,173,32,5.242424,are,2.011601
4,4,24K GOLD plated rose,1,Women/Jewelry/Necklaces,NoBrand,44.0,0,Complete with certificate of authenticity,Women,Jewelry,Necklaces,41,5,6.833333,the,1.788457


In [33]:
test_df.head()

Unnamed: 0,test_id,name,item_condition_id,category_name,brand_name,shipping,item_description,general_category,subcategory_1,subcategory_2,char_count,word_count,word_density,top_tfidf_word,top_tfidf_value
0,0,"Breast cancer ""I fight like a girl"" ring",1,Women/Jewelry/Rings,NoBrand,1,Size 7,Women,Jewelry,Rings,6,2,2.0,,0.0
1,1,"25 pcs NEW 7.5""x12"" Kraft Bubble Mailers",1,Other/Office supplies/Shipping Supplies,NoBrand,1,"25 pcs NEW 7.5""x12"" Kraft Bubble Mailers Lined...",Other,Office supplies,Shipping Supplies,251,38,6.435897,in,1.788457
2,2,Coach bag,1,Vintage & Collectibles/Bags and Purses/Handbag,Coach,1,Brand new coach bag. Bought for [rm] at a Coac...,Vintage & Collectibles,Bags and Purses,Handbag,55,11,4.583333,and,1.606136
3,3,Floral Kimono,2,Women/Sweaters/Cardigan,NoBrand,0,-floral kimono -never worn -lightweight and pe...,Women,Sweaters,Cardigan,67,10,6.090909,the,1.788457
4,4,Life after Death,3,Other/Books/Religion & Spirituality,NoBrand,1,Rediscovering life after the loss of a loved o...,Other,Books,Religion & Spirituality,167,29,5.566667,in,1.788457


checking the shape again

In [34]:
print('Train shape: {}\nTest shape: {}'.format(train_df.shape, test_df.shape))

Train shape: (10, 16)
Test shape: (10, 15)


# Output Data
output data as csv

In [35]:
train_df.to_csv("data/train_df.csv", index=False)
test_df.to_csv("data/test_df.csv", index=False)