# Mercari - Exploratory Data Analysis

This is an initial analysis made on the Mercari dataset.  

In [None]:
import pandas as pd
import numpy as np
import missingno as mso
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
train = pd.read_csv("../input/train.tsv", sep="\t")
test = pd.read_csv("../input/test.tsv", sep="\t")

In [None]:
print("train: {:,} rows; {} columns".format(train.shape[0], train.shape[1]))
print("test: {:,} rows; {} columns".format(test.shape[0], test.shape[1]))

In [None]:
train.info()

In [None]:
test.info()

In [None]:
train.head()

In [None]:
test.head()

**First observations**  
There are not a lot of features to start with. They are mostly text.   
`item_condition_id` is the condition of the item as provided by the seller. We only have a categorical numeric value. 
`category_name` have three levels. We can split this feature into three new ones.  
`brand_name` seems to have a lot of missing values. We'll check this on the entire set.  
The `name` and `item_description` are text typed by the seller. Those will require some NLP.

## Missing values

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
mso.matrix(train.drop(["train_id", "price"], axis=1))
mso.matrix(test.drop("test_id", axis=1))

Except for the `brand_name` there are not a lot of missing values in this dataset.  
We can try to extract the brand from the item name and description to fill the missing values. This could be difficult and will most likely require some external data (list of brands).  Although, the fact that the brand is missing could be a useful indication to predict the price.

## Target
Let's now look at the target.  
This is the price at which the item has been sold.

In [None]:
train["price"].describe()

The distribution of the prices is very skewed. They range from 0 to 2000 but the median is only 1.7.  
I'll look at the distribution of log(price).

In [None]:
price_log = np.log10(train["price"] + 1)
sns.distplot(price_log)
plt.show()

We can see a peak at \$10. Half of the prices are between \$10 and \$30.

Next, I look if the data is shuffled.  I plot the prices ordered by id and try to find any pattern.  

In [None]:
fig = plt.figure(figsize=(10,10))
plt.scatter(train["train_id"], np.log10(train["price"] + 1), s=1, alpha=0.5)
plt.show()

There doesn't seem to be any special order, but we can see a clear line pattern. 


## Features

### Name
This is the name of the product entered by the seller.  
There are no missing values in this columns. The name may be a mandatory field.

In [None]:
train["name_length"] = train["name"].str.len()
test["name_length"] = test["name"].str.len()

plt.figure(figsize=(7,7))
sns.distplot(train["name_length"], label="train")
sns.distplot(test["name_length"], label="test")
plt.legend()
plt.show()

The maximum length of the `name` field seems to be at 40 characters but there are some records beyound this. 
The distribution is the same in the train and test sets.

In [None]:
train[train["name"].str.len() > 40]

The lines with more than 40 characters seem to have the `[rm]` mark. The name has been modified to hide a price. Let's verify this.

In [None]:
def has_price(name):
    if "[rm]" in name:
        return 1
    else:
        return 0
        
train["has_price"] = train["name"].apply(lambda x: has_price(x))
test["has_price"] = test["name"].apply(lambda x: has_price(x))

In [None]:
pd.pivot_table(
    data=train[train["name_length"] > 40],
    values="name",
    index=["name_length", "has_price"],
    aggfunc=lambda x: len(x.unique())
)

In [None]:
pd.pivot_table(
    data=test[test["name_length"] > 40],
    values="name",
    index=["name_length", "has_price"],
    aggfunc=lambda x: len(x.unique())
)

All the names that are more than 40 characters long have a price tag.  They should have been less than 40 characters long before preprocessing.  
The fields at 43 and 44 characters are strange. The _[rm]_ tag has 4 characters that substitte the price. If we substitude _$1_ with _[rm]_ we should get only 2 extra characters so the length of the name shouldn't be more than 42.  
Let's look at those items.

In [None]:
train[["name", "name_length"]][train["name_length"] >= 43 ]

Those long fields had multiple price indications. The number of price tags could also help in our task.  
I'll look later if there is any relationship with the price.  
First, let's go back to the name length and see if it's related to the price.

In [None]:
plt.figure(figsize=(7,7))
plt.scatter(train["name_length"], np.log(train["price"] +1 ), s=1, alpha=0.5)
plt.xlabel("name length")
plt.ylabel("price")
plt.show()

It seems that expensive items tend to have longer names.  
Also, the items with more than 40 characters are cheaper.

## Condition id
This is a categorical variable.  
This features tells us if the item is in good condition or not.  
I expect this to have an influence on the price. 

In [None]:
sns.distplot(train["item_condition_id"], label="train")
sns.distplot(test["item_condition_id"], label="test")
plt.legend()
plt.show()

We also have a very similar distribution between the train and test sets.

In [None]:
plt.scatter(train["item_condition_id"], np.log(train["price"] + 1), s=1, alpha=0.3)
plt.xlabel("item condition id")
plt.ylabel("log price")

In [None]:
condition_price = []
labels = []
for c in range(5):
    condition_price.append(np.log(train[train["item_condition_id"] == (c + 1)]["price"] + 1))
    labels.append(c + 1)
    
plt.boxplot(condition_price, showmeans=True)
plt.xlabel("item condition id")
plt.ylabel("log price")
plt.show()

It seems that the price decreases as the condition id increases. 1 should be for an item in good condition and 5 for an item in bad condition.  

## Category name
The category name has 3 levels separated by a /.  
I'll split it in 3 distinct categories.  
But first, I check that all categories in test set exist in the train set.

In [None]:
train_categories = set(train["category_name"].unique())
test_categories = set(test["category_name"].unique())
test_categories - train_categories

Some categories are present in the test set but not in the train set. 

In [None]:
def get_category(name, level):
    try:
        cat = name.split("/")[level - 1]
    except: 
        cat = "Missing"
        
    return cat

for c in range (3):
    train["category_" + str(c + 1)] = train["category_name"].apply(lambda x: get_category(x, c + 1))
    test["category_" + str(c + 1)] = test["category_name"].apply(lambda x: get_category(x, c + 1))

In [None]:
print("train:")
print("category 1: {} items".format(len(train["category_1"].unique())))
print("category 2: {} items".format(len(train["category_2"].unique())))
print("category 3: {} items".format(len(train["category_3"].unique())))

print("\ntest:")
print("category 1: {} items".format(len(test["category_1"].unique())))
print("category 2: {} items".format(len(test["category_2"].unique())))
print("category 3: {} items".format(len(test["category_3"].unique())))

In [None]:
train_category_1 = set(train["category_1"].unique())
train_category_2 = set(train["category_2"].unique())
train_category_3 = set(train["category_3"].unique())

test_category_1 = set(test["category_1"].unique())
test_category_2 = set(test["category_2"].unique())
test_category_3 = set(test["category_3"].unique())

print("category 1 differences {}".format(len(test_category_1 - train_category_1)))
print("category 2 differences {}".format(len(test_category_2 - train_category_2)))
print("category 3 differences {}".format(len(test_category_3 - train_category_3)))

The differences between the two sets are in category_3. 12 items are present in the test set but not in the train set.

## Brand name
This is the variable where we have the most missing features.  
I'll try to find out why.


In [None]:
train["brand_name"].value_counts()

In [None]:
train["has_brand"] = train["brand_name"].notnull().astype(int)
train["has_brand"].value_counts()

In [None]:
plt.figure(figsize=(7, 7))
sns.distplot(np.log10(train[train["has_brand"] == 0]["price"] + 1), label="no brand")
sns.distplot(np.log10(train[train["has_brand"] == 1]["price"] + 1), label="has brand")
plt.legend()

Items with a brand name are more expensive than the ones with no brand name. 
Most items that were sold for less than \$10 don't have a brand.

## Shipping
The Shipping variable tells us if the shipping fee is paid by the seller (1) of the buyer(0)

In [None]:
plt.bar([0,1], train["shipping"].value_counts())
plt.xticks([0,1])
plt.show()

Most of the time, the shipping is paid by the buyer.  

In [None]:
plt.figure(figsize=(7, 7))
sns.distplot(np.log10(train[train["shipping"] == 0]["price"] + 1), label="buyer")
sns.distplot(np.log10(train[train["shipping"] == 1]["price"] + 1), label="seller")
plt.legend()
plt.show()

It seems that the seller will be more likely to pay the shipping fee on cheap items.

## Item description
This is another text field that is filled by the seller.  
I'll look at the length of the text like I did for the item's name.

In [None]:
train["desc_length"] = train["item_description"].fillna("").str.len()
test["desc_length"] = test["item_description"].str.len()

plt.figure(figsize=(7,7))
sns.distplot(train["desc_length"], label="train")
sns.distplot(test["desc_length"], label="test")
plt.legend()
plt.show()

The `iten_description` variable is much longer than the `name`. It goes up to 1,000 characters.  
The distribution is still the same between the train set and the test set.  

## features correlation


In [None]:
txt_f = ["category_name", "brand_name", "category_1", "category_2", "category_3"]

for f in txt_f:
    train.loc[:, f] = pd.factorize(train[f])[0]

In [None]:
corr = train.drop(["name", "item_description", "category_name"], axis=1).corr().mul(100).astype(int)

cg = sns.clustermap(data=corr, annot=True, fmt="d")
plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
plt.show()

There is no strong correlation between any features. 

## Conclusion
This is it for now.   
The train and test sets seem to be balanced. We shouldn't have too much trouble about distribution differences.  

A big part of predicting the price will be about doing NLP on the `name` and `item_description` features. I didn't explore this aspect in this kernel. 

I'll update this kernel as I find new things I want to explore. 