# Introduction
In this notebook I will describe the exploration of the data for the Mercari competition.

# Set-up

### Importing modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from bokeh.io import output_notebook, show
from bokeh.plotting import figure

### Loading the data
As the data is tab delimited, we use the separator **\t**

In [None]:
train = pd.read_csv('../input/train.tsv', sep='\t', dtype={'item_condition_id':str, 'category_name':str})
test = pd.read_csv('../input/test.tsv', sep='\t', dtype={'item_condition_id':str, 'category_name':str})

### ID variables
To train the model, this ID is not needed, only for creating the submission file. Therefore we drop it from train and separate it from the test data. 

In [None]:
train.drop(['train_id'], axis=1, inplace=True)
test_ids = test['test_id']
test.drop('test_id', axis=1, inplace=True)

# EDA

## General overview

In [None]:
print("Train shape:", train.shape)
print("Test shape:", test.shape)

Checking whether the same columns where used in train and test

In [None]:
assert list(train.drop('price', axis=1,).columns) == list(test.columns)

No AssertionError, so the columns in both data sets are equal. Let's have a visual check on the top 5 rows in train.

In [None]:
train.head()

What can we learn from this?
* **name**: this is a free text variable containing a lot of information like the brand, type of clothing, material type, ...
* **item_condition_id**: here a numeric label is used to indicate the condition of the item. Not sure yet whether 2 is a better condition than 1, or inverse.
* **category_name**: Here we notice some kind of hierarchy separated with a slash. The category part on the most right is the most granular category level. 
* **brand_name**: Not always filled in and sometimes duplicate from the what is in the name variable.
* **shipping**: binary variable. 1 = shipping cost included in price; 0 = shipping cost not included.
* **price**: at first sight the prices seem to be rounded and rather low. For further investigation.
* **item_description**: free text variable. It seems that when this field was not filled in, *"No description yet"* is used. To be investigated.

In [None]:
train.info()

* **no null values**: name, item_condition_id, shipping, price
* **few missing values**: category_name, item_description. However, we need to look at how many rows contain *"No description yet"* in item_description as this can be considered as a missing value too.
* **many missing values**: brand_name

## Item_description
### Missing values
We first check how many records contain *"No description yet"*

In [None]:
print("item_description = NaN in {} records".format(train[train.item_description.isnull()].shape[0]))
print("item_description = 'No description yet' in {} records".format(train[train.item_description == 'No description yet'].shape[0]))

There are 82.489 records that contain *"No description yet"* and only 4 records that are empty. 

Suppose we'll use a simple CountVectorizer or TfIdfVectorizer, leaving *"No description yet"* unchanged will result in having 3 separate words. However, the three words combined have one meaning. Therefore we replace it by **[ndy]**. As such, it will be treated as one word in the CountVectorizer or TfIdfVectorizer.

The missing values will be replaced by [ndy] as well because they represent the same meaning.

In [None]:
def replace_text(df, variable, text_to_replace, replacement):
    df.loc[df[variable] == text_to_replace, variable] = replacement
    
replace_text(train, 'item_description', 'No description yet', '[ndy]')
replace_text(test, 'item_description', 'No description yet', '[ndy]')

train.loc[train['item_description'].isnull(), 'item_description'] = '[ndy]'
test.loc[train['item_description'].isnull(), 'item_description'] = '[ndy]'

### Number of words in item_description

In [None]:
train['item_description_nb_words'] = train['item_description'].str.split().apply(len)

In [None]:
output_notebook()
d = train['item_description_nb_words'].describe()

quartiles = ['Q1', 'Q2', 'Q3']
p = figure(x_axis_label='Nb words in item_description', y_axis_label='Quartile value', 
           x_range=quartiles, toolbar_location=None, tools="")
p.vbar(x=quartiles, top=[d['25%'],d['50%'],d['75%']], width=0.9, color='#EE9D31')
p.xgrid.grid_line_color = None
p.y_range.start = 0
show(p)

25% of all records have up to 7 words in item_description. Let's visualize this.

In [None]:
s = train['item_description_nb_words']

plt.figure(figsize=(12,8))
ax = sns.distplot(s, kde=False, bins=50)
ax.set(xlabel='Nb words in item_description', ylabel='Frequency')
ax.set(xticks=np.arange(0,s.max(),10))
plt.axvline(s.median(), color='r', linestyle='dashed', linewidth=1)  # vertical line at the median
yvals = ax.get_yticks()
ax.set_yticklabels(['{:,}'.format(y) for y in yvals])
plt.show();

## Name
50% of all records have up to 4 words

In [None]:
train['name'].str.split().apply(len).describe()

## Brand_name

In [None]:
train[~train.brand_name.isnull()]['brand_name'].str.split().apply(len).describe()

Most brand names consist of only 1 word

## Item_condition_id
This is probably an ordinal variable, meaning that 1 represents a higher (or lower) value that 2, etc.
First, let's have a look at the distinct values and their proportion in the train data.

In [None]:
train.item_condition_id.value_counts(normalize=True).sort_index()

97% of all records have a value between 1 and 3. Item_condition_id 4 and 5 only represent a very small proportion. Now we'll look at the two extremes (1 and 5) and get a feeling what they might represent.

In [None]:
train[train.item_condition_id == '5'].head()

By simply looking at the item_description, we see that item_condition_id = 5 stands for a poor condition of the item. This is indicated by words like **broken, for parts, junk**

In [None]:
train[train.item_condition_id == '1'].head()

This confirms what we noticed for item_condition_id = 5. 1 is clearly a very good or new condition of the item. We see words like **new, complete**

The second question we may ask is: is the condition of the item reflected in the price? We expect to have a higher price for items in a better condition.

In [None]:
train.groupby('item_condition_id')['price'].describe()

We see that the median price (50% column) is indeed higher for item_condition_id = 1. The median price also goes down for the lower conditions, except for 5. Perhaps item_condition_id = 5 are vintage items which have a higher value, regardless of their condition? Or the seller perhaps offers more items of that kind and tries to get a better price? Hard to tell though...

Another reason might be that item_condition_id = 5 holds items of a specific category. So it might be interesting to include interaction variables of item_condition_id and category_name.

## Category_name
We will check how many levels (separated by the slash symbol) are present. These will be used to create categorical variables representing a level.

In [None]:
train['nb_cat_slashes'] = train['category_name'].str.count('/')
train.nb_cat_slashes.value_counts(normalize=True).sort_index()

Most of the records contain only 3 category levels (or 2 slashes). Only a very small percentage has 4 or 5 category levels. We'll split them in separate variables. We can then have a look at what different values occur.

In [None]:
train[['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5']] = train['category_name'].str.split('/', expand=True)

In [None]:
for c in ['cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_5']:
    print('{} has {} unique values'.format(c, len(train[c].unique())))

In [None]:
train.cat_1.value_counts()

In [None]:
# Top 10 of cat_2
train.cat_2.value_counts()[:10]

In [None]:
# Top 10 of cat_3
train.cat_3.value_counts()[:10]

In [None]:
train.cat_4.value_counts()

In [None]:
train.cat_5.value_counts()

cat_5 is only used for cat_4 = 'Tablet'

In [None]:
train[['cat_4','cat_5']].pivot_table(index='cat_4', columns='cat_5', aggfunc=len, fill_value=0)

In [None]:
train[['cat_1','cat_4']].pivot_table(index='cat_1', columns='cat_4', aggfunc=len, fill_value=0)

When we now look at cat_1 by item_condition_id, we see that the percentage of *Electronics* is much higher in item_condition_id = 5. This can explain why the median price in that group is higher than the other item_condition_id.

In [None]:
condition_counts = (train.groupby('item_condition_id')['cat_1']
                    .value_counts(normalize=True)
                    .rename('Percentage')
                    .reset_index())

plt.figure(figsize=(15,10))
sns.set(font_scale = 1.3)
ax = sns.barplot(x="cat_1", y="Percentage", hue="item_condition_id", data=condition_counts)
loc, labels = plt.xticks()
ax.set_xticklabels(labels, rotation=45)
plt.show();

## Shipping

In [None]:
train.shipping.mean()

45% of all records have shipping equal to 1. 

## Target variable

In [None]:
train['price'].describe()

* There are even items that are given away for free. 
* 75% of all records have a price lower than $30. 
* a small portion of the data has very high prices, so the distribution of price will be left-skewed.

In [None]:
train[train.price == 0].shape[0]

874 records have a price equal to zero