**Overview :**

In this kernel, I will perform Exploratory Data Analysis of Mercari's Price Sugesstion Challenge's Dataset.

Let's start!
- Challenge Description: https://www.kaggle.com/c/mercari-price-suggestion-challenge
- Data Description: https://www.kaggle.com/c/mercari-price-suggestion-challenge/data

**About Mercari: **
- Mercari is community powered shopping app. In this competition, Mercari’s challenging you to build an algorithm that automatically suggests the right product prices. You’ll be provided user-inputted text descriptions of their products, including details like product category name, brand name, and item condition.
- train.tsv and test.tsv are training and test files respectively.

**Data Fields :**
- item_condition_id = the condition of the items provided by the seller
- category_name = category of the listing
- brand_name = name of the brand
- price = the price that the item was sold for (in USD). This is the target variable .
- shipping = 1 if shipping fee is paid by seller and 0 if paid by buyer
- item_description - the full description of the item. Note that we have cleaned the data to remove text that look like prices (e.g. \$20) to avoid leakage. These removed prices are represented as [rm].

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline
import string
import seaborn as sns
sns.set_style("whitegrid")
import warnings
warnings.simplefilter("ignore")

In [None]:
#exclamatory mark runs the command in terminal
!ls ../input/

Let's explore the data at high level.

In [None]:
#read_csv reads csv file as dataframe.
#here files are tsv, so we use tab delimiter/separator.
train_df = pd.read_csv("../input/train.tsv",sep='\t')
test_df = pd.read_csv("../input/test.tsv",sep='\t')

In [None]:
#return 5 data points sampled randomly from dataset.
train_df.sample(5)

**Result :**
This is what I've figured out from above output.

- Well, observing from high level gives me a general understanding of the data I'm using.
- train_id's are unique identifiers.
- name is the title of the product. 
- item_condition_id is categorical variable.
- category_name might require feature engineering to seperate the subcategories.
- price is numerical variable. It's the target variable.
- brand_name seems to have to a lot of NaN values. 
- shipping is categorical binary variable.
- item_description is unstructured text data that will need data preprocessing.

Let's see the size of datasets.

In [None]:
print("Length of Training Data: " + str(len(train_df)/100000) + " lac.")
print("Length of Testing Data: " + str(len(test_df)/100000)+ " lac.")

**Result :** Around 1.5 million training data points and 0.7 million test data points. That feels huge. :-)

I assume all the train_id's are unique. Let's verify.

In [None]:
train_df.train_id.nunique()/len(train_df)

**Result :** Indeed, they are.

Let's see the number of missing values in the dataset.

In [None]:
train_df.isnull().sum()

**Result :**
- item_description has only 4 of them. I can safely delete those points.
- category_name has around 6000 of them. Still, doesn't matter in front of 1.5 million.
- brand_name has a lot of missing data points. I'll handle them in the future.

Let's see the different data types of columns.

In [None]:
train_df.dtypes

** Result :**
- train_id = unique integer identifier.
- shipping, item_condition_id = categorical variable.
- price = numerical variable
- category_name = requires feature engineering.
- item_description = unstructured text data, will require engineering.

Let's analyze each variable one by one starting with the price. Since price is a numerical variable, I'll plot a histogram.

In [None]:
plt.figure(figsize=(10,8))
plt.hist((train_df['price']), bins=30, range=[0,250], label='price')
plt.title('Price Distribution')
plt.xlabel("Price")
plt.ylabel("Count")
plt.show()

**Result :** 
- That's highly skewed towards the left.
- Most of the prices are in 0-50 range.

Let's describe the prices data with some data statistics.

In [None]:
train_df.price.describe()

**Result: **
- Average price is around 27.
- But the max price can be as high as 2009. 
- Minimum price = 0 exists. This raises a question. How many are those?

Let's count.

In [None]:
len(train_df[train_df.price==0])

**Result :** 
- 874 products are free! Really? Strange.
- Prices distribution was skewed, Let's plot log(prices) to see what it looks like.

In [None]:
plt.figure(figsize=(10,8))
plt.hist(np.log(train_df['price']), bins=30, range=[0,8], label='price')
plt.title('Log Price Distribution')
plt.xlabel("Log Price")
plt.ylabel("Count")
plt.show()

** Result :** 
- Nice approximate normal distribution!
- We can also use box plots to explore price variable.
- Due to the highly skewed distribution of price variable, I set the limit of y from -10 to 100 to properly see the quantiles

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(x=train_df.price)
plt.xlim(-10,100)
plt.xlabel("Price")
plt.title("Box Plot of Price Variable")
plt.show()

** Result :**
- Median is 17.
- 25% data is less than 10.
- 75% data is less than 29.
- I read all the values from previous describe call on price variable.
- I can see those values in the box plot too!

Now, what are the other variables left to analyze? Let's take a look again the dataframe to refresh.

In [None]:
train_df.sample(5)

Let's analyse the item_condition_id variable. It's categorical. So, let's plot a bar plot!

In [None]:
item_condn_percent = train_df.item_condition_id.value_counts()/len(train_df)*100
plt.figure(figsize=(10,8))
sns.barplot(x=item_condn_percent.index,y=item_condn_percent.values)
plt.xlabel("Item Condition ID")
plt.ylabel("Percentage of products")
plt.title("Bar Plot of Item Condition with Percentage of Products")
plt.show()
pd.concat([item_condn_percent],keys=["percentage"],axis=1)

** Result: **
- Most items are in condition 1: Around 43%
- Only few items 2% are in condtion 4.
- Even fewer 0.16% are in condition 5.

Let's now analyze our another categorical variable: Shipping
Again, let's plot bar plot.

In [None]:
shipping_percent = train_df.shipping.value_counts()/len(train_df)*100
plt.figure(figsize=(5,8))
sns.barplot(y=shipping_percent.values,x=shipping_percent.index)
plt.title("Bar Plot of Shipping with Percentage of Products")
plt.xlabel("Shipping Variable")
plt.ylabel("Percentage")
plt.show()
pd.concat([shipping_percent],keys=["percentage"],axis=1)

**Result :**
- Well, nicely balanced.

What other variables are left? Let's see

In [None]:
train_df.sample(5)

Let's analyze category_name now. I'll engineer it first. This is what I'm going to do.
- The category_name is in format word1/word2/word3.
- word1 I'll call the "general_cat" variable.
- word2 I'll call the "sub_cat1" variable.
- word3 I'll call "sub_cat2" variable.
- In above names "cat" means category.

But first, I'll remove the null values in the category variable

In [None]:
train_df = train_df[train_df.category_name.notnull()]
train_df['general_cat'], train_df['sub_cat1'], train_df['sub_cat2'] = \
zip(*train_df.category_name.apply(lambda x: x.split('/')))

Let's see how our dataframe looks like now!

In [None]:
train_df.sample(5)

Now, the analysis of category_name is easier..!!
Let's explore the new features!

Let's see the number of unique values for general_cat variable

In [None]:
train_df.general_cat.nunique()

**Result :**
- 10? That's it? 
- Let's see the value counts and plot a bar plot

In [None]:
general_cat_percent = train_df.general_cat.value_counts().sort_values(ascending=False)/len(train_df)*100
plt.figure(figsize=(15,8))
sns.barplot(x=general_cat_percent.index,y=general_cat_percent.values)
plt.show()
pd.concat([general_cat_percent],keys=["percent"],axis=1)

**Result :** A lot of cool information!.
- 45% products are from Women Category! That's as much as half of the products!
- 14% are from Beauty.

What do women buy the most? Let's see.

In [None]:
sub_cat1_percent = train_df[train_df.general_cat=="Women"]['sub_cat1'].value_counts().\
    sort_values(ascending=False)[:10]/len(train_df[train_df.general_cat=="Women"])*100
plt.figure(figsize=(20,10))
sns.barplot(x=sub_cat1_percent.index,y=sub_cat1_percent.values)
plt.title("Percentage of various sub-categories INSIDE the WOMEN category")
plt.xlabel("Sub-category Name")
plt.ylabel("Percentage")
plt.show()
pd.concat([sub_cat1_percent],keys=["percentage"],axis=1)

** Result:**
- Well, women seem to love Athletic Apparel, Tops and Blouses, Shoes and Jewelry!! Let's analyze the second major category: Beauty.

In [None]:
sub_cat1_percent = train_df[train_df.general_cat=="Beauty"]['sub_cat1'].value_counts().\
    sort_values(ascending=False)/len(train_df[train_df.general_cat=="Beauty"])*100
plt.figure(figsize=(20,10))
sns.barplot(x=sub_cat1_percent.index,y=sub_cat1_percent.values)
plt.title("Percentage of various sub-categories INSIDE the BEAUTY category")
plt.xlabel("Sub-category Name")
plt.ylabel("Percentage")
plt.show()
pd.concat([sub_cat1_percent],keys=["percentage"],axis=1)

** Result :** 
- A whooping 60% is just makeup.
- Skin Care and Fragrance also matter a lot.

Let's see what Kids like to buy.

In [None]:
sub_cat1_percent = train_df[train_df.general_cat=="Kids"]['sub_cat1'].value_counts().\
    sort_values(ascending=False)[:10]/len(train_df[train_df.general_cat=="Kids"])*100
plt.figure(figsize=(20,10))
sns.barplot(x=sub_cat1_percent.index,y=sub_cat1_percent.values)
plt.title("Percentage of various sub-categories INSIDE the KIDS category")
plt.xlabel("Sub-category Name")
plt.ylabel("Percentage")
plt.show()
pd.concat([sub_cat1_percent],keys=["percentage"],axis=1)

**Results :**
- Toys. Huh!

Which variables are left to analyze? Let's have a look at the dataframe again.!

In [None]:
train_df.sample(5)

We haven't analysed brand_name and item_description. Let's go with item_description first.
- Since item_description is unstructured text data, the best way to analyze it, is WordClouds!!

In [None]:
import wordcloud as wc
desc_word_cloud = wc.WordCloud(width=1000,height=1000).generate(" ".join(train_df.sample(10000).item_description.astype(str)))
plt.figure(figsize = (20, 15))
plt.imshow(desc_word_cloud)
plt.show()

**Result :** As you can see, you can get a general idea about what people usually write in their product descriptions!!

Now, let's come to variable brand_name. We saw that it has a lot of missing values. Instead of throwing all that data, its better to replace it by "Other" category. But first, let's have another barplot of brand_name variable.

In [None]:
brand_name_percent = train_df['brand_name'].value_counts().\
    sort_values(ascending=False)[:10]/len(train_df)*100
plt.figure(figsize=(20,10))
sns.barplot(x=brand_name_percent.index,y=brand_name_percent.values)
plt.title("Bar plot of Brand Name")
plt.xlabel("Brand Name")
plt.ylabel("Percentage of Products")
plt.show()
pd.concat([brand_name_percent],keys=["percentage"],axis=1)

**Results :**
We see that Nike, PINK, Apple etc. are the most sold product brands.!
- By now, we have completed what is known as univariate analysis. That is, the analysis of variables one at a time.
- Bivariate Analysis: In this we will now perform the relationship of each variable with our target: price.

Let's have a new look at our dataframe to refresh about the column names and types and get ready for bivariate analysis.

In [None]:
train_df.sample(5)

**Result :** 
- There are only categorical variables to visualize the relationship of price with.
- Let's first analyze shipping variable and price together.
- I'll plot a box plot as well as a histogram of price variable for each of the two shipping categories.

In [None]:
plt.figure(figsize=(10,8))
plt.hist(np.log(train_df[train_df.shipping==1]['price']), bins=30, range=[0,8],color='red',alpha=0.5,label="shipping=1")
plt.hist(np.log(train_df[train_df.shipping==0]['price']),bins=30,range=[0,8],color='blue',alpha=0.5,label="shipping=0")
plt.legend(loc='upper right')
plt.title("Log Price Distribution for different shipping categories.")
plt.ylabel("Count")
plt.xlabel("Log Price")
plt.show()

And Box Plot.

In [None]:
plt.figure(figsize=(10,8))
train_df['log_price'] = np.log(train_df['price'])
sns.boxplot(x="shipping",y="log_price",data=train_df)
plt.ylim(0,6)
plt.title("Box Plot of Log Price for different shipping categories.")
plt.ylabel("Log Price")
plt.xlabel("Shipping")
plt.show()

**Result: **
- shipping doesn't seem to have much effect on the value of price.
- Let's see df again now to see what is left in bivariate analysis.

In [None]:
train_df.sample(5)

Let's do price vs general category!! I'll plot boxplot only as histogram won't be appropriate here.

In [None]:
plt.figure(figsize=(20,10))
train_df['log_price'] = np.log(train_df['price'])
sns.boxplot(x="general_cat",y="log_price",data=train_df)
plt.ylim(1,6)
plt.show()

Let's now visualize the relationship of price with item_condition!!

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(x="item_condition_id",y="log_price",data=train_df)
plt.ylim(1,6)
plt.show()

** Result: **
- Item condition id doesn't seem to alter the distribution much except in the case of item_condition_id = 5

**Next:** To create Tf-Idf vectors, and feed into machine learning algorithms