# Introduction
This is my draft version to explorative data analysis to mercari dataset. share it for comment. If you found it useful, pls voite it. 

## Situation
Mercari, Japan’s biggest community-powered shopping app, knows this problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari's marketplace.

In this competition, Mercari’s challenging you to build an algorithm that automatically suggests the right product prices. You’ll be provided user-inputted text descriptions of their products, including details like product category name, brand name, and item condition.

## Actions
    
- **Sampling** - use only 10% of total train/test data set. With sample mode, small computer could do EDA as well. KStest used to make sure if test/train still in similar distribution.
- **Pricing**: https://en.wikipedia.org/wiki/Pricing#Pricing_strategies
    - thoeritical factors are manufacturing cost, market place, **competition **, market condition, **quality of product**. 
    - competition factors:
        quality of product: item_condition, item_description, product,shipping, 
        market place:  category, brand, 
        **competition**:   the frequency of single products, or products, or brand_name. 
        
- **competition** factor(frequency by value_count)  will be  the frequency(suppliers) in certain features(demands). for example:
    - for each of category, the higher frequency, the higher the supplier to meet this categories type of demand. (high competition )
    - for each of product name, the higher frequency, the higher the supplier for this products to meet this products demands(high competition )
    - for each of brand name, the higher frequency, the higher the supplier for this brand to meet this brand demands.



# Prepare the environment

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import seaborn as sns

import nltk
%matplotlib inline
import matplotlib.pyplot as plt

Sample_Ratio = 0.1
all_cols = ["train_id","brand_name","category_name","item_condition_id","name","shipping","item_description"]
y_col =["price"]
SKIPMAP=False
GLOBAL_SEED=42
!free -h

In [None]:
def eda_plot(df,col="",target="log_price",order =None):
    #Usage: eda_plot(df,cols=[],target=""):
    # df: the dataframe
    # cols: the feature columns want to explorate
    # target the target columns want to refer
    if SKIPMAP == False:
        plt.close("all") #clear all plt figure in buffers
           
        if col =="":
            print("No features need print\n")
        else:
            data = df
            if hasattr(data[col], 'cat') or data[col].dtype == np.object:
                fig,ax =plt.subplots(1,2,figsize=(10,1*5),sharey=True)
                sns.countplot(data = data,y=col,ax=ax[0],order=order)
                sns.boxplot(data=data,y=col,x=target,ax=ax[1],orient="h",order=order)
                
            elif (np.issubdtype( data[col].dtype,np.number)):
                #print(data.dtype)
                fig,ax =plt.subplots(1,2,figsize=(10,1*5))
                
                sns.distplot(data[col], ax=ax[0],kde=False)
                sns.boxplot(data=data,x=col,y=target,ax=ax[1])               

            elif data[col].dtype == np.object:
                fig,ax =plt.subplots(1,2,figsize=(10,1*5),sharey=True)

                sns.countplot(data = data,y=col,ax=ax[0],order=order)
                sns.boxplot(data=data,y=col,x=target,ax=ax[1],orient="h",order=order)
            else:
                print(data[col].dtype)
        """
        else:
            fig,ax =plt.subplots(1,2,figsize=(10,1*5))

            for i in range(1):
                col=cols[i]
                data = df[col]

                if (np.issubdtype( data.dtype,np.number)):
                    sns.distplot(data[data.notnull()], ax=ax[0,i])
                    sns.regplot(data,target,ax=ax[1,i])
                    #ax[1,i].set_title(col)

                elif data.dtype== np.object:
                    sns.countplot(y=data[data.notnull()],ax=ax[0,i])
                    sns.boxplot(y=df[col],x=target,ax=ax[1,i])
                    #ax[1,i].set_title(col)
        """
        #plt.subplots_adjust(hspace=0.1,wspace=0.1)

In [None]:
def supply_price_curve(col,train):
    #value_count for all features information. It can help to know how frequency supplied of single name( or categories) under the features
    #log the frequency to capture both high frequency and low frequency.
    tmp = train.groupby(col)[col].transform("count")
    tmp=np.log(tmp)
    if train[col].count() ==0:
        print("{0}-->Empty DataFrame".format(col))
    else:
        fig,ax = plt.subplots(1,2,figsize=(10,5))
        sns.distplot(tmp[tmp.notnull()],kde=False,ax=ax[0])
        sns.regplot(x=tmp,y="log_price",ax=ax[1],data=train)
        ax[0].set_xlabel("Product Qty of each " + col)
        ax[0].set_ylabel("Frequency")
        ax[1].set_xlabel("Product Qty of each " + col)
        plt.subplots_adjust(hspace=0.5,wspace=0.5)
        
    plt.show()

# load the train & test data from HDF5
- the tsv format has been convert to HDF format previously in fixed format per columns.
- the advantage of HDF format is enable it work under 2G memory computer

In [None]:
def get_csv(ratio = Sample_Ratio):
    datasets ={"train":"../input/train.tsv",
            "test":"../input/test.tsv"}
    train = pd.DataFrame()
    test = pd.DataFrame()
    for k in datasets.keys():
        file = datasets[k]
        print("\n")
        print("*"*20)
        print("reading {0} data set".format(file))
        data = pd.read_csv(file,delimiter="\t")
        

        data = data.sample(frac=ratio)
        print("----->data set info")
        print(data.info())
        
        missing_sum = data.isnull().sum()
        print("--->missing data")
        print(missing_sum[missing_sum>0])
        
        if k =="train":
            train = data
        elif k == "test":
            test = data
    return train, test
    
train,test = get_csv()

## Statistics test for same distribution

In [None]:
all_cols = ['name', 'item_condition_id', 'category_name', 'brand_name', 'shipping', 'item_description'] 
# exclude the columns of id, and price. ID has different distribution in train &test set. Price is target variance.

def KS_test(test_cols = all_cols,alpha=0.05):
    cols = train[test_cols].select_dtypes(include=[np.number]).columns
    cols_differ = {"ksvalue":[],"pvalue":[]}
    for col in cols:
            pvalue = None
            ksvalue = None
            ksvalue, pvalue = stats.ks_2samp(train[col],test[col])
            cols_differ["ksvalue"].append(ksvalue)
            cols_differ["pvalue"].append(pvalue)

    KStest_df = pd.DataFrame(cols_differ,columns=["ksvalue","pvalue"],index=cols).sort_values(by="pvalue")
    KStest_df["Same_distribution"] = KStest_df.pvalue>alpha
    print(KStest_df["Same_distribution"].head())
KS_test()

## Missing Data

In [None]:
train = train.fillna("NA")

Result:

- Certain feature has missing data. They should be ignorable as less than 10% of total sample.(Rule of Thumb 1). 
- FillNA with "NA" Though ignorable. 

Note: MAR, MCAR,MNCAR are mode to detect which type of missing value and deal with difference strategy. As not found related package in python, just fill "NA" for all nominal type features.

# Target analysis (first things first)

In [None]:
train["price"].describe([.25, .5, .75,.99])

**Price**
- More than 1.48M samples 
- Price range: {0~2009}
    - Mean price: 2.67
    - 50% sample price <17
    - 99% sample price < 170

In [None]:
fig,ax = plt.subplots(1,2,figsize=(10,5))
sns.distplot(train["price"],ax=ax[0])
sns.distplot(np.log1p(train["price"]),ax=ax[1])
#train["price"].hist(bins=50,ax=ax[0])
#train["price"].apply(np.log1p).hist(bins=50,ax=ax[1])

Description: The price distribution was extremely skewed in the left figure. After log1p convert, it was normal shape like.
 - remember to convert the price with log before training

In [None]:
train["log_price"] = train["price"].apply(np.log1p)
train["log_price"].head()

## Zero Price

In [None]:
def eda_zeroprice():
    zero_train = train[train.price==0]
    print("Zero Price item qty {0}".format(zero_train.shape[0]))

    for col in zero_train.columns:
        supply_price_curve(col,train=zero_train)

eda_zeroprice()

report  about zero price
    
    - zero price freqency decreased when more products supplied under  category(main, cat_1,cat_2), brand and item_description). 
    - zero price frequency increased when more products ssupplier under item_condition, brand_name

# Explorative Data Analysis

## Supply(competition) of one feature vs Price

In [None]:
train.shape

In [None]:
for col in ["brand_name","category_name","name"]:
    supply_price_curve(col,train =train)

**Assumption:**

- Based on the assumption of supply & demand should influence the price. High supply with low demand, the price goes down. and Low supply with high demand, the price goes up. 
- So, if the single category or brand has a great number of product related. The price can not be high. If the single category or brand has only unique product offered. the price should be a bit of high. 
- Further statistics test might needed to see if the assumption is significant
    
Report:

   - the product qty of categoryname name showed above pattern(assumption)
   - the product qty of brand_name did not showed the pattern strongly. If remove the top frequency, it show some pattern.

    

## shippinig feature 

In [None]:
eda_plot(train,"shipping")

Report: 

- Shipping has two values(0,1), 0 frequency is higher than 1
- Shipping over Price:
    No shipping mean price > shipping mean price
    Outliers identified(0, >7)


## item_condition_id

In [None]:
eda_plot(train,"item_condition_id")

Report: 

- item_condition_id has five values(1,2,3,4,5), 2 frequency top, 3,4 is moderate frequency, 5 is low, 1 is almose 0 frequency.
- item_conditon over Price:
    * all condition has similar median valu around 3.(though condition 5 median is highest)
    * condition (1,2,3) have both upper/low outliers. condition(4,5) have only upper outliers
    

## Brand

In [None]:
price_of_brand = train.groupby("brand_name")["price"].agg(["count","mean","std","max","min",])

### Top5 high frequency brand - price distribution vs categories_counter

In [None]:
def top5_plot():
    top5_frequency_brand = train.brand_name.value_counts().head().index

    for col in top5_frequency_brand:
        mask = train.brand_name == col
        train["cat_count"] =train[mask].groupby("category_name")["category_name"].transform("count")
        sns.boxplot(data = train[mask], y = "price",x = "cat_count")
        plt.show()
top5_plot()

### Pie chart of product quantity vs brand

In [None]:
def brand_count_price_plot():
    
    brand_count = train["brand_name"].value_counts()
    brand_NA_count =brand_count["NA"]
    print("Total {0} brand in this simpling".format( brand_count.shape[0]))
    print("%{0:.2F} goods with unknow brand name".format(brand_NA_count/brand_count.sum()*100))
    #brand_count.plot.pie()


    
    print(price_of_brand.sort_values(by="count").tail().sort_values(by="count",ascending=False))
    sns.distplot(np.log(brand_count),kde=False,bins=100) #np.log purpose is smooth show all data(without it, the low frequency data will hard to see)
brand_count_price_plot()

Report:

- 2500 brand have small products group(less than 100 product). While some brands has a great number of products(around than 3K~5K 
- Brand "NA": has  more than 63K products. More than 40% products has no brand. And the price varied from 0 ~ 1528. 
- Brand other: the more of products records, the high the std deviation, and gap in max and min
- **Zero Price: Many products sold with zero price. Maybe to promotion. It could be study later.**
- **High price standard Deviation brand  need study further.**

#### Add brand_name_NA feature 

In [None]:
def mark_NA_brand():

    mask = train["brand_name"] == "NA"
    train["brand_name_NA"] = 0
    train.loc[mask,"brand_name_NA"] = 1
mark_NA_brand()

### top10  stdev price - brand

In [None]:
def top10_stdev_price_brand():
    
    mask = price_of_brand["count"] >1
    brand_price_top10_std = price_of_brand.loc[mask].sort_values(by="std").tail(10).index
    mask = train["brand_name"].isin( brand_price_top10_std)
    eda_plot(train.loc[mask],"brand_name","price", order =brand_price_top10_std)
top10_stdev_price_brand()

###  top10 mean price -brand

In [None]:
def top10_mean_price_brand():
    brand_price_top10_mean =  price_of_brand.sort_values(by="mean").tail(10).index
    mask = train["brand_name"].isin( brand_price_top10_mean)
    eda_plot(train.loc[mask],"brand_name","price",order=brand_price_top10_mean)
top10_mean_price_brand()

###  top10 most expense price -brand

In [None]:
def top10_expensive_brand_price():
    brand_price_top10_max = price_of_brand.sort_values(by="max").tail(10).index
    mask = train["brand_name"].isin( brand_price_top10_max)
    eda_plot(train.loc[mask],"brand_name","price",order=brand_price_top10_max)
top10_expensive_brand_price()

###  Zero Price - top10 cheapest price -brand

In [None]:
def zero_price_brand():
    Brand_Zero_price= price_of_brand.sort_values(by="min").head(10)
    print(Brand_Zero_price[["min","count"]].sort_values(by="count"))
    brand_price_top10_min =Brand_Zero_price.index
    mask = train["brand_name"].isin( brand_price_top10_min)
    eda_plot(train.loc[mask],"brand_name","price",order=brand_price_top10_min)
zero_price_brand()

## Categories

### Overall categories info

In [None]:
def cat_summary():
    print("Total {0} categories was found in this sampling\n".format(train["category_name"].value_counts().shape[0]))
    cat_count = train.category_name.value_counts()
    sns.distplot(cat_count.apply(np.log),kde=False,bins=100)
    print(cat_count.head())

cat_summary()

Report:
    
- Total 1000 category name found in this sampling. 
    - More than 700 of category name just has around 10 product in it.
    - Certain categories has more than 1000 products related with it. The extremely product quantity was higher than 6000
- Women/Beauty related categories has highest count in sample. More than 6000
- the category_name could be splitted to level as "/" existed

### Categories breakdown

In [None]:
def brk_categories(df):
    cat_split = df["category_name"].str.split("/",expand=True).\
    rename(columns={0:"cat_0",1:"cat_1",2:"cat_2",3:"cat_3",4:"cat_4"})
    cat_split = cat_split.apply(lambda x:x.astype("category"))
    if not hasattr(df,"cat_0"):
        df = pd.concat([df,cat_split],axis=1)
    return df

train = brk_categories(train)


Note:
    
- the break main categories to detail categories will lead to missing data(NAN) of lowest levels. It is the  design purpose and ignorable. 

In [None]:
tmp = train.cat_1.value_counts()
#sns.distplot(tmp,kde=False,bins=10)
#sns.boxplot("cat_0",y="log_price",data=train,)
eda_plot(df=train, col="cat_0")

In [None]:
for col in ["cat_0","cat_1","cat_2","cat_3"]:
    supply_price_curve(col,train =train)

# further exploring of price tacticle:

to explor the item_description to see if below price mode tag could be extracted in future
    
    ARC/RRC Pricing
    Complementary Pricing
    Contingency Pricing
    Differential pricing
    Discrete pricing
    Discount pricing
    Diversionary pricing
    Everyday low prices
    Exit fees
    Experience curve pricing
    Geographic pricing
    Guaranteed pricing
    High-low pricing
    Honeymoon pricing
    Loss leader
    Offset pricing
    Parity pricingPrice bundling
    Peak and off-peak pricing
    Price discrimination
    Price lining
    Penetration pricing
    Prestige pricing
    Price signalling
    Price skimming
    Promotional pricing
    Two-part pricing
    Psychological pricing
    Premium pricing