# Dummy Variables vs Label Encoding Approach for Mercari Price Suggestion Challenge

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from sklearn import preprocessing

Loading the data

In [None]:
train = pd.read_csv('../input/train.tsv', sep='\t')
test = pd.read_csv('../input/test.tsv', sep='\t')

## Let's explore the data first

In [None]:
train.head(5)

In [None]:
train.dtypes

In [None]:
test.head(5)

In [None]:
test.dtypes

In [None]:
# checking for Nulls
obj = train.select_dtypes(include=['object']).copy()
train[obj.isnull().any(axis=1)].head(5)

It is clear that we have a lot of categorical variables. The ones that we are going to focus on are the brand_name and the category_name as they appear for more then one different items, as the code bellow shows.

In [None]:
train["category_name"].value_counts().head()

In [None]:
train["brand_name"].value_counts().head()

## Dummy Variables Approach

The category_name is a categorical variable, so we have to turn it into dummy variables and check the correlation between them and the price.

In [None]:
df_dummies = pd.get_dummies(train['category_name'])
df_dummies.head()

In [None]:
df_new = pd.concat([train['price'], df_dummies], axis=1)
df_new.head()

This doesn't seem very helpful, cause the table is very big, so we are going to use the label encoding approach

## Label Encoding

Sklearn provides a very efficient tool for encoding the levels of a categorical features into numeric values. LabelEncoder encode labels with value between 0 and n_classes-1 onto one column.

In [None]:
train["category_name"].value_counts().head()

In [None]:
train["brand_name"].value_counts().head()

In [None]:
encoder = preprocessing.LabelEncoder()
train["brand_name"] = encoder.fit_transform(train["brand_name"].fillna('Nan'))
train["category_name"] = encoder.fit_transform(train["category_name"].fillna('Nan'))
train.head()

This is great! Now we can start our analysis and find correlations between the price and the categorical variables