# Men's Shoe Prices 
The goal of this kernel is to complete the **[task](https://www.kaggle.com/sureshmecad/mens-shoe-prices/tasks?taskId=4437)** associated with the dataset. The specific goals include:
> - What is the average price of each distinct brand listed?
> - Which brands have the highest prices?
> - Which ones have the widest distribution of prices?
> - Is there a typical price distribution (e.g., normal) across brands or within specific brands?
> - Correlate specific product features with changes in price.

One thing to note before we begin... This dataset includes a lot of information on products that aren't actually shoes! Items such as watches and other accessories seem to have been mistakenly included during the original data gathering process (likely due to mislabeled categories). Fully correcting for this would require a decent amount of manual work / advanced regex and NLP and is outside the scope of this kernel. As a result, please interpret the final outputs with a grain of salt and as not fully reflective of reality beyond this dataset. That being said, the general code and methodologies used within should still be a useful learning experience!

Let's dive in!

# Imports & Settings

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
sns.set_theme(context='talk')

# Data Cleaning / Preprocessing

## Summary Info
To start, let's get a general feel for the data we're working with.

In [None]:
df = pd.read_csv('../input/mens-shoe-prices/train.csv')
df.info()

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
round(df.isna().sum() / df.shape[0], 3)

## Dropping Columns
The data is fairly messy as it currently stands. There are lots of missing values (some columns are even entirely missing) and there are various data types that need corrected. Given the limited number of questions being answered for this task, a good chunk of the available data is irrelevant. Right off the bat, let's eliminate any columns that are missing 80% or more of their values. This will help reduce the size of the data we're working with and make further cleaning processes a bit easier.

In [None]:
cols_to_drop = [col for col in df.columns if df[col].isna().sum() >= 0.8*df.shape[0]]
cols_to_drop

In [None]:
df.drop(columns=cols_to_drop, inplace=True)
df.columns

Starting to get somewhere, but there's still more columns to drop:

In [None]:
df.drop(columns=['id', 'dateadded', 'dateupdated', 'descriptions', 'ean', 
                 'features', 'imageurls', 'keys', 'manufacturer',
                 'manufacturernumber', 'merchants', 'prices_condition',
                 'prices_dateadded', 'prices_dateseen', 'prices_issale',
                 'prices_merchant', 'prices_offer', 'prices_shipping',
                 'prices_sourceurls', 'sizes', 'skus', 'sourceurls', 'upc'],
       inplace=True)
df.info()

## Adjusting Data Types

Some finishing touches by adjusting the data types:

In [None]:
df.prices_amountmin = pd.to_numeric(df.prices_amountmin, errors='coerce', downcast='float')
df.prices_amountmax = pd.to_numeric(df.prices_amountmax, errors='coerce', downcast='float')
df.info()

In [None]:
df.head()

Much better! 

## Cleaning Final Dataframe
### Prices
The two columns for price (`prices_amountmin` and `prices_amountmax`) caught my eye since the first five rows in the dataframe have the same value for both. To investigate this further, let's check what percentage of the rows have the same value in both columns.

In [None]:
sum(df.prices_amountmin == df.prices_amountmax) / df.shape[0]

Over 96%! As a result, I'm okay with combining the two columns into a single column called `price` by taking the average.

In [None]:
df['price'] = np.mean([df.prices_amountmin, df.prices_amountmax], axis=0)
df_cleaned = df.drop(columns=['prices_amountmin', 'prices_amountmax'])
df_cleaned.head()

Next, let's check the value counts for the `prices_currency` column.

In [None]:
df_cleaned.prices_currency.value_counts()

It appears that some data was incorrectly placed in the `prices_currency` column in the original dataset. The overwhelming majority is priced in USD though. We could either work on converting the other currencies to USD for accurate comparisons, or simply drop those rows. For the sake of simplicity in this kernel, let's drop the rows.

In [None]:
df_cleaned = df_cleaned[df_cleaned.prices_currency == 'USD']
df_cleaned.prices_currency.value_counts()

### Missing Values

In [None]:
df_cleaned.isna().sum()

The missing values in the `color` column are fine for now but the 17 missing values for the `brand` column need handled.

In [None]:
df_cleaned.dropna(axis=0, subset=['brand'], inplace=True)
df_cleaned.isna().sum()

### Standardizing Brands
Finally, let's change all of the brand names to lowercase. This is to help prevent any variations in capitalization from being classified as different brands. While a more serious method of cleaning the brand names would need to be employed in any sort of production code, this will suffice for now.

In [None]:
df_cleaned.brand = df_cleaned.brand.apply(lambda x: x.lower(), convert_dtype=False)
df_cleaned.brand.head()

# Answering the Questions
## What is the average price of each distinct brand listed?
To begin answering this questions, let's first take a look at how many unique brands are in the data.

In [None]:
df_cleaned.brand.nunique()

Wow! That's way too many to effectively visualize. Let's limit the brands we look at to only those with more than 50 rows of data. This will help eliminate random, unheard of brands with only a couple entries as well as entries which aren't actually

In [None]:
brands_above_50 = df_cleaned.brand.value_counts()[df_cleaned.brand.value_counts() > 50]
brands_above_50

In [None]:
df_brands = df_cleaned[df_cleaned.brand.isin(brands_above_50.index)].groupby('brand').mean()
df_brands.sort_values('price', ascending=False, inplace=True)
df_brands

In [None]:
fig = plt.figure(figsize=(8, 16))
sns.barplot(data=df_brands, x='price', y=df_brands.index, palette='crest')
plt.title('Average Price by Brand')
plt.xlabel('Price')
plt.ylabel('Brand')
plt.annotate(text='Note: only brands with more than 50 observations are included.',
             xy=(60, 50),
             fontsize=12);

## Which brands have the highest prices?

In [None]:
df_max_prices = df_cleaned[df_cleaned.brand.isin(brands_above_50.index)].sort_values('price', ascending=False)
df_max_prices = df_max_prices.drop_duplicates('brand')
df_max_prices[['brand', 'name', 'price']]

In [None]:
fig = plt.figure(figsize=(8, 16))
sns.barplot(data=df_max_prices, x='price', y='brand', palette='crest')
plt.title('Max Price by Brand')
plt.xlabel('Price')
plt.ylabel('Brand');
plt.annotate(text='Note: only brands with more than 50 observations are included.',
             xy=(300, 50),
             fontsize=12);


## Which ones have the widest distribution of prices?

In [None]:
df_medians = df_cleaned[df_cleaned.brand.isin(brands_above_50.index)]
df_medians = df_medians.groupby('brand').median()
df_medians = df_medians.sort_values('price', ascending=False)
df_medians.head()

The answer to this question is up for debate depending on the interpretation of the question. "Widest distribution" could be defined in a number of ways including:
- Standard deviation (nominal or as a percentage)
- The interquratile range (IQR)
- Difference between maximum and minimum values

Most of this information can be captured by utilizing boxplots. The boxplots could be ordered in a number of ways, but using the median tends to yield a fairly orderly result.

In [None]:
fig = plt.figure(figsize=(8, 16))
sns.boxplot(data=df_cleaned[df_cleaned.brand.isin(brands_above_50.index)],
            x='price',
            y='brand',
            order=df_medians.index,
            palette='crest',
            orient='h')
plt.title('Boxplots for Price by Brand')
plt.xlabel('Price')
plt.ylabel('Brand')
plt.annotate(text='Note: only brands with more than 50 observations are included.',
             xy=(300, 50),
             fontsize=12);

## Is there a typical price distribution (e.g., normal) across brands or within specific brands?

In [None]:
df_all_dist = df_cleaned[df_cleaned.brand.isin(brands_above_50.index)]
fig = plt.figure(figsize=(8, 6))
sns.kdeplot(df_all_dist['price'], clip=(0, None))
plt.title('Distribution of Prices')
plt.xlabel('Price');
plt.annotate(text='Note: only brands with more than 50 observations are included.',
             xy=(300, 0.0065),
             fontsize=12);

The distribution of all prices for brands with more than 50 observations is heavily skewed to the right. Let's takes a look at what happens when we apply a log transformation:

In [None]:
fig = plt.figure(figsize=(8, 6))
sns.kdeplot(df_all_dist['price'], clip=(0, None), log_scale=True)
plt.title('Distribution of Prices')
plt.xlabel('Log Price');

So it appears that the prices are log-normally distributed. We can check the distribution of an individual brand with more than 50 observations by creating a simple function.

In [None]:
def plot_brand_distribution(brand, log=False):
    '''
    Description: Takes a brand name and boolean value for log and plots a KDE plot for the distribution
    of that specific brand's prices (log prices if log == True).
    
    Inputs:
    - brand : str
         The name of the brand to visualize. Must have more than 50 observations in the original data set.
         Case insensitive as it will automatically be lowercased.
    
    - log : boolean, default = False
         Determines whether a log transformation should be applied to the prices.
    '''
    fig = plt.figure(figsize=(8, 6))
    sns.kdeplot(df_all_dist[df_all_dist.brand == brand.lower()]['price'], clip=(0, None), log_scale=log)
    plt.title(f'Distribution of {brand.title()}\'s Prices')
    plt.xlabel('Log Price' if log else 'Price')
    plt.show();

In [None]:
plot_brand_distribution('nike', False)

In [None]:
plot_brand_distribution('Adidas', True)

# Conclusion
That's it for this kernel! Correlating prices with specific product attributes may be added in a future update, but would require significantly more data cleaning and preprocessing.

**If you liked this notebook or have any feedback, please let me know in the comments. I'm always looking to improve!**