## 1. Dataset information
### **1.1. Context**

Studying top products requires more than just product listings. You also need to know what sells well and what does not.

### **1.2. Content**
This dataset contains product listings as well as products ratings and sales performance, which you would not find in other datasets.

With this, you can finally start to look for correlations and patterns regarding the success of a product and the various components.

### **1.3. Inspiration**
* How about trying to validate the established idea of human sensitiveness to price drops ? (discounted price compared to original retail_price)
* You may look for top categories of products so that you know what sells best
* Do bad products sell ? How about the relationship between the quality of a product (ratings) and its success ? Does the price factor into this?

### **1.4. Collection Methodology**

The data comes from the Wish platform.
Basically, the products listed in the dataset are those that would appear if you type "summer" in the search field of the platform.

You can browse the Wish website or app to get a feel of the type of information you can get from there and how they are presented. This might give you some ideas and a better understanding.

If you are confused about some columns, you can either look at the column descriptions, browse the Wish website/app, or you can ask in the comments.

The data was scraped with french as settings (hence the presence of some non-ascii latin characters such as « é » and « à ») in the title column.

### **1.5. Features and Columns**

The data was scraped in the french localisation (hence some non-ascii latin characters such as « é » and « à ») in the title column.

The title_orig on the other hand contains the original title (the base title) that is displayed by default. When a translation is provided by the seller, it appears in the title column. When the title and title_orig columns are the same, it generally means that the seller did not specify a translation that would be displayed to users with french settings.

A picture is worth a thousand words. In the following screenshot you see some features and how to interpret them.

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1488294%2F308810459ae5232399672ba3eef228ef%2Fannotated-search-results-wish-website.jpg?generation=1598785563117062&alt=media"></img>

### **1.6. Data Dictionary**

* title: Title for localized for european countries. May be the same as title_orig if the seller did not offer a translation
* title_orig: Original english title of the product
* price: price you would pay to get the product
* retail_price: reference price for similar articles on the market, or in other stores/places. Used by the seller to indicate a regular value or the price before discount.
* currency_buyer: currency of the prices
* units_sold: Number of units sold. Lower bound approximation by steps
* uses_ad_boosts: Whether the seller paid to boost his product within the platform (highlighting, better placement or whatever)
* rating: Mean product rating
* rating_count: Total number of ratings of the product
* rating_five_count: Number of 5-star ratings (the same as four three two one count)
* badges_count: Number of badges the product or the seller have
* badge_local_product: A badge that denotes the product is a local product. Conditions may vary (being produced locally, or something else). Some people may prefer buying local products rather than. 1 means Yes, has the badge
* badge_product_quality: is quality product
* badge_fast_shipping: support fast shipping
* tags: tags set by the seller	
* product_color: Product's main color
* product_variation_size_id: One of the available size variation for this product	
* product_variation_inventory: Inventory the seller has. Max allowed quantity is 50
* shipping_option_name	
* shipping_option_price: shipping price
* shipping_is_express: whether the shipping is express or not. 1 for True
* countries_shipped_to: Number of countries this product is shipped to. Sellers may choose to limit where they ship a product to	
* inventory_total: Total inventory for all the product's variations (size/color variations for instance)	
* has_urgency_banner: whether there was an urgency banner with an urgency	
* urgency_text: A text banner that appear over some products in the search results.
* origin_country	
* merchant_title: Merchant's displayed name (show in the UI as the seller's shop name)	
* merchant_name: Merchant's canonical name. A name not shown publicly. Used by the website under the hood as a canonical name. Easier to process since all lowercase without white space	
* merchant_info_subtitle: The subtitle text as shown on a seller's info section to the user. (raw, not preprocessed). The website shows this to the user to give an overview of the seller's stats to the user. Mostly consists of `% <positive_feedbacks> (<rating_count> reviews)` written in french	
* merchant_rating_count: Number of ratings of this seller
* merchant_rating: merchant's rating
* merchant_id: merchant unique id
* merchant_has_profile_picture: Convenience boolean that says whether there is a `merchant_profile_picture` url
* merchant_profile_picture: Custom profile picture of the seller (if the seller has one). Empty otherwise.
* product_url: url to the product page. You may need to login to access it
* product_picture	
* product_id: product identifier. You can use this key to remove duplicate entries if you're not interested in studying them.
* theme: the event of the sales: the search term used in the search bar of the website to get these search results.
* crawl_month: time of the crawling: meta: for info only.

## **2. Preparing the tools**

### **2.1. Loading libraries**

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import plotly.express as px
import seaborn as sns
import re 

### **2.2. Load and Inspect Data**

In [None]:
df = pd.read_csv('../input/summer-products-and-sales-in-ecommerce-wish/summer-products-with-rating-and-performance_2020-08.csv')
print(df.shape)
df.head().T

Based on the dataset, there are a few interesting questions that need further investigation through analysis: 
* What can we infer from price and retail price? Expecting percent discount calculated from these two variable has a positive correlation with the number of unit solds. 
* Does using ad boost increase success? 
* Correlation between badge counts and rating? 
* Inventory total vs units sold? 
* What can we get from merchant meta data? Does having a merchant profile picture increases badge counts, ratings, success? 
* Difference between shipping options? 

In [None]:
df.info()

In [None]:
df.isnull().sum()

There are a few assumptions that we can make from this table summary of null values: 
* Although the rating count has no null values, individual rating count of five, four, three, two, one has null values. Does this mean that these null value indicate zero count in that specific rating? 
* Although merchant_has_profile_picture has no null values, merchant_profile_picture has 1347 null values. 

## 3. Exploratory Data Analysis (EDA) 

### 3.1. Clean and prepare data 

Drop duplicates 

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True) 

In [None]:
# Columns with null data 
columns_nan = df.columns[df.isnull().sum() > 0]
columns_nan

In [None]:
# Plot null data 
columns_null_pct = (df[columns_nan].isnull().sum() / df.shape[0]).sort_values(ascending=False) * 100
plt.figure(figsize=(10, 6))
sns.barplot(x=columns_null_pct, y=columns_null_pct.index)
plt.title("Percent null in columns with null values")
plt.show()

### Inspect each columns

* merchant_profile_picture, product_picture, and product_url

In [None]:
df['merchant_profile_picture'].value_counts()

This column contains link to the merchant profile picture This does not give us really useful information unless we want to do some image processing, which would be to complicated at this stage. We already have a column that indicates whether or not a listing has merchant profile picture, so let's stick with that. Similarly, we can drop product_picture, and product_url as well.  

In [None]:
df.drop(['merchant_profile_picture', 'product_picture', 'product_url'], axis=1, inplace=True) 

### has_urgency_banner and urgency_text

In [None]:
df['has_urgency_banner'].value_counts()

'has_urgency_banner' column by definition is a binary column to indicate whether a product listing has an urgency banner or not. We replace NaN values with 0 to indicate no urgency banner. 

In [None]:
df['has_urgency_banner'] = df['has_urgency_banner'].replace(np.nan, 0)

In [None]:
df['urgency_text'].value_counts()

In [None]:
df['urgency_text'] = df['urgency_text'].replace({
    'Quantité limitée !' : 'Limited Quantity', 
    'Réduction sur les achats en gros': 'Wholesale discount', 
    np.nan: 'None'
})
df['urgency_text'].value_counts()

### title, title_orig, crawl_month, theme 

In [None]:
df['title'].value_counts()

In [None]:
df['title_orig'].value_counts()

In [None]:
df['crawl_month'].value_counts()

In [None]:
df['theme'].value_counts()

Since crawl_month has only one unique value: 2020-08 (the data was crawled on this date), we can remove this column 

In [None]:
df.drop(['title', 'title_orig', 'crawl_month', 'theme'], axis=1, inplace=True) 

In [None]:
df['shipping_option_name'].value_counts(normalize=True)

Since the column shipping_option_name has low variance (more than 95% of the values in the column belong to a specific category), we can drop this column. 

In [None]:
df.drop(['shipping_option_name'], axis=1, inplace=True)

### product_variation_size_id

In [None]:
pd.options.display.max_rows = 1000
df['product_variation_size_id'].value_counts()

In [None]:
df['product_variation_size_id'].replace(['S', 'S.', 's', 'Size S', 'Size-S', 'Size S.', 'Suit-S', 'size S','S Pink', 'pants-S', 'US-S', 'SIZE S', 'S (waist58-62cm)', 'Size--S', '25-S', 'Size/S', 'S Diameter 30cm', 'S..', 'S(Pink & Black)'], 'S', inplace=True)
df['product_variation_size_id'].replace(['XS', 'XS.', 'SIZE XS', 'Size-XS'], 'XS', inplace=True)
df['product_variation_size_id'].replace(['XXS', 'XXXS', 'SIZE-XXS', 'Size -XXS', 'Size XXS', 'Size-XXS', 'SIZE XXS'], 'XXS+', inplace=True)
df['product_variation_size_id'].replace(['M', 'M.', 'Size M'], 'M', inplace=True)
df['product_variation_size_id'].replace(['L', 'SizeL', '32/L', 'L.', 'Size-L'], 'L', inplace=True)
df['product_variation_size_id'].replace(['XL', '2XL', '1 PC - XL', 'X   L'], 'XL', inplace=True)
df['product_variation_size_id'].replace(['XXL', '4XL', '2XL', 'Size4XL', '3XL', 'XXXXXL', '1 PC - XL', 'SIZE-4XL', '04-3XL', 'Size-5XL', 'XXXXL', '5XL', 'XXXL'], 'XXL+', inplace=True)
size_val_counts = df['product_variation_size_id'].value_counts()
# Select the values where the count is less than 5 
to_change = size_val_counts[size_val_counts <= 5].index
df.loc[df['product_variation_size_id'].isin(to_change), 'product_variation_size_id'] = "Other"
df['product_variation_size_id'] = df['product_variation_size_id'].replace(np.nan, "Other")

In [None]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(x = 'product_variation_size_id',
                   order = df['product_variation_size_id'].value_counts().index,
                   palette= "Set2",
                   data=df)
ax.set(xlabel='Size', ylabel='Count')

plt.show()

In [None]:
df['product_variation_size_id'].value_counts()

### product_color 

In [None]:
df['product_color'].value_counts()

In [None]:
color_to_change = {
'Black' : 'black',
'coolblack': 'black',
'White': 'white',
'offwhite': 'white', 
'bluue': 'blue',
'prussianblue': 'blue',
'navyblue': 'blue', 
'navy blue': 'blue',
'lightblue': 'blue',
'skyblue': 'blue',
'darkblue': 'blue',
'navy' : 'blue',
'bluee': 'blue',
'bluue': 'blue',
'denimblue': 'blue', 
'lakeblue': 'blue', 
'Blue': 'blue',
'gold': 'yellow',
'lightyellow': 'yellow',
'winered': 'red',
'rosered': 'red',
'watermelonred': 'red',
'RED': 'red',
'winered': 'red',
'wine red': 'red',
'rose': 'red',
'orange-red': 'red',
'Rose red': 'red',
'wine': 'red',
'coralred': 'red',
'burgundy': 'red', 
'lightred': 'red', 
'lightpink': 'pink',
'Pink': 'pink',
'dustypink': 'pink',
'armygreen':'green',
'khaki': 'green',
'lightgreen': 'green',
'fluorescentgreen': 'green',
'army green': 'green',
'applegreen': 'green',
'Army green': 'green',
'mintgreen': 'green',
'army': 'green', 
'lightkhaki': 'green', 
'darkgreen': 'green', 
'light green': 'green', 
'lightkhaki': 'green', 
'lightgray': 'grey', 
'apricot': 'orange',
'violet': 'purple',
'lightpurple': 'purple', 
'gray': 'grey',
'silver': 'grey',
'coffee': 'brown', 
'blackwhite': 'dual', 
 np.nan: 'other'}

In [None]:
def update_color(color): 
    if color in color_to_change: 
        return color_to_change[color]
    elif color in color_to_change.values(): 
        return color
    elif '&' in color: 
        return 'dual'
    else:
        return 'other'
df['product_color'] = df.product_color.apply(update_color)

In [None]:
pd.options.display.max_rows = 50
df['product_color'].value_counts()

In [None]:
plt.figure(figsize=(15, 12))
ax = sns.countplot(x = 'product_color',
                   order = df['product_color'].value_counts().index,
                   palette= "Set2",
                   data=df)
ax.set(xlabel='Color', ylabel='Count')
plt.xticks(rotation=45, ha='right')
plt.show()

We will be inspecting: 
* rating_five_count
* rating_four_count
* rating_three_count
* rating_two_count
* rating_one_count



In [None]:
rating_cols = ['rating_five_count', 'rating_four_count', 'rating_three_count', 'rating_two_count', 'rating_one_count']
rating_na_df = df.loc[df[rating_cols].isna().any(axis=1), rating_cols]
rating_na_df.head()

It seems like there are product listing without rating at all, so we will replace NaN values with 0. 

In [None]:
df = df.replace({'rating_five_count': {np.nan : 0}, 
                 'rating_four_count': {np.nan : 0},
                 'rating_three_count': {np.nan : 0},
                 'rating_two_count': {np.nan : 0},
                 'rating_one_count': {np.nan : 0}})

In [None]:
df.loc[rating_na_df.index, rating_cols].head()

### origin_country

In [None]:
df['origin_country'].value_counts(normalize=True)

Since the column origin_country has low variance (more than 95% of the values in the column belong to a specific category), we can drop this column. 

In [None]:
df.drop(['origin_country'], axis=1, inplace=True) 

### merchant_info_subtitle, merchant_title and merchant_name, merchant_rating, merchant_rating_count

* merchant_title: Merchant's displayed name (show in the UI as the seller's shop name)
* merchant_name: Merchant's canonical name. A name not shown publicly. Used by the website under the hood as a canonical name. Easier to process since all lowercase without white space
* merchant_info_subtitle: The subtitle text as shown on a seller's info section to the user. (raw, not preprocessed). The website shows this to the user to give an overview of the seller's stats to the user. Mostly consists of % <positive_feedbacks> (<rating_count> reviews) written in french
* merchant_rating_count: Number of ratings of this seller
* merchant_rating: merchant's rating

In [None]:
df["merchant_id"].value_counts()

In [None]:
df["merchant_title"].value_counts()

In [None]:
df["merchant_info_subtitle"].value_counts()

In [None]:
def getPercentage(x): 
    match = re.search(r'\d+%', str(x))
    if match is None:
        return None
    else:
        return float(match.group().rstrip("%"))

In [None]:
df['merchant_info_subtitle'] = df['merchant_info_subtitle'].str.replace(' ', '')
df['merchant_positive_pct'] = df['merchant_info_subtitle'].apply(getPercentage)
df['merchant_positive_pct'].head()

In [None]:
df['merchant_positive_pct'].fillna((df['merchant_positive_pct'].mean()), inplace=True)

In [None]:
df.drop(['merchant_info_subtitle', 'merchant_title', 'merchant_name', 'merchant_id'], axis=1, inplace=True)

### currency_buyer

In [None]:
df['currency_buyer'].value_counts()

In [None]:
df.drop(['currency_buyer'], axis=1, inplace=True) 

### units_sold

It seems like units_sold is not a continuos variable like we would expect. 

In [None]:
df['units_sold'].value_counts()

In [None]:
df.loc[df['units_sold'] < 10, 'units_sold'] = 10

In [None]:
df['units_sold'].value_counts()

In [None]:
plt.figure(figsize=(10, 6))
ax = sns.countplot(x = 'units_sold',
                   order = df['units_sold'].value_counts().index,
                   palette= "Set2",
                   data=df)
ax.set(xlabel='Units sold', ylabel='Count')
plt.xticks(rotation=45, ha='right')
plt.show()

Let's define product listing with units_sold larger than 1000 to be successful 

In [None]:
def is_success(units_sold):
    if units_sold > 1000: 
        return 1
    else: 
        return 0 

In [None]:
df['is_success'] = df['units_sold'].apply(is_success) 
df['is_success'].value_counts()

In [None]:
ax = sns.countplot(x = 'is_success',
                   palette= "Set2",
                   data=df)
plt.show()

### price and retail price

In [None]:
df[['price', 'retail_price']].describe()

In [None]:
df['percent_discount'] = (df['retail_price'] - df['price']) / df['retail_price'] * 100
df[['retail_price', 'price', 'percent_discount']].head()

In [None]:
sns.violinplot(data=df, y='percent_discount', x='is_success')

### tags

In [None]:
df['tags'].head()

In [None]:
category = pd.read_csv('../input/summer-products-and-sales-in-ecommerce-wish/unique-categories.sorted-by-count.csv')
print(category.shape)
category.head()

There are 2620 unique category 

In [None]:
fig_dims = (15, 10)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot(x = 'keyword',
            y = 'count',
            data = category.iloc[:25],
            ax = ax)
ax.set(xlabel='Keyword', ylabel='Count')
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
df['n_tags'] = df['tags'].apply(lambda x: len(x.split(",")))
df['n_tags'].head()

We can use NLP library to implement category based text tagging. However, since I am not yet familiar with NLP, I will skip this for now (and revisit later).

In [None]:
id_counts = df['product_id'].value_counts()
duplicated_ids = id_counts[id_counts > 1].index 
df[df['product_id'].isin(duplicated_ids)].sort_values("product_id").head().T

In [None]:
df.drop(['product_id'], axis = 1, inplace=True)

The differences here seems to be in the has_urgency_banner and urgency_text column

In [None]:
df.isnull().sum()

## 3.2. Visualization

In [None]:
from wordcloud import WordCloud, STOPWORDS

word_string=" ".join(df['tags'].str.lower())
wordcloud = WordCloud(stopwords=STOPWORDS).generate(word_string)
plt.subplots(figsize=(15,15))
plt.clf()
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
df.drop(['tags'], axis=1, inplace=True)

In [None]:
plt.subplots(figsize=(10,6))
sns.lineplot(data=df, x='n_tags', y='units_sold', ci=None)
plt.show()

In [None]:
px.histogram(df['rating'], nbins=10)

In [None]:
px.histogram(df, x='rating', color='uses_ad_boosts', nbins=10)

In [None]:
df['bins_rating'] = df['rating'].apply(lambda x: 0.5*np.floor(x/0.5))
px.histogram(df.groupby('bins_rating').mean().reset_index(), x='bins_rating', y='units_sold', nbins=10,
            labels={'units_sold' : 'Average units sold'})

In [None]:
sales_df = df[['product_variation_size_id','units_sold']]
sales_df = sales_df.groupby('product_variation_size_id').mean() 
sales_df.sort_values(by='units_sold', ascending=False, inplace=True) 
sales_df = sales_df.reset_index() 

px.bar(sales_df, x='product_variation_size_id', y='units_sold', title='Average sales by product size')

### Uses ad boosts vs unit_sold, retail_price, rating 

In [None]:
df[['units_sold', 'retail_price', 'rating', 'uses_ad_boosts']].groupby('uses_ad_boosts').mean()

### Use ad boosts vs success

In [None]:
sns.countplot(data=df, x='uses_ad_boosts', hue='is_success')

There seems to be not much of a difference between using ad boost or not in terms of success. In fact, the average unit solds for products using ad boosts is lower. 

### Bagdge product quality vs is_success

In [None]:
sns.countplot(data=df, x='badge_product_quality', hue='is_success')

### Bar plot of discount range 

In [None]:
df['discount_pct_range'] = pd.cut(df['percent_discount'], bins=np.arange(-20, 110, 10), right=False)

In [None]:
plt.subplots(figsize=(10,5))
sns.countplot(data= df, x='discount_pct_range', order=sorted(df['discount_pct_range'].unique()))
plt.show()

In [None]:
discount_df = df[['discount_pct_range', 'units_sold']].groupby('discount_pct_range').mean()
discount_df = discount_df.reset_index()
plt.subplots(figsize=(10,5))
sns.barplot(data=discount_df, x='discount_pct_range', y='units_sold')
plt.show()

There seems to be no strong correlation between percent discount and units_sold 

In [None]:
df.describe().T

In [None]:
features = df.select_dtypes(exclude="bool").columns
features

### Correlation Heat Map

In [None]:
px.imshow(df[features].corr(), width=1000, height=1000)

In [None]:
corr = df[features].corr()
x = corr[['is_success']]
hm = sns.heatmap(x.sort_values(by='is_success', ascending=False), vmin=-1, vmax=1)
hm.set_title('Features Correlating with is_success')

### Correlation

* There is a strong positive correlation between units_sold and rating_count. Because there are also strong correlation with individual rating count, we will remove these columns.

In [None]:
df.drop(['rating_five_count', 'rating_four_count', 'rating_three_count',
       'rating_two_count', 'rating_one_count', 'discount_pct_range'], axis=1, inplace=True)

In [None]:
df.columns

## 4. Modeling 

In [None]:
df = pd.get_dummies(df, columns=['product_color'], prefix='color_', drop_first=True)
df = pd.get_dummies(df, columns=['product_variation_size_id'], prefix='size_', drop_first=True)
df = pd.get_dummies(df, columns=['urgency_text'], drop_first=True)
df.head().T

In [None]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(df.drop(['is_success', 'units_sold'], axis=1), df['is_success'], test_size=0.2, random_state=1234, stratify=df['is_success'])

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=1)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy_score(y_pred, y_test)

### Features importance

In [None]:
features = pd.DataFrame()
features['feature'] = X_train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)

features.plot(kind='barh', figsize=(25, 25))

In [None]:
df.drop(['is_success'], axis = 1, inplace=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['units_sold'], axis=1), df['units_sold'], test_size=0.2, random_state=1234, stratify=df['units_sold'])

### Compare multiple models 

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [None]:
import warnings
warnings.filterwarnings('ignore')

models = [DecisionTreeClassifier(), XGBClassifier(),  
          GradientBoostingClassifier(), KNeighborsClassifier(), RandomForestClassifier()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print(model)
    print('---------------------------')
    print(metrics.classification_report(y_test,y_pred))
    print('')
    print('')

## 5. Conclusion

* The uses of ad boosts does not have any effect on number of units sold and might lose revenue from running ads 
* More accurate analysis can be done if the exact number of units_sold is known 
* Higher ratings means higher units sold 
* Product quality badges seems to increase the success of the products. 