# DO UPVOTE AND FOLLOW

# DATASET
# This dataset contains product listings as well as products ratings and sales performance, which you would not find in other datasets.

# With this, you can finally start to look for correlations and patterns regarding the success of a product and the various components.

# Features and Columns
* The data was scraped in the french localisation (hence some non-ascii latin characters such as « é » and « à ») in the title column. 
* The title_orig on the other hand contains the original title (the base title) that is displayed by default. When a translation is provided by the seller, it appears in the title column. When the title and title_orig columns are the same, it generally means that the seller did not specify a translation that would be displayed to users with french settings.
* A picture is worth a thousand words. In the following screenshot you see some features and how to interpret them.

# ***THINGS YOU HAVE TO FOLLOW WHILE WALKING THROUGH THE WHOLE REPORT***

1. If you are familiar with python then only follow the code. 
2. The simple explanations about any visual or graph will be there.
3. There will be a brief conclusion of the report.
4. Every explanation is presented below the line of code's output.

***Enjoy***

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
!pip install pywaffle --quiet
from pywaffle import Waffle
from wordcloud import WordCloud

In [None]:
df= pd.read_csv("../input/summer-products-and-sales-in-ecommerce-wish/summer-products-with-rating-and-performance_2020-08.csv")

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

# WHAT WE CAN EXPECT

1. * How different is the 'price' from 'retail price' and what is the effect of the difference. Expecting higher number of units sold if price is less than retail price and vice versa.
1. * Does having ad boosts increase success?
1. * Any correlation between units sold and ratings?
1. * What is badges? This looks like 'awards' of some sort? Do success with increase in number of badges?
1. * What are effect of different type of badges?
1. * Maybe some brief analysis on product variations.. Do increased variations leads to inreased success?
1. * Shipping options analysis
1. * Analysis of inventory total and units sold.

In [None]:
df.isnull().sum()

In [None]:
def plot_missing_data(df):
    columns_with_null = df.columns[df.isna().sum() > 0]
    null_pct = (df[columns_with_null].isna().sum() / df.shape[0]).sort_values(ascending=False) * 100
    plt.figure(figsize=(8,6));
    sns.barplot(y = null_pct.index, x = null_pct, orient='h')
    plt.title('% Na values in dataframe by columns');

In [None]:
plot_missing_data(df)

In [None]:
df['has_urgency_banner'] = df['has_urgency_banner'].replace(np.nan,0)
df['urgency_text']=df['urgency_text'].replace({'Quantité limitée !':'QuantityLimited',
                                               'Réduction sur les achats en gros':'WholesaleDiscount',
                                               np.nan:'noText'})


In [None]:
rating_columns = ['rating_one_count','rating_two_count','rating_three_count','rating_four_count','rating_five_count']
df[rating_columns] = df[rating_columns].fillna(value=-1)

In [None]:
df.loc[df['rating_five_count']==-1,'rating_count'].value_counts()
# all values in the rating_count column are 0 where there are na values in other rating count columns so lets fill 0 in place of the na values

In [None]:
df[rating_columns]=df[rating_columns].replace(-1,0)

In [None]:
nan_cat_cols = ['origin_country','product_color','product_variation_size_id','merchant_name','merchant_info_subtitle']
df[nan_cat_cols] = df[nan_cat_cols].replace(np.nan,'Unknown')

In [None]:
df.columns[df.isna().sum()>0]

In [None]:
df= df.drop_duplicates()

In [None]:
print("Duplicate product_id :",df['product_id'].duplicated().sum())

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df['price'], color='red', label='Price')

# right skewed distribution

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df['retail_price'], color='blue', label='Retail price')


In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(df["retail_price"])

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(df["price"])

# With Boxplots we can easily spot the outliers and quartiles

* The Upper fence of Price is at 18 i.e most of the data is priced less tha 18
* There an item wiht price of 49 i.e clearly an oulier as it is far away from the Inter Quartile Range (Q3 - Q1)
* Box plot of Retail price is much more spread out, there is huge difference of 195 between the upper fence and max data point

In [None]:
country_price=df[['units_sold','origin_country']]
country_mean_price=country_price.groupby('origin_country')['units_sold'].mean().reset_index()
country_mean_price.rename(columns={'units_sold': 'units_sold_mean'},inplace=True)
to_codes={'CN':'CHN',
         'GB':'GBR',
         'SG':'SGP',
         'US':'USA',
         'VE':'VEN'}
country_mean_price['code']=country_mean_price['origin_country'].map(to_codes)
country_mean_price

# Singapore - China ==> higher average sales

In [None]:
color_sale=df.groupby('product_color')['units_sold'].sum()
color_sale=color_sale.reset_index().sort_values(by='units_sold',ascending=False)
color_sale

In [None]:
top_10_color_sale=color_sale.head(10)

In [None]:
fig=px.bar(data_frame=top_10_color_sale,
      x='product_color',
      y='units_sold')
fig.update_layout(title='Top 10 color sales')
fig.show()

In [None]:
px.scatter(df, x='units_sold', y='price',marginal_x='box', title='Price vs Units Sold')

# High price ==> less units sold
1. There are some cases where the price is low still the units sold are below average, possible reasons the product might not be upto the mark as per the buyers or there are some other factors affecting the price we haven't touched yet
2. median of units sold is 1000, by this we can consider that products with units sold below 1000 (inclusive) were below average and products with units sold are very successfull.
# It totally depends on your business goals which price range you want to focus on.

In [None]:
from sklearn.cluster import KMeans

clusters = {}
for i in range(1,8):
    kmeans = KMeans(n_clusters=i).fit(df[['units_sold']])
    clusters[i] = kmeans.inertia_
    
plt.plot(list(clusters.keys()), list(clusters.values()));
plt.xlabel('no. of clusters');
plt.ylabel('kmeans inertia');

In [None]:
#order cluster method
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final

In [None]:
df['units_sold_cluster'] = KMeans(n_clusters=3).fit(df[['units_sold']]).predict(df[['units_sold']])
df = order_cluster('units_sold_cluster','units_sold',df,True)
df.groupby(['units_sold_cluster'])['units_sold'].describe()


In [None]:
px.scatter(df,x='units_sold',y='rating', color='units_sold_cluster', marginal_y ='box',title='Rating vs units sold')

* Median for rating is 3.85 and the products in top selling cluster has rating between 3.35 to 4.1 seems very reasonable
* Rating is very important to determine the potential of product
* Still there are some products with 5 star rating yet unable to cross the 100-1000 unit sold line
* there are some really bad performing products with rating below 3

In [None]:
px.scatter(df,x='retail_price', y='price',color='units_sold_cluster',marginal_y='box')

Most of the top selling products seems be concentrated to the left where the price difference is much siginificant

In [None]:
px.scatter(df, x='price', y='shipping_option_price', color= 'units_sold_cluster', title='Shipping price vs Price')

People always prefer paying less shipping charges we can see that most selling products has low shipping charges

In [None]:
df.groupby(['uses_ad_boosts'])['units_sold'].describe()

Consider these two groups of products one uses ad boost other dosen't

* There is very small difference between the means of the two groups
* Does using ad boost results in more success of products
* How big the difference is bwetween these two two groups?
* Is the effect statistically significant?

In [None]:
rating_cols=['rating_count','rating_five_count','rating_four_count',
             'rating_three_count','rating_two_count','rating_one_count']
ratings_data=df[rating_cols+['uses_ad_boosts']]

ratings_data.groupby('uses_ad_boosts').describe()
fig = go.Figure()
for col in rating_cols:
    fig.add_trace(go.Box(x=ratings_data['uses_ad_boosts'],
                         y=ratings_data[col],
                         name=col,
                         boxmean=True,
                         boxpoints=False))
fig.update_traces(quartilemethod="exclusive")
fig.update_layout(boxmode='group',
                  title='Relations between ad boosts and rating',
                  xaxis = dict(
                  tickvals = [0,1],
                  ticktext = ['Without add boosts','With add boosts']))
fig.show()

By dividing the data into two groups of "with" and "without add boosts", we can see that surprisingly, products without add boosts gain higher number of ratings on average, the same goes for number of 5, 4, 3, 2, 1-star ratings.



In [None]:
def make_clusters(df,column):
    clusters = {}
    for i in range(1,8):
        kmeans = KMeans(n_clusters=i).fit(df[[column]])
        clusters[i] = kmeans.inertia_

    plt.plot(list(clusters.keys()), list(clusters.values()));
    plt.title(f'{column} clusters')
    plt.xlabel('no. of clusters');
    plt.ylabel('kmeans inertia'); 

In [None]:
df['rating_score'] = df['rating']*df['rating_count']
df['rating_score'] =df['rating_score']/df['rating_score'].max()
plt.figure(figsize=(12,6))
sns.distplot(df['rating_score']);
plt.title('Distribution of Rating Score');

In [None]:
make_clusters(df,'rating_score')

In [None]:
kmeans = KMeans(n_clusters=3).fit(df[['rating_score']])
df['rating_score_cluster'] = kmeans.predict(df[['rating_score']])
df= order_cluster(df=df,cluster_field_name='rating_score_cluster',target_field_name='rating_score',ascending=True)
df.groupby('rating_score_cluster')[['rating','rating_count','units_sold']].describe().T

In [None]:
df['overall_score'] = df['rating_score_cluster'] + df['units_sold_cluster']
make_clusters(df,'overall_score');

In [None]:
kmeans= KMeans(n_clusters=2).fit(df[['overall_score']])
df['overall_score_cluster'] = kmeans.predict(df[['overall_score']])
df = order_cluster(df=df,target_field_name='overall_score', cluster_field_name='overall_score_cluster', ascending=True)
df.groupby('overall_score_cluster')[['rating_score','price','units_sold']].describe().T

* With this overall score we have identified the groups of top selling, most liked products which are the ones generating high revenue and products performing below average
* There 213 successfull products with range of units sold from 10K to 100K at a mean price of 8.45
* In the other cluster the mean price is 8.34 but mean units sold are much low
* another thing to notice is that people prefer a reasonable price as in successfull cluster the max price is 19, products in this cluster must be worth the price.

In [None]:
badges_column = ['badges_count',
       'badge_local_product', 'badge_product_quality', 'badge_fast_shipping']

In [None]:
df[df['badges_count'] != 0][badges_column]

In [None]:
def is_successful(units_sold):
    if units_sold > 1000:
        return 1
    else:
        return 0


In [None]:
df['is_successful'] = df['units_sold'].apply(is_successful)
#df['is_successful'] = df['units_sold'].apply(is_successful).astype('category')
print('Percent of successful products: ', df['is_successful'].value_counts()[1] / len(df['is_successful'])*100)

In [None]:
for column in badges_column:
    sns.countplot(data=df, x=column, hue='is_successful')
    plt.title(column)
    plt.show()

# Almost 50% of those who are successful have badges, especially the product quality badge. 
# So, lets assume badges results in sucess, since there are many successful products which do not have badges as well.

In [None]:
plt.subplots(figsize=(20,35))
wordcloud = WordCloud(
                          background_color='Black',
                          width=1920,
                          height=1080
                         ).generate(" ".join(df.tags))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('cast.png')
plt.show()

# THE END