# Analysis on Sales of summer clothes in E-commerce Wish
### This notebook is open for improvement

#### Goal of this notebook is to have a clear understanding of the data set and help people build their own notebooks

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import matplotlib
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading the data

* **pd.read_csv** function reads the .csv (Comma seperated value) file and turns it into a data table.
* **data.head()** function shows the top 5 rows from this data table.

#### You can copy filepath directly from the box which shows up when you hover your mouse on a certain file
#### Paste this filepath between quotes to make pd.read_csv function load your data

In [None]:
data = pd.read_csv("../input/summer-products-and-sales-in-ecommerce-wish/summer-products-with-rating-and-performance_2020-08.csv")
data.head()

### Categories

Loading data about categories

Using the **head(100)** function to get top 100 columns, which containts the %69 of the sales.
We will use this data to manipulate tags of the original data set

In [None]:
categoryData = pd.read_csv("../input/summer-products-and-sales-in-ecommerce-wish/unique-categories.sorted-by-count.csv")
categoryData['count'].head(100).sum()/categoryData['count'].sum()

In [None]:
importantTags = categoryData.keyword.head(100).str.lower().tolist()

# Feature Selection & Generation
**data.columns** shows us the names of the columns, if we put this info into len() functions, it will tell us the number of columns

In [None]:
print("There are ",len(data.columns), " columns in this dataset.\nColumns:\n",data.columns)

Storing important tags (Those tags contains %69 percent of all sales, don't forget about these, we will use them later!)

In [None]:
importantTags = categoryData.keyword.head(20).str.lower().tolist()

## Dropping unnecessary columns

### Title and Title_orig
**Title** column is language dependent, while **title_orig** column is english, so we will drop **title**

Columns **title** and **title_orig** definitely have an effect on the units sold, but it would require NLP and semantic analysis to turn this data into meaningful categories (Sentiment, theme or maybe word popularity in the title can effect the sales), so these two columns will be dropped.

**data.drop** function helps us to drop specific columns. We pass the columns we want to drop to the columns= parameter of this function. A new data table without these columns will be generated, you can store this new data table in different variables to preserve your original data table.

In [None]:
selectedFeatures = data.drop(columns=['title','title_orig'])

### What does ID's represent
As It can bee seen below, a product gets it's ID based on it's merchant, in other words, two EXACT same products have different ID's on different merchants, therefore we cannot use product ID to calculate a certain product's success, therefore we will be dropping that column.

In [None]:
pid_dt = selectedFeatures['product_id']
mid_dt = selectedFeatures['merchant_id']
print(len(pid_dt), len(mid_dt))


### Other unnecessary columns
* **merchant_id** is a data which doesn't mean anything to humans, but we could use the merchant's popularity, but code above shows us not even a single merchant is selling more than one item, so this column is unnecessary. So we will drop this and all other columns which are related to merchant
* **product_id** read the explanation above,this will be dropped
* **product_url** should not have an effect on sales numbers, but who knows, I *COULD* buy anything from an URL which consists ***THE BATMAN***, anyways we will drop this column too, it is not relevant.
* **product_picture** column carries the link info, not the picture itself, we could also use the picture itself to define some kind of feature, but we cannot do this using the link of the picture.
* **urgency_text** this column has info about the product but it is not categorical data, merchant can write anything he/she likes to describe the urgency of the product, so the distribution is scattered to extract any valuable info from this column
* **shipping_option_name** can vary from language to language, therefore we will be interested in shipping price

In [None]:
selectedFeatures=selectedFeatures.drop(columns=['merchant_id','merchant_title',
                                                'merchant_name','merchant_profile_picture',
                                                'merchant_info_subtitle','product_id','product_url',
                                                'product_picture','shipping_option_name','urgency_text'])
selectedFeatures.head()

## Analyzing other columns
We will be analyzing the columns which we did not drop yet, after analyzing we will be generating new features from those.

### price to retail_price

Does retail price compared to price have an effect on the total sales? As you can see below, people on Wish doesn't really care about the price they pay. Our newly generated column "ret_to_price_ratio" represents the retail price ratio to real price.

Variable named **rptp** is short for "retail price to price", selectedFeatures[[COLUMNS]] returns a data table with specified columns. In our case, these columns are "price" and "retail_price", then we generate a new column on selectedFeatures data table and name it "ret_to_price_ratio".

In [None]:
rptp = selectedFeatures[['price','retail_price']]
selectedFeatures['ret_to_price_ratio'] = rptp['retail_price']/rptp['price']
selectedFeatures[['ret_to_price_ratio','units_sold']].sort_values(by=['units_sold'],ascending = False)

### Rating to sales ratio means anything?
As you can see in the plot below, there is a negative correlation between rating/rating count to sales

# Overfitting example
I wanted to show an overfitting example and bad feature extracting practise.
As you can see below, we are assigning the units_sold/rating and units_sold/rating_count as new columns. This is a bad practise and doesn't mean that you developed a "good" model. This means you cheated. Do not use the value you are going to predict to generate new values.

In [None]:
rtrc_columns = selectedFeatures[['rating','rating_count','units_sold']]
rtrc = rtrc_columns['rating']/(rtrc_columns['rating_count']+1) # +1 is in order to deflect infinity
#These two lines will cause overfitting!!!!!
selectedFeatures['sales_to_rating'] = rtrc_columns['units_sold']/rtrc_columns['rating'] #Overfitting 1
selectedFeatures['sales_to_rating_count'] = rtrc_columns['units_sold']/(rtrc_columns['rating_count']+1) #Overfitting 2
#Change the two lines marked as "Overfitting"!!!!!
selectedFeatures['rating_to_rating_count'] = rtrc
matplotlib.pyplot.scatter(rtrc_columns['units_sold'],rtrc)

### Currency
Does currency of the buyer have an effect on the total sales? There is only one currency in this data table, as you can see below, which provides nothing to our analysis so it will be dropped

In [None]:
print(selectedFeatures['currency_buyer'].unique())
selectedFeatures = selectedFeatures.drop(columns=['currency_buyer'])

### uses_ad_boosts
Does using ad boosts really effect the number of sales? Lets see. As you can see below, using ad boosts doesn't have a meaningful effect on the sales number.

We select two necessary columns in line 1
Then groupping data we got under unique values of "uses_ad_boosts", using the mean of the "units_sold"
Then we drop this column in selectedFeatures because it does not provide any valuable information

Mean units sold for ad boost users is 4167.13
Mean units sold for non ad boost users is 4470.21

In [None]:
ad_boost_success = selectedFeatures[['uses_ad_boosts','units_sold']]
ad_boost_success = ad_boost_success.groupby(['uses_ad_boosts']).mean().sort_values(by=['units_sold'], ascending = False)
ad_boost_success

### Rating and rating_count
Rating is between 1-5, which seems small but because of the float data it carries, the range is too high, so it will be converted into categorical data. You can see the positive correlation between categorical rating and units sold in the graph.

In [None]:
conditions = [(selectedFeatures['rating']<2),
              ((selectedFeatures['rating']>=2) & (selectedFeatures['rating']<3)),
              ((selectedFeatures['rating']>=3) & (selectedFeatures['rating']<4)),
              ((selectedFeatures['rating']>=4) & (selectedFeatures['rating']<=5))]
tags = ['tag_1','tag_2','tag_3','tag_4']
selectedFeatures = selectedFeatures.assign(categorical_rating = np.select(conditions,tags))
#selectedFeatures['categorical_rating'] = round(selectedFeatures['rating'],1)

#These three lanes are for generating the plot and will be repeated frequently
rating_to_sales = selectedFeatures[['units_sold','categorical_rating']]
rating_to_sales = rating_to_sales.groupby(['categorical_rating']).mean().sort_values(by=['categorical_rating'])
rating_to_sales.plot()

As you can see in the print statement below, rating counts range from 0 to 20744 with mean of 890. This means that the data is not uniformly distributed and likely have too many outliers, my solution to this is converting this data into categorical data. Again, you can observe the positive correlation in the graph below

In [None]:
print("Min: ", min(selectedFeatures['rating_count']),"\nMax: ", max(selectedFeatures['rating_count']),"\nMean: ",selectedFeatures['rating_count'].mean())

In [None]:
low = selectedFeatures['rating_count'].quantile(0.3)
mid = selectedFeatures['rating_count'].quantile(0.6)
high = selectedFeatures['rating_count'].quantile(0.9)

conditions = [(selectedFeatures['rating_count']<low),
              ((selectedFeatures['rating_count']>=low) & (selectedFeatures['rating_count']<mid)),
              ((selectedFeatures['rating_count']>=mid) & (selectedFeatures['rating_count']<high)),
              (selectedFeatures['rating_count']>=high)
             ]
tags = ['tag0_low','tag2_mid','tag4_high','tag5_extreme']
selectedFeatures = selectedFeatures.assign(categorical_rating_count = np.select(conditions,tags))
rating_count_to_sales = selectedFeatures[['units_sold','categorical_rating_count']]
rating_count_to_sales = rating_count_to_sales.groupby(['categorical_rating_count']).mean().sort_values(by=['categorical_rating_count'])
rating_count_to_sales.plot()

Now it's time to convert "rating_five_count" into categorical data, just copy-paste and edit the code above
This also have a positive correlation.

In [None]:
low =selectedFeatures['rating_five_count'].quantile(0.3)
mid = selectedFeatures['rating_five_count'].quantile(0.6)
high = selectedFeatures['rating_five_count'].quantile(0.9)

conditions = [(selectedFeatures['rating_five_count']<low),
              ((selectedFeatures['rating_five_count']>=low) & (selectedFeatures['rating_five_count']<mid)),
              ((selectedFeatures['rating_five_count']>=mid) & (selectedFeatures['rating_five_count']<high)),
              (selectedFeatures['rating_five_count']>=high)
             ]
tags = ['tag0_low','tag2_mid','tag4_high','tag5_extreme']
selectedFeatures = selectedFeatures.assign(categorical_rating_five_count = np.select(conditions,tags))
rating5_count_to_sales = selectedFeatures[['units_sold','categorical_rating_five_count']]
rating5_count_to_sales = rating5_count_to_sales.groupby(['categorical_rating_five_count']).mean().sort_values(by=['categorical_rating_five_count'])
rating5_count_to_sales.plot()

Repeat the same process for the other star counts

Four stars: Still positive correlation, lets check if this trend changes when ratings get lower.

In [None]:
low =selectedFeatures['rating_four_count'].quantile(0.3)
mid = selectedFeatures['rating_four_count'].quantile(0.6)
high = selectedFeatures['rating_four_count'].quantile(0.9)

conditions = [(selectedFeatures['rating_four_count']<low),
              ((selectedFeatures['rating_four_count']>=low) & (selectedFeatures['rating_four_count']<mid)),
              ((selectedFeatures['rating_four_count']>=mid) & (selectedFeatures['rating_four_count']<high)),
              (selectedFeatures['rating_four_count']>=high)
             ]
tags = ['tag0_low','tag2_mid','tag4_high','tag5_extreme']
selectedFeatures = selectedFeatures.assign(categorical_rating_four_count = np.select(conditions,tags))

rating4_count_to_sales = selectedFeatures[['units_sold','categorical_rating_four_count']]
rating4_count_to_sales = rating4_count_to_sales.groupby(['categorical_rating_four_count']).mean().sort_values(by=['categorical_rating_four_count'])
rating4_count_to_sales.plot()

Three stars: Trend still goes on.

In [None]:
low =selectedFeatures['rating_three_count'].quantile(0.3)
mid = selectedFeatures['rating_three_count'].quantile(0.6)
high = selectedFeatures['rating_three_count'].quantile(0.9)

conditions = [(selectedFeatures['rating_three_count']<low),
              ((selectedFeatures['rating_three_count']>=low) & (selectedFeatures['rating_three_count']<mid)),
              ((selectedFeatures['rating_three_count']>=mid) & (selectedFeatures['rating_three_count']<high)),
              (selectedFeatures['rating_three_count']>=high)
             ]
tags = ['tag0_low','tag2_mid','tag4_high','tag5_extreme']
selectedFeatures = selectedFeatures.assign(categorical_rating_three_count = np.select(conditions,tags))
rating3_count_to_sales = selectedFeatures[['units_sold','categorical_rating_three_count']]
rating3_count_to_sales = rating3_count_to_sales.groupby(['categorical_rating_three_count']).mean().sort_values(by=['categorical_rating_three_count'])
rating3_count_to_sales.plot()

Two stars:

In [None]:
low =selectedFeatures['rating_two_count'].quantile(0.3)
mid = selectedFeatures['rating_two_count'].quantile(0.6)
high = selectedFeatures['rating_two_count'].quantile(0.9)

conditions = [(selectedFeatures['rating_two_count']<low),
              ((selectedFeatures['rating_two_count']>=low) & (selectedFeatures['rating_two_count']<mid)),
              ((selectedFeatures['rating_two_count']>=mid) & (selectedFeatures['rating_two_count']<high)),
              (selectedFeatures['rating_two_count']>=high)
             ]
tags = ['tag0_low','tag2_mid','tag4_high','tag5_extreme']
selectedFeatures = selectedFeatures.assign(categorical_rating_two_count = np.select(conditions,tags))
rating2_count_to_sales = selectedFeatures[['units_sold','categorical_rating_two_count']]
rating2_count_to_sales = rating2_count_to_sales.groupby(['categorical_rating_two_count']).mean().sort_values(by=['categorical_rating_two_count'])
rating2_count_to_sales.plot()

One star:

In [None]:
low =selectedFeatures['rating_one_count'].quantile(0.3)
mid = selectedFeatures['rating_one_count'].quantile(0.6)
high = selectedFeatures['rating_one_count'].quantile(0.9)

conditions = [(selectedFeatures['rating_one_count']<low),
              ((selectedFeatures['rating_one_count']>=low) & (selectedFeatures['rating_one_count']<mid)),
              ((selectedFeatures['rating_one_count']>=mid) & (selectedFeatures['rating_one_count']<high)),
              (selectedFeatures['rating_one_count']>=high)
             ]
tags = ['tag0_low','tag2_mid','tag4_high','tag5_extreme']
selectedFeatures = selectedFeatures.assign(categorical_rating_one_count = np.select(conditions,tags))
rating1_count_to_sales = selectedFeatures[['units_sold','categorical_rating_one_count']]
rating1_count_to_sales = rating1_count_to_sales.groupby(['categorical_rating_one_count']).mean().sort_values(by=['categorical_rating_one_count'])
rating1_count_to_sales.plot()

As it can be observed above, number of the ratings is more important than positive ratings, Quantity > Quality in this case.


Now we have replacements for "rating", "rating count", and all "rating X count" colums, therefore we no longer need them, we can drop it.

In [None]:
selectedFeatures = selectedFeatures.drop(columns=['rating','rating_count','rating_five_count','rating_four_count',
                                                  'rating_three_count','rating_two_count','rating_one_count'])

Now we have to convert the rating of the merchant into categorical data. The graph is nearly linear and has positive tangent. Which means positive correlation between merchant rating and units sold.

In [None]:
conditions = [(selectedFeatures['merchant_rating']<2),
              ((selectedFeatures['merchant_rating']>=2) & (selectedFeatures['merchant_rating']<3)),
              ((selectedFeatures['merchant_rating']>=3) & (selectedFeatures['merchant_rating']<4)),
              ((selectedFeatures['merchant_rating']>=4) & (selectedFeatures['merchant_rating']<=5))]
tags = ['tag_1','tag_2','tag_3','tag_4']
#selectedFeatures['categorical_merchant_rating'] = round(selectedFeatures['merchant_rating'],1)
selectedFeatures = selectedFeatures.assign(categorical_merchant_rating = np.select(conditions,tags))


m_rating_to_sales = selectedFeatures[['units_sold','categorical_merchant_rating']]
m_rating_to_sales = m_rating_to_sales.groupby(['categorical_merchant_rating']).mean().sort_values(by=['categorical_merchant_rating'])
m_rating_to_sales.plot()

Now rating count of the merchant. Again positive correlation, I wonder if we can find anything BAD.

In [None]:
low = selectedFeatures['merchant_rating_count'].quantile(0.3)
mid = selectedFeatures['merchant_rating_count'].quantile(0.6)
high = selectedFeatures['merchant_rating_count'].quantile(0.9)

conditions = [(selectedFeatures['merchant_rating_count']<low),
              ((selectedFeatures['merchant_rating_count']>=low) & (selectedFeatures['merchant_rating_count']<mid)),
              ((selectedFeatures['merchant_rating_count']>=mid) & (selectedFeatures['merchant_rating_count']<high)),
              (selectedFeatures['merchant_rating_count']>=high)
             ]
selectedFeatures = selectedFeatures.assign(categorical_merchant_rating_count = np.select(conditions,tags))

m_rating_count_to_sales = selectedFeatures[['units_sold','categorical_merchant_rating_count']]
m_rating_count_to_sales = m_rating_count_to_sales.groupby(['categorical_merchant_rating_count']).mean().sort_values(by=['categorical_merchant_rating_count'])
m_rating_count_to_sales.plot()

Drop data related to merchants rating, we already replaced them with categorical data

In [None]:
selectedFeatures = selectedFeatures.drop(columns=['merchant_rating_count','merchant_rating'])
selectedFeatures

### Origin country
Even though some countries have small sample size (AT and VE), we can see that some countries are more successful, but because we have nearly no info on other countries, we will drop this column. You can see this in the pie chart below, more than 3/4 of the units sold belongs to two countries.

In [None]:
origin_country_success = selectedFeatures[['origin_country','units_sold']]
origin_country_success = origin_country_success.groupby(['origin_country']).mean().sort_values(by=['units_sold'], ascending = False)
selectedFeatures = selectedFeatures.drop(columns=['origin_country'])
origin_country_success.plot.pie(subplots=True, figsize=(10,10))

### Theme
The only theme is "summer" so dropping this will help the performance of our model

In [None]:
themes = selectedFeatures['theme']
print(themes.unique())
selectedFeatures = selectedFeatures.drop(columns=['theme'])

### Duplicate data
We have to drop them

In [None]:
print(selectedFeatures.duplicated().sum())
selectedFeatures.drop_duplicates(inplace=True)
selectedFeatures = selectedFeatures.reset_index(drop=True)
selectedFeatures.info()

# Label Encoding
We will convert data shown as "object" into integers or floats to ease our model defining and training process.
We cannot encode "product_color" and "product_variation_size_id" columns so we have to drop them.

In [None]:
selectedFeatures = selectedFeatures.drop(columns=['product_variation_size_id','product_color'])

encoder = LabelEncoder() #From sklearn.preprocessing library

selectedFeatures['categorical_merchant_rating_count'] = encoder.fit_transform(selectedFeatures['categorical_merchant_rating_count'])
selectedFeatures['categorical_merchant_rating'] = encoder.fit_transform(selectedFeatures['categorical_merchant_rating'])

selectedFeatures['categorical_rating_five_count'] = encoder.fit_transform(selectedFeatures['categorical_rating_five_count'])
selectedFeatures['categorical_rating_four_count'] = encoder.fit_transform(selectedFeatures['categorical_rating_four_count'])
selectedFeatures['categorical_rating_three_count'] = encoder.fit_transform(selectedFeatures['categorical_rating_three_count'])
selectedFeatures['categorical_rating_two_count'] = encoder.fit_transform(selectedFeatures['categorical_rating_two_count'])
selectedFeatures['categorical_rating_one_count'] = encoder.fit_transform(selectedFeatures['categorical_rating_one_count'])

selectedFeatures['categorical_rating_count'] = encoder.fit_transform(selectedFeatures['categorical_rating_count'])
selectedFeatures['categorical_rating'] = encoder.fit_transform(selectedFeatures['categorical_rating'])
selectedFeatures['crawl_month'] = encoder.fit_transform(selectedFeatures['crawl_month'])

We have to convert float64 values into float32, because our model does not support float64

In [None]:
selectedFeatures['ret_to_price_ratio'] = np.float32(selectedFeatures['ret_to_price_ratio'])
selectedFeatures['sales_to_rating'] = np.float32(selectedFeatures['sales_to_rating'])
selectedFeatures['price'] = np.float32(selectedFeatures['price'])

Checking for null values

In [None]:
selectedFeatures.isnull().sum()

has_urgency_banner column have NaN values, so I have to convert it manually

In [None]:

selectedFeatures = selectedFeatures.assign(has_urgency = (data['has_urgency_banner']==1.0).astype(int))
selectedFeatures = selectedFeatures.drop(columns=['has_urgency_banner','crawl_month'])

### Lets check our data table and columns
.T operation means "Transpose of the matrix/datatable"

In [None]:
selectedFeatures.T

### Tags
Tags are just list of strings, we will create new columns using the important tags, and drop this column, then we calculate the weighted tags, which gives the most weight to most popular tag.

As you can see below, there is no visible correlation between weighted tags and units sold

In [None]:
weightedTags = []
for i in range(len(selectedFeatures['tags'])):
    count = 0
    weight = 100
    for j in range(len(importantTags)):
        if importantTags[j] in selectedFeatures['tags'][i]:
            count+=weight
        weight-=5
    weightedTags.append(count)
    
df = pd.DataFrame({'weightedTags':weightedTags})
selectedFeatures['weightedTags'] = df['weightedTags']
selectedFeatures = selectedFeatures.drop(columns=['tags'])

weighted_tags_to_sales = selectedFeatures[['units_sold','weightedTags']]
weighted_tags_to_sales = weighted_tags_to_sales.groupby(['weightedTags']).mean().sort_values(by=['weightedTags'])
weighted_tags_to_sales.plot()

In [None]:
selectedFeatures.info()

# Defining and training the model
Seperating the data table into two data tables; one contains the features to predict sales, other holds the sales

In [None]:
features = selectedFeatures.drop(columns=['units_sold'])
sales = selectedFeatures['units_sold']

feature_train,feature_test,sale_train,sale_test=train_test_split(features,sales,test_size=0.2,random_state=0)

#### Training a model

In [None]:
regressor=RandomForestRegressor(n_estimators=10000)
regressor.fit(feature_train,sale_train)

# Make predictions

## Do not trust this prediction score, because as I stated previously, this data means nothing.

In [None]:
sale_pred=regressor.predict(feature_test)
print("Prediction score: ", r2_score(sale_pred,sale_test))

# Your Turn

Can you improve this notebook by defining new feautres or improving the features that we generated already?
Go on, copy this notebook and work on it!