# IMPORT DATA

In [None]:
import numpy as np 
import pandas as pd 

db = pd.read_csv('/kaggle/input/ecommerce-users-of-a-french-c2c-fashion-store/6M-0K-99K.users.dataset.public.csv')
db.head()

# Data Pre-visualization
Finding the related attributes for further analysis using correlation matrix

In [None]:
db.info()

# DATA CLEANING 
   1. Check missing values and the skewness of the data
   2. Data pre-visualization

## 1. Data checking

In [None]:
#checking if there is any missing value
db.isna().sum()

In [None]:
db.skew() # skew() function is used to check skewness in data

## 2. Data pre-visualiztion

Basic information of users 

In [None]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
group_names=['False','True']
q=pyplot.bar(group_names, db['hasAnyApp'].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("hasAnyApp")
plt.pyplot.ylabel("Count")
plt.pyplot.title("hasAnyApp Bins")

Most of the users did not use the mobile app

In [None]:
df_group_two = db[['hasAnyApp','productsBought','productsSold']]
df_group_two = df_group_two.groupby(['hasAnyApp'],as_index=False).agg([np.sum,np.mean])
df_group_two

Users with mobile app have sightly higher buying power;
Buyers without mobile app tend to have slightly higher selling power

In [None]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
group_names=['mrs','mr','miss']
pyplot.bar(group_names, height=db['civilityTitle'].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("civilityTitle")
plt.pyplot.ylabel("Count")
plt.pyplot.title("civilityTitle Bins")

Married women seem to be the most active users on this site

In [None]:
df_group_one = db[['civilityTitle','productsBought','productsSold']]
df_group_one = df_group_one.groupby(['civilityTitle'],as_index=False).agg([np.sum,np.mean])
df_group_one

Although single female is the smallest group of users, its buying power is the highest

In [None]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
group_names=['en', 'fr', 'it', 'de', 'es']
q=pyplot.bar(group_names, db['language'].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("language using on the site")
plt.pyplot.ylabel("Count")
plt.pyplot.title("language Bins")

In [None]:
df_group_three = db[['language','productsBought','productsSold']]
df_group_three = df_group_three.groupby(['language'],as_index=False).agg([np.sum,np.mean])
df_group_three

Users' first prefer language is English, and the second one is French
However, they did not have the highest average productsBought and productsSold.

In [None]:
import seaborn as sns
sns.countplot(x='language',data=db,hue='civilityTitle') 

countplot tells us that there are more married females users globally.

# Segement the users
Determine the number of recommended clusters by using a dendogram:
   1. Remove redundant variable identifier hash: since this is unique for each row of data
   
        & variable type: since this remains the same for all data

        & variable country: since we could use the countryCode for analysis
        
        & variable gender & civilityTitle: since we could use civilityGenderId for analysis
   
   2. Encode string variable language, and countryCode; and boolean variables
   
   3. Draw the correlation matrix of filtered dataset
   
   4. Sample choosing for agglomerative dendogram drawing
   
   5. Define filter function

## 1. Remove variables

In [None]:
repeat_columns = []
# unused and repeated metadata are dropped
repeat_columns += ['identifierHash', 'type','country','gender','civilityTitle']
db1=db.drop(repeat_columns,axis=1)
db1.head()

## 2. Encode variables

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

string_columns = ['language','countryCode','hasAnyApp','hasAndroidApp','hasIosApp','hasProfilePicture']

for var in string_columns:
    var_cat = db[[var]] #use double brakets to make sure i'm taking a dataframe 
    var_cat_encoded = ordinal_encoder.fit_transform(var_cat)
    var_cat_df = pd.DataFrame(var_cat_encoded)
    var_cat_df.columns = [var + '_encoded'] 
    db1 = db1.merge(var_cat_df, how = 'inner', left_index = True, right_index = True)

db2 = db1.drop(string_columns, axis = 1)
db2.head()
db2.info()

## 3. Correlation matrix

In [None]:
import seaborn as sns
a=sns.heatmap(db2.corr()) 
a.set_title('Heatmap of Correlation Matrix for All Users', fontsize = 20)

variable daysSinceLastLogin & hasProfilePicture: seem to be negatively correlated to every other variable. However, only 1.95% users did not have profile picture.

variable socialNbFollowers &socialNBFollows & socialproductsLiked & productsListed & productsSold & productsPassRate & productsWished seem to positively related to each other.

variable seniority seem to be uncorrelated with every other variable.

variable language and country seem to be negatively correlated, have week correlation with variable hasAnyApp & hasIosApp & hasAndroidApp; but have almost no correation with other variables. For now, we could keep these variables for further analysis.

In [None]:
#remove variables with no correlations
no_columns=['seniority','seniorityAsMonths','seniorityAsYears']

db3 = db2.drop(no_columns, axis = 1)

In [None]:
sns.pairplot(db3) # Parplot helps to get one to one relation between all attributes in dataset

## 4. Sample choosing for dendogram
Since this dataset is too large for drawing dendrogram, we could choose 30% of the data randomly.

In [None]:
print("Original dataset before filtering", db.shape)
print("Remainging data after filtering variables with no correlations:\n",db3.shape)
db_final = db3.sample(frac = 0.3)
print("\n Final shrinking columns: \n",db_final.columns)
print("\n Final shrinking data: \n",db_final.shape)

In [None]:
import matplotlib.pyplot as plt
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
fig = plt.figure(figsize = (11, 8))
dendogram = sch.dendrogram(sch.linkage(db_final,method = 'ward'))

According to the dendrogram plot, the numbers of recommended clusters would be two.
We could then segment users into two main clusters(Active/Inactive).
However, it also seems that we group the data into smaller clusters.

## 5. Define filter function

In [None]:
# using Jeffrey's Helpers to filter dataframes 

def helper_has_fields_compared_to(df, columns, target, what, operator):
   
    #Helper to compare several columns to the same value.

    col = columns[0]
    if operator == '>':
        res = (df[col] > target)
    elif operator == '>=':
        res = (df[col] >= target)
    elif operator == '<=':
        res = (df[col] <= target)
    elif operator == '<':
        res = (df[col] < target)
    elif operator == '==':
        res = (df[col] == target)
    elif operator == '!=':
        res = (df[col] != target)  
    for col in columns[1:]:
        if operator == '>':
            tmp = (df[col] > target)
        elif operator == '>=':
            tmp = (df[col] >= target)
        elif operator == '<=':
            tmp = (df[col] <= target)
        elif operator == '<':
            tmp = (df[col] < target)
        elif operator == '==':
            tmp = (df[col] == target)
        elif operator == '!=':
            tmp = (df[col] != target)
        if what == 'all':
            res = res & tmp
        elif what in ['any']:
            res = res | tmp
    return res

def helper_has_any_field_greater_than(df, columns, target):
    #Returns lines of the dataframe where any of value of the specified columns is greater than the target.
    res = helper_has_fields_compared_to(df, columns, target, 'any', '>')
    return res

def helper_has_any_field_smaller_than(df, columns, target):
    #Returns lines of the dataframe where any of value of the specified columns is smaller than the target.
    res = helper_has_fields_compared_to(df, columns, target, 'any', '<')
    return res

def helper_has_all_field_greater_than(df, columns, target):
    #Returns lines of the dataframe where all of value of the specified columns is smaller than the target.
    res = helper_has_fields_compared_to(df, columns, target, 'all', '>')
    return res

def helper_has_all_field_smaller_than(df, columns, target):
    #Returns lines of the dataframe where all of value of the specified columns is smaller than the target.
    res = helper_has_fields_compared_to(df, columns, target, 'all', '<')
    return res

def helper_has_all_field_equal_to(df, columns, target):
    #Returns lines of the dataframe where all of value of the specified columns is equal to the target.
    res = helper_has_fields_compared_to(df, columns, target, 'all', '==')
    return res

In [None]:
# Total Users
print(f"Total users: {db3.shape[0]} records with {db3.shape[1]} columns")

# Inactive Users
Inactive_db=db3[helper_has_all_field_smaller_than(db3,['socialProductsLiked', 'productsListed',
      'productsPassRate', 'productsWished', 'productsListed','productsSold','productsBought'],1)]
Inactive_db.dataframeName = "Inactive Users"
print(f"Inactive users: {Inactive_db.shape[0]} records with {Inactive_db.shape[1]} columns")
#Inactive_db.sample(12)

# Active Users
Active_db=db3[helper_has_any_field_greater_than(db3,['socialProductsLiked', 'productsListed',
      'productsPassRate', 'productsWished', 'productsListed','productsSold','productsBought'],0)]
Inactive_db.dataframeName = "Active Users"
print(f"Active users: {Active_db.shape[0]} records with {Active_db.shape[1]} columns")


### Active Users

In [None]:
# Actual Users with at least one bought or sold
Users_db = db3[helper_has_any_field_greater_than(db3,['productsSold','productsBought'],0)]
print(f"Actual Users: {Users_db.shape[0]} ")
#Users_db.sample(12)

# Active Actual Users with social interaction except transaction
AActive_db = Users_db[helper_has_any_field_greater_than(Users_db,['socialProductsLiked', 'productsListed',
       'productsPassRate', 'productsWished'], 0)]
AActive_db.dataframeName = "Active Actual Users"
print(f"Actal Active Users: {AActive_db.shape[0]}")
#Active_db.sample(12)

## Actual Buyers
buyers_db = db3[db3.productsBought > 0]
buyers_db.dataframeName = "Buyers"
print("Actual buyers: ", buyers_db.shape[0])


## Sellers
sellers_db = db3[(db3.productsListed > 0) | (db3.productsSold > 0)]
sellers_db.dataframeName = "Prospecting Sellers"
print("Prospecting sellers: ",sellers_db.shape[0])

### actual sellers (at least 1 product sold)
successful_sellers_db = db3[db3.productsSold > 0]
successful_sellers_db.dataframeName = "Actual sellers"
print("Actual sellers: ", successful_sellers_db.shape[0])

# Social Users with no transaction but social interaction
#by looking at the data, we could easily conclude that 
# each new account is automatically assigned 3 followers and 8 accounts to follow
social_db = db3[ (db3['socialNbFollowers'] != 3) | (db3['socialNbFollows'] != 8) ]
social_db1=social_db[helper_has_all_field_smaller_than(social_db,['productsSold','productsBought'],1)]
#Among those social users, filter only those active on products 
market_social_db = social_db1[helper_has_any_field_greater_than(social_db1, ['socialProductsLiked', 'productsListed',
       'productsPassRate', 'productsWished'], 0)]
print(f"Potential Social Users: {market_social_db.shape[0]}")
#market_social_db.sample(12)

In [None]:
print(f"""In average, buyers buy {buyers_db.productsBought.sum() / buyers_db.shape[0] :.2f} products. Details are as follows:""")

#successful buyers
Sbuyers_db = db3[db3.productsBought >= 3]
Sbuyers_db.dataframeName = "SBuyers"
print("Accordingly, Successful buyers: ", Sbuyers_db.shape[0])
buyers_db.productsBought.describe()

In [None]:
#include= ['socialNbFollowers','socialNbFollows', 'productsWished','socialProductsLiked']
Sbuyers_db.socialNbFollowers.describe()

In [None]:
print(f"""In average, actual sellers sell {successful_sellers_db.productsSold.sum() / successful_sellers_db.shape[0] :.2f} products. Details are as follows:""")
#successful sellers
Ssellers_db = db3[db3.productsSold >= 6]
Ssellers_db.dataframeName = "SSellers"
print("Accordingly, Successful sellers: ", Ssellers_db.shape[0])
successful_sellers_db.productsSold.describe()

Customer retention:
70% actual users are active users that have social interaction, these are the group of people that the website should value the most. It also implies that having social connection between sellers and buyers improve the loyalty of users.
3679 users have high potential to transform as actual users since they have strong social interaction.

### Products
% of products meeting the product description. (Sold products are reviewed by the store's team before being shipped to the buyer.)
Here, we used variable productsPassRate, the percentage of products meeting the product description (The
store's team reviews sold products before being shipped to the buyer.) as a critical metric. We defined sellers
with product pass rate greater than and equal to 90% as sellers with the highest quality, sellers with product
pass rate greater than and equal to 80% and smaller than 90% as sellers with the medium-high quality, sellers
with product pass rate greater than and equal to 60% and smaller than 80% as sellers with the standard quality,
sellers with product pass rate smaller than 60% as sellers with low quality. 

In [None]:
productsH_db = db3[db3.productsPassRate >= 90]
productsH_db.dataframeName = "Best quality's store"
print("Numbers of sellers with the highest quality: ", productsH_db.shape[0])

productsMh_db = db3[(db3.productsPassRate >= 80) & (db3.productsPassRate < 90)]
productsMh_db.dataframeName = "medium-high quality's store"
print("Numbers of sellers with the Medium-high quality: ", productsMh_db.shape[0])

productsS_db = db3[(db3.productsPassRate >= 60) &  (db3.productsPassRate < 80)]
productsS_db.dataframeName = "Standard qualisty's store"
print("Numbers of sellers with the stadard quality: ", productsS_db.shape[0])

productsU_db = db3[(db3.productsPassRate < 60) &  (db3.productsPassRate > 0)]
productsU_db.dataframeName = "Unqualified store"
print("Numbers of sellers with low quality: ", productsU_db.shape[0]+
      (successful_sellers_db.shape[0]-productsH_db.shape[0]-productsMh_db.shape[0]-productsS_db.shape[0]-productsU_db.shape[0]))

In [None]:
print(f"""In average, active low quality sellers sell {productsU_db.productsSold.sum() / sellers_db.shape[0] :.2f} """)
print(f"""In average, active standard quality sellers sell {productsS_db.productsSold.sum() / sellers_db.shape[0] :.2f} """)
print(f"""In average, active medium-high quality sellers sell {productsMh_db.productsSold.sum() / sellers_db.shape[0] :.2f} """)
print(f"""In average, active high quality sellers sell {productsH_db.productsSold.sum() / sellers_db.shape[0] :.2f} products. Details are as follows:""")
productsH_db.productsSold.describe()

The higher the productsPassRate, the higher the chance the store got higher sell. It also implies that the store's regulation on passing products is good.
We computed the average products sold for sellers with different product quality. We could
conclude that the higher the products' pass rate, the higher the chance the store got a higher sale.
It also implies that this site's regulation on passing products is good enough for buyers to be
satisfied with the product the site passed.
