## Introduction

In this kernel, I will explore a dataset I scraped from a **Customer-to-Customer (C2C)** second hand website in order to give you *insights* and some *benchmark* results about that kind of business. These results can also be extanded towards other domains of the e-commerce landscape, so be sure to follow along.

Hopefully, it will also help you keep hope by comparing your business with another one, especially the successful one studied here.

Feel free to reach out in the comments or with direct messages.
Have a nice walkthrough!

## Exploratory Analysis


In [None]:
# To begin this exploratory analysis, first import libraries and define functions and utilities to work with the data.

from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# for beautiful plots and some types of graphs
import seaborn as sns

In [None]:
# IMPORTING THE DATA
# There is 1 csv file in the current version of the dataset

kBaseDataDirectory = "/kaggle/input"  # on Kaggle
#kBaseDataDirectory = "./kaggle/input"  # when working offline with jupyter notebook

dataset_files = []

# This loop will import all dataset files in case we add more data in a next version of the dataset
for dirname, _, filenames in os.walk(kBaseDataDirectory):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        dataset_files.append(os.path.join(dirname, filename))


In [None]:
#######   Utility functions for statistics   #######

### Helpers to filter dataframes

def helper_has_fields_compared_to(df, columns, target, what, operator):
    """
    Helper to compare several columns to the same value.
    """
    col = columns[0]
    res = df[col] > target
    for col in columns[1:]:
        if operator == '>':
            tmp = (df[col] > target)
        elif operator == '>=':
            tmp = (df[col] >= target)
        elif operator == '<=':
            tmp = (df[col] <= target)
        elif operator == '<':
            tmp = (df[col] < target)
        elif operator == '==':
            tmp = (df[col] == target)
        elif operator == '!=':
            tmp = (df[col] != target)
        
        # 
        if what == 'all':
            res = res & tmp
        elif what in ['any']:
            res = res | tmp
    return res

def helper_has_any_field_greater_than(df, columns, target):
    """Returns lines of the dataframe where any of value of the specified columns
    is greater than the target.
    """
    res = helper_has_fields_compared_to(df, columns, target, 'any', '>')
    return res

def helper_has_all_field_greater_than(df, columns, target):
    res = helper_has_fields_compared_to(df, columns, target, 'all', '>')
    return res


### Other utilities for stats

def frequency(data, probabilities=False, sort=False, reverse=False):
    """Returns the frequency distribution of elements.
    This is a convenience method for effectif()'s most common use case, without all the more complicated parameters.
    :param data: A collection of elements you want to count.
    :param bool probabilities: Whether you want the result frequencies to sum up to 1. Default: False
    """
    xis, nis = effectif(data, returnSplitted=True, frequencies=probabilities, sort=sort, reverse=reverse)
    return xis, nis


def frequences(data, returnSplitted=True, hashAsString=False, universe=None, frequenciesOverUniverse=None):
    """
    """
    if universe is None:
        return effectif(data, returnSplitted, hashAsString, True)
    else:
        return effectifU(data, universe, returnSplitted, hashAsString, True, frequenciesOverUniverse)
    

def effectif(data, returnSplitted=True, hashAsString=False, frequencies=False, inputConverter=None, sort=False, reverse=False):
    """calcule l'effectif
    :param list data: une liste
    :param bool hashAsString: whether we should convert the values in 'data' to
                string before comparing them
    :param function inputConverter: a callable function that is used to convert
                the values within data into the class you want the values to be
                compared as. When not provided, the identity function is used.
                If used with parameter 'hashAsString', the hashed value will be
                the one returned by this function.
    :param bool sort: sort the result (only if returnSplitted). Shorthand for `sortBasedOn`
    :param bool reverse: reverse the order (only if sort and returnSplitted). Shorthand for `sortBasedOn`
    """
    inputConverter = (lambda x: x) if inputConverter is None else inputConverter
    effs = {}
    for val in data:
        val = inputConverter(val)
        key = str(val) if hashAsString else val
        try:
            effs[key] = effs[key]+1
        except:
            effs[key] = 1
    
    if frequencies:
        tot = sum(effs.values())
        for key in effs:
            effs[key] = effs[key]/tot
    
    if returnSplitted:
        xis = list(effs.keys())
        nis = list(effs.values())
        if sort:
            xis, nis = sortBasedOn(nis, xis, nis, reverse=reverse)
        return xis, nis
    
    return effs


In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()

# Correlation matrix
def plotCorrelationMatrix(df, graphWidth, segmentName=None):
    filename = segmentName if segmentName else getattr(df, "dataframeName", segmentName)
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()

# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()


### The dataset file

In [None]:
nRowsRead = None  # integer: How many rows to read. specify 'None' if want to read whole file
# loading the 1 dataset file of the kernel
main_input_dataset_filepath = dataset_files[0]
df1 = pd.read_csv(os.path.join(kBaseDataDirectory, main_input_dataset_filepath), delimiter=',', nrows = nRowsRead)

df1.dataframeName = 'users.dataset.public.csv'
orig_df1 = df1.copy()
orig_df1.dataframeName = 'users.dataset.public.csv'

nRow, nCol = df1.shape
print(f'There are {nRow} rows (i.e. users) and {nCol} columns (i.e. possible features)')
print('\nThese columns are: \n{}'.format(" - ".join(list(df1.columns))))

Let's take a quick look at what the data looks like. More explainations about each column are available in the [dataset page](https://www.kaggle.com/jmmvutu/ecommerce-users-of-a-french-c2c-fashion-store) so I am not going to review those, especially since the columns of this dataset are self-explainatory. I will assume you have familiarized yourself with the information presented on the dataset project page.

As a brief recap, the table represented here concerns users of a C2C website. People can sell *and* buy products from one another.
As such, we can look at this dataset to study either buyers or sellers. They can also follow other users/be followed and "like" products, just as in a. social network.

Each row in the table represents a single user's statistics. Profiles have been anonymised for privacy concerns.

In [None]:
#orig_df1.head(10)
orig_df1.sample(10)

#### Removing duplicate colums

As you may have seen, **some columns describe the same thing** (they are precomputed in order to provide insights with bare eyes). 
For instance, as described in the dataset description, the columns with `seniority` tell us how old the user account is.
Since we are exploring the data using a kernel, I will *remove such duplicate columns and create my owns if need be*.

In [None]:
#print('Columns: \n{}'.format(" - ".join(list(df1.columns))))

useless_columns = []
# unused metadata are dropped
useless_columns += ["identifierHash", "type"]

# Duplicate columns are dropped in favor of their siblings
useless_columns += ["seniority", "seniorityAsYears", "civilityGenderId", "country"]

columns_unused_in_correlations = ["hasProfilePicture", "daysSinceLastLogin"]

# Unless one specifically want to look at the difference between iOS users and Android users
# For instance, "Do iphone/iOS users buy more than Android users, since iPhones are more pricey ?"
# However, keep in mind you should study the average users, and this version of the dataset is too small
# to study and conclude straight. Contact the author of the dataset for more dataset entries ;)
unused_mobile_app_columns = ["hasIosApp", "hasAndroidApp"]

# this way we can have cleaner graphs if needed.
all_unused_columns = useless_columns + unused_mobile_app_columns + columns_unused_in_correlations

## Dropping columns inplace to conserve the df1.dataframeName member
df1.drop(useless_columns, axis=1, errors='ignore', inplace=True)
#df1.drop(all_unused_columns, axis=1, errors='ignore', inplace=True)

## Reordering few columns for better visualization and concordance
cols = df1.columns.tolist()
_i = cols.index("socialProductsLiked")
_ = cols.pop(_i); cols.insert(cols.index("productsWished"), "socialProductsLiked")
_i_country = cols.index("countryCode"); cols.pop(_i_country); cols.insert(cols.index("language")+1, "countryCode")
df1 = df1[cols]

#print(df1.daysSinceLastLogin.describe(), "\n\n", df1.seniority.describe())

print(f"New shape: {df1.shape[0]} users (rows), {df1.shape[1]} features (columns)\n")
print("Remaining columns: \n{}".format(" - ".join(list(df1.columns))))
df1.sample(5)

For the curious eye, here is a statistical summary of the various numerical columns in the dataset so far.

In [None]:
df1.describe()

### **Segmenting the users**

#### **Studying non-passive people**

Currently, the dataset contains too many users who are inactive (users that only seem to have signed up to try the site out, maybe browse few articles and do nothing else/drop out). The website actually strongly incitate users to sign up (otherwise they cannot browse the site for long), which may be a reason for such a proportion of passive users.

Since the website is a C2C platform whose business model is to earn money when products are sold, I will define and call `active` users whose activity on the platform directly participates toward that end, i.e. prospective buyers and prospective sellers.

Formally, **`active` users** are those that are either:

- prospective buyers: those interacted with products of others with a like/wishlist/purchase
- prospective sellers: those having at least one product for sale.


Also, I define as **`social` users** those accounts that interacted with other users by following accounts or getting followers.


However, I **_exclude_** the following key metrics and will not create user segments related to:

- having or not a profile picture
- having or not installed any of the site's official app
- browsing the site regularly (i.e. users who dropout late)


The dataset file contains active, social and inactive users. Also note that one can belong to multiple segments simultaneously, like a seller buying someone else's product.

We might want to study **_buyer_**-specific or **_seller_**-specific behaviour. Hence I will also filter these specific segments. 
While we're at it, let's also create a subset of **_social_** users so that their behavior can be studied.
The code below will create the aforementioned user segments.

In [None]:
### Hence, let's create specific subsets of users regarding all that.

### ACTIVE USERS ###
# By filtering out users that are completely passive from the dataframe
# (i.e. users who did not sell/upload/buy/like/wish/... any product), we can study the behavior of real active users.

# using df1 instead of orig_df1 in case we had removed some rows after the initial import

active_df = df1[helper_has_any_field_greater_than(df1, ['socialProductsLiked', 'productsListed', 'productsSold',
       'productsPassRate', 'productsWished', 'productsBought'], 0)]
active_df.dataframeName = "Active Users"


### BUYING/SELLING BEHAVIOUR ###

### Buyers
buyers_df = df1[df1.productsBought > 0]
#buyers_df.productsBought.describe()
buyers_df.dataframeName = "Buyers"

### Prospecting Sellers (has sold a product or is still trying do his/her first sale)
sellers_df = df1[(df1.productsListed > 0) | (df1.productsSold > 0)]
sellers_df.dataframeName = "Prospecting Sellers"

### Successful sellers (at least 1 product sold)
successful_sellers_df = df1[df1.productsSold > 0]
successful_sellers_df.dataframeName = "Successful sellers"


### SOCIAL INTERACTIONS ###

# Users using the social media features of the service
# Tracks anyone who willingly un/-subscribed to someone.
# And it also includes accounts that are either good (increased their followers)
#   or bad (even basic followers unsubscribed from them).
# Since each new account is automatically assigned 3 followers and 8 accounts to follow
# I filter out those who differ from these default account settings.
social_df = df1[ (df1['socialNbFollowers'] != 3) | (df1['socialNbFollows'] != 8) ]
#asocial_users_df = df1[ (df1['socialNbFollows'] < 8) ]

# Among those social users, filter only those active on products (selling, buying or wishing for articles, ...)
market_social_df = social_df[helper_has_any_field_greater_than(social_df, ['socialProductsLiked', 'productsListed', 'productsSold',
       'productsPassRate', 'productsWished', 'productsBought'], 0)]

#print(socialUsers.shape, activeSocialUsers.shape)
#socialUsers.head(20)


### RESULTS / INFOS
print(f"""Out of the {orig_df1.shape[0]} users of the dataset sample, there are:""")
print()
print(f"""- {active_df.shape[0]} active users ({100*active_df.shape[0]/df1.shape[0]:.3}%). Among these prospective buyers and sellers""")

print(f"""  - {active_df.shape[0] - sellers_df.shape[0]} are prospective buyers""")
print(f"""    among which {buyers_df.shape[0]} people actually bought products (at least 1)""")
print(f"""  - {sellers_df.shape[0]} are prospective sellers""")
print(f"""    among which {successful_sellers_df.shape[0]} are successful sellers (>= 1 product successfully sold)""")
print()
print(f"""- {social_df.shape[0]} people using social network features""")
print(f"""  such as following accounts or getting followers""")
print("\nNote that among the above number of sellers, some may act as buyers and vice-versa")

With these filtered data, let's have a look at the data without all the inactive users. This way there will be more variability in the data and hopefully we can start seeing more interesting things.

Here is a sample of the `active` users.

In [None]:
print(f"Active users: {active_df.shape[0]} records with {active_df.shape[1]} columns")
active_df.sample(12)

In [None]:
#### Adding columns

## Period in which the user has not completely dropped out (in nbr of days)
## TODO: check if enough data for the formula
# df1["activityDays"] = df1.apply(lambda row: (row["seniority"] - row["daysSinceLastLogin"]), axis=1)

In [None]:
#df1["seniorityAsMonths"] = df1["seniorityAsMonths"].apply(lambda x: int(x))

#print("Months of seniority of the users in the dataset: \n- {} months".format(" months\n- ".join(map(str, df1["seniorityAsMonths"].unique()))))

#df1.head(10) #["seniorityAsMonths"]
#df1.seniorityAsMonths.describe()

In [None]:
### Distribution graphs (histogram/bar graph) of sampled columns:

#plotPerColumnDistribution(df1, 10, 5)

### Visualizing the data

Now that our data is segmented, let's have a look at how the features correlate with one another.

Here is a correlation matrix:

In [None]:
plotCorrelationMatrix(df1, 8, "All users")

This correlation matrix may look a bit messy.

Some aspects of the correlation matrix may be obvious or present lower interest, which is why I will remove less meaningful columns and create another correlation matrix.

> For instance, we can briefly note that `daysSinceLastLogin` seems slightly negatively correlated to almost everything. Practically, it just means that the more often a user logs in (i.e. `daysSinceLastLogin` getting smaller) the more chances that user has some a good activity or social network presence on the site (i.e. the other key metrics getting bigger).


> For instance, we see that the boolean feature `hasProfilePicture` correlates negatively to almost all the other features. But almost all the users of the site have a profile picture. That makes it almost useless.

In [None]:
#correlation_df = df1.drop(["hasProfilePicture", "hasAndroidApp", "hasIosApp"], axis=1, errors="ignore")
correlation_df = active_df.drop(["hasProfilePicture", "hasAndroidApp", "hasIosApp"], axis=1, errors="ignore")
correlation_df.drop(["seniority", "daysSinceLastLogin"], axis=1, errors="ignore", inplace=True)
plotCorrelationMatrix(correlation_df, 8, "Active Users")


### Insights from the correlation matrix

The above matrix shows the following correlations:

  **Buyers**
- `productsBought` and `productsWished`
- `productsBought` and `productsLiked` (though correlation is weaker than with `productsWished`).

  This is interesting to note in terms of user experience. A significant enough number of users seem to put different meaning into `wishlist` and `like`. Proposing both options may help understanding a user's tastes and intentions better. These different lists of products can then be used to recommend more appropriate products.
  - Thus this result should not only be limited to C2C platforms. Regular e-commerce shops (B2C) may also incorporate `like` and `wishlist` buttons to improve their knowledge of their users and prospects. By relying only on items added to the `cart`, a business would miss the greater potential of more specific targeting.
  

  **Sellers**
- socialNbFollow***ers*** and `productsSold`
- __socialNbFollow*ers*__ and `productsListed`

  This shows that people who publish more are more likely to get followers. This is not an absolute rule, since the correlation is only around 0.5.
  Practically, we can interpret it saying that some sellers may propose clothes that do not suit the dominant tastes.
  
  **How to use this insight from a business's point of view ?**
  
  Right of the bat, we cannot reach any clear conclusion using only this aggregated data, i.e. there are several ways this correlation could be interpreted, and one would need specifically collected data to test which hypothesis is the most likely.

  For instance, the correlation between `followers` and `sold products` can be seen in two main ways (or a combination of both):
  
  1. a seller gets followers by selling products (and thus being granted a badge of "trusted" seller, which in turn will get him/her to sell more).
  
  2. It is also possible that in C2C platforms, successful sellers first get subscribers and then, by showing them quality products, they earn the client's trust and turn cold prospects into their clients and fans.
  
    For a C2C business, the latter hypothesis would the most promising. It would imply that a C2C company could earn more by empowering their seller users to create attractive content / ad posts.
    This could explain why successful C2C companies like [EBay](ebay.com) and [Ricardo](https://www.similarweb.com/website/ricardo.ch) allow users to customize their ads with styles / HTML, while others like [Vestiaire Collective](https://www.similarweb.com/website/vestiairecollective.com) and [Vide Dressing](https://www.similarweb.com/website/videdressing.com) tackle this problem by turning their users' product pictures into professional looking images. (\*Example companies successful in Europe)



**Social**

- In the shoes of a seller, we can hypothesize that having followers DO really help a seller get sell more, as discussed above. **Hence, for an individual uploading products on a C2C website, it would not be enough to simply add products online.**
  To maximize profit, a seller would also need to be proactive in getting followers, or in other words, advertise himself/herself.
  
  This is the approach taken by C2C businesses like [Vinted](https://www.similarweb.com/website/vinted.com), whose business model is solely based on having sellers promote their products through *ad boosting* and they seem successful. The fact that *Vinted* does not charge *listing fees* nor *commission upon sale* is a good plus in their advantage since most of their competitors do charge commissions based on the article price.
  
  It seems that the idea that one has to advertise oneself in order to get more sales is deeply ingraved in people's minds. So much so that many sellers are willing to pay to increase their visibility. It seems to be the only / main revenue income for [*Vinted*](https://www.vinted.com/how_it_works) and it seems they are doing well.
  
  This is especially true nowadays, since there are too many offers online. Sellers seem to think there are two ways to alleviate this issue: producing more quality content (better ad pictures, product descriptions, adequate categories, ...) or paying money to boost their products in the site's catalog.
  It would be interesting to see where the market is leaning towards and which business model generates more revenue.
  
  Note that, the path *EBay* chose is to combine both approaches to maximize their revenue: they allow sellers to customize their ads (EBay charges some ad features like adding a subtitle) and they also allow sellers to boost the position of their ads in the listing. That gives EBay the benefit of both sources of revenues, which may explain why the company got more funds to develop itself.
  It would be very instructive to have such companies share the results of both approaches, so that we might know what drives more revenue.



There are other interesting things we might note, like the fact that `socialNbFollows` (the nbr of accounts a user follows) and `productsBought` are not correlated at all, but it would be wiser to look at this correlation using only the segment of data containing *buyers*. Sellers may have a different behaviour, such as using the "follow" feature to get some follow-back instead of showing their appreciation of the other user's content.

So how about the correlation matrix for buyers ?


In [None]:
plotCorrelationMatrix(buyers_df.drop(all_unused_columns, axis=1, errors="ignore"), 8, "Buyers")

print(f"""In average, buyers buy {buyers_df.productsBought.sum() / buyers_df.shape[0] :.2f} products. Details are as follows:""")
buyers_df.productsBought.describe()

**_Buyers_ behaviour - Interesting fact**

- So even among the buyers we observe that the `the number of accounts a user follows` and the number of `products that user bought` are not correlated.
  
  - This somehow might seem **counter-intuitive** since one might expect buyers to follow accounts they wish to purchase an article from, or follow accounts of sellers with products they might intend to buy.

    One explaination would be that most users in general use to "follow" profiles that share tastes with. This could be to keep up with the trends, or just to get inspiration of styles from influencers and other people in the social network.

    Regardless, considering only the ROI in developping a social network feature for your website, it might seem that giving your users the ability to follow inspiring users would not directly lead to an increase in the overall number of sales on your website.
    Thus in order to use the business' funds to optimize the ROI, there might be better ways than implementing that aspect.
    
    However, as the previous result shows, from the point of view of individual sellers, sellers compete with each other to gain more followers, and the sellers who get the most followers are more likely to grab the buyers' attention when those are looking for something new to buy.
    
    So a feature to "follow" accounts may not yield buyers to buy more from a C2C website as a whole, but rather this would help motivate the sellers of the platform, since they will be able to see positive results from taking the time to improve their presence on the platform. Sellers are aslo the oil of a C2C business, so pampering them may be one of the factors that set apart successful C2C businesses.
    
  
  - **_However_**, this weak correlation could be mitigated if we consider that maybe users with lower budgets might tend to "dream" more and follow more accounts, even though they cannot afford to buy from that much of their influencers.
   
    After all, clients cannot simply buy a product from each seller they follow. No one has an infinite amount of money to spend, not mentionning that some clients and prospects may even be on budget. In general, buyers bought `3` products in `average`, while the `median` stands at `1`.


In order to conclude whether or not a social network feature is worth it or not in terms of ROI, it would be necessary to look at other data and taking into account parameters specific to your business.

For instance, the prices of the products. If most products on a website are "cheap" (cheap compared to the standard budget people would allocate for your type of products) and there is a very weak correlation between `the number of accounts users follow` and the `number of products bought`, then it would lean towards the first hypothesis, since we could reasonably assume users have enough money to buy more products.

Also, it is worth noting that it would be necessary to normalize the relationship between those two columns using other sources of data. The idea is that, for example, maybe the correlation is roughly linear, but the coefficient of linearity is so low that it would appear as if the correlation is weak or non existent.


#### User retention

Finally, a little parting gift: a graph showing user retention over time.

A lot of users drop out quickly. The vast majority actually.
This is true for most website/services/apps: a lot of people try it once, maybe twice and then a lot of them drop out really quickly.

Hopefully, the following graph should give you a basis for comparison of your engagement vs dropout curve.

In [None]:
dropout_after = lambda df, dayMax: dayMax - df.daysSinceLastLogin
max_last_login = df1.daysSinceLastLogin.max()


In [None]:
#df1 = df1[df1.daysSinceLastLogin <= df1.seniority]

plt.title(f"User retention: how users drop out over time [all user segments]")
#plt.xlabel("(present) <-- # days since last login  --> (past)")
#sns.kdeplot(np.array(df1.seniority), shade=True)
plt.xlabel("(past) <-- # days before dropout  --> (present)")
plt.ylabel("density")
sns.kdeplot(np.array(dropout_after(df1, max_last_login)), shade=True)
plt.show()


How to read this graph: The height of the curve represents the part of users whose last connection was N days ago.

The spike at ~700 shows that among all the users who signed up during the same week/day, the vast majority of them dropped right away.


In [None]:
plt.title("User retention - how people who bought at least one product dropout")
#plt.xlabel("(present) <-- # days since last login  --> (past)")
#sns.kdeplot(np.array(buyers_df.daysSinceLastLogin), shade=True)
plt.xlabel("(past) <=  # days before dropout  => (present)")
plt.ylabel("density")
sns.kdeplot(np.array(dropout_after(buyers_df, max_last_login)), shade=True)
sns.kdeplot(np.array(dropout_after(buyers_df, max_last_login)), shade=True)
plt.show()

plt.title("User retention - how sellers keep visiting a C2C site")
#plt.xlabel("(present) <-- # days since last login  --> (past)")
plt.xlabel("(past) <=  # days before dropout  => (present)")
plt.ylabel("density")
sns.kdeplot(np.array(dropout_after(sellers_df, max_last_login)), shade=True)
plt.show()

plt.title("User retention - buyers and sellers dropout curve")
sns.kdeplot(np.array(dropout_after(sellers_df,max_last_login)), shade=True)
sns.kdeplot(np.array(dropout_after(buyers_df,max_last_login)), shade=True)
plt.xlabel("(past) <=  # days before dropout  => (present)")
plt.ylabel("density")
plt.legend(["sellers", "buyers"])
plt.show()

We see that even a one-time buyer can be expected to keep visiting you over a long period of time.

While a lot of new users can cease using your site rapidly, the ones you get to keep and bring into action will be an asset for the future.
Once someone has made even the slightest step to engage with you and really use your platform (either by buying or uploading products to sell), you should put your efforts to support the relationship. It may be obvious, but here is a proof provided by actual data.

Finally, we can observe the difference in the dropout curve between *buyers* and *sellers*. The latter seem to try the platform for a shorter amount of time before deciding whether or not they will turn to another one.

> This means that a *crucial* step in C2C is ensuring –among other things– that:
> - your platform is easy to use for your sellers/service providers
> - you provide the actual tools they want or need (example: analytics, a not too restrictive way to interact with their prospective buyers, ...)


Whereas *buyers* have mainly two distinct behavior: purchase once and then never come back (as denoted by the first litlle bump) or staying faithful to the platform for a long while (the spike at the end representing the present).


## Conclusion


- Adding `wishlist` and `like` features on top of the basic `cart` feature allows your business to better understand your customers and their tastes. It should not be restricted to C2C business, but it could bring improvements on your products and services recommendations.

- Sellers of a C2C website stand to gain a lot by having the means to create attractive product or service pages to drive more sales. And if your sellers sell more, your business can earn more too.

- Adding a social network feature on a platform that connects sellers with buyers can greatly help your best sellers stand out. Seeing their efforts pay off will serve as rewards pushing them to invest more time and effort to build their community on *your* platform, which is for your benefit. After all, having sellers compete with one another yields to better products/contents/services for your customers and thus your reputation.
  
  And you should not forget that sellers are the oil that allow a C2C (or B2C) business to move forward. You should take interest in treating them well.

- Finally, have a look at the example user retention curve to learn what kind of different behaviour to expect from buyers and sellers.
