# E-COMMERCE company analysis

The following project is about the analysis of a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. <br>
The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
The aim is to extract useful insights about the customers of the online stores.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

import warnings
# current version of seaborn generates a bunch of warnings that we'll ignore
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

import missingno as msno # missing data visualization module for Python
#import pandas_profiling

import gc
import datetime

%matplotlib inline
color = sns.color_palette()

In [None]:
df = pd.read_csv('/kaggle/input/ecommerce-data/data.csv', header= 0, encoding= 'unicode_escape')

In [None]:
df.head()

# Data Cleaning

## Data format cleaning

To simplify the further analysis, the columns will be renamed as follows:

In [None]:
df.rename(columns={"InvoiceNo":"invoice_num", 
                   "StockCode":"stock_code", 
                  "Description":"description", 
                  "Quantity":"quantity", 
                  "InvoiceDate":"invoice_date", 
                  "UnitPrice":"unit_price", 
                  "CustomerID":"customer_id", 
                  "Country":"country"}, inplace=True)

In [None]:
df.info()

The column related to the transaction date will be converted from 'object' (a simple string basically) to the very convenient date format "date-time" in pandas.

In [None]:
df['invoice_date']=pd.to_datetime(df.invoice_date, format='%m/%d/%Y %H:%M')

Moreover, the description column will be converter into lower case.

In [None]:
df['description']=df.description.str.lower()

In [None]:
df.head()

## Missing data analysis and handling

In [None]:
df.info()

There are some null values for description and customer id.<br>
In particular the exact number of missing values in each column is:

In [None]:
df.isnull().sum().sort_values(ascending=False)

The portion of dataframe where some values are missing is the following:

In [None]:
df_miss=df[df.isnull().any(axis=1)]
df_miss.head()

## Is there any relationship between the missing data?

In [None]:
df_miss.head()

In [None]:
df_miss["day"] = df_miss['invoice_date'].map(lambda x: x.day)
df_miss["month"] = df_miss['invoice_date'].map(lambda x: x.month)
df_miss["year"] = df_miss['invoice_date'].map(lambda x: x.year)

In [None]:
df_miss['daymonth']=df_miss['day'].astype(str)+'/'+df_miss['month'].astype(str)
df_miss['daymonthyear']=df_miss['daymonth'].astype(str)+'/'+df_miss['year'].astype(str)
df_miss['monthyear']=df_miss['month'].astype(str)+'/'+df_miss['year'].astype(str)

In [None]:
df_miss.head()

In [None]:
fig, ax = plt.subplots(figsize=(20,6)) 
ax = sns.countplot(x='daymonthyear', data=df_miss)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90);

This plot looks messy, but its clear that during some days there have been more missing values than others.<br>
In particular the days with most missing values are:

In [None]:
sns.displot(df_miss['daymonthyear'].value_counts())

We can see that during some days, more than 2500 customers_id are missing. We will now check the days where most cusomers id are missing.

In [None]:
df_miss['daymonthyear'].value_counts()[:20]

It could be interesting to check if during these days something happend and caused the missing values.

For further analysis, the rows with missing values will be dropped and a new df called 'df_new' will be defined

In [None]:
df_new=df.dropna()

In [None]:
#check if there a are missing values in the new dataframe
df_new.isnull().sum().sort_values(ascending=False)

In [None]:
df_new.info()

Now the dataframe does not have any missing values

## Duplicated Values handling

In [None]:
df_new[df_new.duplicated()].head()

In [None]:
df_new.duplicated().sum()

There are 5225 duplicated transactions<br>
These transaction will be dropped from the dataset.

In [None]:
df_new.drop_duplicates(inplace=True)

In [None]:
df_new.duplicated().sum()

## Change columns type

Next, the 'customer_id' column will be converted to 'int' from 'float' since customersID are integer numbers.

In [None]:
df_new['customer_id']=df_new.customer_id.astype('int64')

In [None]:
df_new.describe()

Quantity has negative values and unit price has a minimum value of 0

In [None]:
np.sum(df_new['quantity'] < 0)

There are 8872 transactions with negative quantity. We will investigate if they are related to canceled orders or mistakes.

### Canceled orders analysis

In [None]:
canceled_orders = df_new[df_new['invoice_num'].apply(lambda x: x[0]=='C')]
canceled_orders.head()

By analyzing the first 5 values of the dataframe we can see that the quantity has a negative values, is this true for all canceled orders?

In [None]:
(canceled_orders['quantity'] < 0).sum()

Yes, as expetcted all the transactions with negative values are canceled orders.

In [None]:
print('The percentage of canceled orders is: {} %'.format(round(canceled_orders.shape[0]/df_new.shape[0]*100,2)))

Moreover, it looks like there are some discounts among the canceled orders. They will be analyzed as well.

## Discounts

In [None]:
discounts = df_new[df_new['stock_code'].apply(lambda order: order=='D')]
discounts.head()

In [None]:
discounts.shape

The company issued 77 discounts

Are there other discounts in the original dataset?

In [None]:
df[df['stock_code'].apply(lambda order: order=='D')].shape[0]

No, all the discounts are correctly included in the transactions with negative values

Now, all the canceled orders will be dropped.

In [None]:
df_new = df_new[df_new['quantity'] > 0]
df_new.sort_values(by='stock_code', ascending=False, inplace=True)
df_new.head()

In [None]:
df_new.info()

## Check for transactions of special items:

The presence of special items will be checked through a regex

In [None]:
import re
spec_list=[]
for code in df_new.stock_code:
    x=re.findall(r"^\w{1}$|\D[A-Z]+\D|[A-Z]\d", code)
    if x not in spec_list:
        if len(x) >0 :
            spec_list.append(x)
spec_list

['BANK ', 'CHARGES'] will be renamed into ['BANK CHARGES']:

In [None]:
spec_list[5] = ['BANK CHARGES']

Then, the spec_list will be transformed into into a single list:

In [None]:
spec_list2=[item for sublist in spec_list for item in sublist]
spec_list2

Now it is possible to check all the transactions related to these special items:

In [None]:
df_new[df_new['stock_code'].apply(lambda x: x in spec_list2)]

There are so other types of transactions included in the dataset. They will be dropped.<br>
The special transations are: POST ( postage), M ( manual), Bank charges and C2 ( carriage)

In [None]:
df_new = df_new[~df_new['stock_code'].isin(spec_list2)]

In [None]:
print("The number of transactions is: ", df_new.shape[0])

392732-391183=1549 elements have been dropped

# Feature Engineering

We will add a column 'Amount spent'

In [None]:
df_new['amount_spent']=df_new['quantity'] * df_new['unit_price']

In [None]:
df_new.head()

We will reorder the columns for easier reference

In [None]:
df_new=df_new[['invoice_num', 'invoice_date', 'stock_code', 'description', 'quantity', 'unit_price', 'amount_spent', 'customer_id', 'country']]

In [None]:
df_new.head()

We will create columns for day, month, year

In [None]:
df_new.insert(loc=2, column='yearmonth', value=df_new['invoice_date'].map(lambda x: 100 * x.year + x.month))
df_new.insert(loc=3, column='month', value=df_new.invoice_date.dt.month)
df_new.insert(loc=4, column='day', value=(df_new.invoice_date.dt.dayofweek) + 
              1) # +1 is used to make Monday=1.....until Sunday=7
df_new.insert(loc=5, column='hour', value=df_new.invoice_date.dt.hour)

In [None]:
df_new.head()

Now the dataset looks cleaner and is ready for EDA

# Exploratory Data Analysis (EDA)

## How many orders by the customers?

In [None]:
orders=df_new.groupby(by=['customer_id','country'], as_index=False)['invoice_num'].count()
orders.head()

The column invoice_num has the count of invoice_num for each customer<br>
The equivalent code in SQL would be:<br>
SELECT customer_id, country, count(invoice_num)<br>
FROM df_new<br>
GROUP BY customer_id, country<br>
ORDER BY customer_id;<br>

We will plot the number of order by customer_id

In [None]:
plt.subplots(figsize=(15,6))
plt.plot(orders.customer_id, orders.invoice_num)
plt.xlabel('Customer ID')
plt.ylabel('Number of Orders')
plt.title('Number of Orders for Different Customers')
plt.show()

### The TOP 5 most number of orders is:

In [None]:
orders.sort_values(by='invoice_num', ascending=False).head()

## How much money spent by the customers?

In [None]:
money_spent = df_new.groupby(by=['customer_id','country'], as_index=False)['amount_spent'].sum()
money_spent.head()

In [None]:
plt.subplots(figsize=(15,6))
plt.plot(money_spent.customer_id, money_spent.amount_spent)
plt.xlabel('Customers ID')
plt.ylabel('Money spent (Dollar)')
plt.title('Money Spent for different Customers')
plt.show()

### The TOP 5 most highest money spent

In [None]:
money_spent.sort_values(by='amount_spent', ascending=False).head()

# Discover Patterns

Number of orders for different Months (1st Dec 2010 - 9th Dec 2011)

In [None]:
ax = df_new.groupby('invoice_num')['yearmonth'].unique().value_counts().sort_index().plot(kind='bar',color=color[0],figsize=(15,6))
ax.set_xlabel('Month',fontsize=15)
ax.set_ylabel('Number of Orders',fontsize=15)
ax.set_title('Number of orders for different Months (1st Dec 2010 - 9th Dec 2011)',fontsize=15)
ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','Jun_11','July_11','Aug_11','Sep_11','Oct_11','Nov_11','Dec_11'), rotation='horizontal', fontsize=13)
plt.show()

November looks like the month with most orders

## How many orders per day?

In [None]:
ax = df_new.groupby('invoice_num')['day'].unique().value_counts().sort_index().plot(kind='bar',color=color[0],figsize=(10,5))
ax.set_xlabel('Day',fontsize=15)
ax.set_ylabel('Number of Orders',fontsize=15)
ax.set_title('Number of orders for different Days',fontsize=15)
ax.set_xticklabels(('Mon','Tue','Wed','Thur','Fri','Sun'), rotation='horizontal', fontsize=15)
plt.show()

It looks like Thursday is the day with most orders

In [None]:
#ax = df_new.groupby('invoice_num')['hour'].unique().value_counts().iloc[:-1].sort_index().plot(kind='bar',color=color[0],figsize=(15,6))
#ax.set_xlabel('Hour',fontsize=15)
#ax.set_ylabel('Number of Orders',fontsize=15)
#ax.set_title('Number of orders for different Hours',fontsize=15)
#ax.set_xticklabels(range(6,21), rotation='horizontal', fontsize=15)
#plt.show()

# Discover Patterns for Unit Price

In [None]:
df_new.unit_price.describe()

There are orders with 0 unit price (free items)

In [None]:
plt.subplots(figsize=(12,6))
sns.boxplot(df_new.unit_price)
plt.show()

It looks like the majority of products have unit_price lower than 10. We will use 10 as a threshold value to deeper explore the unit_price.

In [None]:
plt.subplots(figsize=(12,6))
sns.boxplot(df_new[df_new['unit_price'] < 10].unit_price)
plt.show()

In [None]:
df_free=df_new[df_new['unit_price'] == 0]
df_free.head()

### How many free items are sold on each month?

In [None]:
df_free.yearmonth.value_counts().sort_index()

In [None]:
ax = df_free.yearmonth.value_counts().sort_index().plot(kind='bar',figsize=(12,6), color=color[0])
ax.set_xlabel('Month',fontsize=15)
ax.set_ylabel('Frequency',fontsize=15)
ax.set_title('Frequency for different Months (Dec 2010 - Dec 2011)',fontsize=15)
ax.set_xticklabels(('Dec_10','Jan_11','Feb_11','Mar_11','Apr_11','May_11','July_11','Aug_11','Oct_11','Nov_11'), rotation='horizontal', fontsize=13)
plt.show()

On average, we see that the companies give 2 items for free each month. No free items were given on June 2011 and Sept 2011

# Discover Patterns for each Country

How many orders for each country?

In [None]:
group_country_orders = df_new.groupby('country')['invoice_num'].count().sort_values()
# del group_country_orders['United Kingdom']

# plot number of unique customers in each country (with UK)
plt.subplots(figsize=(15,8))
group_country_orders.plot(kind='barh', fontsize=12, color=color[0])
plt.xlabel('Number of Orders', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.title('Number of Orders for different Countries', fontsize=12)
plt.show()

The company is based in UK, so it seems natural that the country with most sold items is UK.<br>
For further analysis, UK will be dropped

In [None]:
group_country_orders = df_new.groupby('country')['invoice_num'].count().sort_values()
del group_country_orders['United Kingdom']

# plot number of unique customers in each country (without UK)
plt.subplots(figsize=(15,8))
group_country_orders.plot(kind='barh', fontsize=12, color=color[0])
plt.xlabel('Number of Orders', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.title('Number of Orders for different Countries', fontsize=12)
plt.show()

Excluding the United Kingdom, Germany, France and EIRE are the two countries where customers spent the most money.

## How much money spent by each country?

In [None]:
group_country_amount_spent = df_new.groupby('country')['amount_spent'].sum().sort_values()
# del group_country_orders['United Kingdom']

# plot total money spent by each country (with UK)
plt.subplots(figsize=(15,8))
group_country_amount_spent.plot(kind='barh', fontsize=12, color=color[0])
plt.xlabel('Money Spent (Dollar)', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.title('Money Spent by different Countries', fontsize=12)
plt.show()

For similar reason as above, we will exclude UK for this analysis.

In [None]:
group_country_amount_spent = df_new.groupby('country')['amount_spent'].sum().sort_values()
del group_country_amount_spent['United Kingdom']
# plot total money spent by each country (without UK)
plt.subplots(figsize=(15,8))
group_country_amount_spent.plot(kind='barh', fontsize=12, color=color[0])
plt.xlabel('Money Spent (Dollar)', fontsize=12)
plt.ylabel('Country', fontsize=12)
plt.title('Money Spent by different Countries', fontsize=12)
plt.show()

Excluding the UK, customers from the Netherlands, EIRE, Germany, France and Australia spent the most money on the website.

# Sold product Analysis

In [None]:
df_new.head()

In [None]:
df_new['stock_code'].nunique()

There are 3659 different sold products in the dataset

## Which products are the most sold?

In [None]:
most_sold_products=df_new.groupby(by=['stock_code','description'])['quantity'].sum().sort_values(ascending=False).iloc[:50]
df_top_prod=most_sold_products.to_frame().reset_index()
df_top_prod.head()

In [None]:
plt.subplots(figsize=(15,8))
most_sold_products.plot(kind='bar', fontsize=12, color=color[0])
plt.xlabel('Product ID', fontsize=12)
plt.ylabel('Amount sold', fontsize=12)
plt.title('Most sold products', fontsize=12)
plt.show()

# Most profitable products

### TOP 5 profitable products

In [None]:
most_profitable_product = df_new.groupby(by=['stock_code','description'])['amount_spent'].sum().sort_values(ascending=False).iloc[:50]
df_prof_prod = most_profitable_product.to_frame().reset_index().head()

In [None]:
plt.subplots(figsize=(15,8))
most_profitable_product.plot(kind='bar', fontsize=12, color=color[0])
plt.xlabel('Product ID', fontsize=12)
plt.ylabel('Total earning', fontsize=12)
plt.title('Most sold products', fontsize=12)
plt.show()

## How much does the price per unit relate to quantity?

In [None]:
df_new.reset_index().head()

In [None]:
df_3=df_new.drop_duplicates(subset=['stock_code','unit_price','description'])

In [None]:
df_3.sort_values(by=['stock_code','quantity'], inplace=True, ascending=False)
df_3.head()

### I will create a dictionary of dictionaries to include the unit price and quantities for each item

In [None]:
import collections

items_dict = collections.defaultdict(dict)

for product in df_3.iterrows():
    items_dict[product[1][6]][product[1][8]]=[product[1][9]][0]

In [None]:
df_4=pd.DataFrame(list(items_dict.items()),columns = ['stock_code','quantity_price'])
df_4.head()

### Check the items with mutiple price per unit

In [None]:
df_5=df_4[df_4.quantity_price.apply(lambda x: len(x.keys())>1)]
df_5.head()

In [None]:
price_list=[]
for el in df_5.quantity_price:
    price_list.append(len(el.keys()))
df_5['counts']=price_list    
df_5=df_5.sort_values(by='counts', ascending=False)

In [None]:
plt.hist(df_5['counts'], bins=100,color='#0504aa',alpha=0.7, rwidth=0.85)
plt.xlabel('No of different unit price')
plt.show()

In [None]:
df_5['counts'].value_counts()

Excuding prodicts with just one unit_price, most of the products has 2 or 3 unit prices.

## How does the price change in relation to the purchased quantity?

In [None]:
for i in range(1,10): #check first 10 products ordered by the quantity of different unit prices
    print('Number: ', i)
    keys=list(df_5.iloc[i-1:i].quantity_price.values[0].keys())
    values=list(df_5.iloc[i-1:i].quantity_price.values[0].values())
    plt.plot(keys, values)
    plt.show()

These plots shows that generally the unit price decresases with increasing quantity with some exceptions.

# Time series analysis for top sold products

In [None]:
#most sold products dataframe
df_top_prod.head()

In this dataframe the quantity is the sum of all the sold quantites for each products. We need all tha single transactions related to these top psold products.

In [None]:
#check the number of transactions related to these top sold products
df_top_50 = df_new[df_new['description'].isin(df_top_prod['description'])]
df_top_50.info()

In [None]:
print('The top 50 most old products account for the {} % of the total transactions'.format(np.round(len(df_top_50)/len(df_new),2)))

In [None]:
df_top_50.stock_code.value_counts()

### There are indeed lots of transactions related to these top sold products

It could be interesting to see how the product with stock code '23843' is in the top 50 with just 1 transaction

In [None]:
df_new[df_new['stock_code'] == '23843']

This transaction is related to the most sold product.

In [None]:
#most profitable products
df_prof_prod.head()

In [None]:
df_top_50[df_top_50['stock_code'].isin(df_prof_prod.stock_code)].stock_code.value_counts()

The top 5 most profitable products with a number of transaction higher than 1000, have the 'stock code' : 85123A, 22423, 85099B

In [None]:
df_top1 = df_new[df_new['stock_code'] == '85123A']
df_top2 = df_new[df_new['stock_code'] == '22423']
df_top3 = df_new[df_new['stock_code'] == '85099B']

In [None]:
# creating a purchase day feature
df_top1['order_purchase_date'] = df_top1.invoice_date.dt.date

# creating an aggregation
sales_per_purch_date = df_top1.groupby('order_purchase_date', as_index=False).quantity.sum()
ax = sns.lineplot(x="order_purchase_date", y="quantity", data=sales_per_purch_date)
ax.set_title('Sales per day for the Most sold product')

There are indeed some peaks in quantity sold for these product.

In [None]:
# creating a purchase day feature
df_top2['order_purchase_date'] = df_top2.invoice_date.dt.date

# creating an aggregation
sales_per_purch_date = df_top2.groupby('order_purchase_date', as_index=False).quantity.sum()
ax = sns.lineplot(x="order_purchase_date", y="quantity", data=sales_per_purch_date)
ax.set_title('Sales per day for the Second most sold product')

In [None]:
# creating a purchase day feature
df_top3['order_purchase_date'] = df_top3.invoice_date.dt.date

# creating an aggregation
sales_per_purch_date = df_top3.groupby('order_purchase_date', as_index=False).quantity.sum()
ax = sns.lineplot(x="order_purchase_date", y="quantity", data=sales_per_purch_date)
ax.set_title('Sales per day for the Third most sold product')

These plots do not show a clear pattern in the data. It could be interesting to try to predict the future sales.