## E-Commerce Data Analysis
### By: Brandon McManus

#### Data Set Information:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.


#### Attribute Information:
**InvoiceNo:** Invoice number. Nominal, a 6-digit integral number uniquely assigned to each         transaction. If this code starts with letter 'c', it indicates a cancellation.  
       
**StockCode:** Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.         
       
**Description:** Product (item) name. Nominal.   
        
**Quantity:** The quantities of each product (item) per transaction. Numeric.    
        
**InvoiceDate:** Invoice Date and time. Numeric, the day and time when each transaction was generated.      
       
**UnitPrice:** Unit price. Numeric, Product price per unit in sterling.      
      
**CustomerID:** Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.      
        
**Country:** Country name. Nominal, the name of the country where each customer resides.      

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import seaborn as sns

In [None]:
dat = pd.read_csv("../input/ecommerce-data/data.csv", encoding= 'unicode_escape')

pd.set_option('display.max_columns', 500)

dat.info()

### Data Exploration & Data Cleaning

In [None]:
dat.info()

In [None]:
dat.head(15)

Given the initial look of the data, We can see that the data is made up of various transactions with a seperate transaction for each Description. There are some changes that need to made with the data types of the CustomerID and InvoiceDate columns.
        
**InvoiceDate** should be *datetime64* rather than *object* Dtype        
**CustomerID** should be *object* type rather than *float64* Dtype

In [None]:
dat['InvoiceDate'] = pd.to_datetime(dat['InvoiceDate'])
dat.CustomerID = dat.CustomerID.astype(object)
dat.info() 

Now I want to see if there are any missing values in the dataset that may cause problems for our analysis

In [None]:
dat.isnull().sum()

In [None]:
dat.isnull().sum() / dat.shape[0] * 100

We have almost 25% of the CustomerID column missing and less than 1% of the Description column missing. I will check to see if the rows with missing Description values are also missing CustomerID values to eliminate unnecessary work.

In [None]:
pd.set_option('display.max_rows', 1000)
null_data = dat[dat.isnull().any(axis=1)]
null_data.info()

It appears that all the missing values for the Description column are also missing CustomerID values so lets take a look at the rows where both values are missing

In [None]:
null_data = dat[dat['Description'].isnull()]
null_data.sample(15)

It appears that where there is a missing Description and CustomerID the unitprice = 0 and the quantity is either positive or negative. We can infer that these transactions are likely returns from customers and the company has not developed a clear strategy of handling returned items. It would be wise for the company to develop some sort of way to identify returns or faulty transactions so they can be assessed more accurately. However, since there is no explanation for the occurances of these transactions it is best to drop all transactions with missing descriptions and with a unitprice=0. It would also be in our best interest to drop missing CustomerID's as they will not be of use for us if we are looking to make accurate insights for this data analysis.

In [None]:
dat = dat.dropna()
dat.isnull().sum()

Next I would like to determine how many transaction cancellations we have. A cancelled transaction is indicated by a C at the beginning of the InvoiceNo.

In [None]:
dat["IsCancelled"]=np.where(dat.InvoiceNo.apply(lambda l: l[0]=="C"), True, False)
dat.IsCancelled.value_counts() / dat.shape[0] * 100

In [None]:
dat.loc[dat.IsCancelled==True].describe()

Since we have negative quantities for all quartiles and positive unit prices, understanding the data without any further explanation or information will become too difficult for us to predict so it is best to drop the data from the dataset.

In [None]:
dat = dat.loc[dat.IsCancelled==False].copy()
dat = dat.drop("IsCancelled", axis=1)

### Stock Codes and Descriptions

In [None]:
dat.StockCode.nunique(), dat.Description.nunique()

We have 3665 unique StockCodes and 3877 unique Descriptions which aligns with the fact that the retailer sells many different types of products. Lets take a look at the most common stockcodes and descriptions being sold 

In [None]:
stockcode_frequency = dat.StockCode.value_counts().sort_values(ascending=False)
description_frequency = dat.Description.value_counts().sort_values(ascending=False)
fig, ax = plt.subplots(2,1,figsize=(20,15))
sns.barplot(stockcode_frequency.iloc[0:19].index,
            stockcode_frequency.iloc[0:19].values,
            ax = ax[0], palette="Blues_r")
ax[0].set_ylabel("Frequency")
ax[0].set_xlabel("Stockcode")
ax[0].set_title("Which stockcodes are most common?");
sns.barplot(description_frequency.iloc[0:19].index,
            description_frequency.iloc[0:19].values,
            ax = ax[1], palette="Purples_r")
ax[1].set_ylabel("Frequency")
ax[1].set_xlabel("Description")
ax[1].tick_params(labelrotation=90)

ax[1].set_title("Which Descriptions are most common?");

We can see that our top 20 most frequent stockcodes and descriptions generally match up with eachother in terms of level of frequency so we can say it is true that majority of the descriptions are consistent with the stockcodes except for some exceptions causing slight differences in the amount of stockcodes vs descriptions

### Customers and Countries 

Next I would like to see what customers and what countries had the most transactions and the correlation between our top customers and the countries 

In [None]:
customer_frequency = dat.CustomerID.value_counts().sort_values(ascending=False).iloc[0:19] 
plt.figure(figsize=(19,10))
customer_frequency.index = customer_frequency.index.astype('Int64') 
sns.barplot(customer_frequency.index, customer_frequency.values, order=customer_frequency.index, palette="Spectral_r")
plt.ylabel("Frequency")
plt.xlabel("CustomerID")
plt.title("Which customers are most common?");

In [None]:
country_frequency = dat.Country.value_counts().sort_values(ascending=False).iloc[0:20]
plt.figure(figsize=(20,5))
sns.barplot(country_frequency.index, country_frequency.values, palette="plasma_r")
plt.ylabel("Frequency")
plt.title("Which countries made the most transactions?");
plt.xticks(rotation=90);

It is clear that the vast majority of transactions take place in the United Kingdom. Lets see if our top 20 Customers purchase their items in the United Kingdom or in other countries. 

In [None]:
x = dat.groupby(['CustomerID','Country']).size().sort_values(ascending=False).iloc[0:19]
pd.DataFrame(x)

In [None]:
customer_frequency = dat.CustomerID.value_counts().sort_values(ascending=False).iloc[0:19] 
uk_customers = dat.groupby(dat['CustomerID']).size().where(dat['Country'] == 'United Kingdom').sort_values(ascending=False).iloc[0:19]
fig, ax = plt.subplots(2,1,figsize=(20,20))
sns.barplot(customer_frequency.index,
            customer_frequency.values,
            ax = ax[0], palette="Blues_r", order=customer_frequency.index)
ax[0].set_ylabel("Frequency")
ax[0].set_xlabel("CustomerID")
ax[0].set_title("Which Customers are most common?");
sns.barplot(uk_customers.index,
            uk_customers.values,
            ax = ax[1], palette="cool", order=uk_customers.index)
ax[1].set_ylabel("Frequency")
ax[1].set_xlabel("CustomerID")
ax[1].set_title("Which Customers are most common in the United Kingdom?");

It appears we have a few outliers in our top customers group where the top country is Ireland and the Netherlands.However, the majority are from the United Kingdom which makes sense due to the large difference in transactions between the United Kingdom and the rest of the countries in our dataset.

### Unit Price and Quantity 

Before I start doing any time-series analysis, I want to make sure my price and quantity features make sense and will be easy for us to analyze.

In [None]:
dat.UnitPrice.describe()

Before a graph the unit price, I want to make sure there are no 0 or less than 0 valued unit prices as this will become a problem when we are finding the log of the unitprice.

In [None]:
dat = dat.loc[dat.UnitPrice > 0].copy()

Now that there are no zero value unit prices we can graph the distribution

In [None]:
fig, ax = plt.subplots(2,1,figsize=(15,15))
sns.distplot(dat.UnitPrice, ax=ax[0])
ax[0].set_ylabel('Frequency')
sns.distplot(np.log(dat.UnitPrice), ax=ax[1], bins=20)
ax[1].set_ylabel('Frequency')
ax[1].set_xlabel("Log-Unit-Price");

From the graphs we can see that a large portion of the prices are quite small and we have a few outliers that are very large. Due to the high frequency of small transactions I will focus on the transactions with prices in the log-unit-price graph. To find the prices i will take the exponent of -2 and the exponent of 3 as the majority of the price are between these two log units.

In [None]:
np.exp(-2),np.exp(3)

We can see that the majority of our distribution lies between 0.1 and 20.1 so I will delete all outliers outside of this range

In [None]:
dat = dat.loc[(dat.UnitPrice > 0.1) & (dat.UnitPrice < 20)].copy()

In [None]:
dat.UnitPrice.describe()

In [None]:
fig, ax = plt.subplots(2,1,figsize=(15,15))
sns.distplot(dat.UnitPrice, ax=ax[0])
ax[0].set_ylabel('Frequency')
sns.distplot(np.log(dat.UnitPrice), ax=ax[1], bins=20)
ax[1].set_ylabel('Frequency')
ax[1].set_xlabel("Log-Unit-Price");

Now are std is much smaller and we have a more evenly distributed graph. Our graph is still skewed to the right which is something to make note of for which we will deal with later. Now lets take a look at the Quantity column.

In [None]:
dat.Quantity.describe()

In [None]:
fig, ax = plt.subplots(2,1,figsize=(15,15))
sns.distplot(dat.Quantity, ax=ax[0], kde=False)
ax[0].set_title("Quantity distribution")
ax[0].set_ylabel('Frequency')
ax[0].set_yscale("log")
sns.distplot(np.log(dat.Quantity), ax=ax[1], bins=20, kde=False)
ax[0].set_title("Log-Quantity distribution")
ax[1].set_ylabel('Frequency')
ax[1].set_xlabel("Log-Quantity");

From the graphs it looks like we have a small amount of outliers greater than 70000. lets take the exponent where Log-Quantity=4 as most of our distribution lies within this region

In [None]:
np.exp(4),np.quantile(dat.Quantity, 0.95)

It looks like we will be able to keep more than 95% of our data with a max quantity set at 55. Lets take a look at our distribution after we drop the outliers.

In [None]:
dat = dat.loc[dat.Quantity < 55].copy()

In [None]:
fig, ax = plt.subplots(2,1,figsize=(15,15))
sns.distplot(dat.Quantity, ax=ax[0], kde=False)
ax[0].set_title("Quantity distribution")
ax[0].set_ylabel('Frequency')
ax[0].set_yscale("log")
sns.distplot(np.log(dat.Quantity), ax=ax[1], bins=20, kde=False)
ax[0].set_title("Log-Quantity distribution")
ax[1].set_ylabel('Frequency')
ax[1].set_xlabel("Log-Quantity");

In [None]:
dat.Quantity.describe()

We have a much better distribution now and a much smaller std. Now we can move on to our Time-Series Analysis.

### Which months had the highest Revenue?

Now I would like to explore what statistics and insights we can uncover with different time periods and dates in our data. I will start by creating new columns that represent different date ranges that we can use. I will also create a revenue column as that will be a good metric to look at to determine sales performance in different time ranges.

In [None]:
dat["Revenue"] = dat.Quantity * dat.UnitPrice

dat["Month"] = dat.InvoiceDate.dt.month

dat.groupby('Month').sum().sort_values(by='Revenue', ascending=False)

In [None]:
plt.rcParams.update({'font.size': 12})
z = dat.groupby('Month').sum().sort_values(by='Revenue',ascending=False)
x = z.index
y = z['Revenue'].sort_values(ascending=False)
plt.figure(figsize=(10,10))
sns.barplot(x, y, order=x)
plt.ylabel("Revenue", Size=14)
plt.xlabel("Months", Size=14)
plt.title("Which Month had the highest Revenue?", Size=14);

We can see that November is the highest revenue month for the company followed by October and then September. This could be because these are the months leading up to the holiday months where it is more likely people will be buying gifts and business increasing their inventory. It should be noted that the company considers many of their customers to be wholesalers indicating that customers are likely preparing for the holiday season by purchasing more products.

### What product contributed the most to revenue? Why?

In [None]:

df = dat[['StockCode','Revenue']].groupby('StockCode').sum().sort_values(by='Revenue', ascending=False).iloc[0:9]
df

We can see that StockCode 22423 contributed the most to revenue by over $60000. Lets take a look at a sample of the transactions for StockCode 22423 to see if there are any clues that can explain why it's contribution to revenue is so high

In [None]:
dat[dat['StockCode'] == '22423'].sample(10)

The only clue to explain it's high contribution to revenue would be the relatively high UnitPrice compared to the rest of the products. This allows it to generate more revenue in lower quantities being sold. The other data features seem to be relatively random and so without further information we can make any more inferences.

Let me know of any suggestions and critiques you have in the comments below!
Cheers!