## Exploratory Data Analysis E-Commerce Data

The dataset is acquired from kaggle (https://www.kaggle.com/carrie1/ecommerce-data)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df=pd.read_csv('../input/ecommerce-data/data.csv', encoding= 'unicode_escape')

In [None]:
df.head()

In [None]:
df.info()

The datetype for the InvoiceDate column is still object. Use pd.to_datetime to change it to datetime.

In [None]:
df['InvoiceDate']=pd.to_datetime(df['InvoiceDate'])

In [None]:
df.info()

In [None]:
df.shape

Create a total transaction column by multiplying quantity and unit price

In [None]:
df['total_transaction']=df['Quantity']*df['UnitPrice']

In [None]:
df.head()

Create a new columns of Month and Year of each transaction

In [None]:
df['Month']=df['InvoiceDate'].apply(lambda x:x.month)

In [None]:
df['Year']=df['InvoiceDate'].apply(lambda x:x.year)

Sort dataframe by year and month

In [None]:
df=df.sort_values(by=['Year','Month'])

In [None]:
mmap={1:'Jan11',2:'Feb11',3:'Mar11',4:'Apr11', 5:'May11', 6:'Jun11', 7:'Jul11',8:'Aug11',9:'Sep11',10:'Oct11',11:'Nov11',12:'Dec11'}

In [None]:
df['Month_name']=df['Month'].map(mmap)

In [None]:
df.head()

In [None]:
def my(x):
    Month=x[0]
    Year=x[1]
    
    if Year==2010:
        Month='Dec10'
        return Month
    else:
        return Month

In [None]:
df['Month_name']=df[['Month_name','Year']].apply(my, axis=1)

In [None]:
df.head()

## Total transaction by month

Figuring out the monthly total transactions from the data

In [None]:
monthly=df.groupby(['Year','Month','Month_name']).sum()

In [None]:
monthly

In [None]:
monthly.reset_index(inplace=True)

In [None]:
monthly

In [None]:
plt.figure(figsize=(10,5))
sns.set_style('whitegrid')
sns.barplot(x='Month_name', y='total_transaction', data=monthly, palette='viridis')
plt.title('Monthly Total Transaction', fontsize=14)
plt.xlabel('Month')
plt.ylabel('Total Transaction (mil)')

Month with highest transaction = November 2011 <br/>
Month with lowest transaction = December 2011

## What happened in November 2011?

Figuring out what products are sold the most in November 2011 in terms of total transaction

In [None]:
nov11=df[(df['Month']==11) & (df['Year']==2011)]

In [None]:
nov11.info()

Fill the missing values in Description column with 'unknown' so that the length matches the StockCode column

In [None]:
nov11['Description'].fillna('unknown', inplace=True)

In [None]:
nov11.info()

Groupby StockCode and Description, and sort it by total transaction to know what products generate the most transactions

In [None]:
nov11=nov11.groupby(['StockCode','Description']).sum().sort_values(by='total_transaction', ascending=False)

### Top 10 products sold in Nov 2011

In [None]:
nov11['total_transaction'].head(10)

Compare it with other months

In [None]:
df['Description'].fillna('unknown', inplace=True)

In [None]:
pivot=df.pivot_table(index=['StockCode','Description'], values='total_transaction', columns='Month_name', aggfunc='sum').sort_values(by='Nov11', ascending=False)

In [None]:
pivot.head(10)

As we can see in the pivot table above, top 10 items of November 2011 are sold significantly higher than in the other months. This might be due to christmas is around the corner so that people are buying gifts and new stuffs in November.

## Most and Least Popular Items

Discovering what are the most and least popular products based on the quantity

In [None]:
qty=df.pivot_table(index=['StockCode','Description'], values='Quantity', aggfunc='sum').sort_values(by='Quantity', ascending=False)

In [None]:
qty.reset_index(inplace=True)
qty.head()

In [None]:
sns.barplot(y='Description', x='Quantity', data=qty.head(10))
plt.title('Top 10 Most Popular Items', fontsize=14)
plt.ylabel('Item')

In [None]:
sns.barplot(y='Description', x='Quantity', data=qty.tail(10))
plt.title('Top 10 Most Popular Items', fontsize=14)
plt.ylabel('Item')

The data above doesn't actually make sense since the quantity is less than 0. <br>
Therefore we will drop the dataframe which quantity <=0

In [None]:
qty=qty[qty['Quantity']>0]

In [None]:
sns.barplot(y='Description', x='Quantity', data=qty.tail(10))
plt.title('Top 10 Least Popular Items', fontsize=14)
plt.ylabel('Item')

### Discovering countries with most transaction

In [None]:
bycountry=df.groupby('Country').sum()

In [None]:
bycountry.reset_index(inplace=True)

In [None]:
bycountry.sort_values(by='total_transaction', ascending=False).head()

In [None]:
sns.barplot(x='total_transaction', y='Country', data=bycountry.sort_values(by='total_transaction', ascending=False).head())
plt.xlabel('Total Transaction')
plt.title('5 Countries with Most Transaction')

From the graph above we know that 5 top countries in terms of total transactions are UK, Netherlands, EIRE, Germany, and France

### Discovering most popular items of top countries

In [None]:
indexed=df.pivot_table(index=['Country','StockCode','Description'], values='Quantity', aggfunc='sum').reset_index()

In [None]:
sns.barplot(y='Description', x='Quantity', data=indexed[indexed['Country']=='United Kingdom'].sort_values(by='Quantity', ascending=False).head(10))
plt.title('Top 10 Most Popular Items in UK', fontsize=14)
plt.ylabel('Item')

In [None]:
sns.barplot(y='Description', x='Quantity', data=indexed[indexed['Country']=='Netherlands'].sort_values(by='Quantity', ascending=False).head(10))
plt.title('Top 10 Most Popular Items in Netherlands', fontsize=14)
plt.ylabel('Item')

In [None]:
sns.barplot(y='Description', x='Quantity', data=indexed[indexed['Country']=='EIRE'].sort_values(by='Quantity', ascending=False).head(10))
plt.title('Top 10 Most Popular Items in EIRE', fontsize=14)
plt.ylabel('Item')

In [None]:
sns.barplot(y='Description', x='Quantity', data=indexed[indexed['Country']=='Germany'].sort_values(by='Quantity', ascending=False).head(10))
plt.title('Top 10 Most Popular Items in Germany', fontsize=14)
plt.ylabel('Item')

In [None]:
sns.barplot(y='Description', x='Quantity', data=indexed[indexed['Country']=='France'].sort_values(by='Quantity', ascending=False).head(10))
plt.title('Top 10 Most Popular Items in France', fontsize=14)
plt.ylabel('Item')

## Customer Churn

Assume that customers who doesn't have any transactions data for the last 6 months are categorized as churn.

First, figure out the last date of InvoiceDate as a point of reference

In [None]:
df.sort_values(by='InvoiceDate', ascending=False).head(1)

So the last InvoiceDate is 2011-12-09

That means customers who did not have any transaction data since 2011-06-09 (6 months to the last invoice date), are categorized as churn.

Group dataframe by CustomerID, set aggregate function as max() and grab the invoice date

In [None]:
cust=df.groupby('CustomerID').max().sort_values(by='InvoiceDate', ascending=False)

In [None]:
cust.loc[cust['InvoiceDate'] < '2011-06-09', 'Churn']='Yes'
cust.loc[cust['InvoiceDate'] >= '2011-06-09', 'Churn']='No'

In [None]:
cust

So there are 4372 customers.<br/>
All the customers categorized as churn are as follows:

In [None]:
cust[cust['Churn']=='Yes']

In [None]:
churn=cust.reset_index().groupby('Churn').count()

In [None]:
churn.head()

In [None]:
plt.pie(churn['CustomerID'], labels=['Aktif', 'Churn'], autopct='%1.0f%%')

Based on the dataframe above, 849 of 4372 or approximately 19% of the total customers are categorized as churn.

## Thank You