In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
'''Import the e-commerce csv to a pandas df'''
data = pd.read_csv('/kaggle/input/ecommerce-data/data.csv',encoding= 'ISO-8859-1')
# data = pd.read_csv('data.csv',encoding= 'ISO-8859-1')
global separator
separator = '\n***********************************************************************\n'

In [None]:
'''Basic info,describe and memory analysis over the dataset'''
print(separator)
print(data.info())
print(separator)
print(data.describe())
print(separator)
print(data.head())
print(separator)
print(data.memory_usage(deep=True))
print(separator)

In [None]:
'''Let's perform some memory optimisation by making Country, StockCode and Description as categorical columns.'''

data["Country"] = data["Country"].astype("category")
data["StockCode"] = data["StockCode"].astype("category")
data["Description"] = data["Description"].astype("category")
print(separator)
print(data.memory_usage(deep=True))
print(separator)

Now that's a lot better!! :D

In [None]:
print(separator)

This is an e-commerce orders history data set so it be 'should' safe to assume that InvoiceNo a depictor of total number of invoices issued/orders placed/entries in the data. 
There are total 541909 non-null entries.
Description(540455) and CustomerID(406829) has some missing values.
Description is a meta-data column and we may assume that neither it's values nor does some of it being missing shouldn't affect our EDA. Nonetheless we shall explore possibilites to ignore/impute it.
CustomerID values missing is an interesting find, seems like a data-entry limitation by BSS personnel. Having said that imputation for these missing values
for CustomerID seems tough and not fruitful since it's supposed to be a unique data for a unique customer. Having said that, we shall revisit this point after some analysis. We never know what pattern we may find. ;)

From the describe() we see that we have 3 numerical columns: Quantity, UnitPrice and CustomerID. This was expected.
The spread is vast for all 3, since means and medians are far apart saying these data points aren't normally distributed.
Quanitiy and UnitPrice have min in -ve, that's interesting fact. It mostly should refer to the 'return orders' for which the company has to pay out cash from it's pocket to that customer/customerID.

In [None]:
'''Let's find out the missing(isnull) absolute and % of data.'''

print(data.isnull().sum().sort_values(ascending=False))
print(separator)
print(round (data.isnull().sum().sort_values(ascending=False)/len(data)*100,2))

Description has .27% missing data, I'm going to let it be and not do anything about it for now. Later if we find it co relating to some other column we may think again.

CustomerID has close to 25% missing data, we need to do something about it.

Let's dig deeper.

In [None]:
print(data[["InvoiceNo", "Country"]].groupby('Country').count().sort_values("InvoiceNo", ascending=False))

It's a UK based ecommerce website, and their sales/transactions data tell the same story. Highest number in the UK, followed by the EU and then the rest of the world. OK, this confirms our intuition. And the periods of transaction were in 1st Dec 2010 to 9th Dec 2011.

In [None]:
print(sns.kdeplot(data['Quantity'], color="green"))

In [None]:
print(sns.kdeplot(data['Quantity'], clip=(-20000, 20000), color="blue"))
plt.figure()
print(data['Quantity'].plot(kind='hist',bins=5))

The Quantity is mainly available across the -20000,20000 range.

CustomerID     135080
Description      1454

CustomerID     24.93%
Description     0.27%

These are the NA values in our data set. Let's drop them, since description is low in % and it doesn't matter much so we may care not to produce it. CustomerID on the other hand is not possible to find out since there appears to be no co direct relation with other column(s).

In [None]:
data.dropna(inplace=True)
data.isnull().sum()

Perfect, now we have dropped the NA rows. Let's look back at the basic info once again on the new df.

In [None]:
'''Basic info and describe analysis over the dataset'''
print(data.info())
print(separator)
print(data.describe())
print(separator)
print(data.head())

In [None]:
data[(data['Quantity']<=0) | (data['UnitPrice']<0)].count()

8905 rows have Quantity in negative or UnitPrice negative. This might mean they are return orders, or we are not sure of the reason why they are so. 9k out of 5L records, we may trim it out and be fine with the data we will have left. Let's proceed that way.

In [None]:
data=data[data['Quantity']>0]
data=data[data['UnitPrice']>=0]

In [None]:
print(data.shape)

In [None]:
data.head()

Let's add a column to find the total Amount or price of the invoice/order. That would be Quantity * UnitPrice

In [None]:
data['TotalAmount']=data['Quantity']*data['UnitPrice']
data.head()

Let's find the largest amount order. 

In [None]:
data[data['TotalAmount']==data['TotalAmount'].max()]

So the invoice num 581483 was the largest order received amounting to 168469.6. Order was placed on 12-9-2011 by a UK customer(16446) with PAPER CRAFT , LITTLE BIRDIE description.

In [None]:
data[data['CustomerID']==16446.0].sort_values(by='InvoiceDate', ascending=False)

Besides that huge order, this particular customer hasn't placed any significant orders.

Now, let's do some operation on the InvoiceDate column to help us study the dataset more.

In [None]:
'''Convert InvoicdeDate to datetime'''
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

'''Grouping the data based on months to get a feel of the monthly sales data'''
data_new = data.groupby(pd.Grouper(key='InvoiceDate',freq='M')).sum()
data_new.reset_index(level=0, inplace=True)
data_new

Let's plot it out to get a better visual representation.

In [None]:
print(data_new.plot(x='InvoiceDate', y='TotalAmount',kind='bar'))
print(data_new.plot(x='InvoiceDate', y='Quantity',kind='bar'))

Thus, we conclude that Nov-2011 resulted in the highest sales both by TotalAmount and Quantity(this may be the case because of Christmas Shopping? ;) ), while Feb-2011 was the worst in terms of sales and the last quarter of the year was best among the 2011 quarters.

In [None]:
'''Top 5 countries sales count wise in the cleaned up data.'''
data.Country.value_counts().head().plot(kind='bar')

In [None]:
'''Top 5 countries Total Gross Amount sales wise.'''
data_temp = data.groupby(['Country'])['TotalAmount'].agg('sum').reset_index().sort_values(by=['TotalAmount'],ascending=False).head()
print(data_temp)
print(data_temp.plot(x='Country', y='TotalAmount',kind='bar'))

Let's check for the top 5 Order descriptions in terms of highest number of invoices against it.

In [None]:
data.groupby(['Description']).size().reset_index(name='counts').sort_values(by=['counts'],ascending=False).head()

 # EDA conclusion:-

1. Top 5 countries in terms of highest counts of sale/invoices are: UK, Germany, France, Ireland, Spain.
2. Top 5 countries in terms of Total Gross Amount sales are: UK, Netherlands, Ireland, Germany, France.
3. The data had negative quantity/unit price, those might have been return orders. But anyways we have ignored those ones from our list.
4. Invoice num 581483 was the largest single order received amounting to 168469.6. Order was placed on 12-9-2011 by a UK customer(16446) with PAPER CRAFT , LITTLE BIRDIE description.
5. Nov-2011 resulted in the highest sales both by TotalAmount and Quantity(this may be the case because of Christmas Shopping? ;) ), while Feb-2011 was the worst in terms of sales and the last quarter of the year was best among the 2011 quarters.
6. 'WHITE HANGING HEART T-LIGHT HOLDER' was the top Order descriptions in terms of highest number of invoices against it.