# Introduction

This is first part of https://www.kaggle.com/mustafacicek/detailed-marketing-analytics-cohort-pareto-rfm

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
from matplotlib.ticker import PercentFormatter
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from itertools import combinations

pd.options.mode.chained_assignment = None

plt.rcParams["axes.facecolor"] = "#A2A2A2"
plt.rcParams["axes.grid"] = 1

<a id="section-one"></a>

# 1) General Infos & Playing with Features

In [None]:
df = pd.read_csv("../input/ecommerce-data/data.csv")
display(df.head())
print(df.shape)

In [None]:
df.info()

We have missing values for column Description and CustomerID. Go deeper

In [None]:
df.isnull().sum()

In [None]:
df[df.Description.isnull()]

When description is null, we have 0 unit price and missing customer ids. Let's check for whole data.

In [None]:
df[df.Description.isnull()].CustomerID.nunique()

In [None]:
df[df.Description.isnull()].UnitPrice.value_counts()

When description is null, we have no available customer id and zero unit price for all data. Let's drop nan values.

In [None]:
df = df[df.Description.notnull()]

In [None]:
df[df.CustomerID.isnull()]

At first look, we can see records that have missing customer id, there is no specific characteristics.

StockCode contains non-numeric records i.e. DOT. It is a cue for examining stock codes.

In [None]:
print("We have {} observations.".format(df.shape[0]))

df = df[df.CustomerID.notnull()]

print("We have {} observations after removing unknown customers.".format(df.shape[0]))

In [None]:
df.isnull().sum()

We are done with systematically missing values. But lets go deeper.

Sometimes, missing values are filled with some denotations. "NAN", "na", "?", "Unknown", and so on. Let's check them.

In [None]:
df[df.Description.str.len() < 5]

In [None]:
df.InvoiceNo.value_counts()

InvoiceNo has coded with 6 digit numeric characters. We can see that some InvoiceNo records starts with the letter C. This means cancellation.

In [None]:
df[df["InvoiceNo"].str.startswith("C")]

Cancelled invoices have negative quantity.

In [None]:
df["Cancelled"] = df["InvoiceNo"].apply(lambda x: 1 if x.startswith("C") else 0)

Can we have both cancellation record, and record before cancellation. I mean, for example, we have C536379, have we 536379 ?

In [None]:
cancelled_invoiceNo = df[df.Cancelled == 1].InvoiceNo.tolist()
cancelled_invoiceNo = [x[1:] for x in cancelled_invoiceNo]

cancelled_invoiceNo[:5]

In [None]:
df[df["InvoiceNo"].isin(cancelled_invoiceNo)]

Nothing, we have just cancellation.

Well, maybe we have different pattern about InvoiceNo. Let's check it

In [None]:
df[df.InvoiceNo.str.len() != 6]

No, we only have proper invoices and cancellations for InvoiceNo. We don't have any different pattern.

In [None]:
df = df[df.Cancelled == 0]

Stock Codes generally contains 5 digit numerical codes.

In [None]:
df[df.StockCode.str.contains("^[a-zA-Z]")].StockCode.value_counts()

In [None]:
df[df.StockCode.str.contains("^[a-zA-Z]")].Description.value_counts()

It looks like data contains more than customer transactions. I will drop them.

In [None]:
df[df.StockCode.str.len() > 5].StockCode.value_counts()

In [None]:
df[df.StockCode.str.len() > 5].Description.value_counts()

Some stock codes have a letter at the end of their codes. I don't know what they refers, so I will keep them.

In [None]:
df =  df[~ df.StockCode.str.contains("^[a-zA-Z]")]

In [None]:
df["Description"] = df["Description"].str.lower()

I just standardize descriptions with converting them to all lowercase characters.

Stock Codes - Description

In [None]:
df.groupby("StockCode")["Description"].nunique()[df.groupby("StockCode")["Description"].nunique() != 1]

213 Stock codes have more than one description. Let's check some of them.

In [None]:
df[df.StockCode == "16156L"].Description.value_counts()

In [None]:
df[df.StockCode == "17107D"].Description.value_counts()

In [None]:
df[df.StockCode == "90014C"].Description.value_counts()

Seems we have just a litle differences between them, i.e. "," or "/"

In [None]:
df.CustomerID.value_counts()

In [None]:
customer_counts = df.CustomerID.value_counts().sort_values(ascending=False).head(25)

fig, ax = plt.subplots(figsize = (10, 8))

sns.barplot(y = customer_counts.index, x = customer_counts.values, orient = "h", 
            ax = ax, order = customer_counts.index, palette = "Reds_r")

plt.title("Customers that have most transactions")
plt.ylabel("Customers")
plt.xlabel("Transaction Count")

plt.show()

In [None]:
df.Country.value_counts()

In [None]:
country_counts = df.Country.value_counts().sort_values(ascending=False).head(25)

fig, ax = plt.subplots(figsize = (18, 10))

sns.barplot(x = country_counts.values, y = country_counts.index, orient = "h", 
            ax = ax, order = country_counts.index, palette = "Blues_r")
plt.title("Countries that have most transactions")
plt.xscale("log")
plt.show()

In [None]:
df["UnitPrice"].describe()

0 unit price?

In [None]:
df[df.UnitPrice == 0].head()

I didn't find any pattern. So, I remove them.

In [None]:
print("We have {} observations.".format(df.shape[0]))

df = df[df.UnitPrice > 0]

print("We have {} observations after removing records that have 0 unit price.".format(df.shape[0]))

In [None]:
fig, axes = plt.subplots(1, 3, figsize = (18, 6))

sns.kdeplot(df["UnitPrice"], ax = axes[0], color = "#195190").set_title("Distribution of Unit Price")
sns.boxplot(y = df["UnitPrice"], ax = axes[1], color = "#195190").set_title("Boxplot for Unit Price")
sns.kdeplot(np.log(df["UnitPrice"]), ax = axes[2], color = "#195190").set_title("Log Unit Price Distribution")

plt.show()

In [None]:
print("Lower limit for UnitPrice: " + str(np.exp(-2)))
print("Upper limit for UnitPrice: " + str(np.exp(3)))

In [None]:
np.quantile(df.UnitPrice, 0.99)

In [None]:
print("We have {} observations.".format(df.shape[0]))

df = df[(df.UnitPrice > 0.1) & (df.UnitPrice < 20)]

print("We have {} observations after removing unit prices smaller than 0.1 and greater than 20.".format(df.shape[0]))

In [None]:
fig, axes = plt.subplots(1, 3, figsize = (18, 6))

sns.kdeplot(df["UnitPrice"], ax = axes[0], color = "#195190").set_title("Distribution of Unit Price")
sns.boxplot(y = df["UnitPrice"], ax = axes[1], color = "#195190").set_title("Boxplot for Unit Price")
sns.kdeplot(np.log(df["UnitPrice"]), ax = axes[2], color = "#195190").set_title("Log Unit Price Distribution")

fig.suptitle("Distribution of Unit Price (After Removing Outliers)")
plt.show()

In [None]:
df["Quantity"].describe()

75%          12.000000

max       80995.000000

Let's look at these outliers.

In [None]:
fig, axes = plt.subplots(1, 3, figsize = (18, 6))

sns.kdeplot(df["Quantity"], ax = axes[0], color = "#195190").set_title("Distribution of Quantity")
sns.boxplot(y = df["Quantity"], ax = axes[1], color = "#195190").set_title("Boxplot for Quantity")
sns.kdeplot(np.log(df["Quantity"]), ax = axes[2], color = "#195190").set_title("Log Quantity")
plt.show()

In [None]:
print("Upper limit for Quantity: " + str(np.exp(5)))

In [None]:
np.quantile(df.Quantity, 0.99)

In [None]:
print("We have {} observations.".format(df.shape[0]))

df = df[(df.Quantity < 150)]

print("We have {} observations after removing quantities greater than 150.".format(df.shape[0]))

In [None]:
fig, axes = plt.subplots(1, 3, figsize = (18, 6))

sns.kdeplot(df["Quantity"], ax = axes[0], color = "#195190").set_title("Distribution of Quantity")
sns.boxplot(y = df["Quantity"], ax = axes[1], color = "#195190").set_title("Boxplot for Quantity")
sns.kdeplot(np.log(df["Quantity"]), ax = axes[2], color = "#195190").set_title("Log Quantity")

fig.suptitle("Distribution of Quantity (After Removing Outliers)")
plt.show()

In [None]:
df["TotalPrice"] = df["Quantity"] * df["UnitPrice"]
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

In [None]:
df.drop("Cancelled", axis = 1, inplace = True)
df.to_csv("online_retail_final.csv", index = False)

# Next

You can reach this data from here. https://www.kaggle.com/mustafacicek/online-retail-final

Next chapters:

https://www.kaggle.com/mustafacicek/marketing-analytics-cohort-analysis

https://www.kaggle.com/mustafacicek/marketing-analytics-pareto-principle

https://www.kaggle.com/mustafacicek/marketing-analytics-rfm-analysis

Full work: https://www.kaggle.com/mustafacicek/detailed-marketing-analytics-cohort-pareto-rfm