# Exploratory Data Analysis

Our data exploration will mainly focus on answering some questions related to customers and their value. We want to also get some insight into what our client's customer base looks like for when we start doing our market segmentation and later our modeling.

Key Questions:

1. What is the purchasing history by country ?
2. Identify any cyclical nature of purchasing by country ?
3. What other ways are there to segment our customers ?

In [None]:
# import libraries
from warnings import filterwarnings
filterwarnings("ignore")

import pandas as pd
import numpy as np
import json

from sklearn.preprocessing import minmax_scale

import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import seaborn as sns
import plotly.express as px
from plotly.offline import init_notebook_mode

from ipywidgets import widgets, interact

%matplotlib inline
sns.set(font_scale=1.2)
init_notebook_mode(connected=True)

In [None]:
# load data
df = pd.read_csv("../data/interim/data.csv", index_col="invoice_date", parse_dates=True)
df.drop(columns=["Unnamed: 0"], inplace=True)
df.info()

In [None]:
# output first 5 rows
df.head()

## Grouping by Country

### Purchasing History

Our dataset has a country feature, I want to see the purchasing history by country over time, and identify where the majority of purchasing is coming from. This will be beneficial in our pursuit of segmenting the customer base, as we can possibly use country of origin as a way to segment the population.

In [None]:
# pivot table of quarterly gross invoice total by country
quarterly_gross_country = pd.pivot_table(df, index=pd.Grouper(freq="Q"), columns="country", values="total", aggfunc="sum")
fig = px.scatter(quarterly_gross_country, title="Quarterly Gross Purchasing by Country", labels={"value": "Gross", "invoice_date": "Quarter End Date", "country": "Country"})
fig.show()

Noticeably looking at the quarterly gross, without any filtering, we can see that the United Kingdom originates the most purchasing. Historically they have purchased somewhere between 700k to 3M each quarter, outspending the rest of the countries by a large factor. I'll also want to inspect how many customers we have throughout the various countries.

In [None]:
# import geojson
with open("../data/external/countries.geojson") as f:
    countries = json.load(f)

In [None]:
# do we have the geojson for all the countries in our dataframe
geojson_countries = set(map(lambda x: x['properties']['ADMIN'], countries['features']))
df_countries = set(df.country.unique())

df_countries - geojson_countries

In [None]:
# map the countries missing to one that is available
mapping = {
    'Channel Islands': "France",  # since they're closest to France
    'Czech Republic': "Czechia",
    'EIRE': "Ireland",
    'European Community': "European Union",
    'Hong Kong': "Hong Kong S.A.R.",
    'Korea': "South Korea",
    'RSA': "South Africa",
    'USA': "United States of America",
    'Unspecified': "Unspecified",
    'West Indies': "Cuba"  # west indies is 18 countries but this is close enough
}
df['updated_country'] = df.country.replace(mapping)
set(df['updated_country']) - geojson_countries  # left over missing countries

In [None]:
# plot chloropleth map of total purchasing value by region
total_by_country = np.log(df.groupby("updated_country").sum()["total"]).reset_index()
fig = px.choropleth_mapbox(
    data_frame=total_by_country,
    geojson=countries,
    featureidkey="properties.ADMIN",
    locations="updated_country",
    color="total",
    mapbox_style="carto-positron",
    zoom=1,
    title="Regional Sales Rating"
)

fig.show()

Looking at the map of total sales value by country (the numbers are scaled down using the natural log), there appears to be an decreasing value the further the region is from the UK. There are exceptions, but the overall picture describes that, and maybe the economic status of the region itself plays a role in the purchasing power of customers in that region.

### Customer Count

In [None]:
# unique customers by region
region_counts = pd.DataFrame(df.drop_duplicates(["customer_id", "country"]).value_counts("country"), columns=["count"]).reset_index()
fig = px.bar(region_counts, x="country", y="count", title="Unique Customers by Region", 
       labels={"country": "Region", "count": "# of Customers"})
fig.show()

Looks like the majority of our customer base is in the UK, with the rest of the regions trailing with a range from 1 to 107. In descending order we have Germany, France, Spain, Belgium as the runner ups. Looks like if we wanted to segment solely by region we might run into an issue, and would only market in the UK. 

### Time Series Plot

Look into the purchasing habits of countries, by plotting the time series of their gross invoices either weekly, monthly, or quarterly.

In [None]:
# function to plot the time series of invoice sum
from warnings import filterwarnings
filterwarnings("ignore")

periods = {"Month": "M", "Quarter": "Q", "Week": "W"}
def plot_country_time_series(country, period):
    _period = periods[period]
    _df = df.loc[df["country"] == country, "total"].groupby(pd.Grouper(freq=_period)).sum()
    _df.plot(figsize=(16, 9), title=f"{country} - {period} Gross Purchasing (in GBP)", style="o-")
    plt.xlabel("Date")
    plt.ylabel("Invoice Total (in GBP)")
    

countries = widgets.Dropdown(options=df.country.unique(), description="Country:")
periods_ = widgets.Dropdown(options=list(periods.keys()), description="Period:")

interact(plot_country_time_series, country=countries, period=periods_)
plt.show()

Regionally we can see that certain countries have different ranges of time between purchasing. This tells us that on the customer level, we have customers which have large gaps between their purchases. We will definitely have to engineer some features to help us with segmenting the customer base, especially over time.

## Alternative Segmentation Methods

Looking through our data we will have to generate new features for segmenting our customers. In marketing analysis a common way to identify best customers is to use the RFM model.

> Recency, frequency, monetary value is a marketing analysis tool used to identify a company's or an organization's best customers by using certain measures. The RFM model is based on three quantitative factors:

- Recency: How recently a customer has made a purchase
- Frequency: How often a customer makes a purchase
- Monetary Value: How much money a customer spends on purchases

You can read more about RFM here: https://www.investopedia.com/terms/r/rfm-recency-frequency-monetary-value.asp

We'll have to create the following features for each customer through each month:

- Recency = the number of months that have passed since the customer last purchased
- Frequency = the number of purchases by the customer
- Monetary = the highest value of all invoices by the customer

In [None]:
# load raw dataset
raw_df = pd.read_csv("../data/raw/data.csv", parse_dates=True, index_col="InvoiceDate")

In [None]:
# split date_column into date and time and add value of purchase column
raw_df["invoice_date"] = raw_df.index.date
raw_df["invoice_time"] = raw_df.index.time
raw_df["value"] = raw_df.Quantity * raw_df.Price  # the value of the purchase
raw_df.head()

Our end result will be a dataframe, with a row for each customer each month, and the RFM values for that customer for that month.

In [None]:
# dataframe with invoices and their total value
invoices_df = raw_df.groupby(["Customer ID", "invoice_date", "Invoice"]).sum()\
                    .reset_index()[["Customer ID", "invoice_date", "Invoice", 'value']]
invoices_df["invoice_date"] = pd.to_datetime(invoices_df.invoice_date)
invoices_df.set_index("invoice_date", inplace=True)
monetary = pd.pivot_table(
    invoices_df,
    values="value",
    index=pd.Grouper(freq="M"),
    columns="Customer ID",
    aggfunc="max",
    fill_value=0
).applymap(lambda x: 0. if x <= 0 else x)
monetary_rolling = monetary.rolling(6, min_periods=1).sum().abs()  #ensure we have positive values

In [None]:
# create frequency dataframe
frequency = pd.pivot_table(
    raw_df,
    values="Quantity",
    aggfunc="sum",
    index=pd.Grouper(freq="M"),
    columns="Customer ID",
    fill_value=0
).applymap(lambda x: 0.0 if x <= 0 else x)
frequency_rolling = frequency.rolling(6, min_periods=1).sum().abs()  #ensure we have positive values

This is the total purchasing value for each customer each month. Next we need to create the amount of purchases the customer made in each month.

In [None]:
from functools import reduce

In [None]:
def recency_func(sequence):
    """Given a series return an array with recency values.
    
    0 if an element has a value.
    n: where n is the number of cells away from the last 0"""
    purchases = sequence > 0  # array where true means a purchase was made, false means no
    array = []
    for pos, purchase in enumerate(purchases):
        if purchase:
            array.append(0)
        elif not purchase and pos == 0:
            array.append(1)  # if they didn't make a purchase, and it is the beginning of our data, put a 1
        else:
            array.append(array[-1] + 1)
    return array

In [None]:
# create the recency dataframe
recency = pd.pivot_table(
    invoices_df,
    values="value",
    aggfunc="count",
    index=pd.Grouper(freq="M"),
    columns="Customer ID",
    fill_value=0
).apply(recency_func)
recency

We now have our three dataframes for RFM modeliing. Similar to a picture this is a three dimensional dataset, and if we want do any modeling with it we'll have to work on setting it up correctly. One way we can do this is by turning it into a two dimensional dataset, by unwinding the values for R F and M. Prior to doing that we should also scale our values down across the month.

And for the values in our recency dataframe, we'll want to invert the values and scale it. A higher value meaning the customer made a more recent purchase.

All these will be on a scale of 0 to 1.

In [None]:
# unpivot our dataframes and then join them all together
_recency = pd.melt(recency.apply(minmax_scale, axis=1, raw=True).applymap(lambda x: 1-x), ignore_index=False, value_name="recency", var_name="customer_id")
_frequency = pd.melt(frequency.apply(minmax_scale, axis=1, raw=True), ignore_index=False, value_name="freq", var_name="customer_id")
_monetary = pd.melt(monetary.apply(minmax_scale, axis=1, raw=True), ignore_index=False, value_name="monetary", var_name="customer_id")

_recency["frequency"] = _frequency.freq
_recency["monetary"] = _monetary.monetary
_recency["rolling_freq_6mo"] = pd.melt(frequency_rolling, ignore_index=False, value_name="freq_rolling_6mo", var_name="customer_id").freq_rolling_6mo
_recency["rolling_monetary_6mo"] = pd.melt(monetary_rolling, ignore_index=False, value_name="monetary_rolling_6mo", var_name="customer_id").monetary_rolling_6mo


result = _recency.copy()
result.head()

Now that we have our data in the form we want we can plot it over the course of the 25 months to see if there are any noticeable segments.

In [None]:
# function to plot the time series scatter plot
def plot_scatter(month):
    """Plot the 3d scatter of customers RFM for a given month"""
    data = result.loc[month[:-3]]
    fig = plt.figure(figsize=(10, 16))
    ax = plt.axes(projection="3d")
    ax.scatter(xs=data["recency"], ys=data["frequency"], zs=data["monetary"], zdir="z")
    plt.xlabel("Recency")
    plt.ylabel("Frequency")
    ax.set_zlabel("Monetary")
    plt.title(f"RFM {month}")

months = widgets.Dropdown(options=list(result.index.unique().date.astype(str)), description="Month End")

interact(plot_scatter, month=months)
plt.show()

Looking through the months we can see that there is a large amount of customers which have made recent purchases (our customers in the UK), with varying amounts of frequency but mainly clustering in the 0 to .4 range. Those customers which do appear to purchase frequently also as imagined generate more monetary value.

When we move over to modeling, we'll want to generate some more features to use, such as rolling features to segment our customers better.

In [None]:
# save our data
result.to_csv("../data/processed/data.csv")