Analytics can be categorized at a high level into three distinct types:
1. **Descriptive Analytics**, which use data aggregation and data mining to provide insight into the past and answer: “What has happened?”
2. **Predictive Analytics**, which use statistical models and forecasts techniques to understand the future and answer: “What could happen?”
3. **Prescriptive Analytics**, which use optimization and simulation algorithms to advice on possible outcomes and answer: “What should we do?”

Let's start with Descriptive Analysis, which allow us to learn from past behaviors, and understand how they might influence future outcomes.

Note: We need to be careful about how we segment and target customers. So we will focus on **RFM Analysis**, sophisticated technique to segment their customers


In [None]:
# import libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import pandas_profiling as pp
import seaborn as sns

# Input data files are available in the "../input/" directory.

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
#Loading the data

data=pd.read_csv("../input/sales_data_sample.csv",encoding='unicode_escape')
data.shape

In [None]:
data.info()

In [None]:
data.head()

## Quick Insight

We have some unnecessary columns 'ADDRESSLINE1','ADDRESSLINE2','POSTALCODE', 'TERRITORY', 'PHONE'. We can drop them.
We can discretize data i.e.  Instead of using 'CITY' , 'STATE' can use 'COUNTRY' only. So, we can drop 'CITY' , 'STATE' also.
We can regroup column 'PRODUCTCODE' by using only first three character value.
We have 'CUSTOMERNAME',  'CONTACTFIRSTNAME' and 'CONTACTLASTNAME'. We can remove 'CONTACTFIRSTNAME' and 'CONTACTLASTNAME'.


In [None]:
# Dropping Unnecessary columns 
temp=['ADDRESSLINE1','ADDRESSLINE2','POSTALCODE', 'TERRITORY', 'PHONE', 'CITY' , 'STATE','CONTACTFIRSTNAME', 'CONTACTLASTNAME' ]
data.drop(temp,axis=1,inplace=True)

In [None]:
# Regrouping product code.
data['PRODUCTINITIAL'] = data['PRODUCTCODE'].str[:3]
data.drop('PRODUCTCODE',axis=1,inplace=True)

In [None]:
# Recheck columns
data.info()

In [None]:
# Let's plot the data to get more insight.

plt.rcParams['figure.figsize'] = [18, 16]
data.plot(kind="box",subplots=True,layout=(4,4),sharex=False,sharey=False)
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = [18, 16]
data.plot(kind="density",subplots=True,layout=(4,4),sharex=False,sharey=False)
plt.show()

**More Insights.**

* Most of data for year 2003,2004(YEAR_ID). 4th Quater have more sale... 4>1>2>3... 
* According to data, most of sales are with in particular price range. But we have some outliers within 'SALES' and 'QUANTITYORDERED'. 
*  We have skewed variables like 'PRICEEACH','ORDERLINENUMBER'. 
* We have variables with high variance like 'PRICEEACH','ORDERLINENUMBER' and 'MSRP'.

## Detailed data Exploration.

In [None]:
# Checking null values
data.isnull().sum()

* We dont have any duplicates.
* No missing values. 
* We will look forward data quaterwise. Also Variable 'MONTH_ID' is highly correlated with QTR_ID (ρ = 0.9793). As a result  'MONTH_ID'  should be ignored.
* We have 92 unique customers for which we will do RFM analysis.


In [None]:
plt.rcParams['figure.figsize'] = [4, 4]
sns.regplot(x="YEAR_ID",y="QTR_ID",data=data)
plt.show()

In [None]:
data['STATUS'].value_counts()

In [None]:
sns.countplot(y='STATUS',data=data,hue='YEAR_ID')

In [None]:
sns.countplot(y='STATUS',data=data,hue='QTR_ID')

* We have Disputed, In Process, On Hold orders in year 2nd quater, 2005.
* Also we need clarification whether Resolved means Shipped or not.

In [None]:
# Comparing sales for each year(Quaterwise)

data1=data.groupby(['YEAR_ID','QTR_ID']).agg({'SALES': lambda x: x.sum() })
print(data1.info())
print(data1.head())

In [None]:
data1.reset_index(inplace=True)
data1.head()

In [None]:

sns.factorplot(y='SALES', x='QTR_ID',data=data1,kind="bar" ,hue='YEAR_ID')

## RFM Analysis

In [None]:
import warnings
warnings.filterwarnings('ignore')


For RFM analysis, we need only three columns. 'CUSTOMERNAME', 'ORDERNUMBER', 'ORDERDATE' and 'SALES'

In [None]:
temp=['CUSTOMERNAME', 'ORDERNUMBER', 'ORDERDATE', 'SALES']
RFM_data=data[temp]
RFM_data.shape

In [None]:
RFM_data.head()

In [None]:
RFM_data['ORDERDATE'] = pd.to_datetime(RFM_data['ORDERDATE'])

In [None]:
RFM_data['ORDERDATE'].max()

**Create the RFM Table**

* Given dataset last order date is May 31, 2005, which we will use to calculate recency.

In [None]:
import datetime as dt
NOW = dt.datetime(2005,5,31)

In [None]:
RFM_table=RFM_data.groupby('CUSTOMERNAME').agg({'ORDERDATE': lambda x: (NOW - x.max()).days, # Recency
                                                'ORDERNUMBER': lambda x: len(x.unique()), # Frequency
                                                'SALES': lambda x: x.sum()})    # Monetary 

RFM_table['ORDERDATE'] = RFM_table['ORDERDATE'].astype(int)

RFM_table.rename(columns={'ORDERDATE': 'recency', 
                         'ORDERNUMBER': 'frequency',
                         'SALES': 'monetary_value'}, inplace=True)

In [None]:
RFM_table.head()

** RFM_Grouping**

In [None]:
quantiles = RFM_table.quantile(q=[0.25,0.5,0.75])
quantiles


In [None]:
# Converting quantiles to a dictionary, easier to use.
quantiles = quantiles.to_dict()
quantiles 

**RFM Segmentation**

In [None]:
RFM_Segment = RFM_table.copy()

In [None]:
# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def R_Class(x,p,d):
    if x <= d[p][0.25]:
        return 4
    elif x <= d[p][0.50]:
        return 3
    elif x <= d[p][0.75]: 
        return 2
    else:
        return 1
    
# Arguments (x = value, p = recency, monetary_value, frequency, k = quartiles dict)
def FM_Class(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4

In [None]:
RFM_Segment['R_Quartile'] = RFM_Segment['recency'].apply(R_Class, args=('recency',quantiles,))
RFM_Segment['F_Quartile'] = RFM_Segment['frequency'].apply(FM_Class, args=('frequency',quantiles,))
RFM_Segment['M_Quartile'] = RFM_Segment['monetary_value'].apply(FM_Class, args=('monetary_value',quantiles,))

In [None]:
RFM_Segment['RFMClass'] = RFM_Segment.R_Quartile.map(str) \
                            + RFM_Segment.F_Quartile.map(str) \
                            + RFM_Segment.M_Quartile.map(str)

RFM_Segment.head()

**RFM segmentation readily answers these questions for your business…**
* Who are my best customers?
* Which customers are at the verge of churning?
* Who are lost customers that you don’t need to pay much attention to?
* Who are your loyal customers?
* Which customers you must retain?
* Who has the potential to be converted in more profitable customers?
* Which group of customers is most likely to respond to your current campaign?

In [None]:
#Who are my best customers? (BY RFMClass = 444)
RFM_Segment[RFM_Segment['RFMClass']=='444'].sort_values('monetary_value', ascending=False).head(5)

In [None]:
#Which customers are at the verge of churning?
#Customers who's recency value is low

RFM_Segment[RFM_Segment['R_Quartile'] <= 2 ].sort_values('monetary_value', ascending=False).head(5)

In [None]:
#Who are lost customers?
#Customers who's recency, frequency as well as monetary values are low 

RFM_Segment[RFM_Segment['RFMClass']=='111'].sort_values('recency',ascending=False).head(5)

In [None]:
#Who are your loyal customers?
#Customers with high frequency value

RFM_Segment[RFM_Segment['F_Quartile'] >= 3 ].sort_values('monetary_value', ascending=False).head(5)