### Customer Profiling

This activity is meant to give you practice exploring data including the use of visualizations with `matplotlib`, `seaborn`, and `plotly`.  The dataset contains demographic information on the customers, information on customer purchases, engagement of customers with promotions, and information on where customer purchases happened.  A complete data dictionary can be found below.  

Your task is to explore the data and use visualizations to inform answers to specific questions using the data.  The questions and resulting visualization should be posted in the group discussion related to this activity.  Some example problems/questions to explore could be:

-----

- Does income differentiate customers who purchase wine? 
- What customers are more likely to participate in the last promotional campaign?
- Are customers with children more likely to purchase products online?
- Do married people purchase more wine?
- What kinds of purchases led to customer complaints?

-----

### Data Dictionary

Attributes


```
ID: Customer's unique identifier
Year_Birth: Customer's birth year
Education: Customer's education level
Marital_Status: Customer's marital status
Income: Customer's yearly household income
Kidhome: Number of children in customer's household
Teenhome: Number of teenagers in customer's household
Dt_Customer: Date of customer's enrollment with the company
Recency: Number of days since customer's last purchase
Complain: 1 if customer complained in the last 2 years, 0 otherwise


MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years
Promotion


AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise


NumWebPurchases: Number of purchases made through the company’s web site
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s web site in the last month
```

In [8]:
import pandas as pd
import plotly.express as px

In [37]:
df = pd.read_csv('data/marketing_campaign.csv', sep = '\t')

In [91]:
# First, compute age
df['Age'] = 2022-df['Year_Birth']
# Next, eliminate outliers
df = df.query('Age<80')

def converted_when(x):
    if x['AcceptedCmp1'] == 1:
        return 1
    if x['AcceptedCmp2'] == 1:
        return 2
    if x['AcceptedCmp3'] == 1:
        return 3
    if x['AcceptedCmp4'] == 1:
        return 4
    if x['AcceptedCmp5'] == 1:
        return 5
    if x['Response'] == 0:
        return 0
    #else:
    #    return 6

df['Conversion_Stage'] = df.apply(converted_when, axis=1)

When do converted customers 'accept' the campaign offer?

---

I am curious 'where' in the process are customers converted. To do this, I assigned each 'stage' corresponding to AcceptedCmp1 = 1, AcceptedCmp2 = 2, etc. and arranged this as a new column. I can then plot a histogram such as the one below to visualize where conversions are most likely to occur.

Converted customers are overwhelmingly less likely to accept an offer in the second stage of the campaign; there is a staggering 88% drop between the first and second. Afterwards, the conversion stage jumps back up and resumes a logarithmic downward trend.


In [135]:
px.histogram(df.query('Conversion_Stage != 0'), x='Conversion_Stage')

How do income and age affect the conversion stage of the customer?

---

The science questions here are: whether income and age are correlated, and then how those variables affect the distribution of conversion stage.

I chose to include the histogram for income, because it sorts quite nicely into a skewed Gaussian, sharply-peaked in the 75-85k region. I cleaned the data by removing outliers exceeding 6 sigma, which happened to be a few people with ages well-beyond 100 and a couple high-earners.

Age does indeed roughly correlate with income, with some interesting facets. There appear to be two trendlines -- one which begins rather low in 'early' age, progressing rather linearly, and a second which begins high and largely stays that way. This could be a representation of wealth class divide.

High earners exceeding roughly 70k in income are much, much more likely to purchase an item on the first conversion stage and the fifth conversion stage. They likely have the spare income to make the purchase, and are either convinced immediately, or simply need time to overcome the intertia of making a purchase. Customers convinced in the last campaign are fairly distributed throughout all ages and incomes.

In [62]:
px.scatter(df.query('Conversion_Stage != 0').dropna(), x='Age', y='Income', color='Conversion_Stage',
          marginal_y='histogram')


Are frequent purchasers more likely to be persuaded earlier or later in the conversion process?

---

An interesting facet here is that the third campaign is extraordinarily effective for individuals with a single previous purchase. Most online spenders are repeat customers with a few purchases already.

In [132]:
px.density_heatmap(df.query('NumWebPurchases<13 and Conversion_Stage!=0'), x='NumWebPurchases', 
                   y='Conversion_Stage', marginal_x='histogram', marginal_y='histogram')


---

In [96]:
df[['Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 
    'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'Complain', 'Age', 'Conversion_Stage']].corr()
df['Kidhome'].value_counts()

0    1289
1     898
2      48
Name: Kidhome, dtype: int64

In [103]:
def IncomeDiv(x):
    ct=1
    if x['Marital_Status'] in ['Married', 'Together']:
        ct+=1
    ct += x['Teenhome']
    ct += x['Kidhome']
    return x['Income']/ct

df['IncomePerHouseholdMember'] = df.apply(IncomeDiv, axis=1)

In [106]:
px.scatter(df.query('Conversion_Stage != 0').dropna(), x='Age', y='IncomePerHouseholdMember', color='Conversion_Stage',
          marginal_y='histogram')
