# **Marketing EDA**

**Content:**
1. [Introduction:](#1)
    1. [Purpose](#2)
    1. [Import Packages](#3)
1. [Data Cleaning:](#4)
    1. [Read Dataset:](#5)
    1. [Data Titles and Format:](#6)
    1. [Duplicate Records:](#7)
    1. [Null Values:](#8)
    1. [Categorical Fields:](#9)
    1. [Numerical Fields:](#10)
1. [Feature Engineering:](#11)
    1. [Amount Spent:](#12)
    1. [Products Purchased:](#13)
    1. [Total Children:](#14)
    1. [Age Demographic:](#15)
1. [Customer Demographics:](#16)
    1. [Number of Customers by Demographic:](#17)
    1. [Insights:](#18)
    1. [Relationship Between Demographics and Amount Spent:](#19)
    1. [Insights:](#20)
1. [Product Performance:](#21)
    1. [Total Spent by Product Category:](#22)
    1. [Insights:](#23)
    1. [Relationship Between Product Sales and Demographics:](#24)
    1. [Insights:](#25)
1. [Sales Channel Performance:](#26)
    1. [Total Purchases by Sales Channel:](#27)
    1. [Insights:](#28)
    1. [Relationship Between Sales Channel Purchases and Demographics:](#29)
    1. [Insights:](#30)
1. [Correlations:](#31)
    1. [Correlation Heatmap:](#32)
    2. [Insights:](#33)
     
    


<a id="1"></a> <br>
### **Introduction**

<a id="2"></a> <br>
**Purpose:**


This notebook is used to analyze the sales data of a supermarket chain to find insights that can be used to inform and optomize marketing decisions.

Examples of questions to that this project will try to answer include:
1. What sales channel has the most product purchases in each country?
2. Which product categories generate the most revenue?
3. In which countries are the majority of our customers located in?

The analysis is split up into 4 major parts: Customer Demographics, Product Performance, Sales Channel Performance, and Correlations

<a id="3"></a> <br>
**Import packages**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<a id="4"></a> <br>
### **Data Cleaning**

<a id="5"></a> <br>
**Read Dataset**

In [None]:
data=pd.read_csv("../input/marketing-data/marketing_data.csv", sep=',')
# View dataset in table format
data.head()

In [None]:
# Get general information about the dataset
data.info()

1. The orignial dataset has 2,240 records with 28 columns.
2. The column title for Income has a typo with an extra space in " Income", which needs to be corrected as it can cause problems during the analysis
3. The "Income" needs its datatype changed to float and the variable "Dt_customer" needs its datatype changed to datetime


<a id="6"></a> <br>
**Data Titles and Format**

In [None]:
# Isolate the column titles into a list
column_titles = []
for i in data.columns:
    column_titles.append(i)

In [None]:
# Rename the 'Income' title
data = data.rename(columns={column_titles[4]:'Income'})

In [None]:
# Change the Dt_Customer field data type to datetime
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"], format='%m/%d/%y')
# Change the Income field data type to Float
data["Income"] = data["Income"].str.replace("$","").str.replace(",","")
data["Income"] = data["Income"].astype(float)

<a id="7"></a> <br>
**Duplicate Records**

In [None]:
# Isolate the duplicate records
duplicates = data.duplicated()
# Count the number of duplicate records
duplicate_records=[]
for i in duplicates:
    if i==True:
        duplicate_data.append(i)
# Print the number of duplicate records
print(len(duplicate_records))

There are no duplicate records

<a id="8"></a> <br>
**Null Values**

In [None]:
# Count null values for each field
data.isnull().sum()

There are 24 records with missing "Income" values

In [None]:
# Impute missing income values using the median income
data["Income"] = data["Income"].fillna(value=data["Income"].median())

<a id="9"></a> <br>
**Categorical Fields**

In [None]:
# Make list of categorical variables 
cat_var = ["Education", "Marital_Status", "Country"]

In [None]:
# Obtain all unique values for each categorical variable to identify errors
for i in cat_var:
    print(f"{i} Unique Values: {data[i].unique()}")

1. The variables '2n cycle' and 'Master' have the same meaning. The '2n cycle' values should be merged to equal 'Master'.
2. The values 'YOLO', 'Alone', and 'Absurd' all mean 'Single', so these values should be merged to equal 'Single'.
3. The 'Marital_Status' variable does not require changes

In [None]:
# Convert '2n Cycle' values to 'Master'
data["Education"] = data["Education"].replace(["2n Cycle"], value="Master")
# Convert 'YOLO', 'Alone', and 'Absurd' values to 'Single'
data["Marital_Status"] = data["Marital_Status"].replace(["YOLO", "Alone", "Absurd"], value="Single")

<a id="10"></a> <br>
**Numerical Fields**

In [None]:
# Group numerical variables into a new dataframe
num = ['Year_Birth','Income', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth']
data_num = data[num]

In [None]:
# View basic stats of the numerical variables
data_num.describe()

In [None]:
# Obtain a boxplot for each numerical field to identify outliers
for col in data_num.columns:
    plt.figure()
    data_num.boxplot([col])
    plt.title(f'{col} Box Plot')

After observing the boxplots, there are outliers for birth year that should be removed as they are likely typo's. The income variable also needs to be corrected. The other outliers should not be removed as they are reasonable numbers and make sense given the context.

In [None]:
# Only keep records with birth years over 1900
data = data[data["Year_Birth"]>1900]

In [None]:
# Plot the income distribution
sns.distplot(data["Income"])
plt.title("Income Distribution")

As shown by the distribution plot and box plot for income, there is a very large range in the income, which can cause problems in analysis. For this reason the log(income) will be added to the dataset

In [None]:
# Add a field for log(income)
data["log(Income)"] = np.log(data["Income"])

<a id="11"></a> <br>
### **Feature Engineering**

<a id="12"></a> <br>
**Amount Spent**

In [None]:
# Create amount spent field
data["Amount_Spent"] = data["MntWines"] + data["MntFruits"] + data["MntMeatProducts"] + data["MntFishProducts"] + data["MntSweetProducts"] + data["MntGoldProds"]

<a id="13"></a> <br>
**Products Purchased**

In [None]:
# Create products purchased field
data["Products_Purchased"] = data["NumWebPurchases"]+data["NumCatalogPurchases"]+data["NumStorePurchases"]

<a id="14"></a> <br>
**Total Children**

In [None]:
# Create total children field
data["Total_Children"] = data["Kidhome"] + data ["Teenhome"]

<a id="15"></a> <br>
**Age Demographic**

In [None]:
# Create age demographic field
data["Age_Demographic"] = pd.cut(data["Year_Birth"], bins=[1900,1945,1964,1980,1996,2012], labels=["Silent Gen", "Baby Boomer", "Gen X","Millennial", "Gen Z"])

<a id="16"></a> <br>
### **Customer Demographics**

<a id="17"></a> <br>
**Number of Customers by Demographic**

In [None]:
# Rename columns to get cleaner graph labels
data = data.rename(columns={"Amount_Spent":"Amount Spent", "Marital_Status":"Marital Status", "Products_Purchased":"Products Purchased", "Total_Children": "Total Children", "Age_Demographic": "Age Demographic"})

In [None]:
# Make a list of customer demographic variables
dem_variables = ["Country", "Marital Status","Age Demographic", "Total Children", "Education"]

In [None]:
# Make a bar chart that visualizes the number of customers for each demgraphic using a for loop
for i in dem_variables:
    plt.figure()
    sns.countplot(data[i])
    plt.title(f"Number of Customers by {i}")

<a id="18"></a> <br>
**Insights:**

1. The majority of customers live in Spain followed by South Africa
2. Most customers are married followed by together
3. Most customers are Gen X followed by the Baby Boomer generation
4. The majority of customers have 1 child followed by no children
5. The customer base is highly educated with an undergraduate education being the most common followed by a master's degree education

<a id="19"></a> <br>
**Relationship Between Demographics and Amount Spent**

In [None]:
# Make a table showing amount and average spent grouped by each demographic using a for loop
for i in dem_variables:
    c_table = data[[i, "Amount Spent"]].groupby(i).sum()
    c_table["Average Spent"] = data[[i,"Amount Spent"]].groupby(i).mean()
# Make a bar graph showing amount and average spent grouped by each demographic using a for loop
    for i in c_table.columns:
        plt.figure()
        sns.barplot(x=c_table.index,y=c_table[i])
        plt.title(f"{i}")

<a id="20"></a> <br>
**Insights:**

1. Spanish customers spent the most overall, but were in 6th place for average amount spent
2. Married customers spent the most overall, but spent the least on average
3. Gen X customers spent the most overall, but spend the least on average
4. Customers without children spent the most overall and on average

<a id="21"></a> <br>
### **Product Performance**

<a id="22"></a> <br>
**Total Spent by Product Category**

In [None]:
# Rename columns to get cleanner graph labels
data = data.rename(columns={"MntWines" : "Wines",
                     "MntFruits" : "Fruits",
                     "MntMeatProducts": "Meats",
                     "MntFishProducts" : "Fish",
                     "MntSweetProducts" : "Sweets",
                     "MntGoldProds" : "Gold"})

In [None]:
# Make a list of product variables
products = ["Wines", "Fruits", "Meats", "Fish", "Sweets", "Gold"]

In [None]:
# Make table that shows the total amount spent per product category
product_sum = data[products].sum(axis=0)
product_sum = pd.DataFrame(product_sum, columns=["Amount Spent"])
# Graph the table
sns.barplot(x=product_sum.index,y=product_sum["Amount Spent"])
plt.title("Total Spent Per Product Category")
plt.xlabel('Product Type')

<a id="23"></a> <br>
**Insights:**

1. The majority of revenue came from Wines followed by Meats
2. The least amount of revenue came from fruits

<a id="24"></a> <br>
**Relationship Between Product Sales and Demographics**

In [None]:
# Make a stacked bar chart showing the percentage of sales attributed to each product for every demographic variable
for i in dem_variables:
    df = data[[i,'Wines','Fruits', 'Meats', 'Fish', 'Sweets', 'Gold']].groupby(i).sum()
    df = df.div(df.sum(axis=1), axis=0)*100
    df = df.reset_index()
    df.plot(
    x=i,
    kind= 'barh',
    stacked= True,
    mark_right= True)
    plt.title(f"Percent of Sales Per Product by {i}")

<a id="26"></a> <br>
**Insights:**

1. The proportion of revenue by product is very similar in each country, except that Montenegro is the only country where customers did not purchase any gold
2. Meats performed best and Wines performed worst among customers with 0 children compared to customers with 1, 2, and 3 children
3. There is a positive relatioship between Wine performance and the education level of customers
4. Gold, Fish, Sweets, and Fruits performed much better among customers with a Basic education level. Conversly Wines and Meats performed the worst among people with a Basi education level
5. Interestingly the performace of Sweets doesn't increase as the number of children increases

<a id="27"></a> <br>
### **Sales Channel Performance**

<a id="27"></a> <br>
**Total Purchases by Sales Channel**

In [None]:
# Rename columns to make clearer graph labels
data = data.rename(columns={"NumWebPurchases" : "Website",
                     "NumCatalogPurchases" : "Catalog",
                     "NumStorePurchases": "Store",
                     })

In [None]:
# Make list of sales channel variables
sales_c = ["Website", "Catalog", "Store"]

In [None]:
# Make a table showing the number of purchases accross each sales channel
sales_channel = data[sales_c].sum(axis=0)
sales_channel = pd.DataFrame(sales_channel, columns=["Number of Purchases"])
# Graph the table
sns.barplot(x=sales_channel.index, y=sales_channel["Number of Purchases"])
plt.title("Number of Purchases Per Sales Channel")
plt.xlabel("Sales Channel")

<a id="28"></a> <br>
**Insights:**

1. The majority of purchases were made through the store followed by the website and the catalog

<a id="29"></a> <br>
**Relationship Between Sales Channel Purchases and Demographics**

In [None]:
# Make a stacked bar chart showing the percentage of purchases attributed to each sales channel for every demographic variable
for i in dem_variables:
    df = data[[i,"Website", "Catalog", "Store"]].groupby(i).sum()
    df = df.div(df.sum(axis=1), axis=0)*100
    df = df.reset_index()
    df.plot(
    x=i,
    kind= 'barh',
    stacked= True,
    mark_right= True)
    plt.title(f"Percent of Purchases Per Sales Channel by {i}")

<a id="30"></a> <br>
**Insights:**

1. The sales channel performance is very similar accross the countries, with the exception being that the Catalog performed better and the Store performed worst in Montenegro compared to the other countries.
2. Interestingly, the Website performed worst among Millenials and the best among the Silent Gen
3. There is a negative relationship between the number of children customers have and Store performance
4. The Website performed much worst while the Catalog performed much better among customers with 0 children compared to customers with 1, 2, and 3 children
5. The Store performed better among customers with a Basic education while the Catalog performed much worst

<a id="31"></a> <br>
### **Correlations**

<a id="32"></a> <br>
**Correlation Heatmap**

In [None]:
# Make a list of variables to include in the correlation graph
corr_var = ["Year_Birth", "Kidhome", "Teenhome", "Recency", "Wines", "Fruits", "Meats", "Fish", "Sweets", "Gold", "NumDealsPurchases", "Website", "Catalog", "Store", "NumDealsPurchases", "NumWebVisitsMonth", "log(Income)", "Amount Spent", "Products Purchased", "Total Children"]

In [None]:
# Find correlations between the varables of interest
corr_var = data[corr_var]
# Graph the correlations
plt.figure(figsize=(20,10))
sns.heatmap(corr_var.corr(),annot=True)
plt.title("Correlation Heatmap", size=20)

<a id="33"></a> <br>
**Insights:**

1. Wines is the only product type with a strong positive relationship between the amount spent on the product and the total number of products purchased
2. Wines and Meats are the only product types with a strong positive correlation between the amount spent on the product and the total amount spent
3. The amount spent on Meats has a strong positive relationship with the amount of products purchased in the Catalog
4. The number of products purchased through the catalog has a strong positive relatioship with the total amount spent