# Marketing Lab (Advanced Visualization)

**Learning Objectives:**
  * Develop advanced Visualizations
  * Gain exposure to marketing related DataSets

## Context of the Analysis

### Context- A superstore is planning for the year-end sale. They want to launch a new offer - gold membership, that gives a 20% discount on all purchases, for only  499 US Dollars ($999 on other days). It will be valid only for existing customers and the campaign through phone calls is currently being planned for them. The management feels that the best way to reduce the cost of the campaign is to make a predictive model which will classify customers who might purchase the offer.
### Objective - The superstore wants to predict the likelihood of the customer giving a positive response and wants to identify the different factors which affect the customer's response. You need to analyze the data provided to identify these factors and infer purchasing patterns that determine the propensity to accept the offer.

### About this file
This data was gathered during last year's campaign.
#### Data description is as follows;

* Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise
* ID - Unique ID of each customer
* Year_Birth - Age of the customer
* Complain - 1 if the customer complained in the last 2 years
* Dt_Customer - date of customer's enrollment with the company
* Education - customer's level of education
* Marital - customer's marital status
* Kidhome - number of small children in customer's household
* Teenhome - number of teenagers in customer's household
* Income - customer's yearly household income
* MntFishProducts - the amount spent on fish products in the last 2 years
* MntMeatProducts - the amount spent on meat products in the last 2 years
* MntFruits - the amount spent on fruits products in the last 2 years
* MntSweetProducts - amount spent on sweet products in the last 2 years
* MntWines - the amount spent on wine products in the last 2 years
* MntGoldProds - the amount spent on gold products in the last 2 years
* NumDealsPurchases - number of purchases made with discount
* NumCatalogPurchases - number of purchases made using catalog (buying goods to be shipped through the mail)
* NumStorePurchases - number of purchases made directly in stores
* NumWebPurchases - number of purchases made through the company's website
* NumWebVisitsMonth - number of visits to company's website in the last month
* Recency - number of days since the last purchase

## 1. Library Import

In [1]:
from sklearn.datasets import fetch_openml
from sklearn.linear_model import SGDClassifier
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score

from sklearn.metrics import precision_recall_curve

import matplotlib.pyplot as plt

In [2]:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import numpy as np
from scipy import stats
import warnings

In [3]:
warnings.simplefilter('ignore')

## 2. Data loading and DataFrame creation

In [4]:
Data=pd.read_csv("https://raw.githubusercontent.com/thousandoaks/Maths4DS-III/refs/heads/main/datasets/superstore_data.csv")


In [5]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Id                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

In [6]:
Data.sample(3).T

Unnamed: 0,1343,1694,533
Id,5186,11010,1162
Year_Birth,1955,1984,1987
Education,PhD,PhD,PhD
Marital_Status,Married,Single,Single
Income,58482.0,16269.0,42000.0
Kidhome,0,1,0
Teenhome,1,0,0
Dt_Customer,12/3/2014,8/30/2013,10/1/2013
Recency,59,75,23
MntWines,576,19,124


## 3. Data Transformation

In [7]:
## We set date related columns as datetimes




In [8]:
# prompt: convert Year_Birth and Dt_Customer to datetimes

# Convert 'Year_Birth' to datetime, assuming it represents the year of birth
Data['Year_Birth'] = pd.to_datetime(Data['Year_Birth'], format='%Y', errors='coerce')

# Convert 'Dt_Customer' to datetime, assuming it's in a standard date format (you might need to adjust the format)
Data['Dt_Customer'] = pd.to_datetime(Data['Dt_Customer'], errors='coerce')


In [9]:
Data.sample(3).T

Unnamed: 0,438,1005,1461
Id,9596,8812,3439
Year_Birth,1980-01-01 00:00:00,1979-01-01 00:00:00,1972-01-01 00:00:00
Education,PhD,2n Cycle,Graduation
Marital_Status,Single,Divorced,Married
Income,65295.0,13533.0,56721.0
Kidhome,0,1,1
Teenhome,0,0,1
Dt_Customer,2013-12-23 00:00:00,2013-10-03 00:00:00,2012-10-31 00:00:00
Recency,19,45,64
MntWines,365,12,157


In [10]:
## We remove customers being born before 1940 as they are probably ouliers

AgeFilter=Data['Year_Birth']>='1940-01-01'

In [11]:
Data[~AgeFilter]

Unnamed: 0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Complain
513,11004,1893-01-01,2n Cycle,Single,60182.0,0,1,2014-05-17,23,8,...,7,0,2,1,1,0,2,4,0,0
827,1150,1899-01-01,PhD,Together,83532.0,0,0,2013-09-26,36,755,...,104,64,224,1,4,6,4,1,0,0
2233,7829,1900-01-01,2n Cycle,Divorced,36640.0,1,0,2013-09-26,99,15,...,7,4,25,1,2,1,2,5,0,1


In [12]:
# We order the column Dt_Customer_New to determine who is the most recent customer, that defines the most recent date
Data['Dt_Customer'].sort_values()

Unnamed: 0,Dt_Customer
1509,2012-01-08
2222,2012-01-08
455,2012-01-08
1398,2012-01-08
2239,2012-01-09
...,...
667,2014-12-05
52,2014-12-05
434,2014-12-05
1569,2014-12-05


In [13]:
# We compute the number of days customers have been with the company based on the most recent date 2015-01-01
Data['TimeWithUs']=Data['Dt_Customer'].apply(lambda x:pd.to_datetime('2015-01-01')-x)
Data['MonthsWithUs']=Data['TimeWithUs']/np.timedelta64(30, "D")

In [14]:
## We compute the age of the customer based on the most recent date 2015-01-01

Data['Age']=Data['Year_Birth'].apply(lambda x:pd.to_datetime('2015-01-01')-x)
Data['AgeYears']=Data['Age']/np.timedelta64(365, "D")

In [15]:
SuperStore=Data[AgeFilter]
SuperStore.sample(3).T

Unnamed: 0,1352,91,1430
Id,8969,4002,2406
Year_Birth,1977-01-01 00:00:00,1960-01-01 00:00:00,1949-01-01 00:00:00
Education,Graduation,PhD,Graduation
Marital_Status,Married,Married,Together
Income,71855.0,77037.0,54591.0
Kidhome,0,0,0
Teenhome,1,1,1
Dt_Customer,2013-01-16 00:00:00,2013-10-13 00:00:00,2013-05-08 00:00:00
Recency,59,3,63
MntWines,548,463,376


## 4. Product Analysis

In [16]:
SuperStore.columns

Index(['Id', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'Response', 'Complain', 'TimeWithUs', 'MonthsWithUs', 'Age',
       'AgeYears'],
      dtype='object')

In [17]:
# prompt: sum values in columns 'MntWines', 'MntFruits',
#        'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
#        'MntGoldProds'

# Calculate the sum of values in specified columns
SuperStore['TotalSpent'] = SuperStore[[
    'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
    'MntSweetProducts', 'MntGoldProds'
]].sum(axis=1)
SuperStore.head()

Unnamed: 0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Complain,TimeWithUs,MonthsWithUs,Age,AgeYears,TotalSpent
0,1826,1970-01-01,Graduation,Divorced,84835.0,0,0,2014-06-16,0,189,...,4,6,1,1,0,199 days,6.633333,16436 days,45.030137,1190
1,1,1961-01-01,Graduation,Single,57091.0,0,0,2014-06-15,0,464,...,3,7,5,1,0,200 days,6.666667,19723 days,54.035616,577
2,10476,1958-01-01,Graduation,Married,67267.0,0,1,2014-05-13,0,134,...,2,5,2,0,0,233 days,7.766667,20819 days,57.038356,251
3,1386,1967-01-01,Graduation,Together,32474.0,1,1,2014-11-05,0,10,...,0,2,7,0,0,57 days,1.9,17532 days,48.032877,11
4,5371,1989-01-01,Graduation,Single,21474.0,1,0,2014-08-04,0,6,...,1,2,7,1,0,150 days,5.0,9496 days,26.016438,91


In [18]:
# prompt: a scatter plot in plotly relating income and totalspent, limit x-axis to 100k, add a regression curve second order

import plotly.express as px
import plotly.graph_objects as go

# Assuming 'SuperStore' DataFrame is already defined and contains 'Income' and 'TotalSpent' columns

# Use 'lowess' for polynomial regression instead of 'ols'
fig = px.scatter(SuperStore, x="Income", y="TotalSpent", trendline="lowess", trendline_options=dict(frac=0.1),trendline_color_override="red")
# frac controls the smoothness of the curve (adjust as needed)

fig.update_layout(
    xaxis_range=[0, 100000]  # Limit x-axis to 100k
)

fig.show()

In [19]:
fig = px.scatter(SuperStore, x="Income", y="TotalSpent",color='Response')
# frac controls the smoothness of the curve (adjust as needed)

fig.update_layout(
    xaxis_range=[0, 100000]  # Limit x-axis to 100k
)

fig.show()

In [20]:
# Use 'lowess' for polynomial regression instead of 'ols'
fig = px.box(SuperStore, x="AgeYears", y="TotalSpent")
# frac controls the smoothness of the curve (adjust as needed)


fig.show()

In [21]:
SuperStore.head(3)

Unnamed: 0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Complain,TimeWithUs,MonthsWithUs,Age,AgeYears,TotalSpent
0,1826,1970-01-01,Graduation,Divorced,84835.0,0,0,2014-06-16,0,189,...,4,6,1,1,0,199 days,6.633333,16436 days,45.030137,1190
1,1,1961-01-01,Graduation,Single,57091.0,0,0,2014-06-15,0,464,...,3,7,5,1,0,200 days,6.666667,19723 days,54.035616,577
2,10476,1958-01-01,Graduation,Married,67267.0,0,1,2014-05-13,0,134,...,2,5,2,0,0,233 days,7.766667,20819 days,57.038356,251


In [22]:

fig = px.parallel_categories(SuperStore, dimensions=['Education', 'Marital_Status','Kidhome','Teenhome'],
                color="Response", color_continuous_scale=px.colors.sequential.Inferno,
                )
fig.show()

## 5. Customer analysis

In [23]:
SuperStore

Unnamed: 0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Complain,TimeWithUs,MonthsWithUs,Age,AgeYears,TotalSpent
0,1826,1970-01-01,Graduation,Divorced,84835.0,0,0,2014-06-16,0,189,...,4,6,1,1,0,199 days,6.633333,16436 days,45.030137,1190
1,1,1961-01-01,Graduation,Single,57091.0,0,0,2014-06-15,0,464,...,3,7,5,1,0,200 days,6.666667,19723 days,54.035616,577
2,10476,1958-01-01,Graduation,Married,67267.0,0,1,2014-05-13,0,134,...,2,5,2,0,0,233 days,7.766667,20819 days,57.038356,251
3,1386,1967-01-01,Graduation,Together,32474.0,1,1,2014-11-05,0,10,...,0,2,7,0,0,57 days,1.900000,17532 days,48.032877,11
4,5371,1989-01-01,Graduation,Single,21474.0,1,0,2014-08-04,0,6,...,1,2,7,1,0,150 days,5.000000,9496 days,26.016438,91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10142,1976-01-01,PhD,Divorced,66476.0,0,1,2013-07-03,99,372,...,2,11,4,0,0,547 days,18.233333,14245 days,39.027397,689
2236,5263,1977-01-01,2n Cycle,Married,31056.0,1,0,2013-01-22,99,5,...,0,3,8,0,0,709 days,23.633333,13879 days,38.024658,55
2237,22,1976-01-01,Graduation,Divorced,46310.0,1,0,2012-03-12,99,185,...,1,5,8,0,0,1025 days,34.166667,14245 days,39.027397,309
2238,528,1978-01-01,Graduation,Married,65819.0,0,0,2012-11-29,99,267,...,4,10,3,0,0,763 days,25.433333,13514 days,37.024658,1383


### We need to encode quantitative variables

In [24]:
# Identify categorical features
categorical_features = SuperStore.select_dtypes(include=['object']).columns # Select columns with 'object' dtype
categorical_features

Index(['Education', 'Marital_Status'], dtype='object')

In [25]:
from sklearn.preprocessing import OrdinalEncoder
# Create an OrdinalEncoder
enc = OrdinalEncoder()

# Fit and transform the categorical features
SuperStore[categorical_features] = enc.fit_transform(SuperStore[categorical_features])
SuperStore

Unnamed: 0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Complain,TimeWithUs,MonthsWithUs,Age,AgeYears,TotalSpent
0,1826,1970-01-01,2.0,2.0,84835.0,0,0,2014-06-16,0,189,...,4,6,1,1,0,199 days,6.633333,16436 days,45.030137,1190
1,1,1961-01-01,2.0,4.0,57091.0,0,0,2014-06-15,0,464,...,3,7,5,1,0,200 days,6.666667,19723 days,54.035616,577
2,10476,1958-01-01,2.0,3.0,67267.0,0,1,2014-05-13,0,134,...,2,5,2,0,0,233 days,7.766667,20819 days,57.038356,251
3,1386,1967-01-01,2.0,5.0,32474.0,1,1,2014-11-05,0,10,...,0,2,7,0,0,57 days,1.900000,17532 days,48.032877,11
4,5371,1989-01-01,2.0,4.0,21474.0,1,0,2014-08-04,0,6,...,1,2,7,1,0,150 days,5.000000,9496 days,26.016438,91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10142,1976-01-01,4.0,2.0,66476.0,0,1,2013-07-03,99,372,...,2,11,4,0,0,547 days,18.233333,14245 days,39.027397,689
2236,5263,1977-01-01,0.0,3.0,31056.0,1,0,2013-01-22,99,5,...,0,3,8,0,0,709 days,23.633333,13879 days,38.024658,55
2237,22,1976-01-01,2.0,2.0,46310.0,1,0,2012-03-12,99,185,...,1,5,8,0,0,1025 days,34.166667,14245 days,39.027397,309
2238,528,1978-01-01,2.0,3.0,65819.0,0,0,2012-11-29,99,267,...,4,10,3,0,0,763 days,25.433333,13514 days,37.024658,1383


In [26]:
SuperStore.groupby('Response').count()

Unnamed: 0_level_0,Id,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Complain,TimeWithUs,MonthsWithUs,Age,AgeYears,TotalSpent
Response,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1903,1903,1903,1903,1880,1903,1903,1903,1903,1903,...,1903,1903,1903,1903,1903,1903,1903,1903,1903,1903
1,334,334,334,334,333,334,334,334,334,334,...,334,334,334,334,334,334,334,334,334,334


In [27]:
SuperStore.columns

Index(['Id', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
       'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
       'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
       'Response', 'Complain', 'TimeWithUs', 'MonthsWithUs', 'Age', 'AgeYears',
       'TotalSpent'],
      dtype='object')

In [32]:
# prompt: rotate catrgories plotly parallel coordinates plot

fig = px.parallel_coordinates(SuperStore, color="Response",

                              dimensions=['Education', 'Marital_Status','Income','Kidhome',
       'Teenhome', 'Dt_Customer', 'Recency',  'NumDealsPurchases', 'NumWebPurchases',
       'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
        'TimeWithUs', 'MonthsWithUs', 'Age','TotalSpent'],

                             labels={'Education':'Education',
                                     'Marital_Status':'Marital Status',
                                     'Income':'Income',
                                     'Kidhome':'Kids at home',
                                     'Teenhome':'Teens at home',
                                     'Dt_Customer':'Date Customer',
                                     'Recency':'Recency',

                                     'NumDealsPurchases':'Deal Purchases',
                                     'NumWebPurchases':'Web Purchases',
                                     'NumCatalogPurchases':'Catalog Purchases',
                                     'NumStorePurchases':'Store Purchases',
                                     'NumWebVisitsMonth':'Web Visits',

                                     'TimeWithUs':'Time With Us',
                                     'MonthsWithUs':'Months With Us',
                                     'TotalSpent':'TotalSpent',
                                     'Age':'Age'},
                             )

fig.update_layout(
    xaxis_title="Categories",
    yaxis_title="Values",

)

# Rotate the category labels by 45 degrees
fig.update_layout(
    xaxis_tickangle=-45
)

fig.show()