# Starbucks: Analyze-a-Coffee

## Business Questions

The purpose of the analysis is to examine how Starbucks’ customers respond to an offer whether a BOGO or Discount is the offer. Not all customers have the same incentives to view an offer and then make a transaction to complete the offer. Many factors play an important role in impacting how customers make purchasing decisions; for instance, some customers prefer offers that allow them to collect more and more stars toward getting exclusive perks or even free products. Sometimes, customers at a particular age group, prefer an offer different than what another group prefers. Moreover, we should keep in mind that female customers may react to an offer is different than how male customers do. Many aspects can be investigated and analyzed to find answers to such questions. All of that would help Starbucks to target its customers, and then personalizes and customizes the offers it sends depending on who are the audience. Many questions can be asked; here is some of what we are going to investigate:

1. What is the number of customers who received at least one offer? 
2. Who usually spend more at Starbucks, female or male?
3. For the customers who spend more; Who makes more income per year?
4. How old are most of Starbucks customers with respect to gender?
5. How much do customers spend at any time since the start of an offer?
6. Can we find the most popular offer by an age group or a gender, then compare it to other offers, or even another age group?
7. Which offer has made the most for Starbucks? Is there a difference between BOGO offers and Discount offers? If so, Do male customers react the same as female customers do for any of the two offer types?

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt

# magic word for producing visualizations in notebook
%matplotlib inline

import plotly.plotly as py #for creating interactive data visualizations
import plotly.graph_objs as go
import plotly.tools as tls
py.sign_in('salitr', 'bZDLctwyhomQu8cTsfXj') #API key has been removed for security
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot #to work with data visualization offline
init_notebook_mode(connected=True)
import cufflinks as cf #connects Plotly with pandas to produce the interactive data visualizations
cf.go_offline()

from IPython.display import Image

In [4]:
# reading the saved files from PART 1

profile_clean = pd.read_csv('profile_clean.csv', sep=';')
transactions = pd.read_csv('transactions.csv', sep=';')
offers = pd.read_csv('offers.csv', sep=';')
offers_comparsion = pd.read_csv('offers_comparsion.csv', sep=';')

### 2.2 Part 2


In [7]:
# merging all three datasets to have a complete, clean dataset 

full_data = pd.merge(pd.merge(pd.merge(profile_clean, 
                                  transactions, on='customer_id'), 
                         offers, on='customer_id'), 
                offers_comparsion, on='offer_id')

In [8]:
# locking at the full datasets

full_data

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time(hours)_x,event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
0,0610b486422d4921ae7d2bf64640c50b,55,F,112000.0,2017-07-15,2017-07,21.51,18,offer received,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
1,0610b486422d4921ae7d2bf64640c50b,55,F,112000.0,2017-07-15,2017-07,21.51,18,offer completed,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
2,0610b486422d4921ae7d2bf64640c50b,55,F,112000.0,2017-07-15,2017-07,32.28,144,offer received,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
3,0610b486422d4921ae7d2bf64640c50b,55,F,112000.0,2017-07-15,2017-07,32.28,144,offer completed,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
4,0610b486422d4921ae7d2bf64640c50b,55,F,112000.0,2017-07-15,2017-07,23.22,528,offer received,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
5,0610b486422d4921ae7d2bf64640c50b,55,F,112000.0,2017-07-15,2017-07,23.22,528,offer completed,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
6,78afa995795e4d85b5d9ceeca43f5fef,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer received,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
7,78afa995795e4d85b5d9ceeca43f5fef,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer viewed,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
8,78afa995795e4d85b5d9ceeca43f5fef,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer completed,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1
9,78afa995795e4d85b5d9ceeca43f5fef,75,F,100000.0,2017-05-09,2017-05,17.78,144,offer received,9b98b8c7a33c4b65b9aebfe6a799e6d9,...,62.647719,689,bogo,5,168,5,1,1,0,1


In [9]:
# renaming the time hours of the transactions to be distinguished from the time of offers 
full_data.rename(columns={'time(hours)_x': 'time_of_transaction(hours)'}, inplace=True)

In [10]:
# a function to recreate the ids columns to be easy ids to use and communicate

def id_mapper(df, column):
    coded_dict = dict()
    cter = 1
    id_encoded = []
    
    for val in df[column]:
        if val not in coded_dict:
            coded_dict[val] = cter
            cter+=1
        
        id_encoded.append(coded_dict[val])
    return id_encoded

cid_encoded = id_mapper(full_data, 'customer_id')
full_data['customer_id'] = cid_encoded

oid_encoded = id_mapper(full_data, 'offer_id')
full_data['offer_id'] = oid_encoded

To make even easier to communicate the customers and offers ids, they have been mapped to be represented by numbers instead of hashes or codes.

In [11]:
# locking at the full datasets after recreating the ids

full_data

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
0,1,55,F,112000.0,2017-07-15,2017-07,21.51,18,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
1,1,55,F,112000.0,2017-07-15,2017-07,21.51,18,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
2,1,55,F,112000.0,2017-07-15,2017-07,32.28,144,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
3,1,55,F,112000.0,2017-07-15,2017-07,32.28,144,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
4,1,55,F,112000.0,2017-07-15,2017-07,23.22,528,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
5,1,55,F,112000.0,2017-07-15,2017-07,23.22,528,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
6,2,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
7,2,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
8,2,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
9,2,75,F,100000.0,2017-05-09,2017-05,17.78,144,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1


In [12]:
# making sure the dataset contains only the transactions that appear before the end of an offer

full_data = full_data[full_data['time_of_transaction(hours)'] <= full_data['duration(hours)']]
full_data

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
0,1,55,F,112000.0,2017-07-15,2017-07,21.51,18,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
1,1,55,F,112000.0,2017-07-15,2017-07,21.51,18,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
2,1,55,F,112000.0,2017-07-15,2017-07,32.28,144,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
3,1,55,F,112000.0,2017-07-15,2017-07,32.28,144,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
6,2,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
7,2,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
8,2,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
9,2,75,F,100000.0,2017-05-09,2017-05,17.78,144,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
10,2,75,F,100000.0,2017-05-09,2017-05,17.78,144,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
11,2,75,F,100000.0,2017-05-09,2017-05,17.78,144,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1


In [13]:
print(f"The number of transactions in the full data: {full_data.shape[0]}")
print(f"The number of variables: {full_data.shape[1]}")

The number of transactions in the full data: 238356
The number of variables: 25


As mentioned eariler, our goal is to have a dataset that contains only viewed and completed offers where transactions made by anu customers took a place before the end of the duration of an offer. That is, the above code assure that transactions made after the end of an offer are not included in the final, clean data.

In [14]:
# showing the number of transactions by offer_type for the final and clean data

full_data['offer_type'].value_counts()

discount    139360
bogo         98996
Name: offer_type, dtype: int64

We talked about the popularity of the 10 offers we have, and we found that two discount offers were the most popular. Here, we can see than most transactions were made for discount offers. 

In [15]:
# functions to return the individual datasets with respect to each column in the full_data

def full_dataset(df, column, ev):
    data = df[df[column] == ev]
    
    return data

def offer_dataset(df, offer_num):
    offer_num = full_dataset(df, 'offer_id', offer_num)
    
    return offer_num


#### <font color=blue> 2.2.1 Events

In [16]:
# creating datasets of each event for further analysis

df_received = full_dataset(full_data, 'event', 'offer received')
df_viewed = full_dataset(full_data, 'event', 'offer viewed')
df_completed = full_dataset(full_data, 'event', 'offer completed')

In [17]:
print(f"The number of received offers: {df_received.shape[0]}")

df_received.sample(3)

The number of received offers: 96626


Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
686193,8175,84,F,64000.0,2016-10-24,2016-10,6.93,138,offer received,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
1122175,4661,40,M,63000.0,2015-11-09,2015-11,1.65,12,offer received,8,...,50.204763,-3019,bogo,10,120,10,1,1,1,1
648439,13210,34,M,42000.0,2017-12-18,2017-12,2.83,102,offer received,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1


In [18]:
# overview of the transaction where the customers did view an offer

print(f"The number of viewed offers: {df_viewed.shape[0]}")

df_viewed

The number of viewed offers: 73103


Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
7,2,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
10,2,75,F,100000.0,2017-05-09,2017-05,17.78,144,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
37,4,65,M,53000.0,2018-02-09,2018-02,9.54,60,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
39,4,65,M,53000.0,2018-02-09,2018-02,9.54,60,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
77,7,56,F,88000.0,2018-04-28,2018-04,19.91,162,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
79,7,56,F,88000.0,2018-04-28,2018-04,19.91,162,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
99,9,59,M,41000.0,2015-01-21,2015-01,0.67,60,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
151,12,40,M,33000.0,2016-07-09,2016-07,5.47,12,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
156,12,40,M,33000.0,2016-07-09,2016-07,6.18,30,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
161,12,40,M,33000.0,2016-07-09,2016-07,1.54,72,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1


In [19]:
# overview of the transaction where the customers did complete an offer

print(f"The number of completed offers: {df_completed.shape[0]}")

df_completed

The number of completed offers: 68627


Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
1,1,55,F,112000.0,2017-07-15,2017-07,21.51,18,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
3,1,55,F,112000.0,2017-07-15,2017-07,32.28,144,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
8,2,75,F,100000.0,2017-05-09,2017-05,19.89,132,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
11,2,75,F,100000.0,2017-05-09,2017-05,17.78,144,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
40,4,65,M,53000.0,2018-02-09,2018-02,9.54,60,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
53,5,57,M,42000.0,2017-12-31,2017-12,4.33,42,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
142,11,96,F,89000.0,2017-11-17,2017-11,12.03,132,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
150,12,40,M,33000.0,2016-07-09,2016-07,5.47,12,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
155,12,40,M,33000.0,2016-07-09,2016-07,6.18,30,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
160,12,40,M,33000.0,2016-07-09,2016-07,1.54,72,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1


After writting some functions that return the full data with respect to a specific variable, we will look next to top_customers customers who completed any offer and paid the most amount.

In [20]:
# making sure we have only unique_customers even if the customer made transaction more than once for an offer
unique_customers = df_completed.drop_duplicates(subset=['customer_id', 'time_of_transaction(hours)', 'offer_id'], keep='first')

# finding the top customers based on the sum of their transaction_amount of all offers completed 

top_customers = unique_customers.groupby('customer_id')['transaction_amount'].sum().sort_values(ascending=False).head(10)
top_customers

customer_id
8658     5056.42
4307     3343.89
12886    3127.56
2896     3030.39
6759     2955.24
10196    2870.32
11936    2857.77
6307     2833.60
1828     2608.13
10487    2427.42
Name: transaction_amount, dtype: float64

It looks like some customers paid thousands of dollars for Starbucks. Next, we will see the total amount of money made for each offer.

In [21]:
# finding the top offers based on the sum of the transaction_amount made by all customers who completed the offer

top_offers = unique_customers.groupby('offer_id')['transaction_amount'].sum().sort_values(ascending=False).head(10)
top_offers

offer_id
5    194632.34
7    149296.90
6    122381.83
2    110113.80
1    107063.15
4    106886.66
3     84649.89
8     77249.56
Name: transaction_amount, dtype: float64

Offers 5 and 7 have the most sum of transactions amount made by all customers.

In [22]:
# finding the average transaction_amount over all transactions and offer where customers completed the offers

df_completed['transaction_amount'].mean()

16.55074431346271

In [23]:
# creating a dataframe grouped by the time when each transaction takes a place from the start of an offer
# showing a description of the amount spent by that time

transcation_by_time = df_completed.groupby('time_of_transaction(hours)').describe()['transaction_amount'].reset_index()
transcation_by_time = transcation_by_time.drop(['std', '25%', '75%'], axis=1)
transcation_by_time

Unnamed: 0,time_of_transaction(hours),count,mean,min,50%,max
0,0,1421.0,15.365721,0.05,14.07,195.24
1,6,1854.0,16.203981,0.07,14.44,448.97
2,12,1972.0,21.03321,0.06,15.86,871.51
3,18,2127.0,20.125129,0.05,14.04,962.1
4,24,2219.0,15.679117,0.07,13.95,674.48
5,30,2430.0,15.904683,0.05,13.6,575.23
6,36,2353.0,17.465168,0.05,14.16,947.43
7,42,2485.0,16.747191,0.05,13.47,657.26
8,48,2473.0,16.036284,0.09,14.15,475.2
9,54,2391.0,16.191409,0.08,14.1,845.01


In [24]:
# a function to split the data by column and then returns a description of the all transcations grouped by their time

def transcation_by_time(df, col, target):
    transcations = df[df[col] == target]
    transcations = transcations.groupby('time_of_transaction(hours)').describe()['transaction_amount'].reset_index()
    transcations = transcations.drop(['std', '25%', '75%'], axis=1)
    return transcations

In [25]:
# plotting a trend that shows the average Amount Spent since a Start of an Offer at a specific time

f_transcations = transcation_by_time(df_completed, 'gender', 'F')
m_transcations = transcation_by_time(df_completed, 'gender', 'M')

trace_mean1 = go.Scatter(
    x=f_transcations['time_of_transaction(hours)'],
    y=f_transcations['mean'],
    name = "Female",
    line = dict(color = 'pink'),
    opacity = 0.6)

trace_mean2 = go.Scatter(
    x=m_transcations['time_of_transaction(hours)'],
    y=m_transcations['mean'],
    name = "Male",
    line = dict(color = 'cornflowerblue'),
    opacity = 0.6)

data1 = [trace_mean1, trace_mean2]

layout = {
    'title': 'Average Amount Spent Since a Start of an Offer Trend',
    'xaxis': {'title': 'Time Since a Start of an Offer (hours)'},
    'yaxis': {'title': 'Average Amount ($)', 
              "range": [
                10,
                30
            ]},
               
    'shapes': [
        # Line Horizontal, average
        {
            'type': 'line',
            'x0': 0,
            'y0': 16.55,
            'x1': 243,
            'y1': 16.55,
            'line': {
                'color': 'black',
                'width': 1,
                'dash': 'dashdot',
            }
        },
        
        # 1st highlight above average amount
        {
            'type': 'rect',
            # x-reference is assigned to the x-values
            'xref': 'paper',
            # y-reference is assigned to the plot [0,1]
            'yref': 'y',
            'x0': 0,
            'y0': 16.55,
            'x1': 1,
            'y1': 30,
            'fillcolor': 'olive',
            'opacity': 0.1,
            'line': {
                'width': 0,
            }
        },
        
        # 3nd highlight below average months
        {
            'type': 'rect',
            'xref': 'paper',
            'yref': 'y',
            'x0': 0,
            'y0': 16.55,
            'x1': 1,
            'y1': 0,
            'fillcolor': 'tomato',
            'opacity': 0.1,
            'line': {
                'width': 0,
            }
        }
    ]
}

layout.update(dict(annotations=[go.Annotation(text="Overall Average Amount ($16.55) Spent After the Start of an Offer", 
                                              x=150, 
                                              y=16.55,
                                              ax=10, 
                                              ay=-120)]))
        
fig = dict(data=data1, layout=layout)
py.iplot(fig, filename = "Amount Spent since a Start of an Offer Trend")


plotly.graph_objs.Annotation is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Annotation
  - plotly.graph_objs.layout.scene.Annotation



Consider using IPython.display.IFrame instead



The trend of the transactions made since a start of an offer shows that, on average, female customers paid more than male customers did at any time from the beginning of an offer up to 10 days. Back to our concern in the first chart; even though the number of male customers is higher, female customers paid more than male customers have paid. The only exception is transactions made after 228 hours from the start of an offer. On average, all customers paid an overall amount of $16.55 at any time since an offer has started. We can see female customers paid more than the average at any time! Whereas male customers most the time paid less than the average. Also, we can observe some peaks over time for both gender where they, on average, paid more than usual. That could be during the weekends or specific times during the day.


#### <font color=blue> 2.2.2 Offer Type

In [26]:
# creating datasets of each offer_type for further analysis

df_bogo = full_dataset(full_data, 'offer_type', 'bogo')
df_discount = full_dataset(full_data, 'offer_type', 'discount')

In [27]:
# overview of the transaction where the offer is BOGO

df_bogo.sample(3)

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
33058,1432,74,F,102000.0,2016-02-26,2016-02,23.36,162,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
1063041,11885,26,M,63000.0,2016-06-09,2016-06,12.33,6,offer received,8,...,50.204763,-3019,bogo,10,120,10,1,1,1,1
1053085,9892,38,M,44000.0,2015-09-17,2015-09,5.63,30,offer received,8,...,50.204763,-3019,bogo,10,120,10,1,1,1,1


In [28]:
df_discount.sample(3)

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
860656,4328,22,M,32000.0,2015-12-24,2015-12,1.63,120,offer viewed,6,...,73.418482,-1493,discount,7,168,3,1,1,1,1
746562,658,67,M,42000.0,2017-10-27,2017-10,0.5,66,offer received,6,...,73.418482,-1493,discount,7,168,3,1,1,1,1
802155,11895,58,M,93000.0,2016-07-29,2016-07,19.79,18,offer viewed,6,...,73.418482,-1493,discount,7,168,3,1,1,1,1


In [29]:
# function to return top customers or offers for a specific offer type
# based on the sum of their transaction_amount of all offers completed 

def tops(df, col):
    # making sure we have only unique_customers who completed an offer even if the customer made transaction more than once for that offer
    completed_offers = df[df['event'] == 'offer completed']
    unique_customers = completed_offers.drop_duplicates(subset=['customer_id', 'time_of_transaction(hours)', 'offer_id'], keep='first')

    # finding the tops based on the sum of their transaction_amount of all offers completed for that specific offer type 

    top1 = unique_customers.groupby(col)['transaction_amount'].sum().sort_values(ascending=False).head(10)
    
    return top1

def tops_by_gender(df):
    # making sure we have only unique_customers who completed an offer even if the customer made transaction more than once for that offer
    completed_offers = df[df['event'] == 'offer completed']
    unique_customers = completed_offers.drop_duplicates(subset=['customer_id', 'time_of_transaction(hours)', 'offer_id'], keep='first')

    # finding the amount of all completed offer by gender

    amount = unique_customers.groupby(['offer_id', 'gender'])['transaction_amount'].sum()
    
    return amount

def customer_report(cid, df=df_discount):
    report = df[df.event == 'offer completed']
    report = report[report.customer_id == cid]
    
    return report

In [30]:
# finding the top customers based on the sum of their transaction_amount of all BOGO offers completed 

tops(df_bogo, 'customer_id')

customer_id
4307     1990.88
1828     1724.50
261      1573.74
10822    1515.80
2896     1507.78
621      1465.94
6307     1416.80
10080    1408.48
6624     1301.02
10487    1184.38
Name: transaction_amount, dtype: float64

As we now focus on transactions made for only the BOGO offers, we can see the top 10 customers who paid the most for all BOGO offers

In [31]:
# finding the top customers based on the sum of their transaction_amount of all discount offers completed 

tops(df_discount, 'customer_id')

customer_id
8658     4057.18
12886    3127.56
6759     2216.43
9904     2049.56
12108    2019.64
10196    1933.38
11936    1928.68
10048    1853.52
11791    1765.39
7923     1740.66
Name: transaction_amount, dtype: float64

Here, we find the top 10 customers with respect to the discount offers.

In [32]:
# reporting more info about a customer

customer_report(8658, df_discount)

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
537915,8658,60,M,106000.0,2016-07-02,2016-07,24.23,30,offer completed,4,...,58.980546,451,discount,10,168,2,1,1,0,1
537918,8658,60,M,106000.0,2016-07-02,2016-07,947.43,36,offer completed,4,...,58.980546,451,discount,10,168,2,1,1,0,1
537921,8658,60,M,106000.0,2016-07-02,2016-07,27.58,120,offer completed,4,...,58.980546,451,discount,10,168,2,1,1,0,1
707286,8658,60,M,106000.0,2016-07-02,2016-07,24.23,30,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
707289,8658,60,M,106000.0,2016-07-02,2016-07,947.43,36,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
707292,8658,60,M,106000.0,2016-07-02,2016-07,27.58,120,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
707295,8658,60,M,106000.0,2016-07-02,2016-07,30.11,180,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
877717,8658,60,M,106000.0,2016-07-02,2016-07,24.23,30,offer completed,6,...,73.418482,-1493,discount,7,168,3,1,1,1,1
877720,8658,60,M,106000.0,2016-07-02,2016-07,947.43,36,offer completed,6,...,73.418482,-1493,discount,7,168,3,1,1,1,1
877723,8658,60,M,106000.0,2016-07-02,2016-07,27.58,120,offer completed,6,...,73.418482,-1493,discount,7,168,3,1,1,1,1


In [33]:
# finding the top offers based on the sum of the transaction_amount made by all customers who completed the offer

tops(df_bogo, 'offer_id')

offer_id
2    110113.80
1    107063.15
3     84649.89
8     77249.56
Name: transaction_amount, dtype: float64

As mentioned eariler, we have 4 BOGO offers and 4 Discount offers. Here we see that offer 2 had the most total transactions amount over the other BOGO offers.

In [34]:
# finding the top offers based on the sum of the transaction_amount made by all customers who completed the offer

tops(df_discount, 'offer_id')

offer_id
5    194632.34
7    149296.90
6    122381.83
4    106886.66
Name: transaction_amount, dtype: float64

Offer 5 made the most among other Discount offers, and overall!

In [35]:
# finding the total amount for each BOGO offer by gender 

tops_by_gender(df_bogo)

offer_id  gender
1         F         55421.53
          M         49555.68
          O          2085.94
2         F         56392.20
          M         51419.22
          O          2302.38
3         F         43190.65
          M         40197.47
          O          1261.77
8         F         42923.16
          M         33526.68
          O           799.72
Name: transaction_amount, dtype: float64

More details about the total transactions amount made for each BOGO offer by customers with respect to their gender. Overall, female customers paid more than male customers. Both, Female and male customers paid most for offer 2.

In [36]:
# finding the total amount for each discount offer by gender 

tops_by_gender(df_discount)

offer_id  gender
4         F         56113.92
          M         48823.54
          O          1949.20
5         F         98573.24
          M         93458.66
          O          2600.44
6         F         63224.15
          M         57367.13
          O          1790.55
7         F         76970.11
          M         69990.11
          O          2336.68
Name: transaction_amount, dtype: float64

With respect to Discount offers, again, femal customers paid more than male customers. Both paid the most for offer 5.


#### <font color=blue> 2.2.3 Gender

In [37]:
# creating datasets for each gender for further analysis

df_male = full_dataset(full_data, 'gender', 'M')
df_female = full_dataset(full_data, 'gender', 'F')
df_other = full_dataset(full_data, 'gender', 'O')

In [38]:
# overview of the transactions made by male customers
df_male.sample(3)

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
1016520,614,31,M,39000.0,2017-09-02,2017-09,7.05,60,offer viewed,8,...,50.204763,-3019,bogo,10,120,10,1,1,1,1
793610,9954,53,M,75000.0,2016-02-16,2016-02,13.01,162,offer viewed,6,...,73.418482,-1493,discount,7,168,3,1,1,1,1
20940,903,54,M,56000.0,2017-06-26,2017-06,16.76,60,offer viewed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1


In [39]:
# overview of the transactions made by female customers

df_female.sample(3)

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
53765,2292,65,F,85000.0,2015-08-02,2015-08,18.65,150,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
876785,11065,28,F,57000.0,2017-10-16,2017-10,8.7,24,offer completed,6,...,73.418482,-1493,discount,7,168,3,1,1,1,1
206106,2942,26,F,69000.0,2015-10-30,2015-10,25.77,66,offer completed,2,...,54.720934,-2244,bogo,10,168,10,1,1,1,0


In [40]:
# finding the top male customers based on the sum of their transaction_amount

tops(df_male, 'customer_id')

customer_id
8658     5056.42
4307     3343.89
6307     2833.60
261      2360.61
10048    2316.90
10822    2273.70
8260     2107.81
7862     2030.46
8976     1751.50
7923     1740.66
Name: transaction_amount, dtype: float64

Eariler, we looked at the overall top customers. Here, we find the top 10 male customers based on the total amount they paid for all offer.

In [41]:
# finding the top female customers based on the sum of their transaction_amount

tops(df_female, 'customer_id')

customer_id
12886    3127.56
2896     3030.39
6759     2955.24
10196    2870.32
11936    2857.77
1828     2608.13
10487    2427.42
11791    2344.68
3342     2166.70
9904     2124.15
Name: transaction_amount, dtype: float64

The top 10 female customers.

In [42]:
# reporting more info about a customer

customer_report(9904, df_female)

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
332417,9904,58,F,64000.0,2015-11-16,2015-11,14.43,6,offer completed,3,...,62.393552,-2207,bogo,5,120,5,1,1,1,1
332420,9904,58,F,64000.0,2015-11-16,2015-11,25.08,30,offer completed,3,...,62.393552,-2207,bogo,5,120,5,1,1,1,1
332423,9904,58,F,64000.0,2015-11-16,2015-11,21.71,48,offer completed,3,...,62.393552,-2207,bogo,5,120,5,1,1,1,1
332426,9904,58,F,64000.0,2015-11-16,2015-11,13.37,108,offer completed,3,...,62.393552,-2207,bogo,5,120,5,1,1,1,1
615311,9904,58,F,64000.0,2015-11-16,2015-11,14.43,6,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
615314,9904,58,F,64000.0,2015-11-16,2015-11,25.08,30,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
615317,9904,58,F,64000.0,2015-11-16,2015-11,21.71,48,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
615320,9904,58,F,64000.0,2015-11-16,2015-11,13.37,108,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
615323,9904,58,F,64000.0,2015-11-16,2015-11,11.2,132,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1
615326,9904,58,F,64000.0,2015-11-16,2015-11,570.78,138,offer completed,5,...,75.210463,-1404,discount,10,240,2,1,1,1,1


In [43]:
# finding the top offers based on the sum of the transaction_amount made by male customers who completed that offer

tops(df_male, 'offer_id')

offer_id
5    93458.66
7    69990.11
6    57367.13
2    51419.22
1    49555.68
4    48823.54
3    40197.47
8    33526.68
Name: transaction_amount, dtype: float64

In [44]:
# finding the top offers based on the sum of the transaction_amount made by male customers who completed that offer

tops(df_female, 'offer_id')

offer_id
5    98573.24
7    76970.11
6    63224.15
2    56392.20
4    56113.92
1    55421.53
3    43190.65
8    42923.16
Name: transaction_amount, dtype: float64

In [45]:
# function to return plots with respect to different distributions by gender

def age_distributions(df, df2, color, color2):
    df = df.drop_duplicates(subset='customer_id', keep='first')
    df2 = df2.drop_duplicates(subset='customer_id', keep='first')
    x = df.age
    x2 = df2.age

    trace1 = go.Histogram(
        x=x,
        name='Female',
        opacity=0.6,
        nbinsx = 7,
        marker=dict(
            color=color)
    )
    
    trace2 = go.Histogram(
        x=x2,
        name='Male',
        opacity=0.6,
        nbinsx = 7,
        marker=dict(
            color=color2)
    )

    data1 = [trace1, trace2]
    layout = go.Layout(
        barmode='stack',
        bargap=0.1,
        title = 'Age by Gender',
        xaxis=dict(
            title='Age'),
        yaxis=dict(
            title='Total number of Customers'))
    
    updatemenus = list([
    dict(active=0,
         buttons=list([   
            dict(label = 'All',
                 method = 'update',
                 args = [{'visible': [True, True]},
                         {'title': 'Age Distribution by Gender'}]),
             dict(label = 'Female',
                 method = 'update',
                 args = [{'visible': [True, False]},
                         {'title': 'Age Distribution of Male Customers'}]),
            dict(label = 'Male',
                 method = 'update',
                 args = [{'visible': [False, True]},
                         {'title': 'Age Distribution of Female Customers'}])
            ]),
        )
    ])

    layout.update(dict(updatemenus=updatemenus))

    fig = go.Figure(data=data1, layout=layout)
    plot = py.iplot(fig, filename='Age by Gender')
    
    return plot

def income_distributions(df, df2, color, color2):
    df = df.drop_duplicates(subset='customer_id', keep='first')
    df2 = df2.drop_duplicates(subset='customer_id', keep='first')
    x = df.income
    x2 = df2.income

    trace1 = go.Histogram(
        x=x,
        name='Female',
        opacity=0.6,
        nbinsx = 7,
        marker=dict(
            color=color)
    )
    
    trace2 = go.Histogram(
        x=x2,
        name='Male',
        opacity=0.6,
        nbinsx = 7,
        marker=dict(
            color=color2)
    )

    data2 = [trace1, trace2]
    layout = go.Layout(
        barmode='stack',
        bargap=0.1,
        title = 'Income by Gender',
        xaxis=dict(
            title='Income'),
        yaxis=dict(
            title='Total number of Customers'))
    
        
    updatemenus = list([
    dict(active=0,
         buttons=list([
            
            dict(label = 'All',
                 method = 'update',
                 args = [{'visible': [True, True]},
                         {'title': 'Income Distribution by Gender'}]),

            dict(label = 'Female',
                 method = 'update',
                 args = [{'visible': [True, False]},
                         {'title': 'Income Distribution of Male Customers'}]),
            dict(label = 'Male',
                 method = 'update',
                 args = [{'visible': [False, True]},
                         {'title': 'Income Distribution of Female Customers'}])

            ]),
        )
    ])

    layout.update(dict(updatemenus=updatemenus))

    fig = go.Figure(data=data2, layout=layout)
    plot = py.iplot(fig, filename='Income by Gender')
    
    return plot

# function to return an analysis of the numbers and percentage of all transaction by each event

def analysis(df1, col1, col2, col3):
    v = df1[df1['event'] == 'offer viewed']
    v = v.drop_duplicates(subset=['customer_id', 'offer_id'], keep='first')

    c = df1[df1['event'] == 'offer completed']
    c = c.drop_duplicates(subset=['customer_id', 'offer_id'], keep='first')

    received = df1.drop_duplicates(subset=['customer_id', 'offer_id'], keep='first')

    viewed = pd.Series(v.offer_id).value_counts().reset_index().rename(columns={'index': 'offer_id_v', 'offer_id': 'viewed'})
    completed = pd.Series(c.offer_id).value_counts().reset_index().rename(columns={'index': 'offer_id_c', 'offer_id': 'completed'})
    received = pd.Series(received.offer_id).value_counts().reset_index().rename(columns={'index': 'offer_id_r', 'offer_id': 'received'})


    analysis = pd.concat([received, viewed, completed], axis=1).rename(columns={'offer_id_r': 'offer_id', 'viewed': col2, 'completed': col3, 'received': col1})
    analysis[col2+'(%)'] = round(analysis[col2]/analysis[col1]*100, 2)
    analysis[col3+'(%)'] = round(analysis[col3]/analysis[col1]*100, 2)

    analysis = analysis.drop(['offer_id_c', 'offer_id_v'], axis=1)
    
    return analysis

In [46]:
# plotting the age_distributions 

age_distributions(df_female, df_male, 'pink', 'cornflowerblue')

Another look at the age distribution by gender. The box plots confirm that the majority of the customers for both genders are between 40 and 60 years old.

In [49]:
# plotting the income_distributions 

income_distributions(df_female, df_male, 'pink', 'cornflowerblue')

Here, we confirm the previous findings of the income distributions. The plot confirms that most customers for all gender types make between 40k and 60k. Most of the female customers make between 60k and 80k while most male customers make between 40k and 60k a year.

In [50]:
# showing an analysis of each offer for both genders; male and female customers

overall_analysis = analysis(full_data, 'received', 'overall_views', 'overall_completion')
overall_analysis

Unnamed: 0,offer_id,received,overall_views,overall_completion,overall_views(%),overall_completion(%)
0,5,4533,4402,3729,97.11,82.26
1,7,4451,3711,3169,83.37,71.2
2,2,3904,3552,2773,90.98,71.03
3,6,3837,3276,2592,85.38,67.55
4,1,3802,3248,2573,85.43,67.67
5,4,3776,2175,2499,57.6,66.18
6,3,3385,2132,2374,62.98,70.13
7,8,3349,1607,1917,47.98,57.24


In [51]:
# showing an analysis of each offer for male customers

male_analysis = analysis(df_male, 'received', 'male_views', 'male_completion')
male_analysis

Unnamed: 0,offer_id,received,male_views,male_completion,male_views(%),male_completion(%)
0,5,2639,2554,2052,96.78,77.76
1,7,2554,2174,1755,85.12,68.72
2,2,2275,2106,1441,92.57,63.34
3,6,2250,1884,1354,83.73,60.18
4,1,2194,1878,1324,85.6,60.35
5,4,2166,1196,1265,55.22,58.4
6,8,1944,1156,1143,59.47,58.8
7,3,1941,861,917,44.36,47.24


In [52]:
# showing an analysis of each offer for female customers

female_analysis = analysis(df_female, 'received', 'female_views', 'female_completion')
female_analysis

Unnamed: 0,offer_id,received,female_views,female_completion,female_views(%),female_completion(%)
0,5,1836,1790,1626,97.49,88.56
1,7,1835,1483,1371,80.82,74.71
2,2,1569,1393,1281,88.78,81.64
3,4,1556,1353,1261,86.95,81.04
4,1,1541,1324,1197,85.92,77.68
5,6,1531,934,1188,61.01,77.6
6,3,1396,929,1134,66.55,81.23
7,8,1365,711,970,52.09,71.06


In [53]:
# preparing a dataframe shows the completion percentages of each offer by gender

completion_perc = pd.merge(pd.merge(female_analysis, 
                                    male_analysis, on='offer_id'),
                           overall_analysis, on='offer_id')
col_list = ['offer_id', 'female_completion(%)', 'male_completion(%)', 'overall_completion(%)']
completion_perc = completion_perc[col_list].sort_values(by='offer_id').set_index('offer_id')

completion_perc

Unnamed: 0_level_0,female_completion(%),male_completion(%),overall_completion(%)
offer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,77.68,60.35,67.67
2,81.64,63.34,71.03
3,81.23,47.24,70.13
4,81.04,58.4,66.18
5,88.56,77.76,82.26
6,77.6,60.18,67.55
7,74.71,68.72,71.2
8,71.06,58.8,57.24


After the data has been cleaned even further during the previous processes, here is a completion rate comparison of the genders for each offer. The overall completion rate account for customers, again, offer 5 is the most popular offer. Oviously, female customers reponse to offer more than male customers. The following chart illustrates the number where we can see a clear difference between female customers actions toward offers and the male customers response.

In [54]:
# plotting the Percentages of Completed offers by Gender

f_completed = completion_perc['female_completion(%)']
m_completed = completion_perc['male_completion(%)']
overall_completed = completion_perc['overall_completion(%)']
offers = completion_perc.index

x = offers
y = f_completed
y2 = m_completed
y3 = overall_completed

trace1 = go.Bar(
    x=x,
    y=y,
    name='Female',
    #hoverinfo = 'y',
    hovertemplate = '<i>Percentage of all Female Customers who completed the Offer</i>: %{y:.2f}%'
                    '<br><b>Offer</b>: %{x}<br>',
    marker=dict(
        color='pink',
        line=dict(
            color='grey',
            width=1.5),
        ),
    opacity=0.6
)

trace2 = go.Bar(
    x=x,
    y=y2,
    name='Male',
    #hoverinfo = 'y',
    hovertemplate = '<i>Percentage of all Male Customers who completed the Offer</i>: %{y:.2f}%'
                    '<br><b>Offer</b>: %{x}<br>',
    marker=dict(
        color='cornflowerblue',
        line=dict(
            color='grey',
            width=1.5),
        ),
    opacity=0.6
)



trace3 = go.Scatter(
    x=x,
    y=y3,
    name='Overall',
    #hoverinfo= 'y',
    hovertemplate = '<i>Percentage of all Customers, Male, Female, and Other who completed the Offer</i>: %{y:.2f}%'
                    '<br><b>Offer</b>: %{x}<br>',
    marker=dict(
        color='grey',
        )
)

data1 = [trace1, trace2, trace3]

layout = go.Layout(
    title = "Percentage of Completed offers by Gender",
    xaxis=dict(title = 'Offers',
    type='category'),
    barmode='group',
    yaxis = dict(title = 'Percentage of Completed offers'
        #hoverformat = '.2f'
    )
)

fig = go.Figure(data=data1, layout=layout)
py.iplot(fig, filename='Percentage of Completed offers by Gender')


Consider using IPython.display.IFrame instead



The bar chart illustrates the numbers found earlier where we can see a clear difference between female customers’ behavior toward offers and the male customers’ behavior. Overall, offer 8 is the least popular offer, and offer 5 is the most popular offer for both gender. Offer 8 is the least popular offer for female customers and offer 3 is the least popular for male customers.


#### <font color=blue> 2.2.4 Offers

In [55]:
offers_list = full_data.drop_duplicates(subset=['offer_id'], keep='first')
offers_list = offers_list.drop(['customer_id', 
                                'age', 
                                'gender', 
                                'membership_start', 'membership_period', 
                                'transaction_amount', 'time_of_transaction(hours)',
                                'event', 
                                'income', 
                                'time(hours)_y',
                                'offer_views(%)',
                                'completed(not_viewed)'], axis=1)
offers_list.set_index('offer_id')

Unnamed: 0_level_0,received,viewed,completed,offer_completion(%),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
offer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,6685,3499,4188,62.647719,bogo,5,168,5,1,1,0,1
2,6683,5901,3657,54.720934,bogo,10,168,10,1,1,1,0
3,6576,6310,4103,62.393552,bogo,5,120,5,1,1,1,1
4,6631,3460,3911,58.980546,discount,10,168,2,1,1,0,1
5,6652,6407,5003,75.210463,discount,10,240,2,1,1,1,1
6,6655,6379,4886,73.418482,discount,7,168,3,1,1,1,1
7,6726,2215,3386,50.341957,discount,20,240,5,1,0,0,1
8,6593,6329,3310,50.204763,bogo,10,120,10,1,1,1,1


In [56]:
offers_list[offers_list.offer_type == 'bogo'].set_index('offer_id')

Unnamed: 0_level_0,received,viewed,completed,offer_completion(%),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
offer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,6685,3499,4188,62.647719,bogo,5,168,5,1,1,0,1
2,6683,5901,3657,54.720934,bogo,10,168,10,1,1,1,0
3,6576,6310,4103,62.393552,bogo,5,120,5,1,1,1,1
8,6593,6329,3310,50.204763,bogo,10,120,10,1,1,1,1


In [57]:
offers_list[offers_list.offer_type == 'discount'].set_index('offer_id')

Unnamed: 0_level_0,received,viewed,completed,offer_completion(%),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
offer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4,6631,3460,3911,58.980546,discount,10,168,2,1,1,0,1
5,6652,6407,5003,75.210463,discount,10,240,2,1,1,1,1
6,6655,6379,4886,73.418482,discount,7,168,3,1,1,1,1
7,6726,2215,3386,50.341957,discount,20,240,5,1,0,0,1


In [58]:
# function to return an analysis of the numbers and percentage of all transaction by offer
def analysis2(df1, col1, col2, col3, offer):
    v = df1[df1['event'] == 'offer viewed']
    v = v.drop_duplicates(subset=['customer_id', 'offer_id'], keep='first')

    c = df1[df1['event'] == 'offer completed']
    c = c.drop_duplicates(subset=['customer_id', 'offer_id'], keep='first')

    received = df1.drop_duplicates(subset=['customer_id', 'offer_id'], keep='first')

    viewed = pd.Series(v.age).value_counts().reset_index().rename(columns={'index': 'age_v', 'age': 'viewed'})
    completed = pd.Series(c.age).value_counts().reset_index().rename(columns={'index': 'age_c', 'age': 'completed'})
    received = pd.Series(received.age).value_counts().reset_index().rename(columns={'index': 'age_r', 'age': 'received'})


    analysis2 = pd.concat([viewed, completed, received], axis=1).rename(columns={'age_v': 'age', 'viewed': col2, 'completed': col3, 'received': col1})
    analysis2['offer'+offer+'_'+col2+'(%)'] = round(analysis2[col2]/analysis2[col1]*100, 2)
    analysis2['offer'+offer+'_'+col3+'(%)'] = round(analysis2[col3]/analysis2[col1]*100, 2)

    analysis2 = analysis2.drop(['age_r', 'age_c'], axis=1).sort_values(by='age')
    
    return analysis2

# function to return dataframe grouped by the time when each transaction takes a place from the start of an offer
# showing a description of the amount spent by that time
def offer_transcations(offer):
    offer_trans = transcation_by_time(df_completed, 'offer_id', offer)
    
    return offer_trans

# function to plot a trend of any two offers that shows the average Amount Spent since a Start of an Offer at a specific time
def plot2(offer1, offer2):
    offer_trans1 = offer_transcations(offer1)
    offer_trans2 = offer_transcations(offer2)
    
    trace_mean1 = go.Scatter(
        x=offer_trans1['time_of_transaction(hours)'],
        y=offer_trans1['mean'],
        name = offer1,
        opacity = 0.6)
    trace_mean2 = go.Scatter(
        x=offer_trans2['time_of_transaction(hours)'],
        y=offer_trans2['mean'],
        name = offer2,
        opacity = 0.6)


    data1 = [trace_mean1, trace_mean2]

    layout = {
        'title': 'Average Amount Spent Since a Start of an Offer Trend',
        'xaxis': {'title': 'Time Since a Start of an Offer (hours)'},
        'yaxis': {'title': 'Average Amount ($)', 
                  "range": [
                    10,
                    35
                ]}}
        
    fig = dict(data=data1, layout=layout)
    plot2 = py.iplot(fig, filename = "Amount Spent since a Start of an Offer Trend by Offer")
    
    return plot2

In [59]:
# showing a description of the amount spent by the time of transactions

offer_transcations(1)

Unnamed: 0,time_of_transaction(hours),count,mean,min,50%,max
0,0,176.0,16.762386,0.16,15.185,195.24
1,6,190.0,18.077684,0.07,14.78,448.97
2,12,260.0,18.776923,0.21,15.565,639.59
3,18,252.0,15.69127,0.13,13.245,366.65
4,24,261.0,15.152414,0.07,13.77,42.3
5,30,308.0,15.146201,0.12,12.87,323.88
6,36,289.0,16.060796,0.05,13.99,400.74
7,42,325.0,18.911662,0.18,13.1,657.26
8,48,310.0,17.156097,0.09,14.725,475.2
9,54,267.0,14.922097,0.43,12.74,347.37


In [60]:
# plotting and comparing the trend of any two offers

plot2(5, 7)

The bar chart illustrates the numbers found earlier where we can see a clear difference between female customers’ behavior toward offers and the male customers’ behavior. Overall, offer 8 is the least popular offer, and offer 5 is the most popular offer for both gender. Offer 8 is the least popular offer for female customers and offer 3 is the least popular for male customers.

In [61]:
# creating a dataframe that contains only transactions by a specific offer

offer1 = offer_dataset(full_data, 1)
offer1.sample(3)

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
21683,929,29,F,61000.0,2017-09-18,2017-09,16.05,66,offer completed,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
22075,946,82,F,62000.0,2016-09-19,2016-09,5.76,36,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1
120747,5201,24,M,36000.0,2016-05-25,2016-05,3.08,72,offer received,1,...,62.647719,689,bogo,5,168,5,1,1,0,1


In [62]:
# showing the numbers and percentages of all transactions made by a specific age for a specific offer

offer1_analysis = analysis2(offer1, 'received', 'viewed', 'completed', '1')
offer1_analysis

Unnamed: 0,age,viewed,completed,received,offer1_viewed(%),offer1_completed(%)
68,18.0,9.0,12.0,16,56.25,75.00
60,19.0,13.0,19.0,30,43.33,63.33
56,20.0,15.0,22.0,33,45.45,66.67
57,21.0,14.0,21.0,31,45.16,67.74
58,22.0,14.0,20.0,30,46.67,66.67
61,23.0,13.0,17.0,28,46.43,60.71
52,24.0,18.0,25.0,37,48.65,67.57
62,25.0,12.0,17.0,28,42.86,60.71
49,26.0,19.0,27.0,38,50.00,71.05
42,27.0,21.0,29.0,42,50.00,69.05


In [63]:
offer2 = offer_dataset(full_data, 2)
offer2.sample(3)

Unnamed: 0,customer_id,age,gender,income,membership_start,membership_period,transaction_amount,time_of_transaction(hours),event,offer_id,...,offer_completion(%),completed(not_viewed),offer_type,difficulty($),duration(hours),reward,email,mobile,social,web
165927,6404,53,F,34000.0,2017-03-11,2017-03,8.67,18,offer completed,2,...,54.720934,-2244,bogo,10,168,10,1,1,1,0
143716,666,61,F,50000.0,2016-08-28,2016-08,14.24,24,offer completed,2,...,54.720934,-2244,bogo,10,168,10,1,1,1,0
263125,8734,30,M,51000.0,2015-11-28,2015-11,18.58,54,offer received,2,...,54.720934,-2244,bogo,10,168,10,1,1,1,0


In [64]:
offer2_analysis = analysis2(offer2, 'received', 'viewed', 'completed', '2')
offer2_analysis

Unnamed: 0,age,viewed,completed,received,offer2_viewed(%),offer2_completed(%)
69,18,16,8,18,88.89,44.44
57,19,32,17,33,96.97,51.52
42,20,42,25,44,95.45,56.82
53,21,33,18,34,97.06,52.94
55,22,33,17,33,100.00,51.52
49,23,35,19,39,89.74,48.72
47,24,38,20,41,92.68,48.78
39,25,46,27,47,97.87,57.45
40,26,45,26,47,95.74,55.32
54,27,33,18,33,100.00,54.55


In [65]:
offer3 = offer_dataset(full_data, 3)
offer3_analysis = analysis2(offer3, 'received', 'viewed', 'completed', '3')
offer3_analysis

Unnamed: 0,age,viewed,completed,received,offer3_viewed(%),offer3_completed(%)
68,18,17,13,17,100.00,76.47
41,19,37,27,38,97.37,71.05
44,20,35,25,36,97.22,69.44
40,21,37,27,38,97.37,71.05
55,22,29,21,30,96.67,70.00
43,23,36,25,38,94.74,65.79
52,24,32,22,33,96.97,66.67
47,25,33,23,35,94.29,65.71
28,26,46,34,47,97.87,72.34
45,27,35,25,36,97.22,69.44


In [66]:
offer4 = offer_dataset(full_data, 4)
offer4_analysis = analysis2(offer4, 'received', 'viewed', 'completed', '4')
offer4_analysis

Unnamed: 0,age,viewed,completed,received,offer4_viewed(%),offer4_completed(%)
71,18.0,6.0,9,12,50.00,75.00
64,19.0,11.0,13,23,47.83,56.52
54,20.0,15.0,20,33,45.45,60.61
50,21.0,16.0,21,37,43.24,56.76
59,22.0,13.0,17,29,44.83,58.62
49,23.0,16.0,22,38,42.11,57.89
44,24.0,18.0,24,42,42.86,57.14
60,25.0,12.0,17,28,42.86,60.71
58,26.0,13.0,17,29,44.83,58.62
57,27.0,13.0,17,29,44.83,58.62


In [67]:
offer5 = offer_dataset(full_data, 5)
offer5_analysis = analysis2(offer5, 'received', 'viewed', 'completed', '5')
offer5_analysis

Unnamed: 0,age,viewed,completed,received,offer5_viewed(%),offer5_completed(%)
66,18,21,19,22,95.45,86.36
53,19,40,31,41,97.56,75.61
61,20,31,25,32,96.88,78.12
41,21,49,40,53,92.45,75.47
49,22,42,33,44,95.45,75.00
48,23,42,34,44,95.45,77.27
55,24,38,30,40,95.00,75.00
44,25,46,38,49,93.88,77.55
39,26,54,43,56,96.43,76.79
40,27,53,43,55,96.36,78.18


In [68]:
offer6 = offer_dataset(full_data, 6)
offer6_analysis = analysis2(offer6, 'received', 'viewed', 'completed', '6')
offer6_analysis

Unnamed: 0,age,viewed,completed,received,offer6_viewed(%),offer6_completed(%)
66,18,21,19.0,23,91.30,82.61
45,19,43,34.0,44,97.73,77.27
56,20,33,25.0,33,100.00,75.76
47,21,40,33.0,41,97.56,80.49
50,22,38,31.0,39,97.44,79.49
51,23,37,29.0,38,97.37,76.32
48,24,39,32.0,41,95.12,78.05
52,25,36,28.0,38,94.74,73.68
43,26,44,36.0,45,97.78,80.00
38,27,46,39.0,47,97.87,82.98


In [69]:
offer7 = offer_dataset(full_data, 7)
offer7_analysis = analysis2(offer7, 'received', 'viewed', 'completed', '7')
offer7_analysis

Unnamed: 0,age,viewed,completed,received,offer7_viewed(%),offer7_completed(%)
68,18.0,6.0,11,19,31.58,57.89
72,19.0,5.0,8,12,41.67,66.67
58,20.0,7.0,17,34,20.59,50.00
65,21.0,6.0,12,26,23.08,46.15
66,22.0,6.0,11,24,25.00,45.83
45,23.0,11.0,23,47,23.40,48.94
44,24.0,12.0,24,48,25.00,50.00
50,25.0,10.0,20,44,22.73,45.45
62,26.0,7.0,13,31,22.58,41.94
67,27.0,6.0,11,22,27.27,50.00


In [70]:
offer8 = offer_dataset(full_data, 8)
offer8_analysis = analysis2(offer8, 'received', 'viewed', 'completed', '8')
offer8_analysis

Unnamed: 0,age,viewed,completed,received,offer8_viewed(%),offer8_completed(%)
64,18,20,10.0,20,100.00,50.00
41,19,38,18.0,39,97.44,46.15
55,20,29,14.0,30,96.67,46.67
38,21,39,22.0,40,97.50,55.00
57,22,28,13.0,29,96.55,44.83
40,23,38,19.0,40,95.00,47.50
45,24,35,16.0,38,92.11,42.11
46,25,35,16.0,37,94.59,43.24
39,26,38,22.0,40,95.00,55.00
32,27,42,26.0,42,100.00,61.90


In [71]:
# merging all the analyses created for each offer and then create a dataframe shows the completion percentage 
# of each offer based on the age group

completion_perc_o = pd.merge(pd.merge(pd.merge(pd.merge(pd.merge(pd.merge(pd.merge(offer1_analysis, offer2_analysis, on='age'), 
                             offer3_analysis, on='age'),
                             offer4_analysis, on='age'),
                             offer5_analysis, on='age'),
                             offer6_analysis, on='age'),
                             offer7_analysis, on='age'),
                             offer8_analysis, on='age')
col_list = ['age', 
            'offer1_completed(%)', 
            'offer2_completed(%)', 
            'offer3_completed(%)', 
            'offer4_completed(%)', 
            'offer5_completed(%)', 
            'offer6_completed(%)', 
            'offer7_completed(%)', 
            'offer8_completed(%)']
completion_perc_o = completion_perc_o[col_list]
completion_perc_o

Unnamed: 0,age,offer1_completed(%),offer2_completed(%),offer3_completed(%),offer4_completed(%),offer5_completed(%),offer6_completed(%),offer7_completed(%),offer8_completed(%)
0,18.0,75.00,44.44,76.47,75.00,86.36,82.61,57.89,50.00
1,19.0,63.33,51.52,71.05,56.52,75.61,77.27,66.67,46.15
2,20.0,66.67,56.82,69.44,60.61,78.12,75.76,50.00,46.67
3,21.0,67.74,52.94,71.05,56.76,75.47,80.49,46.15,55.00
4,22.0,66.67,51.52,70.00,58.62,75.00,79.49,45.83,44.83
5,23.0,60.71,48.72,65.79,57.89,77.27,76.32,48.94,47.50
6,24.0,67.57,48.78,66.67,57.14,75.00,78.05,50.00,42.11
7,25.0,60.71,57.45,65.71,60.71,77.55,73.68,45.45,43.24
8,26.0,71.05,55.32,72.34,58.62,76.79,80.00,41.94,55.00
9,27.0,69.05,54.55,69.44,58.62,78.18,82.98,50.00,61.90


A complete analysis of each offer was created to show the completion rate of each age group with respect to an offer. The following function will allow us to report the completion rates per age group. Then, we will plot a trend of the completion rates by each age group for each offer.

In [72]:
# a function to return a report of an age group that contains all completion percentages by offer

def age_report(a, df=completion_perc_o):
    report = df[df.age == a]
    
    return report

In [73]:
age_report(35)

Unnamed: 0,age,offer1_completed(%),offer2_completed(%),offer3_completed(%),offer4_completed(%),offer5_completed(%),offer6_completed(%),offer7_completed(%),offer8_completed(%)
17,35.0,67.57,56.82,64.29,68.42,72.5,82.61,51.35,44.83


In [74]:
age_report(40)

Unnamed: 0,age,offer1_completed(%),offer2_completed(%),offer3_completed(%),offer4_completed(%),offer5_completed(%),offer6_completed(%),offer7_completed(%),offer8_completed(%)
22,40.0,73.68,63.24,78.85,67.27,81.82,84.21,56.96,64.29


In [75]:
age_report(25)

Unnamed: 0,age,offer1_completed(%),offer2_completed(%),offer3_completed(%),offer4_completed(%),offer5_completed(%),offer6_completed(%),offer7_completed(%),offer8_completed(%)
7,25.0,60.71,57.45,65.71,60.71,77.55,73.68,45.45,43.24


In [76]:
# creating a copy of the completion reports 

plot1 = completion_perc_o.copy()

# plotting the Percentages of Completed offers by each Age group for each offer

plot1 = plot1.set_index('age')

trace1 = go.Scatter(
    x=plot1.index,
    y=plot1['offer1_completed(%)'],
    name = "Offer 1",
    opacity = 0.8)

trace2 = go.Scatter(
    x=plot1.index,
    y=plot1['offer2_completed(%)'],
    name = "Offer 2",
    opacity = 0.8)

trace3 = go.Scatter(
    x=plot1.index,
    y=plot1['offer3_completed(%)'],
    name = "Offer 3",
    opacity = 0.8)

trace4 = go.Scatter(
    x=plot1.index,
    y=plot1['offer4_completed(%)'],
    name = "Offer 4",
    opacity = 0.8)

trace5 = go.Scatter(
    x=plot1.index,
    y=plot1['offer5_completed(%)'],
    name = "Offer 5",
    opacity = 0.8)

trace6 = go.Scatter(
    x=plot1.index,
    y=plot1['offer6_completed(%)'],
    name = "Offer 6",
    opacity = 0.8)

trace7 = go.Scatter(
    x=plot1.index,
    y=plot1['offer7_completed(%)'],
    name = "Offer 7",
    opacity = 0.8)

trace8 = go.Scatter(
    x=plot1.index,
    y=plot1['offer8_completed(%)'],
    name = "Offer 8",
    opacity = 0.8)

data1 = [trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8]

layout = {
    'title': 'Percentage of Completed offers by Age',
    'xaxis': {'title': 'Age'},
    'yaxis': {'title': 'Percentage Completed (%)'}}

layout.update(dict(xaxis=dict(rangeslider=dict(visible = True),type='linear')))

updatemenus = list([
    dict(active=0,
         buttons=list([   
            dict(label = 'All',
                 method = 'update',
                 args = [{'visible': [True, True, True, True, True, True, True, True]},
                         {'title': 'Percentage of Each Completed offers by Age'}]),
            dict(label = 'Offer 1',
                 method = 'update',
                 args = [{'visible': [True, False, False, False, False, False, False, False]},
                         {'title': 'Percentage of Completed offers by Age for Offer 1'}]),
            dict(label = 'Offer 2',
                 method = 'update',
                 args = [{'visible': [False, True, False, False, False, False, False, False]},
                         {'title': 'Percentage of Completed offers by Age for Offer 2'}]),
            dict(label = 'Offer 3',
                 method = 'update',
                 args = [{'visible': [False, False, True, False, False, False, False, False]},
                         {'title': 'Percentage of Completed offers by Age for Offer 3'}]),
            dict(label = 'Offer 4',
                 method = 'update',
                 args = [{'visible': [False, False, False, True, False, False, False, False]},
                         {'title': 'Percentage of Completed offers by Age for Offer 4'}]),
            dict(label = 'Offer 5',
                 method = 'update',
                 args = [{'visible': [False, False, False, False, True, False, False, False]},
                         {'title': 'Percentage of Completed offers by Age for Offer 5'}]),
            dict(label = 'Offer 6',
                 method = 'update',
                 args = [{'visible': [False, False, False, False, False, True, False, False]},
                         {'title': 'Percentage of Completed offers by Age for Offer 6'}]),
            dict(label = 'Offer 7',
                 method = 'update',
                 args = [{'visible': [False, False, False, False, False, False, True, False]},
                         {'title': 'Percentage of Completed offers by Age for Offer 7'}]),
            dict(label = 'Offer 8',
                 method = 'update',
                 args = [{'visible': [False, False, False, False, False, False, False, True]},
                         {'title': 'Percentage of Completed offers by Age for Offer 8'}])
        ]),
         
    )
    
])


layout.update(dict(updatemenus=updatemenus))

fig = go.Figure(data=data1, layout=layout)

              
py.iplot(fig, filename = "Percentage of Completed offers by Age")


Consider using IPython.display.IFrame instead



The plot illustrates the Percentages of Completed offers by each age group that can be filtered by offer. For example, at the age group 35 years old, offer 6 has the highest completion rate 82.61%. While the age group 25 years old prefer offer 5 the most with a completion rate of 77.55%.

In [77]:
fig.update_layout(
    updatemenus=[
        go.layout.Updatemenu(
            buttons=list([
                dict(
                    args=["colorscale", "Viridis"],
                    label="Viridis",
                    method="restyle"
                ),
                dict(
                    args=["colorscale", "Cividis"],
                    label="Cividis",
                    method="restyle"
                ),
                dict(
                    args=["colorscale", "Blues"],
                    label="Blues",
                    method="restyle"
                ),
                dict(
                    args=["colorscale", "Greens"],
                    label="Greens",
                    method="restyle"
                ),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.1,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
        go.layout.Updatemenu(
            buttons=list([
                dict(
                    args=["reversescale", False],
                    label="False",
                    method="restyle"
                ),
                dict(
                    args=["reversescale", True],
                    label="True",
                    method="restyle"
                )
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.37,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
        go.layout.Updatemenu(
            buttons=list([
                dict(
                    args=[{"contours.showlines": False, "type": "contour"}],
                    label="Hide lines",
                    method="restyle"
                ),
                dict(
                    args=[{"contours.showlines": True, "type": "contour"}],
                    label="Show lines",
                    method="restyle"
                ),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.58,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
    ]
)

AttributeError: 'Figure' object has no attribute 'update_layout'

### Bonus

In [78]:
# plotting a Sunburst Charts shows the numbers of customers
# with respect to all transactions where the the customers completed an offer 

trace = go.Sunburst(
    labels=["Transactions",
            "BOGO", "Discount",
            "Offer 1", "Offer 2", "Offer 3", "Offer 4", "Offer 5", 
            "Offer 6", "Offer 7", "Offer 8",
            "Female", "Male", "Other", 
            "Female", "Male", "Other",
            "Female", "Male", "Other", 
            "Female", "Male", "Other", 
            "Female", "Male", "Other", 
            "Female", "Male", "Other", 
            "Female", "Male", "Other", 
            "Female", "Male", "Other"],
    parents=["", 
             "Transactions", "Transactions", 
             "BOGO", "BOGO", "BOGO", "Discount", "Discount", 
             "Discount", "Discount", "BOGO",
             "Offer 1", "Offer 1", "Offer 1",
             "Offer 2", "Offer 2", "Offer 2", 
             "Offer 3", "Offer 3", "Offer 3",
             "Offer 4", "Offer 4", "Offer 4",
             "Offer 5", "Offer 5", "Offer 5",
             "Offer 6", "Offer 6", "Offer 6",
             "Offer 7", "Offer 7", "Offer 7",
             "Offer 8", "Offer 8", "Offer 8"],
    values=[26226, 
            9937, 12089,
            2573, 2773, 2374, 2499, 3729, 
            2592, 3269, 1917,
            1197, 1324, 52,
            1281, 1441, 51,
            1134, 917, 323,
            1247, 1252, 0,
            1626, 2052, 51,
            1188, 1354, 50,
            1371, 1755, 141,
            872, 1045, 0],
    branchvalues="total",
    outsidetextfont = {"size": 15, "color": "#377eb8"},
    marker = {"line": {"width": 2}})

layout = go.Layout(
    title = 'test',
    margin = go.layout.Margin(t=0, l=0, r=0, b=0))

py.iplot(go.Figure([trace], layout), filename='basic_sunburst_chart_total_branchvalues')

An interactive sunburst chart that shows more details about the number of transactions made where customers received an offer, viewed that offer through one of the channels used, and then made a transaction and completed the offer before the end of the duration of that offer.