### Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?


**Data**

This data is from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\$20 - $50).

**GOAL**

A brief report to highlight the differences between customers who did and did not accept the coupons.





### Data Description

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day
    - 'Restaurant(<20)', 'Coffee House', 'Bar', 'Carry out & Take away', 'Restaurant(20-50)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px

# Would you accept that coupon and take a short detour to the restaurant?
# Would you accept the coupon but use it on a subsequent trip?
# Would you ignore the coupon entirely?
# What if the coupon was for a bar instead of a restaurant?
# What about a coffee house? Would you accept a bar coupon with a minor passenger in the car?
# What about if it was just you and your partner in the car?
# Would weather impact the rate of acceptance?
# What about the time of day?


### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [None]:
data = pd.read_csv('data/coupons.csv')

In [None]:
data.head()

2. Investigate the dataset for missing or problematic data.

In [None]:
# Total # of records / rows
len(data)

In [None]:
# View count of rows with NA values in 1 or more columns
data.isnull().sum()
# data['CoffeeHouse'].value_counts()
# data['has_children'].unique()


3. Decide what to do about your missing data -- drop, replace, other...

In [None]:
# Drop 'car' column since it only has 108 (12684-12576) rows with value.
data_clean = data.drop('car', axis=1)
# Drop other rows that hava na, which are Bar (107 nas), CoffeeHouse (217), CarryAway (151), RestaurantLessThan20 (130), Restaurant20To50 (189 na's)
data_clean = data_clean.dropna()
data_clean.isnull().sum()

4. What proportion of the total observations chose to accept the coupon?



In [None]:
proportion = len(data_clean[data_clean['Y']==1]) / len(data_clean)
print("Proportion of acceptance: ", proportion)

# Analyzing various dependencies across features

5. Use a bar plot to visualize the `coupon` column.

In [None]:
px.bar(data_frame=data_clean, x='coupon', title="Stacking different types of Coupons")

# This plot shows most coupons issued were for "Coffee House". Followed by "Restaurants(<20)" that are less than $20. Followed by "Carry out & Take away".
# Followed by "Restaurant(20-50)", followed by Bar.

6. Use a histogram to visualize the temperature column.

In [None]:
# Temp is recorded as a fixed value, not as an interval. We could presume the temperature to be above what is provided for analysis purposes.
# Given using the qualitative attribute "weather" also indicates similar patter, decided to use temperature itself based on the give prompt here.
print(len(data_clean.query('temperature == 80 & Y==1')) / len(data_clean.query('temperature == 80 & Y==0')))
print(len(data_clean.query('temperature == 55 & Y==1')) / len(data_clean.query('temperature == 55 & Y==0')))
print(len(data_clean.query('temperature == 30 & Y==1')) / len(data_clean.query('temperature == 30 & Y==0')))
px.histogram(data_frame=data_clean, x='temperature', color='Y', text_auto=True, title="Impact of Temperature to Coupon Redemption")
# --- Take Aways : NOTE the pattern is the same if 'temperature' is replaced with 'weather'
# Lot more coupons were issued on hotter days (80 to 90) - 1.5 times more than ones that were not used.


In [None]:
# To run correlation, replace non-numeric values with number
data_age = data_clean.copy(deep=True)
data_age['age'] = data_age['age'].replace({'50plus': 50, 'below21': 21})
# data_age.age.unique()
data_age['age'].corr(data_age['Y'])
# --- Take Away :
# Age and Coupon redemption are -vely correlated and the corr is insignificant

In [None]:
# This plot compares user behavior for different types of coupons.
px.histogram(data_frame=data_clean, x='Y', color='coupon', text_auto=True, title="Comparison of different types of Coupons that are redeemed vs not")

# --- Take away :
# "Carry Out & Take away" coupons are more likely to be redeemed.
# Similarly Restaurants within 20 miles, coupons are more likely to be redeemed.
# Bar's less likely

In [None]:
# Are coupons more likely to be redemmed if the place happens to be in the same direction ?
px.density_heatmap(data_frame=data_clean, x='education', y='Y', text_auto=True, title="Heatmap of Education to Coupon Redemption")
## -- Take aways
# Data is heavy with customers with Bachelor's degree or some college. Need to calculate the ratio / proportion since just the numbers do not tell us anything.


In [None]:
px.histogram(data_frame=data_clean, x='Y', color='has_children', title="Children impact on Coupon Redemption")
## -- Take aways
# Below plot tells us customers without children are more likely to redeeem the coupon.

**Investigating the Bar Coupons**

Exploration of just the bar related coupons.  




In [None]:
data_bar = data_clean.query('coupon == "Bar"')
#data_bar = data_clean[data_clean['coupon'] == 'Bar']
data_bar.reset_index(inplace=True)
data_bar.head()

2. What proportion of bar coupons were accepted?


In [None]:
# len(data_bar.query('Y == 1')) / len(data_bar) - using len()
bar_coupon_redeemed = data_bar.query('Y == 1')['Y'].sum() / len(data_bar)  # using query
print("Proportion of coupons redeemed: ", bar_coupon_redeemed);

## --- Take Away :
# About 41% of Bar coupons were accepted

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [None]:
## @TODO : Tech Debt. Revisit the code
# print('unique values: ', data_bar.Bar.unique(), " # of records= ", len(data_bar))

data_bar.loc[data_bar['Bar'].isin(['gt8', '4-8']), 'Bar_4orMore'] = 1
data_bar.loc[data_bar['Bar'].isin(['never', 'less1', '1~3']), 'Bar_4orMore'] = 0
# @TODO python complains about this call - need to understand how NaN can be replaced using better approach
data_bar['Bar_4orMore'] = data_bar['Bar_4orMore'].fillna(0)
# data_bar['Bar_4orMore'] = data_bar.apply(lambda x: 1 if (x['Bar'].isin(['gt8', '4-8'])) else 0, axis = 1)

# ---- verifying the data ----
# print("Bar= ", len(data_bar['Bar'].isin(['gt8', '4-8'])) )
# print("Bar less than 4= ", len(data_bar[(data_bar['Bar'] == 'never') | (data_bar['Bar'] == 'less1') | (data_bar['Bar'] == '1~3') ] ))
# print("Go to bar 4 or more times: ", (data_bar[data_bar['Bar_4orMore'] == 1]['Bar_4orMore'].sum() ) )
# print("Go to bar less than 4 times: ", len(data_bar[data_bar['Bar_4orMore'] == 0]) )
# print('Go to bar 4 or more times AND redeem coupon: ', (data_bar.query('Bar_4orMore == 1 & Y == 1')['Bar_4orMore'].sum()) )
# print('Go to bar less than 4 times AND redeem coupon: ', len(data_bar.query('Bar_4orMore == 0 & Y == 1')) )

# ---- calculate acceptance rate ----
bar_4orMore_acceptancerate = len(data_bar.query('Bar_4orMore == 1 & Y == 1')) / len(data_bar[data_bar['Bar_4orMore'] == 1] )
bar_lessThan4_acceptancerate = len(data_bar.query('Bar_4orMore == 0 & Y == 1')) / len(data_bar[data_bar['Bar_4orMore'] == 0])

print('Coupon Acceptance Rate of Customers who go to:')
print('\t Bar 4 or more times ', bar_4orMore_acceptancerate)
print('\t Bar less than 4 times ', bar_lessThan4_acceptancerate)

## ---- Take Aways :
# Customers who go to Bar 4 or more times are more likely to redeem the coupon (~720%) over customers who go less (40%).


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [None]:
## @TODO : Tech Debt. Revisit the code
data_bar_age = data_clean.loc[data_clean['coupon'] == 'Bar']
# -- NOTE : Given Bar data intervals is "less than 1" and "1 to 3", we assume 1 - 3 is more than once a month.
# data_bar_age['Bar_1orMore'] = np.where(data_bar_age['Bar'].isin(['1~3', 'gt8', '4~8']), 1, 0)
data_bar_age['Bar_1orMore'] = data_bar_age['Bar'].transform(lambda x: 1 if (x == '1~3' or x == 'gt8' or x == '4~8') else 0)
data_bar_age.loc[data_bar_age['age'] == '50plus', 'age'] = 50
# Given we are processing data associated to coupons for Bar, we assume coupons are only delivered to pepole above 18yrs. # of below21 rows = 81
data_bar_age.loc[data_bar_age['age'] == 'below21', 'age'] = 18
data_bar_age['age'] = data_bar_age['age'].astype(int)


# ---- validate the data ---- Use value_counts() function instead of len
# print("Age unique values:", data_bar_age.age.unique())
# print('Total # of records in dataframe: ', len(data_bar_age))
# print('Customers > 25yrs who go to bar 1 or more times: ', data_bar_age[(data_bar_age['Bar_1orMore'] == 1) & (data_bar_age['age'] > 25)]['Bar_1orMore'].value_counts())
# print('Customers < 25yrs who go to bar 1 or more times: ', data_bar_age[(data_bar_age['Bar_1orMore'] == 1) & (data_bar_age['age'] <= 25)]['Bar_1orMore'].value_counts())
# print('Customers > 25yrs who go to bar less than once: ', data_bar_age[(data_bar_age['Bar_1orMore'] == 0)  & (data_bar_age['age'] > 25)]['Bar_1orMore'].value_counts())
# print('Customers > 25yrs who go to bar less than once: ', data_bar_age[(data_bar_age['Bar_1orMore'] == 0) & (data_bar_age['age'] <= 25)]['Bar_1orMore'].value_counts())

# -- Customer over 25 yrs AND goes to Bar 1 or more times --
bar1orMore_Above25 = data_bar_age[(data_bar_age['Bar_1orMore'] == 1) & (data_bar_age['age'] > 25)]['Bar_1orMore'].value_counts()
bar1orMore_Above25_redemmed = data_bar_age[(data_bar_age['Bar_1orMore'] == 1) & (data_bar_age['age'] > 25) & (data_bar_age['Y'] == 1)]['Y'].value_counts()
bar1orMore_Above25_rate = bar1orMore_Above25_redemmed / bar1orMore_Above25
print('Customers > 25yrs, go to bar 1 or more times and redeem coupon: ', bar1orMore_Above25_rate)

# -- Customer less than 25 yrs AND goes to Bar 1 or more times --
bar1orMore_less25 = data_bar_age[(data_bar_age['Bar_1orMore'] == 1) & (data_bar_age['age'] <= 25)]['Bar_1orMore'].value_counts()
bar1orMore_less25_redemmed = data_bar_age[(data_bar_age['Bar_1orMore'] == 1) & (data_bar_age['age'] <= 25) & (data_bar_age['Y'] == 1)]['Y'].value_counts()
bar1orMore_less25_rate = bar1orMore_less25_redemmed / bar1orMore_less25
print('Customers <= 25yrs, go to bar 1 or more times and redeem coupon: ', bar1orMore_less25_rate)

# @TODO - When col 'Y' is used at the end, NaN is returned ??? But for abv it works fine.
# -- Customer over 25 yrs AND goes to Bar less than 1 --
barLessThan1_Above25 = data_bar_age[(data_bar_age['Bar_1orMore'] == 0) & (data_bar_age['age'] > 25)]['Bar_1orMore'].value_counts()
barLessThan1_Above25_redemmed = data_bar_age[(data_bar_age['Bar_1orMore'] == 0) & (data_bar_age['age'] > 25) & (data_bar_age['Y'] == 1)]['Bar_1orMore'].value_counts()
barLessThan1_Above25_rate = barLessThan1_Above25_redemmed / barLessThan1_Above25
print('Customer > 25yrs, go to bar less than 1 and redeem coupon: ', barLessThan1_Above25_rate)

# -- Customer less than 25 yrs AND goes to Bar 1 or more times --
barLessThan1_less25 = data_bar_age[(data_bar_age['Bar_1orMore'] == 0) & (data_bar_age['age'] <= 25)]['Bar_1orMore'].value_counts()
barLessThan1_less25_redemmed = data_bar_age[(data_bar_age['Bar_1orMore'] == 0) & (data_bar_age['age'] <= 25) & (data_bar_age['Y'] == 1)]['Bar_1orMore'].value_counts()
barLessThan1_less25_rate = barLessThan1_less25_redemmed / barLessThan1_less25
print('Customer <= 25yrs, go to bar less than 1 and redeem coupon: ', barLessThan1_less25_rate)

## ---- Take Aways :
# Customers who go to bar 1 or more times a month, irrespective of their age, are more likely to redeem the coupon ~68% over customers who go less than once.

5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


In [None]:
# data_clean.occupation.unique()
# data_clean.passanger.unique()
data5 = data_bar_age.query('Bar_1orMore == 1 and passanger != "Kid(s)" and occupation != "Farming Fishing & Forestry"')
data5_accepted = data5[data5['Y'] == 1]['Y'].value_counts()
# print(data6_accepted)

print("Customers who visit bar 1 or more times, do not ride a Kid, Occupation other than Farming, Fishing or Forestry : ", (data5_accepted / len(data5) ) )

## --- Take Aways
# Customers with 1 or more visits to the Bar, and driving with other passengers, AND whose occupation is not Farming, Fishing & Forestry,
# are 71% more likelty to redeem the coupon.

6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.



In [None]:
# data_bar_age.maritalStatus.unique()
data6a = data_bar_age.query('Bar_1orMore == 1 and passanger != "Kid(s)" and maritalStatus != "Widowed"')
# print('6a - Total count: ', len(data6a))
data6a_accepted = data6a[data6a['Y'] == 1]['Y'].value_counts()
# print('6a - data accepted: ', data6a_accepted)
print('Customers who visit Bar 1 or more times, are not riding with child, and Widowed : ', (data6a_accepted / len(data6a) ))

# ---- Goes to Bar more than once a month and age < 30
data6b = data_bar_age.query('Bar_1orMore == 1 and age < 30')
# print('6b - Total count: ', len(data6b))
data6b_accepted = data6b[data6b['Y'] ==  1]['Y'].value_counts()
# print('6b - data accepted: ', data6b_accepted)
print('Customers who visit Bar 1 or more times, AND <30yrs : ', (data6b_accepted / len(data6b) ))

# ---------
# data_clean.RestaurantLessThan20.unique()
data_rest_50k = data_clean.query(
    '(RestaurantLessThan20 == "4~8" or RestaurantLessThan20 == "gt8") and (income == "Less than $12500" or income == "$12500 - $24999" or income == "$25000 - $37499" or income == "$37500 - $49999")')
# print('6c - Total count: ', len(data_rest_50k))
data_rest_50k_accepted = data_rest_50k[data_rest_50k['Y'] == 1]['Y'].value_counts()
# print('6c - Accepted: ', data_rest_50k_accepted)
print('Customers who visit Restaurants with bill <20 and whose income is less than 50K: ', (data_rest_50k_accepted / len(data_rest_50k) ))

## -- Take Aways :
# Customers who go to Bar with Adults and who are not Widowed and less than 30yrs old, are more likely to redem coupon (70%)
# Customers who fall into low income (< 50K) are more likely to redeem coupoin at restaurants where the cost is less than $20

7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

In [None]:
## Hypothesis :
# Customers who go to Bar frequently are more likely to redeem coupon, over ones who do not.
# Demographics of this group is people with age < 30, who socialize with friends and are not widowed. Although the analysis was focused on
# one attribute of Marriage Status as well as for Socializing aspect - whether Kids are in the car or not.
# Similarly Customers whose income is < 50K are highly likely (~60%) to redeem coupons where bill is <20

### Further Investigation

Similar to the bar coupon, exploring one of the other coupon groups to determine the characteristics of passengers who accept the coupons.  

In [None]:
## Based on the Graph plot "Comparison of different types of Coupons that are redeemed vs not", further above,
# we learn the 2 major buckets are "Restaurants < $20" and "Carryout & Takeaway". Hence exploring more on these 2 areas
# data_restaurants = data_clean.query('coupon == "Restaurant(<20)" or coupon == "Carry out & Take away"')
data_restaurants = data_clean.query('coupon == "Restaurant(<20)"')
#print(data_restaurants_lean.income.unique())
px.histogram(data_frame=data_restaurants, x='Y', color='income', title="Restaurant < 20 coupon redemption across income ranges")
## --- Take Away
# Below plot shows people with income levels "12500 to 24999", "25000 - 37499", "50000 - 62499" are more likely to redeem coupon
# along with surprisingly"$100000 or more" income customers.


In [None]:
data_carryout = data_clean.query('coupon == "Carry out & Take away"')
print(len(data_carryout))
px.histogram(data_frame=data_carryout, x='Y', color='income', title="Carryout & Takeaway coupon redemption across income ranges")
## ---- Take Aways :
# Customers with income range "12500 - 24999", "$25000 - $37499", are more likely to redeem coupons at Carry out & Take away places.

In [None]:
#px.histogram(data_frame=data_carryout, x='Y', color='time', title="Carryout & Takeaway coupon redemption across income ranges")
#px.scatter(data_frame=data_carryout, x='income', y='time', color='time', title="Carryout & Takeaway coupon redemption across income ranges")
px.bar(data_frame=data_carryout, x='time', color='Y', title="Spread of Coupons across Time")

## -- Take away :
# A good chunk of the coupons were issued for 7AM
# TODO - find the proportion of coupons accepted in each time slot
