### Will a Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaraunt near where you are driving. Would you accept that coupon and take a short detour to the restaraunt? Would you accept the coupon but use it on a sunbsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaraunt? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50). 

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece. 





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [169]:
!pip3 install pandas
!pip3 install plotly
!pip3 install numpy
!pip3 install seaborn
!pip3 install matplotlib.pyplot

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user install

In [171]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [172]:
data = pd.read_csv('coupons.csv')

In [173]:
data.head(100)

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,No Urgent Place,Friend(s),Sunny,80,2PM,Restaurant(<20),1d,Male,21,Single,...,less1,1~3,less1,1~3,1,1,0,0,1,1
96,No Urgent Place,Friend(s),Sunny,80,6PM,Coffee House,2h,Male,21,Single,...,less1,1~3,less1,1~3,1,0,0,0,1,1
97,No Urgent Place,Friend(s),Sunny,80,6PM,Restaurant(<20),2h,Male,21,Single,...,less1,1~3,less1,1~3,1,1,0,0,1,1
98,No Urgent Place,Friend(s),Sunny,55,2PM,Coffee House,2h,Male,21,Single,...,less1,1~3,less1,1~3,1,0,0,0,1,1


2. Investigate the dataset for missing or problematic data.

In [174]:
# Looking at the data, we see that the 'car' has several empty / NaN values.
print("number of null car entries:", len(data[data['car'].isnull()]))

# It would be ideal to have consistency between CarryAway, RestaurantLessThan20, Restaurant20To50 as these are values we may compare. 
print("number of null CarryAway entries:", len(data[data['CarryAway'].isnull()]))
print("number of null RestaurantLessThan20 entries:", len(data[data['RestaurantLessThan20'].isnull()]))
print("number of null Restaurant20To50 entries:", len(data[data['Restaurant20To50'].isnull()]))

# It looks like these are consistent, we have NaNs but that is fine as they represent unknown response to this question (not equiv. to "never")

# Other notes: 
# - could strip away the "<20" in the coupon type "Restaurant(<20)" but not crucial 
# - performing calculations with columns using ranges (e.g. "1-3") could be difficult, but since exact values are not known this is still a better option


number of null car entries: 12576
number of null CarryAway entries: 151
number of null RestaurantLessThan20 entries: 130
number of null Restaurant20To50 entries: 189


3. Decide what to do about your missing data -- drop, replace, other...

In [175]:
# We will crop the cars column, since it does not seem particularly relevant to the analysis 
# This assumes a weak correlation between car type and the decision to use the coupon
data = data.drop(columns=['car'])

4. What proportion of the total observations chose to accept the coupon? 



In [176]:
total = data["Y"].dropna().count()
accepted = len(data[data["Y"] == 1])
percentage_accepted = (accepted / total) * 100

print("total: ", total)
print("accepted: ", accepted)
print("percentage accepted: ", percentage_accepted)

total:  12684
accepted:  7210
percentage accepted:  56.84326710816777


5. Use a bar plot to visualize the `coupon` column.

In [177]:
px.bar(data["coupon"])

6. Use a histogram to visualize the temperature column.

In [178]:
px.histogram(data["temperature"])

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [179]:
only_bar = data[data["coupon"] == "Bar"]

2. What proportion of bar coupons were accepted?


In [180]:
len(only_bar.query("Y == 1"))

827

3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [181]:
no_null_coupon_bar = only_bar[only_bar["Y"] != ""]
total_df = no_null_coupon_bar[no_null_coupon_bar["Bar"] != ""]
total = len(total_df.dropna())
print("total: ", total)

acceptance_of_3_of_less = len(total_df.query('(Bar == "never" or Bar == "less1" or Bar == "1~3") and Y == 1'))
print("acceptance less than 3 bars: ", acceptance_of_3_of_less)

acceptance_more_than_3 = len(total_df.query('(Bar == "4~8" or Bar == "gt8") and Y == 1'))
print("acceptance more than 3 bars: ", acceptance_more_than_3)

print("Those who went to fewer than 3 bars accepted the coupon: ", (acceptance_of_3_of_less/total)*100, "% of the time")
print("Those who went to more than 3 bars accepted the coupon: ", (acceptance_more_than_3/total)*100 , "% of the time")

total:  1913
acceptance less than 3 bars:  666
acceptance more than 3 bars:  153
Those who went to fewer than 3 bars accepted the coupon:  34.81442760062728 % of the time
Those who went to more than 3 bars accepted the coupon:  7.99790904338735 % of the time


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [182]:
# only_bar['age'].str.replace('plus','')
# only_bar['age'].str.replace('below21','21')
copy_only_bar = data[data["coupon"] == "Bar"]
copy_only_bar['age'] = copy_only_bar['age'].str.replace('[^0-9]', '', regex=True).astype('int64')
more_than_once_a_month_and_over_25yo = copy_only_bar.query('(Bar != "never" or Bar != "") and age > 25 and Y==1')
all_other = copy_only_bar.query('Bar == "never" and age < 25 and Y==1')

print("Drivers who go to a bar more than once a month and are over the age of 25, accept coupons: ", (len(more_than_once_a_month_and_over_25yo) / total)*100 , "% of the time")
print("All other drivers, accept coupons: ", (len(all_other) / total)*100 , "% of the time")



Drivers who go to a bar more than once a month and are over the age of 25, accept coupons:  30.318870883429167 % of the time
All other drivers, accept coupons:  3.0841610036591742 % of the time




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry. 


In [183]:
more_than_once_a_month_and_no_kids_or_occ = only_bar.query('(Bar != "never" or Bar != "") and Y==1 and passanger != "Kid(s)" and occupation != "Farming Fishing & Forestry"')
all_other_with_kids_other_occ = copy_only_bar.query('Bar == "never" and Y==1 and passanger == "Kid(s)" and occupation == "Farming Fishing & Forestry"')

print("drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry accept coupons: ", (len(more_than_once_a_month_and_no_kids_or_occ) / total)*100 , "% of the time")
print("All other drivers, accept coupons: ", (len(all_other_with_kids_other_occ) / total)*100 , "% of the time")


drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry accept coupons:  40.82592786199687 % of the time
All other drivers, accept coupons:  0.0 % of the time


6. Compare the acceptance rates between those drivers who:
- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K. 




In [184]:
copy_of_only_bar = data[data["coupon"] == "Bar"]

# - go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
more_than_once_a_month_and_no_kids_not_widowed = copy_of_only_bar.query('(Bar != "never" or Bar != "") and maritalStatus == "Widowed"')

# - go to bars more than once a month and are under the age of 30 *OR*
copy_of_only_bar['age'] = copy_of_only_bar['age'].str.replace('[^0-9]', '', regex=True).astype('int64')
more_than_once_a_month_and_under_30 = copy_of_only_bar.query('(Bar != "never" or Bar != "") and age < 30')

# - go to cheap restaurants more than 4 times a month and income is less than 50K. 
income_range = ["Less than $12500", "$25000 - $37499","$37500 - $49999"]
intermed = copy_of_only_bar[copy_of_only_bar["income"].isin(income_range)]
more_than_4_cheap_rest_and_income_lt_50k = copy_of_only_bar.query('(RestaurantLessThan20 == "4~8" or RestaurantLessThan20 == "gt8")')

print("more_than_once_a_month_and_no_kids_not_widowed: ", (len(more_than_once_a_month_and_no_kids_not_widowed)/total)*100, "%")
print("more_than_once_a_month_and_under_30: ", (len(more_than_once_a_month_and_under_30)/total)*100,"%")
print("more_than_4_cheap_rest_and_income_lt_50k: ", (len(more_than_4_cheap_rest_and_income_lt_50k)/total)*100,"%")


more_than_once_a_month_and_no_kids_not_widowed:  1.097752221641401 %
more_than_once_a_month_and_under_30:  46.99424986931521 %
more_than_4_cheap_rest_and_income_lt_50k:  39.41453214845792 %




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

In [187]:
# Some observations: 
# - those who go to bars less often will accept coupon more than those that go often
# - those who go to bars more than once and are under 30 have the highest acceptance rate

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

In [None]:
# look at coffee house coupons
only_coffee_house = data[data["coupon"] == "Coffee House"]
total_coffee_accepted = only_coffee_house[no_null_coupon_coffee["Y" == 1]]
total_c = len(total_coffee_accepted)
print("total coffee coupons accepted: ",total_c )

In [193]:
# employment
px.bar(only_coffee_house["occupation"], color="occupation")

In [None]:
# visit coffee house more than once a month and employed in Architecture & Engineering and under age 50 

only_coffee_house['age'] = only_coffee_house['age'].str.replace('[^0-9]', '', regex=True).astype('int64')

more_than_once_a_month_and_eng_and_under_50 = only_coffee_house.query('(CoffeeHouse != "never" or CoffeeHouse != "") and age < 50 and occupation=="Architecture & Engineering"')
print("acceptance for more_than_once_a_month_and_eng_and_under_50:", (len(more_than_once_a_month_and_eng_and_under_50)/total_c)*100, "%")

more_than_once_a_month_and_unemployed_and_under_50 = only_coffee_house.query('(CoffeeHouse != "never" or CoffeeHouse != "") and age < 50 and occupation=="Student"')
print("acceptance for more_than_once_a_month_and_student_and_under_50:", (len(more_than_once_a_month_and_eng_and_under_50)/total_c)*100, "%")
