### Will a Customer Accept the Coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**


This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \\$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\\$20 - \\$50). 

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece. 





### Data Description

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)

3. Coupon attributes
    - time before it expires: 2 hours or one day

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px

### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [2]:
data = pd.read_csv('data/coupons.csv')

In [3]:
data.head()

Unnamed: 0,destination,passanger,weather,temperature,time,coupon,expiration,gender,age,maritalStatus,...,CoffeeHouse,CarryAway,RestaurantLessThan20,Restaurant20To50,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
0,No Urgent Place,Alone,Sunny,55,2PM,Restaurant(<20),1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,1
1,No Urgent Place,Friend(s),Sunny,80,10AM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,0,0,0,1,0
2,No Urgent Place,Friend(s),Sunny,80,10AM,Carry out & Take away,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,1
3,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,2h,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0
4,No Urgent Place,Friend(s),Sunny,80,2PM,Coffee House,1d,Female,21,Unmarried partner,...,never,,4~8,1~3,1,1,0,0,1,0


2. Investigate the dataset for missing or problematic data.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

In [83]:
#Here we need to change the coupon column to type 'string'. Bar plot will not work with it being 'object'
data['coupon'] = data['coupon'].astype('string')
data['Bar'] = data['Bar'].astype('string')
data['age'] = data['age'].replace('below21',"15")
data['age'] = data['age'].replace('50plus',"55")
data['age'] = data['age'].astype(int)
data['income'] = data['income'].astype('string')


In [6]:
#Here we will simply search for any columns that have NaN values
data.isna().any()

destination             False
passanger               False
weather                 False
temperature             False
time                    False
coupon                  False
expiration              False
gender                  False
age                     False
maritalStatus           False
has_children            False
education               False
occupation              False
income                  False
car                      True
Bar                      True
CoffeeHouse              True
CarryAway                True
RestaurantLessThan20     True
Restaurant20To50         True
toCoupon_GEQ5min        False
toCoupon_GEQ15min       False
toCoupon_GEQ25min       False
direction_same          False
direction_opp           False
Y                       False
dtype: bool

3. Decide what to do about your missing data -- drop, replace, other...

In [7]:
#In this case, we cant drop the rows where there are some NaN values, as the row may contain some other pertinent data.
#instead, when we run our searches, we can simply exclude based on our search criteria.

4. What proportion of the total observations chose to accept the coupon? 



In [128]:
#get the total amount of rows in the dataframe
df_total = len(data)
#get the total amount of rows where the coupon was accepted
df_accepted = len(data.query(' Y == 1 '))
#calculate the rate of accepted coupons
df_acceptedRate = ( df_accepted /df_total ) * 100
print(f'The portion of the total observeration that chose to accept a coupon was: {df_acceptedRate} %')

The portion of the total observeration that chose to accept a coupon was: 56.84326710816777 %


5. Use a bar plot to visualize the `coupon` column.

In [126]:
#here we will use a bar plot to visualize. Its very easy on the eye!
px.bar(data, x="coupon", title="Visualization based on coupon type" , labels={'coupon':'Coupon Type', 'count':'Number of Coupons'})

6. Use a histogram to visualize the temperature column.

In [127]:
#here we will create a histogram to look at the temperature column amongst our data
px.histogram(data, x='temperature', text_auto=True)

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [10]:
dfBarCoupon = data.query('coupon == "Bar"')

2. What proportion of bar coupons were accepted?


In [11]:
#here we will look at the new dataframe for bar coupons. then get a sum of the values. The values are either 0 or 1. 
#So if we get the sum of this column, it will only include where those where the value was one, i.e., Accepted
totalBarCouponsAccepted = dfBarCoupon['Y'].sum()
print(f"Number of Bar coupons accepted: {totalBarCouponsAccepted}")
# We can see that 827 Bar Coupons were Accepted.
#Regarding portions, we can get this as a percentage:
dflen = len(dfBarCoupon)
barCouponpercent = (totalBarCouponsAccepted / dflen) * 100
print(f'Proportion of bar coupons accepted: {barCouponpercent} %')

Number of Bar coupons accepted: 827
Proportion of bar coupons accepted: 41.00148735746158 %


3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [12]:
#First we need to find where individuals went to a bar 3 or fewer times a month
dfBarAttendanceLTET3 = dfBarCoupon.query(' (Bar == "never") or (Bar == "less1") or (Bar == "1~3")')

#Then we can find where individuals went to a bar more than 3 times a month
dfBarAttendanceGT3 = dfBarCoupon.query(' (Bar == "4~8") or (Bar == "gt8") ')

#Further drilling down to only those accepted
dfBarAttendanceLTET3 = dfBarAttendanceLTET3.query('Y == 1')
dfBarAttendanceGT3 = dfBarAttendanceGT3.query('Y == 1')

#print(len(dfBarAttendanceLTET3))
#print(len(dfBarAttendanceGT3))

#lets show rates:
threeOrFewerTotal = len(dfBarAttendanceLTET3) + len(dfBarAttendanceGT3)
print(f'The total acceptance of of those who went to a bar 3 or fewer times in a month was {len(dfBarAttendanceLTET3)} ')
print(f'The total acceptance of of those who went to a bar more times in a month was {len(dfBarAttendanceGT3)} ')
print(f'The percentage of people who accepted, and went to the bar 3 or fewer times was {(len(dfBarAttendanceLTET3) / threeOrFewerTotal) *100} %')

The total acceptance of of those who went to a bar 3 or fewer times in a month was 666 
The total acceptance of of those who went to a bar more times in a month was 153 
The percentage of people who accepted, and went to the bar 3 or fewer times was 81.31868131868131 %


4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [13]:
#first we want a data frame where folks went to a bar at least once a month and are over 25
dfBarAttendanceOnceMonthOver25 = dfBarCoupon.query(' (Bar == "1~3") or (Bar == "gt8") or (Bar == "4~8") ')
dfBarAttendanceOnceMonthOver25 = dfBarAttendanceOnceMonthOver25.query(' Y == 1 ' and 'age > 25')

#next we want a data frame where folks went to a bar at least once a month and are under 25
dfBarAttendanceOnceMonthUnder25 = dfBarCoupon.query(' (Bar == "1~3") or (Bar == "gt8") or (Bar == "4~8") ')
dfBarAttendanceOnceMonthUnder25 = dfBarAttendanceOnceMonthUnder25.query(' Y == 1 ' and 'age < 25')

totalWhoWentMoreThanOncePerMonth = len(dfBarAttendanceOnceMonthOver25) + len(dfBarAttendanceOnceMonthUnder25)
overtwenty25toyoungerRate =  (len(dfBarAttendanceOnceMonthOver25) / totalWhoWentMoreThanOncePerMonth) * 100
undertwenty25Rate =  (len(dfBarAttendanceOnceMonthUnder25) / totalWhoWentMoreThanOncePerMonth) * 100
print(f'The acceptance rate of those who go to a bar more than once a month and are over the age of 25 is {overtwenty25toyoungerRate} %')
print(f'The acceptance rate of those who go to a bar more than once a month and are under the age of 25 is {undertwenty25Rate} %')
print('Yes there is a difference. A very large difference!!')




The acceptance rate of those who go to a bar more than once a month and are over the age of 25 is 70.46979865771812 %
The acceptance rate of those who go to a bar more than once a month and are under the age of 25 is 29.53020134228188 %
Yes there is a difference. A very large difference!!


5. Construct a null and alternative hypothesis for the difference between groups of drivers who go to a bar more than once a month and are over the age of 25 to all other drivers. 

In [14]:
## skipped per instructions on Canvas ##

6. Using alpha at 0.05 test your hypothesis and state your conclusion.

In [25]:
## skipped per instructions on Canvas ##

7. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry. 


In [42]:
#here we will query for those folks who go to the bar at least once per month
dfBarAttendanceGTOPM = dfBarCoupon.query(' (Bar == "1~3") or (Bar == "gt8") or (Bar == "4~8") ')
#narrow our query down to those who did not accept and did not have kids as passengers. the occcupation filter was already met by default.
dfBarAttendanceGTOPM0 = dfBarAttendanceGTOPM.query(' (Y ==0 ) and (passanger !=  "Kid(s)") ')
#narrow our query down to those who accepted and did not have kids as passengers. the occcupation filter was already met by default.
dfBarAttendanceGTOPM1 = dfBarAttendanceGTOPM.query(' (Y ==1 ) and (passanger !=  "Kid(s)") ')
#evaluated the length
dfBarAttendanceGTOPM0_len = len(dfBarAttendanceGTOPM0)
#evaluated the length
dfBarAttendanceGTOPM1_len = len(dfBarAttendanceGTOPM1)
#get the total
dfBarAttendanceGTOPM_total = dfBarAttendanceGTOPM0_len + dfBarAttendanceGTOPM1_len
#acceptance rate
aRatedfBarAttendanceGTOPM = (dfBarAttendanceGTOPM1_len /dfBarAttendanceGTOPM_total) * 100
print(f'The acceptance rate between drives who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry was: {aRatedfBarAttendanceGTOPM} %')
#non-acceptance rate
aRatedfBarAttendanceGTOPM0 = (dfBarAttendanceGTOPM0_len /dfBarAttendanceGTOPM_total) * 100
print(f'Those who did not accept was: {aRatedfBarAttendanceGTOPM0} %')

The acceptance rate between drives who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry was: 71.32486388384754 %
Those who did not accept was: 28.67513611615245 %


8. Compare the acceptance rates between those passengers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K. 



In [99]:
#The goal here is to find: go to bars more than once a month, had passengers that were not a kid, and were not widowed
#here we apply our first filter to get those who went to a bar more than once a month
dfBarAttendanceGTOPMq8a = dfBarCoupon.query(' (Bar == "1~3") or (Bar == "gt8") or (Bar == "4~8") ')
#now we for those passengers who accepted and did not have a passenger that was a kid. the widow filter was not necessary as all individuals who were widowed went less than 1x per month or was never.
dfBarAttendanceGTOPMq8a_total = len(dfBarAttendanceGTOPMq8a.query('(passanger !=  "Kid(s)") '))
dfBarAttendanceGTOPMq8a = dfBarAttendanceGTOPMq8a.query(' (Y ==1) and (passanger !=  "Kid(s)") ')
dfBarAttendanceGTOPMq8a_len = len(dfBarAttendanceGTOPMq8a)
dfBarAttendanceGTOPMq8aRate = (dfBarAttendanceGTOPMq8a_len / dfBarAttendanceGTOPMq8a_total) * 100
print(f'The acceptance rate of those who go to bars more than once a month and had passengers that were not a kid and were not widowed was: {dfBarAttendanceGTOPMq8aRate} %')

The acceptance rate of those who go to bars more than once a month and had passengers that were not a kid and were not widowed was: 71.32486388384754 %


In [101]:
#The goal here is to find: go to bars more than once a month and are under the age of 30
#here we apply our first filter to get those who went to a bar more than once a month
dfBarAttendanceGTOPMq8b = dfBarCoupon.query(' (Bar == "1~3") or (Bar == "gt8") or (Bar == "4~8") ')
#next apply age filters
dfBarAttendanceGTOPMq8b = dfBarAttendanceGTOPMq8b.query('(age == 21) or (age == 26) or (age == 15) ')
dfBarAttendanceGTOPMq8b_total = len(dfBarAttendanceGTOPMq8b)
#next apply filter of acceptance
dfBarAttendanceGTOPMq8b = dfBarAttendanceGTOPMq8b.query('Y == 1')
dfBarAttendanceGTOPMq8bRate = (len(dfBarAttendanceGTOPMq8b) / dfBarAttendanceGTOPMq8b_total  ) * 100


print(f'The acceptance rate of those who go to bars more than once a month and under the age of 30 is: {dfBarAttendanceGTOPMq8bRate} %')

The acceptance rate of those who go to bars more than once a month and under the age of 30 is: 72.17391304347827 %


In [95]:
#The goal here is to find: go to cheap restaurants more than 4 times a month and income is less than 50K
cheapRestMT4ILT50k = data.query(' (RestaurantLessThan20 =="4~8" ) or (RestaurantLessThan20 =="gt8") ')
cheapRestMT4ILT50k = cheapRestMT4ILT50k.query(' (income == "Less than $12500") or (income == "$25000 - $37499") or (income == "$37500 - $49999") or (income == "$12500 - $24999") ')
cheapRestMT4ILT50k_total = len(cheapRestMT4ILT50k)
cheapRestMT4ILT50kAccepted = cheapRestMT4ILT50k.query (' Y==1 ')
cheapRestMT4ILT50kAccepted = len(cheapRestMT4ILT50kAccepted)
cheapRestMT4ILT50kRate = (cheapRestMT4ILT50kAccepted /  cheapRestMT4ILT50k_total) * 100
print(f'The acceptance rate of those who go to cheap restaurants more than 4x per month and have an income less than 50k is: {cheapRestMT4ILT50kRate} %')

The acceptance rate of those who go to cheap restaurants more than 4x per month and have an income less than 50k is: 60.07020623080298 %


9.  Based on these observations, what do you hypothesize about passengers who accepted the bar coupons?

Based on these observations, I hypothesize that passengers who accepted the bar coupons: were not widowed; are likely to go to the bar if they are passengers under 30 and do not have kids; are typically male; typically go when the weather is sunny, 

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

I chose coffehouse coupon group


In [107]:
#lets create a dataframe to look at where the coupon accepted and was for a coffee house group
dfCoffeeCoupon = data.query('(coupon == "Coffee House") and (Y ==1)')

In [129]:
#now lets create a beautiful bar plot to look at where our passengers were headed when they decided to redeem the coupon
px.bar(dfCoffeeCoupon, x="destination", title="Redepmption By Destination")

From our chart, we can see that most people redeemed their coupon when there was no urgen place to go. Typically, we think the most redemptions would be on the way to work, but that is not the case. 

In [120]:
px.bar(dfCoffeeCoupon, x="gender", title="Redepmption By Gender and Age", color="maritalStatus")

Of our series, we can see that Females redeemed the most, and typically, they were married. This could be that they were buying for themself and their spouses. While males typically redeemed less, and were single. This is a very interesting insight. I begin to wonder if married males do not buy coffee for their spouse. 

In [130]:
#here will use the beautiful visualization of a histogram to show variations across occupation and gender
px.histogram(dfCoffeeCoupon, x="occupation", title="Redepmption By Gender and Occupation", color="gender")

Lastly we looked at who purchased the most coffee according to gender and occupation. very interesting results! We can see that the highest number of redeemers of the coupon were Students, the unemployed, and computer and mathematical. We also see that the highest group, students, Males redeemed more coupons than females. The acceptance rate here was 63.46%