### Required Assignment 5.1: Will the Customer Accept the Coupon?

**Context**

Imagine driving through town and a coupon is delivered to your cell phone for a restaurant near where you are driving. Would you accept that coupon and take a short detour to the restaurant? Would you accept the coupon but use it on a subsequent trip? Would you ignore the coupon entirely? What if the coupon was for a bar instead of a restaurant? What about a coffee house? Would you accept a bar coupon with a minor passenger in the car? What about if it was just you and your partner in the car? Would weather impact the rate of acceptance? What about the time of day?

Obviously, proximity to the business is a factor on whether the coupon is delivered to the driver or not, but what are the factors that determine whether a driver accepts the coupon once it is delivered to them? How would you determine whether a driver is likely to accept a coupon?

**Overview**

The goal of this project is to use what you know about visualizations and probability distributions to distinguish between customers who accepted a driving coupon versus those that did not.

**Data**

This data comes to us from the UCI Machine Learning repository and was collected via a survey on Amazon Mechanical Turk. The survey describes different driving scenarios including the destination, current time, weather, passenger, etc., and then ask the person whether he will accept the coupon if he is the driver. Answers that the user will drive there ‘right away’ or ‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are labeled as ‘Y = 0’.  There are five different types of coupons -- less expensive restaurants (under \$20), coffee houses, carry out & take away, bar, and more expensive restaurants (\$20 - $50).

**Deliverables**

Your final product should be a brief report that highlights the differences between customers who did and did not accept the coupons.  To explore the data you will utilize your knowledge of plotting, statistical summaries, and visualization using Python. You will publish your findings in a public facing github repository as your first portfolio piece.





### Data Description
Keep in mind that these values mentioned below are average values.

The attributes of this data set include:
1. User attributes
    -  Gender: male, female
    -  Age: below 21, 21 to 25, 26 to 30, etc.
    -  Marital Status: single, married partner, unmarried partner, or widowed
    -  Number of children: 0, 1, or more than 1
    -  Education: high school, bachelors degree, associates degree, or graduate degree
    -  Occupation: architecture & engineering, business & financial, etc.
    -  Annual income: less than \\$12500, \\$12500 - \\$24999, \\$25000 - \\$37499, etc.
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she buys takeaway food: 0, less than 1, 1 to 3, 4 to 8 or greater
    than 8
    -  Number of times that he/she goes to a coffee house: 0, less than 1, 1 to 3, 4 to 8 or
    greater than 8
    -  Number of times that he/she eats at a restaurant with average expense less than \\$20 per
    person: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    -  Number of times that he/she goes to a bar: 0, less than 1, 1 to 3, 4 to 8 or greater than 8
    

2. Contextual attributes
    - Driving destination: home, work, or no urgent destination
    - Location of user, coupon and destination: we provide a map to show the geographical
    location of the user, destination, and the venue, and we mark the distance between each
    two places with time of driving. The user can see whether the venue is in the same
    direction as the destination.
    - Weather: sunny, rainy, or snowy
    - Temperature: 30F, 55F, or 80F
    - Time: 10AM, 2PM, or 6PM
    - Passenger: alone, partner, kid(s), or friend(s)


3. Coupon attributes
    - time before it expires: 2 hours or one day

In [126]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px


### Problems

Use the prompts below to get started with your data analysis.  

1. Read in the `coupons.csv` file.




In [127]:
data = pd.read_csv('data/coupons.csv')

In [128]:
data.head()
data.info()
data.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

Unnamed: 0,temperature,has_children,toCoupon_GEQ5min,toCoupon_GEQ15min,toCoupon_GEQ25min,direction_same,direction_opp,Y
count,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0,12684.0
mean,63.301798,0.414144,1.0,0.561495,0.119126,0.214759,0.785241,0.568433
std,19.154486,0.492593,0.0,0.496224,0.32395,0.410671,0.410671,0.495314
min,30.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,55.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
50%,80.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
75%,80.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0
max,80.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


2. Investigate the dataset for missing or problematic data.

In [129]:
data.info()
print(data.isnull().sum().sort_values(ascending=False))

# Optional: check for unexpected values in key columns
for col in ['coupon', 'weather', 'temperature']:
    print(f"{col}: {data[col].unique()[:10]}")

print("coffee house:", data['CoffeeHouse'].unique())

#print unique values for every column
for col in data.columns:
    print(f"{col}: {data[col].unique()[:10]}")



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12684 entries, 0 to 12683
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   destination           12684 non-null  object
 1   passanger             12684 non-null  object
 2   weather               12684 non-null  object
 3   temperature           12684 non-null  int64 
 4   time                  12684 non-null  object
 5   coupon                12684 non-null  object
 6   expiration            12684 non-null  object
 7   gender                12684 non-null  object
 8   age                   12684 non-null  object
 9   maritalStatus         12684 non-null  object
 10  has_children          12684 non-null  int64 
 11  education             12684 non-null  object
 12  occupation            12684 non-null  object
 13  income                12684 non-null  object
 14  car                   108 non-null    object
 15  Bar                   12577 non-null

3. Decide what to do about your missing data -- drop, replace, other...

In [130]:
#print unique values for every column
for col in data.columns:
    print(f"{col}: {data[col].unique()[:10]}")

destination: ['No Urgent Place' 'Home' 'Work']
passanger: ['Alone' 'Friend(s)' 'Kid(s)' 'Partner']
weather: ['Sunny' 'Rainy' 'Snowy']
temperature: [55 80 30]
time: ['2PM' '10AM' '6PM' '7AM' '10PM']
coupon: ['Restaurant(<20)' 'Coffee House' 'Carry out & Take away' 'Bar'
 'Restaurant(20-50)']
expiration: ['1d' '2h']
gender: ['Female' 'Male']
age: ['21' '46' '26' '31' '41' '50plus' '36' 'below21']
maritalStatus: ['Unmarried partner' 'Single' 'Married partner' 'Divorced' 'Widowed']
has_children: [1 0]
education: ['Some college - no degree' 'Bachelors degree' 'Associates degree'
 'High School Graduate' 'Graduate degree (Masters or Doctorate)'
 'Some High School']
occupation: ['Unemployed' 'Architecture & Engineering' 'Student'
 'Education&Training&Library' 'Healthcare Support'
 'Healthcare Practitioners & Technical' 'Sales & Related' 'Management'
 'Arts Design Entertainment Sports & Media' 'Computer & Mathematical']
income: ['$37500 - $49999' '$62500 - $74999' '$12500 - $24999' '$75000 - $8

In [131]:
# Remove 'car' column
data1 = data.drop(columns=['car']).copy()

# Fix passenger column values and spelling
data1.rename(columns={'passanger': 'passenger'}, inplace=True)
data1['passenger'] = data1['passenger'].replace({
    'Friend(s)': 'Friends',
    'Kid(s)': 'Kids'
})

# Fix age column values (note: data1, not dat1a)
data1['age'] = data1['age'].replace({
    'below21': 18,
    '21': 21,
    '26': 26,
    '31': 31,
    '36': 36,
    '41': 41,
    '46': 46,
    '50plus': 50
}).astype(int)

# Convert income to numeric midpoint
def income_to_midpoint(x):
    if 'Less than' in x:
        return 6250
    elif 'or More' in x:
        return 110000
    else:
        nums = [int(n.replace('$', '')) for n in x.split(' - ')]
        return sum(nums) / 2

data1['income_num'] = data1['income'].apply(income_to_midpoint)

# Fix education column values
data1['education'] = data1['education'].replace({
    'High School Graduate': 'High School',
    'Some college - no degree': 'Some College',
    'Associates degree': 'Associate',
    'Bachelors degree': 'Bachelor',
    'Graduate degree (Masters or Doctorate)': 'Graduate',
    'Some High School': 'High School (Some)'
})

# Create occupation groups
data1['occupation_group'] = data1['occupation'].replace({
    'Healthcare Practitioners & Technical': 'Healthcare',
    'Healthcare Support': 'Healthcare',
    'Education&Training&Library': 'Education',
    'Arts Design Entertainment Sports & Media': 'Arts & Media',
    'Architecture & Engineering': 'Engineering'
})

# Fix coupon column values
data1['coupon'] = data1['coupon'].replace({
    'Restaurant(<20)': 'Restaurant (<$20)',
    'Restaurant(20-50)': 'Restaurant ($20-$50)',
    'Carry out & Take away': 'Carryout/Takeaway'
})

# Fix frequency columns
freq_map = {
    'never': 'Never',
    'less1': '<1x/month',
    '1~3': '1–3x/month',
    '4~8': '4–8x/month',
    'gt8': '>8x/month'
}

# Fix weather and temperature columns
temp_map = {
    30: "Cold (30°F)",
    55: "Moderate (50°F)",
    80: "Hot (80°F)"
}
data1['temperature'] = data1['temperature'].map(temp_map)

for col in ['Bar', 'CoffeeHouse', 'CarryAway', 'RestaurantLessThan20', 'Restaurant20To50']:
    data1[col] = data1[col].replace(freq_map).fillna('Never')


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



In [132]:
#data1.info()
#print(data1.isnull().sum().sort_values(ascending=False))

print("passenger:", data1['passenger'].unique())

#print unique values for every column
for col in data1.columns:
    print(f"{col}: {data1[col].unique()[:10]}")

passenger: ['Alone' 'Friends' 'Kids' 'Partner']
destination: ['No Urgent Place' 'Home' 'Work']
passenger: ['Alone' 'Friends' 'Kids' 'Partner']
weather: ['Sunny' 'Rainy' 'Snowy']
temperature: ['Moderate (50°F)' 'Hot (80°F)' 'Cold (30°F)']
time: ['2PM' '10AM' '6PM' '7AM' '10PM']
coupon: ['Restaurant (<$20)' 'Coffee House' 'Carryout/Takeaway' 'Bar'
 'Restaurant ($20-$50)']
expiration: ['1d' '2h']
gender: ['Female' 'Male']
age: [21 46 26 31 41 50 36 18]
maritalStatus: ['Unmarried partner' 'Single' 'Married partner' 'Divorced' 'Widowed']
has_children: [1 0]
education: ['Some College' 'Bachelor' 'Associate' 'High School' 'Graduate'
 'High School (Some)']
occupation: ['Unemployed' 'Architecture & Engineering' 'Student'
 'Education&Training&Library' 'Healthcare Support'
 'Healthcare Practitioners & Technical' 'Sales & Related' 'Management'
 'Arts Design Entertainment Sports & Media' 'Computer & Mathematical']
income: ['$37500 - $49999' '$62500 - $74999' '$12500 - $24999' '$75000 - $87499'
 '$5

4. What proportion of the total observations chose to accept the coupon?



In [133]:
px.histogram(data1, x='Y', title='Distribution of Y')
accept_rate = data1['Y'].mean()
print(f"accepted percent: {accept_rate:.2%}")


accepted percent: 56.84%


5. Use a bar plot to visualize the `coupon` column.

In [134]:
#chart the total number of coupons accepted and not accepted by coupon type
fig_c = px.histogram(data1,x='coupon',color='Y', text_auto='1f', color_discrete_sequence=px.colors.qualitative.Set1_r, title='Coupon Acceptance by Coupon Type')

#chart the percent of coupons accepted and not accepted by coupon type
fig_p = px.histogram(data1,x='coupon',color='Y', barnorm='percent' , text_auto='.1f', color_discrete_sequence=px.colors.qualitative.Set1_r, title='Percent Coupon Acceptance by Coupon Type')
fig_c.show()
fig_p.show()

6. Use a histogram to visualize the temperature column.

In [135]:
fig = px.histogram(
    data1,
    x="temperature",
    title="Distribution of Temperature",
    color = 'Y',
    text_auto='1f',
    color_discrete_sequence=px.colors.qualitative.Set1_r
)
fig.show()

**Investigating the Bar Coupons**

Now, we will lead you through an exploration of just the bar related coupons.  

1. Create a new `DataFrame` that contains just the bar coupons.


In [136]:
#create DF with just bar coupons
bar_data = data1[data1['coupon'] == 'Bar']

2. What proportion of bar coupons were accepted?


In [137]:
#proportions of bar coupons accepted vs not accepted
bar_accept_rate = bar_data['Y'].mean()
print(f"Bar coupon acceptance: {bar_accept_rate:.2%}")

Bar coupon acceptance: 41.00%


3. Compare the acceptance rate between those who went to a bar 3 or fewer times a month to those who went more.


In [138]:
#compare accptance rate between those who went to a bar 3 or fewer times a month vs more than 3 times a month
bar_data['Bar_freq'] = bar_data['Bar'].replace({
    'Never': '3 or fewer',
    '<1x/month': '3 or fewer',
    '1–3x/month': '3 or fewer',
    '4–8x/month': 'More than 3',
    '>8x/month': 'More than 3'
})
freq_accept_rate = bar_data.groupby('Bar_freq')['Y'].mean()
print(freq_accept_rate.apply(lambda x: f"{x:.2%}"))

Bar_freq
3 or fewer     37.07%
More than 3    76.88%
Name: Y, dtype: object




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



4. Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others.  Is there a difference?


In [139]:
#Compare the acceptance rate between drivers who go to a bar more than once a month and are over the age of 25 to the all others
group_accept_rate = bar_data.groupby((bar_data['Bar_freq'] == 'More than 3') & (bar_data['age'] > 25))['Y'].mean()
print(group_accept_rate.apply(lambda x: f"{x:.2%}"))

False    38.38%
True     77.21%
Name: Y, dtype: object


5. Use the same process to compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry.


In [140]:
#Compare the acceptance rate between drivers who go to bars more than once a month and had passengers that were not a kid and had occupations other than farming, fishing, or forestry

customaudience1 = bar_data[
    ((bar_data['Bar'] != 'Never') & (bar_data['Bar'] != '<1x/month'))   # Bar filter
    & (bar_data['age'] > 25)                                           # Age filter
    & (bar_data['passenger'] != 'Kids')                                # Passenger filter
    & (bar_data['occupation'] != 'Farming Fishing & Forestry')          # Occupation filter
]

# Calculate acceptance rate
customaudience1_accept_rate = customaudience1['Y'].mean()

# Print as percentage
print(f"Custom Audience 1 acceptance: {customaudience1_accept_rate:.4%}")


Custom Audience 1 acceptance: 73.4748%


6. Compare the acceptance rates between those drivers who:

- go to bars more than once a month, had passengers that were not a kid, and were not widowed *OR*
- go to bars more than once a month and are under the age of 30 *OR*
- go to cheap restaurants more than 4 times a month and income is less than 50K.



In [141]:
bar_data[bar_data['passenger'] !='Kids'].groupby('occupation').size().sort_values(ascending=False)

occupation
Unemployed                                   273
Student                                      243
Computer & Mathematical                      196
Sales & Related                              169
Education&Training&Library                   113
Management                                   103
Office & Administrative Support              101
Arts Design Entertainment Sports & Media      86
Business & Financial                          74
Retired                                       69
Food Preparation & Serving Related            45
Healthcare Support                            43
Community & Social Services                   42
Healthcare Practitioners & Technical          34
Legal                                         33
Transportation & Material Moving              32
Protective Service                            24
Architecture & Engineering                    24
Personal Care & Service                       21
Construction & Extraction                     21
Life Phys

7.  Based on these observations, what do you hypothesize about drivers who accepted the bar coupons?

### Independent Investigation

Using the bar coupon example as motivation, you are to explore one of the other coupon groups and try to determine the characteristics of passengers who accept the coupons.  

In [142]:
bar_data['occupation'].unique()

array(['Unemployed', 'Architecture & Engineering', 'Student',
       'Education&Training&Library', 'Healthcare Support',
       'Healthcare Practitioners & Technical', 'Sales & Related',
       'Management', 'Arts Design Entertainment Sports & Media',
       'Computer & Mathematical', 'Life Physical Social Science',
       'Personal Care & Service', 'Community & Social Services',
       'Office & Administrative Support', 'Construction & Extraction',
       'Legal', 'Retired', 'Installation Maintenance & Repair',
       'Transportation & Material Moving', 'Business & Financial',
       'Protective Service', 'Food Preparation & Serving Related',
       'Production Occupations',
       'Building & Grounds Cleaning & Maintenance',
       'Farming Fishing & Forestry'], dtype=object)

In [None]:
#create a df with just carryout/takeaway coupons
carry_data = data1[data1['coupon'] == 'Carryout/Takeaway']  # check exact spelling in your dataset

In [165]:
#perform some exploratory analysis on this new dataframe
#carry_data.info()
carry_data['occupation'].value_counts(normalize=True).head(10).apply(lambda x: f"{x:.2%}")

occupation
Unemployed                                  14.83%
Student                                     13.08%
Computer & Mathematical                     10.78%
Sales & Related                              8.44%
Education&Training&Library                   7.52%
Management                                   6.10%
Office & Administrative Support              5.14%
Arts Design Entertainment Sports & Media     4.81%
Business & Financial                         4.35%
Retired                                      3.76%
Name: proportion, dtype: object

In [169]:
carry_data['Y'].value_counts(normalize=True).apply(lambda x: f"{x:.2%}")

Y
1    73.55%
0    26.45%
Name: proportion, dtype: object

In [174]:
carry_pass = (
    carry_data.groupby('passenger')['Y']
    .mean()
    .reset_index(name='acceptance_rate')
)
carry_pass['acceptance_rate'] *= 100

fig = px.bar(carry_pass, x='passenger', y='acceptance_rate', text_auto='.1f',
             title="Carryout/Takeaway Acceptance by Passenger Type")
fig.show()

In [177]:
carry_age = (
    carry_data.groupby('age')['Y']
    .mean()
    .reset_index(name='acceptance_rate')
)
carry_age['acceptance_rate'] *= 100

import plotly.express as px
fig = px.bar(
    carry_age, 
    x='age', 
    y='acceptance_rate',
    text_auto='.1f',   # show percentage on bars (1 decimal place)
    title="Carryout/Takeaway Acceptance by Age"
)
fig.update_yaxes(title="Acceptance Rate (%)")
fig.show()

In [178]:
carry_pass = (
    carry_data.groupby('passenger')['Y']
    .mean()
    .reset_index(name='acceptance_rate')
)
carry_pass['acceptance_rate'] *= 100

fig = px.bar(
    carry_pass, 
    x='passenger', 
    y='acceptance_rate',
    text_auto='.1f',
    title="Carryout/Takeaway Acceptance by Passenger"
)
fig.update_yaxes(title="Acceptance Rate (%)")
fig.show()

In [179]:
carry_occ = (
    carry_data.groupby('occupation')['Y']
    .mean()
    .reset_index(name='acceptance_rate')
)
carry_occ['acceptance_rate'] *= 100

fig = px.bar(
    carry_occ, 
    x='occupation', 
    y='acceptance_rate',
    text_auto='.1f',
    title="Carryout/Takeaway Acceptance by Occupation"
)
fig.update_yaxes(title="Acceptance Rate (%)")
fig.show()

In [180]:
carry_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2393 entries, 2 to 12680
Data columns (total 27 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   destination           2393 non-null   object 
 1   passenger             2393 non-null   object 
 2   weather               2393 non-null   object 
 3   temperature           2393 non-null   object 
 4   time                  2393 non-null   object 
 5   coupon                2393 non-null   object 
 6   expiration            2393 non-null   object 
 7   gender                2393 non-null   object 
 8   age                   2393 non-null   int64  
 9   maritalStatus         2393 non-null   object 
 10  has_children          2393 non-null   int64  
 11  education             2393 non-null   object 
 12  occupation            2393 non-null   object 
 13  income                2393 non-null   object 
 14  Bar                   2393 non-null   object 
 15  CoffeeHouse           239

In [183]:
#loop through all categorical columns and print the acceptance rate for each category

cat_cols = [
    'destination','passenger','weather','temperature','time','expiration',
    'gender','maritalStatus','education','occupation','income','occupation_group'
]

for col in cat_cols:
    rates = (
        carry_data.groupby(col)['Y']
        .mean()
        .reset_index(name='accept_rate')
    )
    rates['accept_rate'] *= 100
    print(f"\n=== {col} ===")
    print(rates.sort_values('accept_rate', ascending=False))


=== destination ===
       destination  accept_rate
0             Home    78.866769
1  No Urgent Place    76.278119
2             Work    65.485564

=== passenger ===
  passenger  accept_rate
1   Friends    75.778078
3   Partner    73.195876
0     Alone    72.740214
2      Kids    70.394737

=== weather ===
  weather  accept_rate
2   Sunny    76.287493
1   Snowy    70.684039
0   Rainy    61.128527

=== temperature ===
       temperature  accept_rate
0      Cold (30°F)    75.632490
1       Hot (80°F)    72.983114
2  Moderate (50°F)    71.875000

=== time ===
   time  accept_rate
2   2PM    86.697248
3   6PM    82.528736
1  10PM    75.921909
0  10AM    70.212766
4   7AM    65.485564

=== expiration ===
  expiration  accept_rate
0         1d    78.159341
1         2h    66.382070

=== gender ===
   gender  accept_rate
1    Male    75.888985
0  Female    71.370968

=== maritalStatus ===
       maritalStatus  accept_rate
4            Widowed    84.615385
2             Single    74.676724
1

In [None]:
cat_cols = [
    'destination','passenger','weather','temperature','time','expiration',
    'gender','maritalStatus','education','occupation','income','occupation_group']

for col in cat_cols:
    rates = (
        carry_data.groupby(col)['Y']
        .mean()
        .reset_index(name='accept_rate')
    )
    rates['accept_rate'] *= 100
    
    fig = px.bar(
        rates, x=col, y='accept_rate', text_auto='.1f',
        title=f"Carryout/Takeaway Acceptance by {col}"
    )
    fig.update_yaxes(title="Acceptance Rate (%)")
    fig.show()