# Starbucks Capstone Challenge

### Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. 

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer. 

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

### Example

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

### Cleaning

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

### Final Advice

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

# Ref 
[link](https://github.com/hamadalaqeel/Starbuck-s-Capstone-Project/blob/master/Starbucks_Capstone_notebook.ipynb)

# Data Sets

The data is contained in three files:

* portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
* profile.json - demographic data for each customer
* transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record


# Business Understanding

The objective here is to find patterns and show when and where to give specific offer to a specific customer.

In [219]:
# Import relevant modules to project
import pandas as pd
import numpy as np
import math
import json

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [220]:
# Read the json files
portfolio = pd.read_json('../input/udacity-starbucks-capstone-project/portfolio.json', orient='records', lines=True)
profile = pd.read_json('../input/udacity-starbucks-capstone-project/profile.json', orient='records', lines=True)
transcript = pd.read_json('../input/udacity-starbucks-capstone-project/transcript.json', orient='records', lines=True)


# portfolio = pd.read_json('../input/udacity-starbucks-capstone-project/portfolio.json', orient='records')
# profile = pd.read_json('../input/udacity-starbucks-capstone-project/profile.json', orient='records')
# transcript = pd.read_json('../input/udacity-starbucks-capstone-project/transcript.json', orient='records')

# 1. Data Cleaning
## 1.1. Porfolio
- Update the name of the id column to offer_id.
- Divide the channels into a number of columns.
- Offer_type should be split across different columns.


In [221]:
portfolio.head()

In [222]:
portfolio.describe()

In [223]:
portfolio.info()

In [224]:
# Update the name of the id column to offer_id.
portfolio.rename(columns={'id':'offer_id'},inplace=True)

# Divide the channels into a number of columns.
channel_dummies = pd.get_dummies(portfolio['channels'].apply(pd.Series).stack(), prefix='channel').sum(level=0)
portfolio = pd.concat([portfolio,channel_dummies],axis=1)
portfolio.drop(columns=['channels'],inplace=True)

# Result:
portfolio.head()

## 1.2. Profile
- Replace the name "id" with "customer_id".
- Fix the date.
- Irregular ages in the "age" column.
- In the gender and income columns, there are 17,000 - 14,825 = 2,175 missing values.

In [225]:
profile.head()

In [226]:
profile.info()

In [227]:
profile.describe()

In [228]:
# Replace the name "id" with "customer_id".
profile.rename(columns={'id':'customer_id'},inplace=True)

# Fix the date.
profile['became_member_on'] = profile['became_member_on'].apply(lambda x: pd.to_datetime(str(x),format='%Y%m%d'))

# Handle missing values in "gender" & "income" column.
# gender_dummies = pd.get_dummies(profile['gender'],prefix='gender')
# profile = pd.concat([profile.drop(columns=['gender']),gender_dummies],axis=1)
# profile['gender'].fillna('NA', inplace=True)
profile['income'].fillna((profile['income'].mean()), inplace=True)

# Some customers who has the irregular ages. Take them out of the concern by adding a new column - "valid"
ages = profile['age'].unique()
perct_old_ages = profile['age'][profile['age'] > 100].count()/profile['age'].count() * 100
print('''
Unique ages in the df: {},
% customers who has the age > 100: {} %
'''.format(ages, round(perct_old_ages,2)))
profile['valid'] = profile['age'].apply(lambda x: 1 if x <= 100 else 0)

# Result:
profile.head()


## TO_DO: Not handling the missing values in "income" columns????????

## 1.3. Transcript
- Rename the "person" column to "customer_id".
- Get dummies for "event" column.
- Unlist the values in "value" column.



In [229]:
transcript.head()

In [230]:
transcript.info()

In [231]:
transcript.describe()

In [232]:
# Rename the "person" column to "customer_id".
transcript.rename(columns={'person':'customer_id'},inplace=True)

# Get dummies for "event" column.
# transcript['event'] = transcript['event'].apply(lambda x: x.replace(' ','_'))
# event_dummies = pd.get_dummies(transcript['event'], prefix='event')
# transcript = pd.concat([transcript.drop(columns=['event']), event_dummies], axis=1)

# Unlist the values in "value" column.
transcript['offer_id'] = [list(x.values())[0]  if (list(x.keys())[0] in ['offer_id', 'offer id']) else np.nan for x in transcript['value']]
transcript['amount'] = [list(x.values())[0]  if (list(x.keys())[0] in ['amount']) else np.nan for x in transcript['value']]
transcript.drop(columns=['value'],inplace=True)


# Result:
transcript.head()

## 1.4. Merge datasets

In [233]:
df = pd.merge(transcript, profile, on='customer_id', how="left")
df = pd.merge(df, portfolio, on='offer_id', how="left")
df.head()

In [234]:
# Simplify the offer_id:
offer_ids = df['offer_id'].unique()
cnt = 1
offer_list = {}
for offer in offer_ids:
    offer_list[offer] = 'X'+str(cnt)
    cnt += 1
df['offer_id'] = df['offer_id'].apply(lambda x: offer_list[x] if (x in offer_list.keys()) else x)

# Simplify the customer_id:
customer_ids = profile['customer_id'].unique()
count = 1
customer_list = {}
for cus in customer_ids:
    customer_list[cus] = 'A'+str(count)
    count += 1
df['customer_id'] = df['customer_id'].apply(lambda x: customer_list[x] if (x in customer_list.keys()) else x)

# Add "age_group" for analysis purpose
df['age_group'] = pd.cut(df['age'], bins=[0, 12, 18, 21, 64, 200], 
                        labels=['child', 'teen', 'young adult', 'adult', 'elderly'])

df.head()

# Example:
# offer_list = {'ae264e3637204a6fb9bb56bc8210ddfd': 'X1',
#                 '4d5c57ea9a6940dd891ad53e9dbe8da0': 'X2',
#                 '9b98b8c7a33c4b65b9aebfe6a799e6d9': 'B3',
#                 'f19421c1d4aa40978ebb69ca19b0e20d': 'B4',
#                 '0b1e1539f2cc45b7b9fa7c272da2e1d7': 'D1',
#                 '2298d6c36e964ae4a3e7e9706d1fb8c2': 'D2',
#                 'fafdcd668e3743c1bb461111dcafc2a4': 'D3',
#                 '2906b810c7d4411798c6938adc9daaa5': 'D4',
#                 '3f207df678b143eea3cee63160fa8bed': 'I1',
#                 '5a8bc65990b245e5a138643cd4eb9837': 'I2'}

# 2. Analyze
## 2.1.  Univariate Exploration:

1. What is the average income of a Starbucks customer?
2. What is the average Starbucks customer's age?
3. Which of the following promotions is the most common?
4. What are the most common values in each column of each dataframe?
5. In terms of transcripts, who is the most loyal customer?

Let's start with the first question:

**1. What is the average income of a Starbucks customer?**

In [235]:
print('The average income for Starbucks customers: ', round(profile['income'].mean(),2))

**2. What is the average Starbucks customer's age?**

In [236]:
print('The average age for Starbucks customers: ', round(profile['age'].mean(),2))

**3. Which of the following promotions is the most common?**

Bogo and Discount seem the most and they are close to each other with bogo been slightly higher

In [237]:
def addlabels(x,y,rotation='horizontal'):
    '''
    INPUT:
    - x: an array of x labels
    - y: an array of y values
    - rotation: the default is 'horizontal', could be changed to 'vertical' or a number of degree.
    OUTPUT: the label values attached in each bar column.
    '''
    for i in range(len(x)):
        plt.text(i,y[i]//2,y[i],horizontalalignment='center')

In [238]:
# Check the completed orders only

plt.subplot(121)
count_offer_id = df[df['event'] == 'offer completed']['offer_id'].value_counts()
count_offer_id.plot(kind='bar',figsize=(15, 5), rot=45)
plt.xlabel('Offer ID')
plt.ylabel('Count')
plt.title('Distribution of Completed Promotions for each offer')
addlabels(count_offer_id.index, count_offer_id.values);

plt.subplot(122)
count_offer_type = df['offer_type'].value_counts()
count_offer_type.plot(kind='bar',figsize=(15, 5), rot=45)
# plt.xlabel('Offer Type')
plt.ylabel('Count')
plt.title('Offer Type Distribution')
addlabels(count_offer_type.index, count_offer_type.values);



**4. What are the most common values in each column of each dataframe?**

In [239]:
plt.subplot(121)
df['age'].hist()
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution');

plt.subplot(122)
count_age_group = df['age_group'].value_counts()
count_age_group.plot(kind='bar',figsize=(15, 5), rot=45)
# plt.xlabel('Age Group')
# plt.ylabel('Count')
plt.title('Age Group Distribution')
addlabels(count_age_group.index, count_age_group.values);



In [240]:
gender_count = df['gender'].value_counts()
gender_count.plot(kind='bar',figsize=(15, 5), rot=0)
plt.xlabel('Gender')
plt.ylabel('Count')
plt.title('Gender Distribution')
addlabels(gender_count.index, gender_count.values);

Most are Adults and Males, interesting...

**5. In terms of transcripts, who is the most loyal customer?**

In [241]:
loyal_customer_count = df[(df['event'] == 'offer completed') | (df['event'] == 'transaction')].groupby(['customer_id', 'event'])['amount'].sum().reset_index()
loyal_customer_count = loyal_customer_count.sort_values('amount',ascending=False).head()

# Visualize
loyal_cus = loyal_customer_count.set_index('customer_id')
loyal_cus.plot(kind='bar',figsize=(15, 5), rot=0)
plt.xlabel('Customer ID')
plt.ylabel('Count')
plt.title('Customer Distribution')
addlabels(loyal_cus.index, loyal_cus['amount']);

for cus in loyal_customer_count['customer_id']:
    print('''
    Profile ID: {},
    Number of Completed Offers: {},
    Amount: {}
    '''.format(cus
               ,df[df['event'] == 'offer completed'].groupby('customer_id')['offer_id'].count().loc[cus]
               ,round(loyal_customer_count[loyal_customer_count['customer_id']==cus]['amount'].values[0], 2))
         )

    

## 2.2. Multvariate Exploration

1. What is the most popular promotion among children, teenagers, young adults, adults, and the elderly?
2. Which gender earns more money from profiles, guys or females?
3. What kinds of promotions do each gender prefer?

**1. What is the most popular promotion among children, teenagers, young adults, adults, and the elderly?**

In [242]:
plt.figure(figsize=(15, 5))
sns.countplot(data=df, x='age_group', hue='offer_type')
plt.title('Offer Distribution by Age Group & Offer Type')
plt.ylabel('Count')
plt.xticks(rotation = 0)
plt.legend(title='Offer Type')
plt.show();

**2. Which gender earns more money from profiles, guys or females?**

Note: Exclude N/A because they didn't tell their gender

In [243]:
plt.figure(figsize=(15, 5))
sns.violinplot(x=df[df['gender'] != 'NA']['gender'], y=df['income'])
plt.title('Income vs. Gender')
plt.ylabel('Income')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.show();

**Note:** The median is shown by the `white dot` in each graph.

**3. What kinds of promotions do each gender prefer?**

In [244]:
plt.figure(figsize=(15, 5))
sns.countplot(data=df, x=df[df['gender'] != 'NA']['gender'], hue = 'offer_type')
plt.title('Income vs Gender')
plt.ylabel('Income')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.show();

# 3. Simple Linear Regression Machine Learning Model

Based on consumer age, income, and gender, the model attempts to forecast **amount spent per transaction.**

In [245]:
df.head()

In [246]:
# Extract df for the model
df_ml = df[['amount','gender','age','income','event']]
df_ml = pd.concat([pd.get_dummies(df_ml['event']), df_ml.drop(columns=['event'])], axis=1)
df_ml = df_ml[['amount','gender','age','income','transaction']]

df_ml = df_ml[(df_ml['transaction'] == 1) & (df_ml['gender'] != 'O')]
df_ml = pd.concat([pd.get_dummies(df_ml['gender']), df_ml.drop(columns=['gender'])], axis=1)
df_ml.drop(columns=['transaction'], axis=1, inplace=True)

df_ml.head()

In [247]:
# Check NaN values
df_ml.isnull().mean()

In [248]:
# Define features and target as well as split train/test data
X = df_ml.drop('amount', axis=1)
y = df_ml['amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)

scores = dict()

In [249]:
# No Scaling/Normalization Approach

# Instantiate, Fit & Predict
lr_dumb = LinearRegression(normalize=True) 
lr_dumb.fit(X_train, y_train) 
y_test_preds = lr_dumb.predict(X_test) 

scores['No Scaling/Normalization'] = round(r2_score(y_test, y_test_preds),2)

print(
'''
No Scaling/Normalization: 
r-square score: {} on {} values.'''.format(round(r2_score(y_test, y_test_preds),2), len(y_test))
)

In [250]:
# Normalization approach

# Fit scaler on the training data
norm = MinMaxScaler().fit(X_train)

# Transform
X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)

# Instantiate, Fit & Predict
lr_norm = LinearRegression(normalize=True) 
lr_norm.fit(X_train_norm, y_train) 
y_test_preds = lr_norm.predict(X_test_norm) 

scores['Normalization'] = round(r2_score(y_test, y_test_preds),2)

print(
'''
Normalization: 
r-square score: {} on {} values.'''.format(round(r2_score(y_test, y_test_preds),2), len(y_test))
)


In [251]:
# Scalarization approach

# Apply standardization on numerical features
num_cols = ['age','income']
for i in num_cols:
    # Fit on training data column
    scale = StandardScaler().fit(X_train_stand[[i]])
    
    # Transform
    X_train_stand[i] = scale.transform(X_train_stand[[i]])
    X_test_stand[i] = scale.transform(X_test_stand[[i]])
    
#Instantiate, Fit & Predict
lr_stand = LinearRegression(normalize=True) 
lr_stand.fit(X_train_stand, y_train) 
y_test_preds = lr_stand.predict(X_test_stand)

scores['Scalarization'] = round(r2_score(y_test, y_test_preds),2)

print(
'''
Scalarization: 
r-square score: {} on {} values.'''.format(round(r2_score(y_test, y_test_preds),2), len(y_test))
)



In [252]:
score_df = pd.DataFrame()
score_df['model type'] = scores.keys()
score_df['r-square value'] = scores.values()
score_df

## Conclusion

The r-squared score was the same for all three ways. Let's take a look at the identical r-squared score as the previous item. In a linear regression model, characteristics are not assigned more weight based on their magnitude, as they would be in a distance-based method. Each feature converges on a minima in a linear regression model, which is a type of gradient descent model. When not scaled, the pace of descent and step size of each feature can vary. This does not give larger magnitude features a higher weight than lower magnitude features, but it can hurt model performance because some features decline to the minima faster than others. Scaling numerical data in a linear regression model is generally a good idea to improve model stability and convergence time. However, as the results in the preceding section show, it is not necessary in terms of feature weighting.


The r-squared value will be discussed in the second item. The r-squared number is based on a 0 to 100 percent scale, as described in the metrics section. The better the correlation and model accuracy in predicting, the higher the percentage. As a result, there is little link between consumer amount spent per transaction and age, gender, or annual income in the model above.

## Improvements

I believe I've arrived to a position where I have strong outcomes and a decent grasp of the data. However, in order to improve our results, I would strive to improve my data collecting and resolve any issues I have with NaN values. I'll also try to obtain further information, such as the place and time the transaction was made, as well as the branch and time of day. All of this information can assist us in determining when and where we should make our proposals.