# Index  
[Project Overview](#Project-Overview)  
[Problem Statement](#Problem-Statement)  
[Metrics](#Metrics)  
[Data Exploration](#Data-Exploration)  
[Data Visualization](#Data-Visualization)  
[Data Preprocessing](#Data-Preprocessing)  
[Implementation](#Implementation)  
[Model Evaluation and Validation pt.1](#Model-Evaluation-and-Validation-pt.1)  
[Refinement](#Refinement)  
[Model Evaluation and Validation pt.2](#Model-Evaluation-and-Validation-pt.2)  
[Justification](#Justification)  
[Reflection](#Reflection)  
[Improvement](#Improvement)  
[Conclusion](#Conclusion)


# Project Definition

## Project Overview  
This project tackles the problem of sending offers to the right customers to increase revenue.  
Data is simulated data from [StarBucks](https://starbucks.com).  

Data Dictionary:  

profile.csv

Rewards program users (17000 users x 5 fields)

    gender: (categorical) M, F, O, or null
    age: (numeric) missing value encoded as 118
    id: (string/hash)
    became_member_on: (date) format YYYYMMDD
    income: (numeric)

portfolio.csv

Offers sent during 30-day test period (10 offers x 6 fields)

    reward: (numeric) money awarded for the amount spent
    channels: (list) web, email, mobile, social
    difficulty: (numeric) money required to be spent to receive reward
    duration: (numeric) time for offer to be open, in days
    offer_type: (string) bogo, discount, informational
    id: (string/hash)

transcript.csv

Event log (306648 events x 4 fields)

    person: (string/hash)
    event: (string) offer received, offer viewed, transaction, offer completed
    value: (dictionary) different values depending on event type
        offer id: (string/hash) not associated with any "transaction"
        amount: (numeric) money spent in "transaction"
        reward: (numeric) money gained from "offer completed"
    time: (numeric) hours after start of test


## Problem Statement  
The problem is how do we know if an offer is sutable for a customer or not? to keep it simple we will use the 'offer copleted' event as an indicator that a customer is happy with the offer they got, and for offers that don't have a completion condition such as informitive offers the 'offer viewd' will indicate that a customer is happy (even though spending changes after reciving the informitive offer are better indecator of customer's responce).  

## Metrics  
Since it is a classification problem both precision and accuracy will be used to judge the model performance.




[Github repo](https://github.com/FancyWhale69/Predict_customer_responce_SB) for the web app 



# Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#load data
portfolio= pd.read_csv('../input/starbucks-customer-data/portfolio.csv')
profile= pd.read_csv('../input/starbucks-customer-data/profile.csv')
transcript= pd.read_csv('../input/starbucks-customer-data/transcript.csv')

## Data Exploration

Check the statistcal information for each file.

In [None]:
portfolio.describe()

In [None]:
profile.describe()

In [None]:
transcript.describe()

From [Project Overview](#Project-Overview) we can see that the 118 in hte age column is equal too NaN we will have to deal with it later. seems like the average age for the program member is 62 which is old.

Check for Nans to deal with.

In [None]:

portfolio.isnull().sum()

In [None]:
profile.isnull().sum()

In [None]:
transcript.isnull().sum()

Impute NaN values.

In [None]:
#in gender column group NaNs with other, since we don't know thier gender
profile['gender'].fillna("O", inplace=True)

In [None]:
#from document 118 == null. impute with the average for each group in gender
#find wich groups has null (118) in thier age
profile[profile['age']==118]['gender'].value_counts()

In [None]:
#find average age for 'other' group
average_age=np.round(profile.groupby('gender').mean().age.loc['O'])

In [None]:
#since all missing ages from the other group, use the average for the group
profile['age']= profile['age'].apply(lambda x : average_age if x == 118 else x)

In [None]:
#find wich groups has null in thier income
profile[profile['income'].isnull()].groupby('gender').count()['income']

In [None]:
#average income for 'other'
average_income= np.round(profile.groupby('gender').mean().income.loc['O'])

In [None]:
#imput nan with average
profile['income']= profile['income'].apply(lambda x : average_income if np.isnan(x) else x)

Drop Unnamed: 0 since its just repeats the index.

In [None]:
#drop unnamed:0 column from all dataframes
profile.drop('Unnamed: 0', axis=1, inplace=True)
transcript.drop('Unnamed: 0', axis=1, inplace=True)
portfolio.drop('Unnamed: 0', axis=1, inplace=True)

## Data visualization 

In [None]:
#analyse profile file first
#see gender count
sns.countplot(profile['gender'])

looks like males are more than female customers. lets see which one of them has more income

In [None]:
sns.displot(profile, x='income', hue='gender', bins=30)
plt.title('Income per gender')

remember that all the missing income were from the 'other' group, because I made every person with no gender as 'other'

In [None]:
profile.groupby('gender').sum()['income'].plot(kind='bar')
plt.title('Total income per gender')
plt.ylabel('income')

males total income is higher even though the number of females in the higher income (78k and above) are higher than males.

In [None]:
#extract only the year from 'became_member_on' and create column 'year'
profile['year']= profile['became_member_on'].apply(lambda x : str( x )[:4])

In [None]:
#see membership regstration throug the years
profile['year'].value_counts().sort_index().plot()
plt.xlabel('year')
plt.ylabel('new members')
plt.title('loyalty program regestration per year')

2017 was the highest year then it went down the next year.

## Data Preprocessing

Merge profile with transcript but before that the 'id' column does not exist in transcript but data in 'person' column (transcript) data matches the 'id' column (profile) so change column name 'person' to 'id'.  

In [None]:

#transcript['person'] == profile['id']
#change column name from person to id, to merge them
transcript.rename(columns={'person':'id'}, inplace=True)


In [None]:
#merge profile and transcript on 'id' column
df= profile.merge(transcript, on='id', how='outer')

using the 'value' column (df) create 'offer' column (df) which contains offer name in the following format: offerName_difficulty_duration by cross refrencing with portfolio's 'id' column.  

In [None]:
#map the offer_id from df with id in portfolio

#create dict such that offers['offer_id']= offerName_difficulty_duration
offers=dict()
for offer, i, diff, dur in portfolio[['offer_type', 'id', 'difficulty', 'duration']].values:
    offers[i]= f'{offer}_{diff}_{dur}'


def value_col(col):
    """
    extract offer_id from value column and mapped it to the offer name
    
    input- value column
    
    output-  mapped offer names
    """
    value_type= col.split(':')[0].replace("'", "").replace('{', "")
    
    if value_type == 'offer id':
        value= col.split(':')[1].replace("'", "").replace('}', "").strip()
        return offers[value]
    elif value_type == 'offer_id':
        value= col.split(':')[1].split(',')[0].replace("'", "").strip()
        return offers[value]
    else:
        return 'None'

In [None]:
#get offer names
df['offer']= df['value'].apply(value_col)

after that create the 'offer_id' (df) which contains the id of the offer by cross refrencing with portfolio's 'id' column. 

In [None]:
def value_col_id(col):
    """
    extract offer_id from value column
    
    input- value column
    
    output- offer_ids
    """
    value_type= col.split(':')[0].replace("'", "").replace('{', "")
    
    if value_type == 'offer id':
        value= col.split(':')[1].replace("'", "").replace('}', "").strip()
        return value
    elif value_type == 'offer_id':
        value= col.split(':')[1].split(',')[0].replace("'", "").strip()
        return value
    else:
        return 'None'

In [None]:
#get offer ids
df['offer_id']= df['value'].apply(value_col_id)

add an 'amount' column (df) which hosts the amount in trasaction event.

In [None]:
def value_col_trans(col):
    """
    get transaction amount from value column
    
    input- value column
    
    output- transaction amount
    """
    value_type= col.split(':')[0].replace("'", "").replace('{', "")
    
    if value_type == 'amount':
        value= col.split(':')[1].replace("'", "").replace('}', "").strip()
        return np.round(float(value), 2)
    else:
        return np.nan

In [None]:
#get transaction amount
df['amount']= df['value'].apply(value_col_trans)

In [None]:
df.groupby('year').sum().amount.plot()
plt.ylabel("amount")
plt.title('total transactions per year')

both total transactions and # of regestration follow the same trend, climping up till 2017 then dropping.

In [None]:
sns.barplot(x= ['F','M','O'], y=df.groupby('gender').sum().amount.values)
plt.xlabel('gender')
plt.ylabel('total')
plt.title('total amount spent per gender')

In [None]:
sns.lineplot(x=df.groupby(['gender','year']).sum().amount.loc['O'].index, y=df.groupby(['gender','year']).sum().amount.loc['O'].values , label='Other')
sns.lineplot(x=df.groupby(['gender','year']).sum().amount.loc['M'].index, y=df.groupby(['gender','year']).sum().amount.loc['M'].values , label='Male')
sns.lineplot(x=df.groupby(['gender','year']).sum().amount.loc['F'].index, y=df.groupby(['gender','year']).sum().amount.loc['F'].values , label='Female')
plt.ylabel('total transactions')
plt.title('Total transactions per year per gender')

around 2015 females started spending more at starbucks than males, then a sharp drop for 2018.

Now let's deal with the categorical data in 'event' by making dummies.

In [None]:
# get dummies for event column
df=pd.concat([df, pd.get_dummies(df['event'])], axis=1)

informational is a special kind of offer that does not need a spending condition to complete but rather it should affect the customer's spending behavior. for simplicty i assume that if a customer reads the offer then it is completed.

In [None]:
def infor(offer, viewed, complete):
    """
    if a informational offer is viewd then counted as completed
    
    input:
    offer column
    offer viewed column
    offer completed column
    
    output:
    0- not completed
    1- completed
    """
    if offer == 'informational_0_3' or offer == 'informational_0_4':
        if viewed == 1:
            return 1
    
    return complete

In [None]:
df['offer completed']=df.apply( lambda x : infor( x['offer'], x['offer viewed'], x['offer completed'] ) , axis=1)

keep total transactions made per customer stored. then delete duplacte rows such that only the final result of each (customer, offer) interiaction is present example:  
if we have the following data:  

cus_id, offer_id, event      , completed  
1     , qq1     , recived    , 0  
1     , qq1     , viewed     , 0  
1     , qq1     , completed  , 1  
2     , qq1     , recived    , 0  
2     , qq1     , viewed     , 0  

the output should be:  
1     , qq1     , completed  , 1  
2     , qq1     , viewed     , 0  


In [None]:
#store tatal spending per person
total_amount=df.groupby('id').sum().amount

In [None]:
temp= df.sort_values('event')#sort values
new_df= temp[temp['offer completed']== 1]# extract completed offers to new_df
temp= temp[temp['offer completed'] != 1]# delete completed offers from temp
temp= temp[temp.event != 'transaction']# delete transactions from temp
temp.drop_duplicates(subset=['id', 'offer_id'], inplace=True)# drop duplicated rows
new_df= pd.concat([new_df, temp], ignore_index=True) # concat temp and new_df
new_df= pd.concat([new_df, pd.get_dummies(new_df['offer'], drop_first=True)], axis=1) #concat new_df and dummyies for offer

add 'total_transaction' column which holds total spending for each customer.

In [None]:
#map total transaction amount with person id
new_df['total_transaction']= new_df['id'].apply(lambda x : total_amount.loc[x])

finally make dummies for gender and yearm then drop all unneeded coluns and with this the data is ready for the ML model.

In [None]:
#drop uneeded columns
new_df.drop(['became_member_on', 'event', 'time', 'amount', 'offer_id', 'id', 'value', 'offer received', 'offer viewed', 'transaction', 'offer'], axis=1, inplace=True)

In [None]:
# make dummiys for gender and year and concat with new_df
new_df = pd.concat([new_df, pd.get_dummies(new_df['gender']), pd.get_dummies(new_df['year'], drop_first=True)], axis=1)

In [None]:
#drop year and gender
new_df.drop(['gender', 'year', 'O'], axis=1, inplace=True)

In [None]:
#rename target column to completed
new_df.rename(columns= {'offer completed': 'completed'}, inplace=True)

In [None]:
#get classes dist
new_df['completed'].value_counts()

## Implementation  

the ML pipeline consist of normilazing the data by using minmax then feeding it to a KNN algorithm.  
If 2 similar customers where one responcded to an offer then it is likely that the other customer will respond to it as weill and that's why KNN eas used.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

In [None]:
#create train test data
x=new_df.drop('completed', axis=1)
y= new_df['completed']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [None]:
#normalaize data then use knn
pipeline= Pipeline([
    ('scaler', MinMaxScaler()),
    ('cls',  KNeighborsClassifier())
])


In [None]:
#adjust how many nighbors affect the selection
params={'cls__n_neighbors': [1, 5, 10, 15, 20, 25]}

cv= GridSearchCV(pipeline, params, verbose=3)

In [None]:
cv.fit(x_train, y_train)#train

## Model Evaluation and Validation pt.1  

Looks like the model can predict '0' class better than the '1' class this could be because the data is unbalanced.

In [None]:
pred = cv.predict(x_test)#predict

In [None]:
#evaluate
from sklearn.metrics import classification_report, confusion_matrix
print("classification report:")
print(classification_report(y_test, pred))
print("-------------------------------------")
print('confusion matrix:')
print(confusion_matrix(y_test, pred))

## Refinement

Since data is unbalanced lets try undersampling which shrink the dominant class and see if there is an improvment.

In [None]:
new_df.completed.value_counts()#data unbalanced

In [None]:
#use downsampling
from sklearn.utils import resample
df_majority = new_df[new_df.completed==0]
df_minority = new_df[new_df.completed==1]
 
# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_minority),     # to match minority class
                                 random_state=123) # reproducible results
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.completed.value_counts()

In [None]:
#crete train test data using balansed data
x=df_downsampled.drop('completed', axis=1)
y= df_downsampled['completed']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [None]:
cv.fit(x_train, y_train)#train

## Model Evaluation and Validation pt.2  

an increase in the precision score for the '1' label can observed as well as an increase in accuracy.

In [None]:
pred = cv.predict(x_test)

In [None]:
print("classification report:")
print(classification_report(y_test, pred))
print("-------------------------------------")
print('confusion matrix:')
print(confusion_matrix(y_test, pred))

## Justification
By simply balancing the data we gaind an increas in precision for '1' class from 44% to 56% and a jump in accuracy from 55% to 58%. this could indecate that the model was biased toward the '0' class before balancing the data.

## Reflection
the current solution is limited in the data it captures a better solutiom would be where the behavour of customers after viewing an offer is captured as well. using deep learning could also capture more data than is possable with this soultion.  


## Improvement
Instead of predicting if a customer will responce to an offer or not, a prediction could be made about what offers this user will responce too.

## Conclusion
we reached the goal of predicting if a customer will responce to an offer or not, the solution leaves a lot to be desired but it does the job. i would loved to explore how users spending behavior changes after reading an offer, it is quite diffuclite to do at my current level.  
Also deep learning or better feature engineering could be used to achive better results.  

From a business point it seems like males customers are spending less even though they are the majority some steps are needed to encourge them to spend more.

## save the model & data

In [None]:
import pickle
f= open('model.pkl', 'wb')
pickle.dump(cv, f)
f.close()

new_df.to_csv('df.csv')
df_downsampled.to_csv('downsampled.csv')