# Problem Statement

* We will be exploring the Starbuck’s Dataset which simulates how people make purchasing decisions and how those decisions are influenced by promotional offers. 
* There are three **offers_types** that can be sent: buy-one-get-one (BOGO), discount, and informational. 
* We will segment the customer data on different parameters and check its behavious on different **offer_tyoes** using both supervised and unsupervised learning
* We will analyse the data in the Exploratory Data Analysis part of this section and answer the following questions related to customer segmentation and its buying behavious.


- 1. What is the Gender Distribution of Starbucks Customers?
- 2. What is the Age Distribution and average age of Starbucks Customers?
- 3. What is the Income Distribution and average Income of Starbucks Customers?
- 4. How many customers enrolled yearly?
- 5. Which gender has the highest yearly membership?
- 6. Which gender has the highest Annual income?
- 7. What is the distribution of event  in  transcripts?
- 8. What is the percent of trasactions and offers in the event?
- 9. What are the types of offers : received,views, completed ?
- 10. What is the Income Distribution for the Offer Events?
- 11. What are the Offer types amongst ages, gender and income groups?
- 12. What is the highest completed offer?
- 13. What is the lowest completed offer?


#### 1.Data Preparation
#### 2.Data understanding
#### 3.Cleaning the data
#### 4.Exploratory Data Analysis
#### 5.Data Modelling ( Unsupervised ans Supervised Learning)
#### 6.Hyper parameter tuning
#### 7.Evalute the model accuracy
#### 8.Conclusion
#### 9.Results

## Data Preparing

In [1]:
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import seaborn as sns
#% matplotlib inline

In [None]:
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('transcript.json', orient='records', lines=True)

## Data understanding
The data is provided by Starbucks with three files containing the following information:
* 3 .json files.
    * portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
    * profile.json - demographic data for each customer
    * transcript.json - records for transactions, offers received, offers viewed, and offers completed


**portfolio.json**
* id (string) - offer id
* offer_type (string) - type of offer ie BOGO, discount, informational
* difficulty (int) - minimum required spend to complete an offer
* reward (int) - reward given for completing an offer
* duration (int) - time for offer to be open, in days
* channels (list of strings)

In [None]:
portfolio.head()

**profile.json**
* age (int) - age of the customer 
* became_member_on (int) - date when customer created an app account
* gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
* id (str) - customer id
* income (float) - customer's income

In [None]:
profile.head()


**transcript.json**
* event (str) - record description (ie transaction, offer received, offer viewed, etc.)
* person (str) - customer id
* time (int) - time in hours since start of test. The data begins at time t=0
* value - (dict of strings) - either an offer id or transaction amount depending on the record

In [None]:
transcript.head()

### Cleaning each dataset
we will go though the each data diles and clean them and convert them into the formats to use the data for futhur analysis

Rename the id columns of every dataset to its respective tags to avoid confusion

In [None]:
# rename the id columns for ease of understanding
portfolio.rename(columns={"id":"offer_id"}, inplace=True)
profile.rename(columns={"id":"customer_id"}, inplace=True)
transcript.rename(columns={"person":"customer_id"}, inplace=True)

### Clean Portfolio Data:

- One-hot encode channels
- One-hot encode offer_type column

In [None]:
portfolio

We can see from above that the portfolio dataset consists of 10 not-null entries that contains information about the offers provided by starbucks

We can see a histogram containing a distribution of 3 types of offers in 10 entries

In [None]:
# One-hot encode : channels column
channels = portfolio["channels"].str.join(sep="*").str.get_dummies(sep="*")
    
# One-hot encode : offer_type column
offer_type = pd.get_dummies(portfolio['offer_type'])
    
# Concat one-hot into a portfolio_df
portfolio_df = pd.concat([portfolio, channels, offer_type], axis=1, sort=False)

# Remove channels and offer_type
portfolio = portfolio_df.drop(['channels'], axis=1)
portfolio_df

### Clean Profile Data
- Check for null values
- check the age column for extreme values (118)
- Drop rows with no gender, income, age of 118
- Create readable date format in became_member_on column
- Extract its year from became_member_on column add start_year columns (for further analysis)

In [None]:
profile

In [None]:
profile.info()

In [None]:
profile.isnull().sum()

In [None]:
profile.age.describe()

We have age as 118 which isnt practical ans is an outlier value. Lets check the count of 118 values in the column

In [None]:
profile.where(profile.age==118).count()

There are 2175 null values in gender and income columns and age column has 118 has  2175 values. 
Since we have the same number of null values and 118 are same these columns, we need to check do they lie in the same row.

In [None]:
# Check if NaN values for  gender & income and 118 value of age columnalways occur in same rows
profile[(profile.age == 118) & (profile.gender.isnull()) & (profile.income.isnull())]

These NaN values and 118  occur in the same rows, resulting in 2,175 out of 17,000 customers without any demographic data. 
There wont be any means to keep this data which could hamper the accuracy of the model,
Although this means dropping more than 10 percent of the customer data, We will have to drop these rows.

In [None]:
# profile: drop rows with no gender, income, age data
profile = profile.drop(profile[profile['gender'].isnull()].index)
profile.isnull().sum()

Hence, cleared all the null values

In [None]:
# Convert to datetime
profile.became_member_on = pd.to_datetime(profile.became_member_on, format = '%Y%m%d')
profile['start_year'] = profile.became_member_on.dt.year

In [None]:
profile.head()

### Clean Transcript Data

- Create separate columns for amount and offer_id from value column dictionary
- merge the three datasets with common columns
- transcript: segregate offer and transaction data 
- Label the columns - offer_id. offer_type, gender, and the unique customer_ids to convert them into integer data tpe
- Create a offers dataframe by seperating it from transaction in the event column

In [None]:
transcript.head()

In [None]:
# Functions to create offer id and amount columns from the transcript table.
def create_offer_id_column(val):
    if list(val.keys())[0] in ['offer id', 'offer_id']:
        return list(val.values())[0]
    
def create_amount_column(val):
    if list(val.keys())[0] in ["amount"]:
        return list(val.values())[0]


In [None]:
# Create separate columns for amount and offer_id and reward from value col dictionary.
transcript['offer_id'] = transcript.value.apply(create_offer_id_column)
transcript['amount'] = transcript.value.apply(create_amount_column)


# change amount and reward column type to float
transcript.amount.astype('float')
transcript.info()

In [None]:
transcript.drop(columns=['value'], inplace=True)
transcript.head()

## Merge the three data sets with common columns into one for futhur analysis

In [None]:
# merge the transcript and profile dataframes on customer_id column
transcript = transcript.merge(profile, on=['customer_id'])
transcript.head(3)

In [None]:
# merge the transcript and portfolio  on customer_id column using left join
# To maintain all the offer_ids from the transcript column
transcript = transcript.merge(portfolio, on=['offer_id'], how='left')

In [None]:
transcript

### Data Labelling
- Label the columns - offer_id. offer_type, gender, and the unique customer_ids

In [None]:
#Label Encoding the category columns- 
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

#label encoding - offer_id (10 different IDs) from the portfolio data set
le1 = preprocessing.LabelEncoder()
le1.fit(portfolio.offer_id)
transcript['offer_id'] = le1.fit_transform(transcript['offer_id'].astype(str))


#label encoding - offer_type from the portfolio data set (3 different types, bogo-discount-informational)
le2 = preprocessing.LabelEncoder()
le2.fit(portfolio.offer_type)
transcript['offer_type'] = le2.fit_transform(transcript['offer_type'].astype(str))


# label encoding for gender from the profile data set(4 different types, male-female-other)
le3 = preprocessing.LabelEncoder()
le3.fit(profile.gender)
transcript['gender'] = le3.fit_transform(transcript['gender'].astype(str))

In [None]:
transcript.head()

In [None]:
#To retrive its original value we can use its inverse function
le3.inverse_transform([0,1,2])

In [None]:
transcript.info()

In [None]:
# label the unique customer ids, create a mapper function to avoid duplication.
def id_label(customer_id):
    """
    
    Description:
    This function will label ;ong values of customer_ids '912b9f623b9e4b4eb99b6dc919f09a93' to unique integers.
    
    INPUT: 
    customer_id (str): transcript column to be labeled whose values are to be changed
    
    OUTPUT:
    coded_id (list): list of the labelled integers for each value
     
    """
    coded_dict = dict()
    counter = 1
    col_name=str(customer_id)
    coded_id = []
    
    for val in transcript[customer_id]:
        try: 
            if isinstance(val, str):
                if val not in coded_dict:
                    coded_dict[val] = counter
                    counter+=1

                coded_id.append(coded_dict[val])
            else:
                coded_dict[val] = np.nan
                coded_id.append(coded_dict[val])
        except:
            pass
    del transcript[customer_id]
    return coded_id

In [None]:
transcript['customer_id'] = id_label("customer_id")

In [None]:
transcript.customer_id.nunique()

In [None]:
transcript.head()

- Create a offers dataframe by seperating it from transaction in the event column
- Offer dataframe consist of all the offer types- offer_received, offer_viewed, offer_completed 

In [None]:
# Seperate the three offer columns from the transaction column
transaction_df = transcript[transcript.event == "transaction"]
transaction_df.head()

In [None]:
# Seperate the three offer columns from the transaction column
offers_df = transcript[transcript.event != "transaction"]
offers_df.head()

## Exploratory Data Analysis

###  Analysis:

- 1. What is the Gender Distribution of Starbucks Customers?
- 2. What is the Age Distribution and average age of Starbucks Customers?
- 3. What is the Income Distribution and average Income of Starbucks Customers?
- 4. How many customers enrolled yearly?
- 5. Which gender has the highest yearly membership?
- 6. Which gender has the highest Annual income?
- 7. What is the distribution of event  in  transcripts?
- 8. What is the percent of trasactions and offers in the event?
- 9. What are the types of offers : received,views, completed ?
- 10. What is the Income Distribution for the Offer Events?
- 11. What are the Offer types amongst ages, gender and income groups?
- 12. What is the highest completed offer?
- 13. What is the lowest completed offer?
    

In [None]:
profile.head()

#### Ques: 1 What is the Gender Distribution of Starbucks Customers?

In [None]:
#Creating Subplots for distribution based on Gender,Age,Income and start year of membership for the cleaned Profile data
fig, ax = plt.subplots(2, 2, figsize=(13, 12))
fig.suptitle('Demographics of Customer Data of Starbucks', fontsize=15, weight='bold')

# GENDER BASED SUBPLOT
plt.subplot(2, 2, 1)
plt.hist(profile['gender']);
plt.style.use('seaborn');
plt.title('Gender Distribution of Starbucks Customers');
plt.xlabel("Gender");
plt.ylabel("Frequency");


# AGE BASED SUBPLOT
plt.subplot(2, 2, 2)
plt.hist(profile['age']);
plt.style.use('seaborn')
plt.title("Age Distribution of Starbucks Customers" );
plt.xlabel("Age");
plt.ylabel("Frequency");

# INCOME BASED  SUBPLOT
plt.subplot(2, 2, 3)
plt.hist(profile['income'] * 1E-3 );
plt.style.use('seaborn')
plt.title("Income Distribution of Starbucks Customers");
plt.xlabel("Income");
plt.ylabel("Frequency");


# BECAME A MEMBER OF STARBUCKS ON(YEAR) SUBPLOT
plt.subplot(2, 2, 4)
profile["start_year"].value_counts().plot(kind = 'bar'); 
plt.style.use('seaborn')
plt.title("Became a member of Starbucks Customers in the year");
plt.xlabel("Yearly Membership");
plt.ylabel("Frequency");

plt.show()

- Ans1: The proprtion of males(around 9000) is slightly more than those of the females(around 6000)and very small amount of others

#### Ques2. What is the Age Distribution and average age of Starbucks Customers?

In [None]:
profile['age'].describe()['mean']

- Ans2: Age group range from 40-70 frequently visit starbucksrbucks, the reason can be steady life after 40.
- with an average of 54 years.

#### Ques3: What is the Income Distribution and average Income of Starbucks Customers?

In [None]:
profile['income'].describe()['mean']

- Ans 3: There is a decrease in the number of customers as after 70K, 
    mentioning as the income increases people spend less on coffe.
- with an average income of 65k.

#### Ques4. How many customers enrolled yearly?

In [None]:
profile["start_year"].value_counts()

- Ans4: Members of the starbucks increased exponentially from 2013 and reached its highest in 2017 which later declines steadily
- 5599 customers enrolled in 2017

#### Ques5 : Which gender has the highest yearly membership? 

In [None]:
# groupby start_year and gender to plot a graph
membership_year = profile.groupby(['start_year', 'gender'])["age"].count().reset_index()
membership_year.head()

In [None]:
#plot a bar graph for membership program as a function of gender 
plt.figure(figsize=(15, 5))
sns.barplot(x='start_year', y='age', hue='gender', data=membership_year);
plt.xlabel('Membership Start Year',fontsize = 12);
plt.ylabel('Count',fontsize = 12);
plt.title("Gender distribution of yearly membership", fontsize = 15)
plt.show()

- Ans5: With the increase in popularity of starbucks, people have joined starbucks yerly exponentially and reached its zenith in 2017.
- more men have joined than the female and very few from others every year. 


#### Ques6 :Which gender has the highest Annual income?

In [None]:
plt.figure(figsize=(14, 5))
sns.violinplot(x=profile['gender'], y=profile['income'])
plt.title('Gender distribution of Annual Income')
plt.ylabel('Income')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.show();

- The highest and the lowest income for both male and female are approximately same and for others it is less on both the sides.
- The median income (the white dot) for females (around **70k**) is higher than males and others (around **60k)**
- for females the income spreads from **40k** to **100k**. 
- For males most the spread is around **40k** to **70k** which close to median.
- for others the spread is around **60K**
- The count of male customers in low-income level is slightly higher than that of female  and other customers

In [None]:
ax = portfolio["offer_type"].value_counts().plot.bar(figsize=(5,5),fontsize=14,)
ax.set_title("What are the offer types?", fontsize=20)
ax.set_xlabel("Offers", fontsize=15)
ax.set_ylabel("Frequency", fontsize=15)
sns.despine(bottom=True, left=True)

#### Ques7: What is the distribution of event  in  transcripts?

In [None]:
sns.countplot(transcript['event'])
plt.title('Number of events In Transcripts')
plt.ylabel('Number of Transcripts')
plt.xlabel('Transcript type')
plt.xticks(rotation = 0)
plt.show();

- Ans7: We can see that most of the transcripts are transactions. 
- Around **75%** of the offer received were viewed. And nearly **50%** of the viewed offers were completed.

#### Ques8: What is the percent of trasactions and offers in the event?

In [None]:
event_counts = transcript['event'].value_counts()
event_counts

In [None]:
#tranaction percent and offer percent

transactions_percent = 100 * event_counts[0] / event_counts.sum()
offers_percent = 100 * event_counts[1:].sum() / event_counts.sum()

(transactions_percent, offers_percent)

- Ans 8: Nearly 45.5% are trasactions and 54.5% are offers.

#### Ques9: What are the types of  offers  : received,views, completed ?

In [None]:
offers_df.event.value_counts()

In [None]:
offer_received = offers_df[offers_df["event"] == "offer received"]
offer_viewed = offers_df[offers_df["event"]== "offer viewed"]
offer_completed = offers_df[offers_df["event"] == "offer completed"]


In [None]:
# Visualize distribution of membership days grouped by success
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('Offer types : received, viewed and completed', fontsize=15, weight='bold')

# Subplot for bogo offers
plt.subplot(1, 3, 1)
sns.countplot(x=offer_received['offer_type'])
plt.title('Number of types of offers received ', fontsize=13)
plt.xlabel('Offer Received')
plt.xticks(rotation = 45, fontsize=13)


# Subplot for discount offers
plt.subplot(1, 3, 2)
sns.countplot(x=offer_viewed['offer_type'])
plt.title('Number of Viewed Promotions for each Offer', fontsize=13)
plt.xlabel('Offer Viewed')
plt.xticks(rotation = 45, fontsize=13)

# Subplot for informational offers
plt.subplot(1, 3, 3)
sns.countplot(x=offer_completed['offer_type'])
plt.title('Number of Viewed Promotions for each Offer', fontsize=13)
plt.xlabel('Offer Completed')
plt.xticks(rotation = 45, fontsize=13)
plt.show()

In [None]:
le2.inverse_transform([0, 1, 2, 3])

- Ans9: More of Bogo and Dicount offers were received by the customers than that of informational.
- More Bogo offers have been viewed
- Most of the discount offers have been completed and no informational offer completed.
- Hence, in order to make a offer complete, more of discount offers must be sent to the customers.
- Here, bogo has also been a good offer since high number of customers view such offers.m

#### Ques10: What is the Income Distribution for the Offer Events?

In [None]:
#Create a age group Column cleaning by  segregation
offers_df['age_groups'] = pd.cut(offers_df.age, bins=[11, 20, 30, 40, 50, 60, 70, 80, 110], 
                               labels=['11-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80+'])


#Create a Income group Column cleaning by  segregation
offers_df['income_groups'] = pd.cut(x=profile["income"],
                                    bins=[30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000,  120000],
                                   labels =['30-40K','40-50K','50-60K','60-70K','70-80K','80-90K','90-100K','100-110K','110-120K'])

In [None]:
plt.figure(figsize=(14, 6))
sns.countplot(x=offers_df['income_groups'], hue="event", data=offers_df)
plt.title("Income Distribution for the Offer Events")
plt.ylabel('Total')
plt.xlabel('Income ')
plt.xticks(rotation = 30)
plt.legend(title='Offer Event')
plt.show();


- **Ans10:** Highest Offer is received by income group of 50-60k with the least of 110-120k.
- The highest offer completed is also from 50-60k and decreses on either side with a larger slope on the higher income groups.
- starbucks have  lesser higher income group customers.

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(13, 12))
fig.suptitle('Offer types amongst ages, gender and income groups', fontsize=18, weight='bold')

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=1.9, 
                    wspace=0.4, 
                    hspace=0.4)

fig.tight_layout()

#fig.tight_layout(pad=5.0)
plt.subplot(3, 1, 1)
#plt.figure(figsize=(14, 5))
sns.countplot(x="age_groups", hue="offer_type", data=offers_df);
plt.ylabel('Total',fontsize=15);
plt.xlabel('Age Group',fontsize=15);
plt.xticks(rotation = 0);
plt.legend(title='Offer Type');

plt.subplot(3, 1, 2)
sns.countplot(x=offers_df['gender'], hue = 'offer_type', data=offers_df);
plt.ylabel('Total',fontsize=15);
plt.xlabel('Gender',fontsize=15);
plt.xticks(rotation = 0);

plt.subplot(3, 1, 3)
sns.countplot(x=offers_df['income_groups'], hue = 'offer_type', data=offers_df)
plt.ylabel('Total',fontsize=15)
plt.xlabel('Income Group',fontsize=15)
plt.xticks(rotation = 45)
plt.show();



In [None]:
le2.inverse_transform([0, 1, 2, 3])

In [None]:
le3.inverse_transform([0, 1, 2])

- **Ans11:** We can see from the above graphs that, Bogo is slightly more popular amongst the ages,gender and income groups.
- 50-59 age group is more respondent to these offers than the otherer groups
- Also, for the income distribution, the informational offer is almost round 50% than the other two.
- Most male are respondents of these offers than the females with BOGO its leading type
- To sumup it up,  the active starbucks customer respondents are from the age group of 50-59, with higher male percentage having and annual income of 50-60k.

In [None]:
#completed_off_count = transcript[transcript['event'] == 'offer completed']
plt.figure(figsize=(14, 5))
offer_completed = offers_df[offers_df["event"] == "offer completed"]
sns.countplot(y=offer_completed['offer_id'])
plt.title('Number of Completed offers for each Offer')
plt.ylabel('Offer ID')
plt.xticks(rotation = 45)
plt.show();


#### Ques 12: What is the highest completed offer?

In [None]:
print("Number of Completion: {}" .format(offer_completed.offer_id.value_counts().values[0]))
print("Offer ID with maximum offers completed:{}".format(offer_completed.offer_id.value_counts().index[0]))

In [None]:
le1.inverse_transform([9])

- Ans12: Out of the orders completed,The offer_id which was a gained higher success rate is 'fafdcd668e3743c1bb461111dcafc2a4'
- with a total of 4957 completions

#### Ques13:What is the lowest completed offer??

In [None]:
print("Number of Completion: {}" .format(offer_completed.offer_id.value_counts().values[-1]))
print("Offer ID with minimum offers completed:{}".format(offer_completed.offer_id.value_counts().index[-1]))

In [None]:
le1.inverse_transform([4])

- Ans13: Out of the orders completed,The offer_id which was a gained least success rate is '4d5c57ea9a6940dd891ad53e9dbe8da0'
- with a total of 3281 completions

In [None]:
offers_df.info()

In [None]:
# dropping these columns because with null values, datetime,object,category datatypes
cols_to_drop = ['age_groups','income_groups','amount','became_member_on' ,'event']
offers_df = offers_df.drop(columns= cols_to_drop)

In [None]:
offers_df.head()

## Data Modelling
#### Unsupervised Learning

In [None]:
from sklearn.cluster import KMeans

In [None]:
from sklearn.cluster import KMeans
wss = [] #within the sum of squares
for k in range(1,11): #take the range of Kvalues 1-10
    kmeans = KMeans(n_clusters=k, init="k-means++")
    kmeans.fit(offers_df) #Fit the subset of data
    wss.append(kmeans.inertia_) #inertia_ : Sum of squared distances of samples to their closest cluster center.

#Plot the Figure
plt.figure(1, figsize=(14,5))
plt.plot(range(1,11), wss, color='green', linewidth=2.0 , marker = "o")
plt.xlabel("K values")
plt.ylabel("wss (Within Sum of square)")
plt.show()

- The Graph makes an elbow at 2.
- Number of optimal clusters for the dataset is 2

In [None]:
#Take the number of clusters as 2
kmodel = KMeans(n_clusters=2,random_state=10)
#Fit the model to predict the labels
cluster_labels =kmodel.fit_predict(offers_df) 
cluster_labels #view the labels of the cluster

In [None]:
offers_df["cluster"] = cluster_labels

In [None]:
#Check the centroids for the above cluster model
kmodel.cluster_centers_

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15, 10))

plt.subplots_adjust(left=0.1,
                    bottom=0.1, 
                    right=0.9, 
                    top=1.9, 
                    wspace=0.4, 
                    hspace=0.4)
fig.tight_layout()
plt.subplot(2, 1, 1)
sns.countplot(x="income", hue="offer_type", data=offers_df[cluster_labels==0]);
plt.title('K-Means Clustering of cluster labels 1',fontsize=15);
plt.ylabel('Total');
plt.xlabel('Income');
plt.xticks(rotation = 45,fontsize= 15);
plt.legend(title='Offer Type');

plt.subplot(2, 1, 2)
sns.countplot(x="income", hue="offer_type", data=offers_df[cluster_labels==1]);
plt.title('K-Means Clustering of cluster labels 2',fontsize=15);
plt.ylabel('Total');
plt.xlabel('Income');
plt.xticks(rotation = 45, fontsize= 15);
plt.legend(title='Offer Type');

plt.show();


- Compared to BOGO and Discount offer, the informational offers are very less popular.
- Few cases the Discout Offers are used more than the BOGO offers:
- In **cluster1** at income=51000, income=52000  and
- In **cluster2** at income=76000, income=77000
- Since the income is unevenly distributed,it can also be concluded that the annual income is indepedent of the purchasing behaviour

In [None]:
# Plot graph to find Most Popular Offers Type Gender wise for cluster 1
fig, ax = plt.subplots(1,2, figsize=(15, 10))

fig.tight_layout()

plt.subplot(2, 1, 1)
sns.countplot(x="gender", hue="offer_type", data=offers_df[cluster_labels==0])
plt.title('K-Means Clustering for cluster 1 ')
plt.ylabel('Total')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.legend(title='Offer Type')


plt.subplot(2, 1, 2)
sns.countplot(x="gender", hue="offer_type", data=offers_df[cluster_labels==1])
plt.title('K-Means Clustering for cluster 2')
plt.ylabel('Total')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.legend(title='Offer Type')
plt.show();

In [None]:
print("Income Range for Cluster 1:", offers_df[cluster_labels==0]['income'].min(), 
      "to", offers_df[cluster_labels==0]['income'].max())

print("Income Range for Cluster 2:", offers_df[cluster_labels==1]['income'].min(), 
      "to", offers_df[cluster_labels==1]['income'].max())

-  Compared to BOGO and Discount offer, the informational offers are less popular.
-  For **Cluster 1**, the income ranges from 30000.0 to 68000.0. 
-  It can thus be concluded  that Males with the above income range tend to spend more than Females and Other Genders for the BOGO and Discount Offers.
-  For **Cluster 2**, the income ranges from 69000.0 to 120000.0. 
-  It can thus be concluded  that Females with income range 71000.0 to 120000.0 tend to spend more than Males and Other Genders for the BOGO and Discount Offers.


## Supervised Learning
#### The target column is offer_type.  It will help to predict the correct offer_type to send to each customer.

In [None]:
offers_df.info()

In [None]:
# Split the Data into Target and Features variables

target = offers_df['offer_type']
features = offers_df.drop(columns=['offer_type', 'customer_id', 'offer_id','cluster'])

In [None]:
features.head()

In [None]:
#Create training and testing sets
from sklearn.model_selection import train_test_split, GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(features, target,test_size=0.2, random_state=10)

print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)

### Metrics:
- Since it is a classification problem,
- we will use accuracy to evaluate my models.
- Comapre the correct predictions and total number of predicitons to determine the accuracy of the model and choose the best.
- **Five different ML algorithms** can be test on the datset :  
1. Decision Trees  
2. Logistic Regression  
3. Nearest Neighbours (KNN)  
4. Naive Bayes  
5. Random Forest  



In [None]:
def train_predict(model, X_train, y_train, X_test, y_test): 
    '''
    Description: Train the dataset and predict its accuracy using differnt ML algorithms
              for testing take first 300 training samples (X_train[:300],y_train[:300])
    INPUT:
       - model: the learning algorithm to be trained and predicted
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    OUTPUT:
        - Accuracy Scores and F Scores of the models
    '''
    
    results = {}
    
    # Fit the model to the training data
    model.fit(X_train, y_train)
   
    # Predict on the X_test,
    predictions_test =model.predict(X_test)
    
    # Predict on the first 300 training samples(X_train)
    predictions_train = model.predict(X_train[:300]) 
    
    # Accuracy on  y_train[:300]
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
        
    # Accuracy on test set using accuracy_score()
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    
    # F-score on y_train[:300]  using fbeta_score()
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta = 0.5, average='weighted') # average = weighted because multiclass classification
        
    #  F-score on  y_test
    results['f_test'] = fbeta_score(y_test, predictions_test, beta = 0.5, average='weighted')
       
    print("{} trained.".format(model.__class__.__name__))
        
    # Return the results
    return results

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import fbeta_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Initialize the models

lr = LogisticRegression(random_state=10)
rf = RandomForestClassifier(random_state=10)
knn = KNeighborsClassifier()
gnb = GaussianNB() 
dt = DecisionTreeClassifier()

In [None]:
# Collect results on the models
results = {}
for model in [lr, rf, knn, gnb, dt]:
    model_name = model.__class__.__name__
    results[model_name] = {}
    results[model_name] = train_predict(model, X_train, y_train, X_test, y_test)

In [None]:
for i in results.items():
    print (i[0])
    display(pd.DataFrame(i[1], index=range(1)))

- Accuracy score is 100% for  training and testing datasets for **RandomForestClassifier,GaussianNB,DecisionTreeClassifier** which can lead to **overfitting.
- Since logistic Regression has a very low train accuracy of 0.50 and test accuracy of 0.52.
- So, we choose **KNeighborsClassifier.**
- It has good results **0.93 on training and 0.82 on testing datasets.**
- Since we have few binomial outcomes ( BOGO = 1, discount = 2, informational = 3 ) we can use  **KNeighborsClassifier.**.

## Hyper parameter tuning of KNeighborsClassifier to increase the acuuracy

- It is possible to **improve the performance of the model** from it base instance by **tuning hyperparameters** of that algorithm.
- We will define **a range of values** that would be evaluated in the hyper parameter space of the for **KNeighborsClassifier** model using **RandomizedSearchCV.**

In [None]:
#Randomly assign the values to the parameters
para_grid = {"n_neighbors" :list(range(20,30)), 
             "leaf_size" :list(range(1,6)),
             "p" : [1,2]}

In [None]:
knn_randomcv = RandomizedSearchCV(estimator= knn, param_distributions= para_grid , n_iter=10, cv=3, 
                                  verbose=2,random_state=100,n_jobs=-1)
knn_randomcv.fit(X_train, y_train)

In [None]:
#find the best parameter values
best_parameters = knn_randomcv.best_params_
best_parameters

In [None]:
# instantiate model with best parameters
train_predict(KNeighborsClassifier(leaf_size=4, n_neighbors=21, p=1),X_train, y_train, X_test, y_test)

- The best scores achieved after tuning,its essential hyper-parameters{'p': 1, 'n_neighbors': 21, 'leaf_size': 4} byKNeighborsClassifier : **training accuracy : 0.93 and testing accuracy :0.93**  .
- testing accuracy has increased after hyperparameter tunning

## Evalute the model accuracy

Lets pick a random customer from our data n test its accuracy

In [None]:
features.columns

In [None]:
features.iloc[27275,:]

In [None]:
target.iloc[27275]

In [None]:
le2.inverse_transform([0, 1, 2, 3])

The above taken cutsomer has responded to **BOGO** type of offer.

#### Now lets evaluate our model to check its accuracy

If we have the customer data with the above features we would be able to preduct its offer type using our above tested model

In [None]:
customer_data = [588, 1.0, 51.0, 61000.0, 2015, 10.0, 10.0, 7.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0]

In [None]:
clf = KNeighborsClassifier(leaf_size=4, n_neighbors=21, p=1)
clf.fit(features, target)
clf.score(features, target)

In [None]:
clf.predict([customer_data])

- The model has correctly predicted that the customer will likely respond tor **BOGO offer** type with an **accuracy of 93 %.**
- Hence our model has good accuracy for prediction,

## Conclusion:
### Segmentation of startbucks Customers:
- The customers can be segmented depending on various parameters according to the campaign chosen
- On analysis the data using supervised and unsupervised learning(Kmeans), we can conclude that:
- Different segments of customers react to offers differently.
- The count of male customers in low-income level is slightly higher than that of female and other customers
- Though the aveage salary of femal is greater than that of the male, female spend less on starbucks than male
- Starbucks has more of the young crowd than those of the aged once.
- The result of the offer_type was prediced by training a supervised classifier. 

## Results:- 
- Customers are attracted to **BOGO and Discount** offers more as compared to Informational Offers
- **The buying behaviour of a customer is indepemdent of its annual income**
- **Starbucks have more male customers than females and other gender.**
- **KNeighborsClassifier** turned out to be the best algorithm for this task and predicts customer response with an accuracy rate of almost 93% after hyperarameter tuning. Given the fact that also the same customer will react differently the same offer.