<p align = "center" draggable=‚Äùfalse‚Äù ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# **Bonus Challenge! - Taking it to the Next Level üì∂**

## :coffee: Starbucks Promotional Offers

Starbucks Corporation is the leading roaster, retailer, and marketer of specialty coffee in the world.  Starbucks' mobile app routinely pushes offers to customers in hopes that it will entice the customer to visit to one of their many local coffeehouses.   The offers may be promotional like a **discount** on a seasonal beverage or pastry or **BOGO (Buy One Get One free)** or an **informational** whichs is an advertisement for a product.  Some users may receive offers more frequently than others.  

In this challenge:

1. We want to predict the responsiveness of a customer to an offer - i.e. how likely they are to make a purchase after receiving an offer.

2. We will use Spark and will introduce the Pandas on Spark API - this will help out with your Exploratory Data Analysis (EDA) and Feature Engineering.

3. You will be responsible for the majority of coding in the section, but you will be not left alone!  If you get stuck, look at the dropdown hints (ü§î **Hints**).  The hints will start off vague and get increasingly more specific to help guide you through the challenge.

5. You will use **multiclass classification** techniques in Spark to classify customer responsiveness to promotional offer campaigns.


# Multiclass (Multi-label) Classification

[Multiclass classification](https://en.wikipedia.org/wiki/Multi-label_classification) is an extension of binary classification with more than two labels where each label is an binary classification model.  The model that produces the highest probability score for the label is the label becomes the label prediction for the input.

# Spark Environment
### Make sure to run this notebook in your Spark environment!

# **Get the Data**

There are three datasets in this challenge.  **Portfolio** contains the different offer campaigns, **Profile** contains the customer profile, and **Transcript** contains the user interactions with the offer campaigns.  Have a look üëÄ for yourself!

Love üêº Pandas?  Love Spark? Let's combine them!  Born out of a Databricks project called üê® Koalas, [Pandas on Spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html), allows us to utilize the Pandas API distributed on Spark!  Check out this [article](https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45) for a nice quickstart as well as the official [API Documentation](https://spark.apache.org/docs/3.3.0/api/python/reference/index.html)!

In [None]:
import pyspark.pandas as ps

psdf_portfolio = ps.read_csv("../dat/portfolio.csv")
psdf_profile = ps.read_csv("../dat/profile.csv")
psdf_transcript = ps.read_csv("../dat/transcript.csv")

In [None]:
psdf_portfolio.head()

In [None]:
psdf_profile.head()

In [None]:
psdf_transcript.head()

# **Explore the Data - EDA**

It's your turn to practice EDA!  Perform either manual EDA or use automated EDA tools like SweetViz!  Get to know the data, what do you notice, what stands out?

## **ü§î Hints**
<details>
<summary>Hint 1</summary>
Take a look at each dataset independently.  What do you notice?  Are there any missing values? Are there values that don't make sense? What are the unique events?
<br></br>
</details>

<details>
<summary>Hint 2</summary>
Is the dataset balanced?  
<br></br>
</details>

<details>
<summary>Hint 3</summary>
Notice that the challenge is to predict customer responsiveness. How can responsiveness be represented in the data?
</details>

In [None]:
# YOUR CODE GOES HERE!

# **Setup the ETL Pipeline + Feature Engineering (The Art üé® üñå of ML**)

Bring the datasets together, perform feature engineering, rename columns, and drop columns that you don't need for predictions.

**Required**: Create a feature for customer responsiveness using the event column in the transcript table.  

  *   Create **five** labels (one for each event + unresponsive) for customer responsiveness using a 5-point Likert scale:
  
    1. Unresponsive, 
    2. Slightly Responsive, 
    3. Somewhat Responsive 
    4. Moderately Responsive,
    5. Very Responsive.   

## **ü§î Hints**
<details>
<summary>Hint 1</summary>
Take a look at the AGE, GENDER, and INCOME variables.  Do you notice anything unusual?  Do the values make sense?  How will you handle this?
<br></br>
</details>

<details>
<summary>Hint 2</summary>
Are there any other features that you could create features for?  
<br></br>
</details>

<details>
<summary>Hint 3</summary>
As you bring the datasets together, what do you notice about the relationship between the customer and transcript tables?  Is it a one-to-one or a one-to-many relationship?
<br></br>
</details>

<details>
<summary>Hint 4</summary>
Since the data has a one-to-many relationship between the customer and transcript tables. You will need flatten your dataset (i.e. in one row per customer).
</details>

<details>
<summary>Hint 5</summary>
Create binary features (dummy variables) for the channels, offer_type, and event variables.  Use the event dummy variables to create your customer responsiveness labels.
<br></br>
</details>

### Here's a freebie! Let's create a row-level function to parse out the offer id value. This probably won't be the last time that you'll need it! üòâ

In [None]:
import pandas as pd

# Row-level function to clean offer id in transcript table.
def get_offer_id(offer_id):
  offer_id = offer_id.split()[2]
  return offer_id[1:len(offer_id)-2]

psdf_transcript["offer_id"] = psdf_transcript["value"].apply(get_offer_id)

# Drop original value column
psdf_transcript = psdf_transcript.drop("value")

psdf_transcript.head()

In [None]:
# YOUR CODE GOES HERE!

# **Modeling**

## Prepare your dataset for ML!

In [None]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

# YOUR CODE GOES HERE!

## Train-Test-Split

In [None]:
# YOUR CODE GOES HERE!

There are a couple different ways to setup a multiclass classifier as an extension of Logistic Regression in PySpark: 

1.   Multinomial Logistic Regression (softmax)
2.   One vs Rest (OVR) / One vs All (OVA)


### Multinomial Logistic Regression (softmax)

Logistic Regression can be adapted to a multiclass task by using the setting the `family = "multinomial"`.  

In [None]:
from pyspark.ml.classification import LogisticRegression

# Instantiate the classifier

# Fit the model 
# YOUR CODE GOES HERE!

# Make Predictions
# YOUR CODE GOES HERE!

# Evaluate the model
# YOUR CODE GOES HERE!


### One vs Rest (OVR) / One vs All (OVA)

In [None]:
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# instantiate the base Logistic Regression classifier.

# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=lr)

# Fit the model 
# YOUR CODE GOES HERE!

# Make Predictions
# YOUR CODE GOES HERE!

# Evaluate the model
# YOUR CODE GOES HERE!


## Setup Your Modeling Pipeline with Hyper Parameter Tuning

In [None]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# YOUR CODE GOES HERE!

Additionally, you can solve Multiclass Classification problems using:

1.   Random Forest
2.   Decision Trees
3.   Na√Øve Bayes

You can also extend the idea of multiclass classification to [multi-label multiclassification](https://en.wikipedia.org/wiki/Multi-label_classification).

But that's another lesson...

## Review Questions

1.  What are two other applications you could use multiclass classification for?
2.  What do you think the drawbacks to multiclass classifications are?