# Predict Visitor Purchases with a Classification Model

Build a Machine Learning model for a ecommerce business that would want to know about their customers purchasing habits.

### 1. Connecting BigQuery Jupyter Notebook

Set environment variables for notebook to connect Bigquery

In [1]:
import os 
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'C:/Users/Rakyan Prajnagra/Documents/Data Engineer/GCP-DataEngineerLearningPath/Quest-MachineLearning/Quest-3-Predict-Visitor-Purchases/qwiklabs-gcp-01-acfb5f19a1a7-6583f36c2676.json'

Load the BigQuery client library by executing the command below

In [2]:
%load_ext google.cloud.bigquery

### 2. Explore Google Merchandise Store Data

Before build a model, we want to identify ecommerce business data which is Google Merchanidse Store in BigQuery.

The data is in the web_analytics table on the ecommerce dataset. First check the schema of our table.

In [4]:
%%bigquery
SELECT column_name, data_type
FROM `data-to-insights.ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'web_analytics'
ORDER BY ordinal_position

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,column_name,data_type
0,visitorId,INT64
1,visitNumber,INT64
2,visitId,INT64
3,visitStartTime,INT64
4,date,STRING
5,totals,"STRUCT<visits INT64, hits INT64, pageviews INT..."
6,trafficSource,"STRUCT<referralPath STRING, campaign STRING, s..."
7,device,"STRUCT<browser STRING, browserVersion STRING, ..."
8,geoNetwork,"STRUCT<continent STRING, subContinent STRING, ..."
9,customDimensions,"ARRAY<STRUCT<index INT64, value STRING>>"


Some field has Array and Struct data type. We want to be careful about this data type.

Now try to find how much visitors who made a purchases.

In [5]:
%%bigquery
WITH visitors AS(
  SELECT
  COUNT(DISTINCT fullVisitorId) AS total_visitors
  FROM `data-to-insights.ecommerce.web_analytics`),
purchasers AS(
  SELECT
  COUNT(DISTINCT fullVisitorId) AS total_purchasers
  FROM `data-to-insights.ecommerce.web_analytics`
  WHERE totals.transactions IS NOT NULL)
SELECT
  total_visitors,
  total_purchasers,
  total_purchasers / total_visitors AS conversion_rate
FROM visitors, purchasers

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_visitors,total_purchasers,conversion_rate
0,741721,20015,0.026985


From all the visitors who visit the webiste, only 2,69% visitors made a purchases.

Let's check the top 5 selling products on the webiste.

In [6]:
%%bigquery
SELECT
  p.v2ProductName,
  p.v2ProductCategory,
  SUM(p.productQuantity) AS units_sold,
  ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC
LIMIT 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,v2ProductName,v2ProductCategory,units_sold,revenue
0,Nest® Learning Thermostat 3rd Gen-USA - Stainl...,Nest-USA,17651,870976.95
1,Nest® Cam Outdoor Security Camera - USA,Nest-USA,16930,684034.55
2,Nest® Cam Indoor Security Camera - USA,Nest-USA,14155,548104.47
3,Nest® Protect Smoke + CO White Wired Alarm-USA,Nest-USA,6394,178937.6
4,Nest® Protect Smoke + CO White Battery Alarm-USA,Nest-USA,6340,178572.4


Nest-USA dominates the top selling product.

We want to know how many visitors bought on subsequent visits to the website?

In [8]:
%%bigquery
WITH all_visitor_stats AS(
  SELECT
    fullvisitorid,
    IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
  FROM `data-to-insights.ecommerce.web_analytics`
  GROUP BY fullvisitorid)
SELECT
  COUNT(DISTINCT fullvisitorid) AS total_visitors,
  will_buy_on_return_visit
FROM all_visitor_stats
GROUP BY will_buy_on_return_visit

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_visitors,will_buy_on_return_visit
0,729848,0
1,11873,1


After we analyze the result, 1,6% of total visitors will return and purchase on the website. This includes the subset of visitors who bought on their very first session and then came back and bought again. The reason that might happen is the visitors need to compare the price with the other ecommerce.

Our objective is to predict whether or not a new user is likely to purchase in the future. Identify this user can help marketing team to targeting a promotion or ad campaign.

Now time to select useful features that will help a machine learning model understand the relationship between data about a visitor's first time on your website and whether they will return and make a purchase. Test these 2 fields are good inputs or not.
1. Whether the visitor left the website immediately (`bounces`)
2. How long the visitor was on our website (`time_on_site`)

The label that we will use is `will_buy_on_result`.

In [9]:
%%bigquery
SELECT * EXCEPT(fullVisitorId)
FROM(
  SELECT
    fullVisitorId,
    IFNULL(totals.bounces, 0) AS bounces,
    IFNULL(totals.timeOnSite, 0) AS time_on_site
  FROM `data-to-insights.ecommerce.web_analytics`
  WHERE totals.newVisits = 1)
  JOIN(
    SELECT
      fullvisitorid,
      IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid)
USING (fullVisitorId)
ORDER BY time_on_site DESC
LIMIT 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,bounces,time_on_site,will_buy_on_return_visit
0,0,15047,0
1,0,12136,0
2,0,11201,0
3,0,10046,0
4,0,9974,0
5,0,9564,0
6,0,9520,0
7,0,9275,1
8,0,9138,0
9,0,8872,0


From the inital data results, out of the top 10 time_on_site, only 1 customer returned to buy. Which is not very promising. We can't tell this feature will be the good indicator before we train and evaluating the model.

### 3. Create a Model

We have an information for the model that we explore before (the features and label). Try to create a model with our 2 feature.

Create a dataset first to save our model and table that we use later to evaluating.

In [10]:
%%bigquery
CREATE SCHEMA ecommerce

Query is running:   0%|          |

Ecommerce dataset has been created.

There are many model type used in machine learning. In order to building machine learning model, we have to choose the model type. Since we are bucketing visitors into "will buy" or "will not buy", we use `logictic_reg` in a classification model.

In [11]:
%%bigquery
CREATE OR REPLACE MODEL `ecommerce.classification_model`
OPTIONS(
    model_type='logistic_reg',
    labels = ['will_buy_on_return_visit'])
AS
    SELECT * EXCEPT(fullVisitorId)
    FROM(
        SELECT
            fullVisitorId,
            IFNULL(totals.bounces, 0) AS bounces,
            IFNULL(totals.timeOnSite, 0) AS time_on_site
        FROM `data-to-insights.ecommerce.web_analytics`
        WHERE 
            totals.newVisits = 1 AND 
            date BETWEEN '20160801' AND '20170430')
    JOIN(
        SELECT
            fullvisitorid,
            IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
        FROM `data-to-insights.ecommerce.web_analytics`
        GROUP BY fullvisitorid)
    USING (fullVisitorId)

Query is running:   0%|          |

We are not using all the available data to train the model, because we need to save the rest of the data for evaluting and testing the model. We train only the first 9 months.

After create a model, evaluate the performance of the model. In the classification model in ML, we want to minimize False Positive Rate and maximie True Positive Rate. Try to maximize the area under the curve (AUC).

In [12]:
%%bigquery
SELECT
  roc_auc,
  CASE
    WHEN roc_auc > .9 THEN 'good'
    WHEN roc_auc > .8 THEN 'fair'
    WHEN roc_auc > .7 THEN 'decent'
    WHEN roc_auc > .6 THEN 'not great'
  ELSE 'poor' END AS model_quality
FROM
  ML.EVALUATE(MODEL ecommerce.classification_model, (
    SELECT * EXCEPT(fullVisitorId)
    FROM(
      SELECT
        fullVisitorId,
        IFNULL(totals.bounces, 0) AS bounces,
        IFNULL(totals.timeOnSite, 0) AS time_on_site
      FROM `data-to-insights.ecommerce.web_analytics`
      WHERE 
        totals.newVisits = 1 AND
        date BETWEEN '20170501' AND '20170630')
    JOIN(
      SELECT
      fullvisitorid,
      IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
      FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid)
    USING (fullVisitorId)))

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,roc_auc,model_quality
0,0.723834,decent


We use the different data (Use the rest of data that didn't use on the training data) for evaluate the model. After evaluating, we got 0.72 for the roc_auc. It's not great. The goal is to get AUC close to 1.0. We have to improve our model.

Improve model performace can be done by adding new features:
1. How far the visitor got in the checkout process on their first visit
2. Where the visitor came from
3. Device category
4. Geographic information

In [13]:
%%bigquery
CREATE OR REPLACE MODEL `ecommerce.classification_model_2`
OPTIONS(
  model_type='logistic_reg', labels = ['will_buy_on_return_visit']) 
AS
  WITH all_visitor_stats AS(
    SELECT
      fullvisitorid,
      IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
    FROM `data-to-insights.ecommerce.web_analytics`
    GROUP BY fullvisitorid)
  SELECT * EXCEPT(unique_session_id) 
  FROM(
    SELECT
      CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
      will_buy_on_return_visit,
      MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
      IFNULL(totals.bounces, 0) AS bounces,
      IFNULL(totals.timeOnSite, 0) AS time_on_site,
      IFNULL(totals.pageviews, 0) AS pageviews,
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,
      device.deviceCategory,
      IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE 
      1=1 AND 
      totals.newVisits = 1 AND 
      date BETWEEN '20160801' AND '20170430'
    GROUP BY
      unique_session_id,
      will_buy_on_return_visit,
      bounces,
      time_on_site,
      totals.pageviews,
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,
      device.deviceCategory,
      country)

Query is running:   0%|          |

We still training on the same first 9 months of data. Its because we want to compare the output of the model.

Evaluate the performance of the second model.

In [14]:
%%bigquery
SELECT
  roc_auc,
  CASE
    WHEN roc_auc > .9 THEN 'good'
    WHEN roc_auc > .8 THEN 'fair'
    WHEN roc_auc > .7 THEN 'decent'
    WHEN roc_auc > .6 THEN 'not great'
  ELSE 'poor' END AS model_quality
FROM
  ML.EVALUATE(MODEL ecommerce.classification_model_2, (
    WITH all_visitor_stats AS (
      SELECT
        fullvisitorid,
        IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
      FROM `data-to-insights.ecommerce.web_analytics`
      GROUP BY fullvisitorid)
    SELECT * EXCEPT(unique_session_id)
    FROM(
        SELECT
          CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
          will_buy_on_return_visit,
          MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
          IFNULL(totals.bounces, 0) AS bounces,
          IFNULL(totals.timeOnSite, 0) AS time_on_site,
          totals.pageviews,
          trafficSource.source,
          trafficSource.medium,
          channelGrouping,
          device.deviceCategory,
          IFNULL(geoNetwork.country, "") AS country
        FROM `data-to-insights.ecommerce.web_analytics`,
        UNNEST(hits) AS h
        JOIN all_visitor_stats USING(fullvisitorid)
        WHERE 
          1=1 AND
          totals.newVisits = 1 AND 
          date BETWEEN '20170501' AND '20170630'
        GROUP BY
          unique_session_id,
          will_buy_on_return_visit,
          bounces,
          time_on_site,
          totals.pageviews,
          trafficSource.source,
          trafficSource.medium,
          channelGrouping,
          device.deviceCategory,
          country)))

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,roc_auc,model_quality
0,0.909458,good


With the second model, we get roc_auc of 0.91. Better than before.

We can make prediction from trained model (second model). We want to predict which new visitors will come back and make a purchase.

In [15]:
%%bigquery
SELECT *
FROM 
  ML.PREDICT(MODEL `ecommerce.classification_model_2`, (
    WITH all_visitor_stats AS (
      SELECT
        fullvisitorid,
        IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
      FROM `data-to-insights.ecommerce.web_analytics`
      GROUP BY fullvisitorid)
    SELECT
      CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
      will_buy_on_return_visit,
      MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
      IFNULL(totals.bounces, 0) AS bounces,
      IFNULL(totals.timeOnSite, 0) AS time_on_site,
      totals.pageviews,
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,
      device.deviceCategory,
      IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE
      totals.newVisits = 1 AND
      date BETWEEN '20170701' AND '20170801'
    GROUP BY
      unique_session_id,
      will_buy_on_return_visit,
      bounces,
      time_on_site,
      totals.pageviews,
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,
      device.deviceCategory,
      country))
    ORDER BY predicted_will_buy_on_return_visit DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,predicted_will_buy_on_return_visit,predicted_will_buy_on_return_visit_probs,unique_session_id,will_buy_on_return_visit,latest_ecommerce_progress,bounces,time_on_site,pageviews,source,medium,channelGrouping,deviceCategory,country
0,1,"[{'label': 1, 'prob': 0.575517231848732}, {'la...",7344038342439021225-1501120635,0,6,0,2515,61,mall.googleplex.com,referral,Referral,desktop,United States
1,1,"[{'label': 1, 'prob': 0.5595825429791841}, {'l...",0070452995371344990-1499834445,0,6,0,573,14,sites.google.com,referral,Referral,desktop,United States
2,1,"[{'label': 1, 'prob': 0.5807093147444004}, {'l...",4766623641504040239-1500335893,1,6,0,1305,19,gdeals.googleplex.com,referral,Referral,desktop,United States
3,1,"[{'label': 1, 'prob': 0.5355234954905926}, {'l...",7774240957927896789-1499442290,0,6,0,331,10,sites.google.com,referral,Referral,desktop,United States
4,1,"[{'label': 1, 'prob': 0.5230611566616062}, {'l...",7019345936998504727-1499907549,0,6,0,608,11,gdeals.googleplex.com,referral,Referral,desktop,United States
...,...,...,...,...,...,...,...,...,...,...,...,...,...
59597,0,"[{'label': 1, 'prob': 0.00528786658973396}, {'...",2577005515154248957-1501177138,0,0,0,121,3,youtube.com,referral,Social,desktop,Romania
59598,0,"[{'label': 1, 'prob': 0.007090356957853785}, {...",7582261539527099659-1500270143,0,0,0,85,3,m.facebook.com,referral,Social,mobile,United States
59599,0,"[{'label': 1, 'prob': 0.009842112431380218}, {...",1671504134842863799-1500187178,0,0,0,72,3,youtube.com,referral,Social,desktop,United States
59600,0,"[{'label': 1, 'prob': 0.005286459217437}, {'la...",1516264646025877273-1500100000,0,0,0,117,3,youtube.com,referral,Social,desktop,India


The prediction using the data on the last 1 month. As we can see, the output will add new 3 fields.
1. predicted_will_buy_on_return_visit = whether the model thinks the visitor will buy later
2. predicted_will_buy_on_return_visit_probs.label: the binary classifier for yes / no
3. predicted_will_buy_on_return_visit.probs.prob: the confidence the model has in it's prediction

Save our result to the table. Create a `predict_purchases_result` table.

In [16]:
%%bigquery
CREATE OR REPLACE TABLE ecommerce.predict_purchases_result 
AS
SELECT *
FROM 
  ML.PREDICT(MODEL `ecommerce.classification_model_2`, (
    WITH all_visitor_stats AS (
      SELECT
        fullvisitorid,
        IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
      FROM `data-to-insights.ecommerce.web_analytics`
      GROUP BY fullvisitorid)
    SELECT
      CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,
      will_buy_on_return_visit,
      MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
      IFNULL(totals.bounces, 0) AS bounces,
      IFNULL(totals.timeOnSite, 0) AS time_on_site,
      totals.pageviews,
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,
      device.deviceCategory,
      IFNULL(geoNetwork.country, "") AS country
    FROM `data-to-insights.ecommerce.web_analytics`,
    UNNEST(hits) AS h
    JOIN all_visitor_stats USING(fullvisitorid)
    WHERE
      totals.newVisits = 1 AND
      date BETWEEN '20170701' AND '20170801'
    GROUP BY
      unique_session_id,
      will_buy_on_return_visit,
      bounces,
      time_on_site,
      totals.pageviews,
      trafficSource.source,
      trafficSource.medium,
      channelGrouping,
      device.deviceCategory,
      country))
    ORDER BY predicted_will_buy_on_return_visit DESC

Query is running:   0%|          |

We can analyze the result of the prediction. 

In [20]:
%%bigquery
WITH predicted AS
(
  SELECT
  COUNT (*) AS total,
  COUNTIF(predicted_will_buy_on_return_visit=1) AS predicted_buy,
  COUNTIF(predicted_will_buy_on_return_visit=0) AS predicted_not_buy
  FROM `ecommerce.predict_purchases_result`
)
SELECT 
  predicted.total AS total_predicted,
  predicted.predicted_buy,
  predicted.predicted_not_buy,
  ROUND((predicted.predicted_buy/predicted.total),3)*100 AS percentage
FROM predicted

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,total_predicted,predicted_buy,predicted_not_buy,percentage
0,59602,368,59234,0.6


Overall, only 0.6% of first time visitors make a purchase in a later visit.