## Description:

This file contains:
* Predictions on unseen GA Data 
* Match rate between ZoomInfo DMs and C-level models
* More match rates with Bombora and Cordial BDM predictions (just out of curiosity)

In [1]:
"""Helper"""
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.2f' % x)
import numpy as np
from numpy import mean
import joblib

"""BQ"""
import google.auth
from google.cloud import bigquery
from google.cloud import bigquery_storage
import time

bqclient = bigquery.Client()
bqstorageclient = bigquery_storage.BigQueryReadClient()

### Checking overlap of predictions from ZoomInfo-DM model with ZoomInfo-C-level model 

* Data coming from 7/6/2021 Tuesday's ml_data.user_pred_data. 1,344,513 rows 

In [2]:
# current ZoomInfo C-level model

start_time = time.time()

query_string = """ SELECT * FROM `api-project-901373404215.lookalike.zoom_info_c_level` WHERE date='2021-07-06'  """

zoom_clevel = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

print("--- %s seconds ---" % (time.time() - start_time))
print("Shape: ", zoom_clevel.shape)

--- 2.3320841789245605 seconds ---
Shape:  (1344513, 3)


In [3]:
# ZoomInfo DM model

query_string = """ SELECT * FROM `api-project-901373404215.lookalike.zoom_info_dm`"""

zoom_dm = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

print(zoom_dm.shape)

# rename column
zoom_dm.rename(columns={'managementLevel':'dmLevel'}, inplace=True)

(1344513, 3)


In [4]:
overlap = pd.merge(zoom_clevel, zoom_dm, on="GA_fullVisitorId", how="inner")
print(overlap.shape)
overlap.head()

(1344513, 5)


Unnamed: 0,date_x,GA_fullVisitorId,managementLevel,date_y,dmLevel
0,2021-07-06,999970127319638241,C-level,2021-07-06,DM
1,2021-07-06,9999956768411702819,C-level,2021-07-06,Non-DM
2,2021-07-06,9999145376852336812,C-level,2021-07-06,DM
3,2021-07-06,9999047611199527304,Non-Clevel,2021-07-06,DM
4,2021-07-06,9999248213606623070,Non-Clevel,2021-07-06,DM


In [5]:
pd.crosstab(overlap.managementLevel, overlap.dmLevel)

dmLevel,DM,Non-DM
managementLevel,Unnamed: 1_level_1,Unnamed: 2_level_1
C-level,50247,21225
Non-Clevel,445784,827257


In [6]:
total_clevels = overlap[overlap.managementLevel=='C-level'].shape[0]
clevels_as_dms = overlap[(overlap.managementLevel=='C-level') & (overlap.dmLevel=='DM')].shape[0]
clevels_as_nondms = overlap[(overlap.managementLevel=='C-level') & (overlap.dmLevel=='Non-DM')].shape[0]

* 70% of predicted C-levels are also labelled as DMs

In [7]:
round((clevels_as_dms/total_clevels)*100)

70

* 30% of predicted C-levels are erroneously labelled as Non-DMs 
    * Note: the misclassification for these 21,225 folks could be because of either model:
        * Either C-level model misclassified these folks as C-levels and DM model is correct to label them as Non-DMs or vice versa -  Something to improve on in the future
        * TODO: perform EDA on these folks' GA data to try and get an idea

In [8]:
round((clevels_as_nondms/total_clevels)*100)

30

In [None]:
# @Rob - is this an acceptable match rate?

* If acceptable, **Next steps:**
    * In deployment script - update the SQL query to:
        * read from prediction pipeline - "ml_data.user_pred_data" table and 
        * exclude all fullvids present in "lookalike.zoom_info_c_level" where predicted managementLevel='C-level' like below
        * make sure DM deployed VM runs after C-level VM

In [10]:
sql = """
    SELECT 
        * 
    FROM 
        `api-project-901373404215.ml_data.user_pred_data` 
    WHERE
        fullvid_pred NOT IN 
        (
        -- Get users we already predicted as C-levels with the latest date the prediction was made  
          SELECT
              fullvid
          FROM 
              (
                SELECT
                  date,
                    GA_fullVisitorId as fullvid,
                  RANK() OVER (PARTITION BY GA_fullVisitorId ORDER BY date DESC) AS mostrecent
                FROM
                    `api-project-901373404215.lookalike.zoom_info_c_level`
                WHERE
                  managementLevel = 'C-level'
                )
              WHERE 
                mostrecent = 1 
        )
        """

upated_dm_pool = (
    bqclient.query(sql)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

print("--- %s seconds ---" % (time.time() - start_time))
print("Shape: ", upated_dm_pool.shape)

--- 140.4542498588562 seconds ---
Shape:  (1245630, 710)


In [10]:
# I had also calculated some match rates like David did for C-level model earlier - just out of curiosity. They are below: 

* Bombora defined number of unique DMs between '2021-06-01' and '2021-07-06' - 2,879,509

| ..      | Cordial BDM (existing) - <br>predictions for July 6,'21 | ZoomInfo DM (new) - <br>predictions for July 6,'21 <br>(ml_data.user_pred_data)   |
| :---:        |    :----:   | :---: |
| Number of unique predicted DMs     | 1,864       | 496,031   |
| Matching with 1 month Bombora DMs   | 1,82        | 20,897      |
| Matching with Cordial BDM's predictions   | N/A        | 897      |


**Note:** I guessed that the "bombora_decisionMaker" present in GA DataMart would be the column with bombora defined DMs, so used it below. Please correct me if I am wrong

### Checking overlap of DM model predictions with Bombora

In [9]:
# just for a quick check, fetch 1-month of GA data where Bombora has given a Decision Maker label for folks

query_string = """
    SELECT 
        GA_date, 
        GA_fullVisitorId, 
        bombora_decisionMaker  
    FROM 
        `api-project-901373404215.DataMart.DataMart6` 
    WHERE 
        bombora_decisionMaker IS NOT NULL 
    AND 
        GA_Date BETWEEN '2021-06-01' 
    AND 
        '2021-07-06'"""

bombora_bdm = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

bombora_bdm.shape

(5908160, 3)

In [10]:
# duplicates present - drop them

bombora_bdm.drop_duplicates('GA_fullVisitorId', 
                            inplace=True)

print("Any dups present? ", bombora_bdm.duplicated('GA_fullVisitorId', keep='first').any())
print("Bombora folks shape: ", bombora_bdm.shape)

print()
print("Bombora labels:", list(bombora_bdm.bombora_decisionMaker.unique()))

Any dups present?  False
Bombora folks shape:  (2879509, 3)

Bombora labels: ['hr decision maker', 'it decision maker', 'finance decision maker', 'marketing decision maker', 'healthcare decision maker', 'small business decision maker']


In [11]:
zoom_dm_folks = zoom_dm[zoom_dm.dmLevel=='DM']
zoom_dm_folks.shape

(496031, 3)

In [12]:
pd.merge(bombora_bdm, zoom_dm_folks, on= 'GA_fullVisitorId', how='inner')

Unnamed: 0,GA_date,GA_fullVisitorId,bombora_decisionMaker,date,dmLevel
0,2021-06-07,1937290557177945965,hr decision maker,2021-07-06,DM
1,2021-06-07,1884844058712228487,hr decision maker,2021-07-06,DM
2,2021-06-07,1985555381035920670,hr decision maker,2021-07-06,DM
3,2021-06-07,1848442613152469042,hr decision maker,2021-07-06,DM
4,2021-06-07,1809696562125660594,hr decision maker,2021-07-06,DM
...,...,...,...,...,...
20892,2021-07-03,8344789284982684704,small business decision maker,2021-07-06,DM
20893,2021-06-16,12845895644105350517,small business decision maker,2021-07-06,DM
20894,2021-06-26,15425365506454076311,small business decision maker,2021-07-06,DM
20895,2021-06-26,15509543153359656714,small business decision maker,2021-07-06,DM


* 20,897 predicted DMs match with Bombora defined decision makers

### Checking overlap of ZoomInfo DM model predictions with Cordial DMs 

* Caveat to note - Cordial DMs model was trained on much different data. So below number may just be directional

In [13]:
query_string = """ SELECT * FROM `api-project-901373404215.lookalike.cordial_bdm` WHERE date='2021-07-06'  """

cordial_bdm = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(bqstorage_client=bqstorageclient)
)

cordial_bdm.shape

(3843, 4)

In [14]:
cordial_bdm.drop_duplicates('fullvid', inplace=True)
print(cordial_bdm[cordial_bdm.fullvid.duplicated(keep='first')].any())
print(cordial_bdm.shape)

date            False
fullvid         False
bdm             False
GA_visitorId    False
dtype: bool
(1864, 4)


* 897 predicted DMs by ZoomInfo-DM model were also predicted as DMs by Cordial BDM model

In [15]:
# matching

print(zoom_dm_folks[zoom_dm_folks.GA_fullVisitorId.isin(cordial_bdm.fullvid)].shape)

(897, 3)


In [16]:
# non-matching
non_match_cordials = cordial_bdm[~cordial_bdm.fullvid.isin(zoom_dm_folks.GA_fullVisitorId)]
print(non_match_cordials.shape)

(967, 4)


* Cordial intersection Bombora for July 6th predictions

In [17]:
pd.merge(bombora_bdm, cordial_bdm, left_on= 'GA_fullVisitorId', right_on='fullvid', how='inner').shape

(182, 7)