# Yankees - Event Propensity - Next Event Buyer Scoring
* Ryan Kazmerik, Nakisa Rad, Joey Lai, Shawn Sutherland, Matt Bahler, Pat Faith
* Feb 22, 2022

In [1]:
import boto3
import json
import matplotlib.pyplot as plt
import pandas as pd
import warnings

from pandas_profiling import ProfileReport
from pycaret.classification import *

boto3.setup_default_session(profile_name='Legacy-DataScienceAdmin')

warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'

### Let's get some data from S3 and convert the CSV files into a single dataframe. There should be 4 files per game, with the game date included in the filename:

In [10]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='stellar-redshift-etl', Key='hold/ml_data/game2022-04-070000_part_00')
df_prev_games = pd.read_csv(obj['Body'])

In [11]:
df_prev_games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573332 entries, 0 to 573331
Data columns (total 12 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   daysOut              573332 non-null  object 
 1   eventDate            573332 non-null  object 
 2   tenure               573332 non-null  int64  
 3   dimCustomerMasterId  573332 non-null  int64  
 4   events_purchased     573332 non-null  int64  
 5   frequency_opponent   129392 non-null  float64
 6   frequency_eventDay   85552 non-null   float64
 7   frequency_eventTime  291792 non-null  float64
 8   inMarket             554600 non-null  object 
 9   distanceToVenue      490400 non-null  float64
 10  recent_openRate      0 non-null       float64
 11  recent_clickRate     0 non-null       float64
dtypes: float64(6), int64(3), object(3)
memory usage: 52.5+ MB


### We need to replace some NaN values with 0, except for the distanceToVenue column:

In [12]:
for col in df_prev_games.columns:
    if col != 'distanceToVenue':
        df_inference[col].fillna(0, inplace=True)

### Let's load up the saved event propensity model from the model directory:

In [13]:
saved_model= load_model('./models/MLB Yankees - Event Propensity (22Feb2022)')

Transformation Pipeline and Model Successfully Loaded


### Now we can model the data using a binary classification prediction for the target field to see how likely a customer is to purchase:

In [15]:
df_inference = predict_model(saved_model, data=df_prev_scores, raw_score=True)
df_inference.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573332 entries, 0 to 573331
Data columns (total 15 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   daysOut              573332 non-null  object 
 1   eventDate            573332 non-null  object 
 2   tenure               573332 non-null  int64  
 3   dimCustomerMasterId  573332 non-null  int64  
 4   events_purchased     573332 non-null  int64  
 5   frequency_opponent   129392 non-null  float64
 6   frequency_eventDay   85552 non-null   float64
 7   frequency_eventTime  291792 non-null  float64
 8   inMarket             554600 non-null  object 
 9   distanceToVenue      490400 non-null  float64
 10  recent_openRate      0 non-null       float64
 11  recent_clickRate     0 non-null       float64
 12  Label                573332 non-null  int64  
 13  Score_0              573332 non-null  float64
 14  Score_1              573332 non-null  float64
dtypes: float64(8), in

### Included are some metrics on did and did_not purchase:

In [16]:
did_purchase = df_inference["Label"].value_counts()[1]
did_not_purchase = df_inference["Label"].value_counts()[0]
total_rows = df_inference["Label"].count()
purchase_percentage = round((did_purchase / total_rows), 2) * 100

print("Would purchase:", did_purchase)
print("Would not purchase:", did_not_purchase)
print("Purchase percentage:", purchase_percentage)

Would purchase: 193672
Would not purchase: 379660
Purchase percentage: 34.0


### Let's sort the scores to find the top prospects for each fan:

In [18]:
scoring_dict = df_inference.to_dict('records')

In [19]:
max_dict = {}
for record in scoring_dict:
    if record['dimCustomerMasterId'] not in max_dict:
        max_dict[record['dimCustomerMasterId']] = 0
    
    if record['Score_1'] > max_dict[record['dimCustomerMasterId']]:
        max_dict[record['dimCustomerMasterId']] = record['Score_1']
        
max_scores = [{'id': k, 'score': v} for k, v in max_dict.items()]
max_scores.sort(key = lambda v: -v["score"])

### Exporting to CSV for further analysis against past purchase percentages:

In [20]:
df_max_scores = pd.DataFrame(max_scores)
df_max_scores.head()

df_max_scores.to_csv('./data/game2022-04-070000_part_00_results.csv')

## Observations
Here you can talk about next steps, did the experiment work? If yes, what to do next? If no, why?