# Introduction
The goal of the data-driven casino is to maximize value of every player by encouraging them to deposit money and place bets by sending promos and giving away bonuses. Smart and efficient casinos are relentlessly searching for the ideal distribution of bonus money by developing  various models and algorithms. It can be done in multiple ways: customer segmentation, early life value prediction, churn prevention, bonus hunters detection etc.

I shall now explore the dataset and come up with the possible solution on how the casino could increase its net revenue.

## Data
The source data is 25k rows of transactions of players activity between their first deposit date and 3 months after first deposit. It contained:
- Player's information:
    - customer_id
    - acqusition_source
    - country
    - gender
    - date_first_deposit_id
    - date_registration_id
- Transactional data (all the metrics are aggregated per player and per transaction date):
    - date_transaction_id
    - count_deposit
    - sum_deposit
    - sum_free_spin
    - sum_bonus_cost
    - sum_ngr
    - sum_bet_real
    - sum_bet_bonus
    - sum_win_real
    - sum_win_bonus

# 1_get_data
This first notebook is all about loading and checking the provided file: ```task_data_set.CSV```. To do so:
- I shall load data to ensure usability,
- Apply very quick descriptive statistics to explore the contents,
- Ensure data quality of the different data objects,
- Enhance original dataset with third-party data for Country information. In particular, the addition of Lat and Long per Country since this will be required further on at modelling stages.

In the end, I export the dataset to a pickle file for use in later notebooks.

# 2_EDA_WhichModel_FeatureEngineering
In this notebook, I perform an EDA to try and understand which model would be most suited for the provided data. The problem at hand is to **increase casino net revenue**. i.e. we would like to maximise the NGR per player, where NGR is defined as $$NGR = Bets – Wins – Bonus Cost – Tax$$

I consider different visualisations, employ feature engineering and gauge different perspectives of the data to come up with a strategy to try and solve for this. 

I make the assumption that the max transaction date for the customer within the provided 3 month span since deposit marks the churn date. Thus, I looked at the frequency of duration in days per customer, which produced the fig below: ![fig](./figs/EDA_player_diration_transaction_min_max.png)
This highlights that:
- A big proportion of customers have a spending spree of 0 days, meaning they register, deposit, bet and churn straight away. These could potentially be Bonus Hunters.
- Within 10 days, we are loosing 25.14% of our customers, or **37%** when excluding Day 0 players. We should thus have a model which will allow us to identify those customers that offer the best NGR at an early stage and employ measures to ensure they stick around
- At the 30-40 day mark, we are seeing a increase in frequency. This could potential be a location where to inject bonuses or game recommendations so we encourage repeat play.

### Strategy
With this in mind, I propose the following strategy to increase NGR per player.
Given a customer who transacted with us for the 1st time.
1. **10 day mark:** After 10 days since the player's first transaction, run the customer through a churn prediction model which will predict if a churn happened in the past 10 days. I chose 10 since (i) it is longer than a week, thus we can gauge a player's weekly pattern and at least 1 weekend, and (ii) it encapsulates **25%** of all our customers, or **37%** when exclude Day 0 players.

    From the predicted churn outcome,
    - **if the customer did not churn or will not churn within 10 days**:
        - either do not apply any change to the current bonuses, promo, etc, or
        - implement bonus savings on these players. As a first pass of the model, I would advise against this for now since I would like to focus on increasing player retention at the moment.
    - **if the customer is going to churn or has already churned**: 
        - pass through a NGR prediction (early value classification model) to identify leads. Promos and bonuses are then sent to the high value clients to try and retain or reactivate them.
1. **30 day mark:** 
    - Pass the players through a churn prediction model tuned for day 29, 30, 31 days.
    - Apply the previous logic to identify promo and discount distribution
    
### Feature Engineering
To be able to implement the above mentioned strategy, I feature engineered different metrics to try and encapsulate different dimension of the data. I grouped the featured based on:
- Behavior:
    - Timing differences and days when active,
    - Sums, Averages and Frequency of interaction for the customer financials within the timespan,
    - iGaming KPIs in the form of NGR_Deposit_ratio and Bets_Deposit_ratio, and
    - Modified RFM Analysis metric
- Geographic and Demographic details
- Gross NGR per player

### Output
The output of this notebook are 5 dataframes. In summary, I have:
- Dataframes that have metrics for individual players based on the whole dataset
    - **df_players** : Timing differences between Registration, Deposit and Min_Max Transaction dates. 
    - **df_NGR** : NGR metric split per player based on 3 quantiles and -ve values
    - **df_geo_gen** : 

- Time limited data identified at critical points
    - **df_10** : Behavioral, Timing Difference, KPIs, Averages, Ratios and other metrics defined within the functions ```getCustomerMetric``` and ```getCustomerMetricDF```, limited to transactions occuring within 10 days of first deposit.
    - **df_30** : As above, but transactions limited up to 31 days

# 3_PreProcessing
In this notebook I implement the strategy suggested in ```2_EDA_WhichModel_FeatureEngineering```. This will involve the following 4 models. Within this notebook, I also apply (i) scaling to continous data, (ii) encode categorical, and (iii) inpute any missing data.
1. **Churn Prediction for 10 day history**: 

| MODEL | Churn_10 |
| --- | ----------- |
| *Model Type* | Binary classifier |
| *Source Datasets* | <ul><li>Geographic and Demographic ```df_geo_gen```</li><li>Behavioral ```df_10```</li><li>Player details ```df_players``` </li></ul>|
| *Target* | ```df_players[churn_10]```: {0,1}   |

2. **Customer Value Preduction for 10 day history**:
        
| MODEL | CVP_10 |
| --- | ----------- |
| *Model Type* | Multi-class classifier |
| *Datasets* | <ul><li>Geographic and Demographic ```df_geo_gen```</li><li>Behavioral ```df_10```</li><li>NGR values ```df_ngr```</li></ul>|        
| *Target* | ```df_ngr[CLTV]```: {0, 1, 2, 3} |

3. **Churn Prediction for 30 day history**: 

| MODEL | Churn_30 |
| --- | ----------- |
| *Model Type* | Binary classifier |
| *Datasets* | <ul><li>Geographic and Demographic ```df_geo_gen```</li><li>Behavioral ```df_30```</li><li>Player details ```df_players``` </li></ul>|        
| *Target* | ```df_players[churn_30]```: {0,1}   |

4. **Customer Value Prediction for 30 day history**:
        
| MODEL | CVP_30 |
| --- | ----------- |
| *Model Type* | Multi-class classifier |
| *Datasets* | <ul><li>Geographic and Demographic ```df_geo_gen```</li><li>Behavioral ```df_30```</li><li>NGR values ```df_ngr```</li></ul>|        
| *Target* | ```df_ngr[CLTV]```: {0, 1, 2, 3} |

# 4_Modelling_*
The models identified above have been implemented in the following set of notebooks. These are provided together with respective results.
- **4_Modelling_Churn10**:

| Model Type | Classifier | Label | Recall | ROC_AUC|
| --- | --- | --- | --- | ---|
| Linear | Logistic Regression | lr | 0.79078 | 0.753103|
| SVM | SVC | svc | 0.684397 | 0.775709|
| Tree | Decision Tree | tr | 0.602837 | 0.748227|
| Tree | Random Forest | rf | 0.599291 | 0.765071|
| Tree | Gradient Boosting | gb | 0.673759 | 0.791667|
| Neural Network | Multi-layer Perceptron | nn | 0.535461 | 0.719193|
| Probabilistic | Gaussian Naive Bayes | NB | 0.815603 | 0.602615|
| Ensemble | [gb, rf, tr, NB] | [gb, rf, tr, NB] | 0.751773 | 0.800089|

The chosen model was an ```Ensemble``` model due to the large ROC_AUC. This employed a combination of ```gb, rf, tr, NB``` models using soft voting.
- **4_Modelling_Churn30**:

| Model Type | Classifier | Label | Recall | ROC_AUC|
| --- | --- | --- | --- | ---|
| Linear | Logistic Regression | lr | 0.979167 | 0.955608|
| Tree | Decision Tree | tr | 0.791667 | 0.893298|
| Tree | Random Forest | rf | 0.75 | 0.873479|
| Tree | Gradient Boosting | gb | 0.854167 | 0.924041|
| Neural Network | Multi-layer Perceptron | nn | 0.729167 | 0.862048|
| Probabilistic | Gaussian Naive Bayes | NB | 0.895833 | 0.938789|
| Ensemble | [gb, rf, lr, NB] | [gb, rf, lr, NB] | 0.916667 | 0.954784|

The chosen model was an ```Ensemble``` model due to the boosted performance in either metrics. This employed a combination of ```gb, rf, lr, NB``` models using soft voting.
- **4_Modelling_CVP10**:

| Model Type | Classifier | Label | Recall | F1_Score | Accuracy |
| --- | --- | --- | --- | --- | --- |
| Linear | Logistic Regression | lr | 0.77176 | 0.771741 | 0.77176 |
| Tree | Decision Tree | tr | 0.860735 | 0.859882 | 0.860735 |
| Tree | Random Forest | rf | 0.868472 | 0.868298 | 0.868472 |
| Tree | Gradient Boosting | gb | 0.859768 | 0.859762 | 0.859768 |
| Neural Network | Multi-layer Perceptron | nn | 0.766925 | 0.767642 | 0.766925 |
| Probabilistic | Gaussian Naive Bayes | NB | 0.692456 | 0.680394 | 0.692456 |
| Ensemble | [gb, rf, tr, lr] | [gb, rf, tr, lr] | 0.865571 | 0.865294 | 0.865571 |

The chosen model was an ```Ensemble``` model due to general good performance of the 3 metric and less susceptible to local minima nature of an ensemble model. This employed a combination of ```gb, rf, tr, lr``` models using soft voting.
- **4_Modelling_CVP30**:

| Model Type | Classifier | Label | Recall | F1_Score | Accuracy |
| --- | --- | --- | --- | --- | --- |
| Linear | Logistic Regression | lr | 0.876209 | 0.875918 | 0.876209 |
| Tree | Decision Tree | tr | 0.928433 | 0.928568 | 0.928433 |
| Tree | Random Forest | rf | 0.930368 | 0.930304 | 0.930368 |
| Tree | Gradient Boosting | gb | 0.928433 | 0.928428 | 0.928433 |
| Neural Network | Multi-layer Perceptron | nn | 0.886847 | 0.887157 | 0.886847 |
| Probabilistic | Gaussian Naive Bayes | NB | 0.787234 | 0.782324 | 0.787234 |
| Ensemble | [gb, rf, tr, lr] | [gb, rf, tr, lr] | 0.933269 | 0.933171 | 0.933269 |

The chosen model was an ```Ensemble``` model since it improves either of the 3 metrics. This employed a combination of ```gb, rf, tr, lr``` models using soft voting.


Overall, the 30 day model performed better than the 10 day model. This is as expected since the longer duration introduces sparsity within the different features of the data.

# 5_Deploying_the_model 
In this notebook I first walk-through an example on how to make use of the model, calculate the potential return of the strategy and then line-out the next stages to take this model ready for to production, and beyond.

In the example, I randomly selected 1000 customers.
- Within 10 days, 316 customers (32%) have been identified as churnable.
    - 53 where identified as High value potential
    - 94 where identified as Medium value potential
- Within 30 days, 50 customers (5%) have been identified as churnable. (5% is reassuringly the same number identified in *2_EDA_WhichModel_FeatureEngineering*)
    - 21 where identified as High value potential
    - 14 where identified as Medium value potential
    
### Potential Return of Strategy
With this regard, I passed the data through the strategy to see and compare different conversion rates to see returns.
![fig](./figs/Pot_Return_NGR.png)
Looking at the above, even a low successful conversion rate of 50% would lead 100,000CUR returns