# Question
Based on the 3 datasets provided, perform EDA and put together an Early Customer Classification Model
## Datasets
The data provided consisted of 3 datasets:
### 1. Transactions
This dataset contains hourly aggregated transactions (player activity) for a period of 201701-201703. Transaction types could be wager, winning, deposit , withdraw. Amounts for deposits and winnings will be positive but for wagers and withdraw this will be negative.

field|Comments
--- | --- 
playerid| 
created_date_time|time when transaction was made (hourly aggregated)
account_currency| 
gameid|game that player played
cashier_method_id|method that player used for payments
account_type|like 'CASH%' means it's a cash transaction , like 'BONUS%' means it is bonus transaction
trans_type|wager, winning,deposit and withdraw
payment_status|accepted=successful status , other statuses mean that transaction wasn't successful
payment_result_code|detailed status
payment_channel|mobile or desktop
transaction_count|number of transactons aggregated
account_amount|amount in account currency

### 2. Player Details
|field|Comments|
|---|---|
|playerid|unique playerid|
|signup_completed|date when player signed up |
|first_deposit|date when player succesfully deposited for the first time (null values ='1900-01-01')|
|first_deposit_amt|amount of the first deposit|
|city||
|birth_date||
|gender||

### 3. Games
Game information and game characteristics	

|field|Comments|
|---|---|
|gameid|unique gameid|
|gameid_root|If there is a game x available on desktop and mobile gameid will be different but gameidroot will indicate that it's the same game |
|categoryid|id of the game category |
|rtp|theoretical Return To Player|
|channel|desktop or mobile|
|categorycode|type of the game - slots,tablegames etc.|

# Task
Given the 3 datasets **(transactions, player_details, games)**, I was asked to provide insights that could improve business results. To answer this problem I put together 4 notebooks:
## **[1_get_data](https://github.com/zerafachris/playGround/blob/master/published/iGamingAnalytics/1_get_data.ipynb)**
In this notebook, the datasets are cleaned and augmented. These are available in the directory *./cleanedData/*. In particular, the following was applied to the data:
- Player Dataset: 
    - Renaming of 1 city from 'r	ejmyre' to 'rejmyre' as this resulted in an extended row.
    - Adding of *city* and *country* information based on third-party datasets
    - Addition of player age related features
- Transaction Dataset:
    - All 3 csv files were aggregated
    - Conversion of FX to EUR
    - Addition of time features
- Games Data:
    - No alterations were made to the dataset

## **[2_viz_data](https://github.com/zerafachris/playGround/blob/master/published/iGamingAnalytics/2_viz_data.ipynb)**
In this notebook, some investigative EDA was done and plots produced with possible ideas which could provide added value. In particular, the following were investigated:
1. Daily Volumes,
2. Busiest game types, and
3. Age of active players and their playing habits

## **[3_player_classification_data_prep](https://github.com/zerafachris/playGround/blob/master/published/iGamingAnalytics/3_early_player_classification_data_prep.ipynb)**
In this notebook, a dataset for the different players was created with features relating to:
- Geographic & Demographic Traits,
- Behavioral Traits, and
- Value Traits

The final dataset had *17313* rows, i.e. different player profiles
    

## **[4_early_player_classification_modelling](https://github.com/zerafachris/playGround/blob/master/published/iGamingAnalytics/4_early_player_classification_modelling.ipynb)**
In this final notebook I proceed as follows:
1. **Creation of player clusters:** Using *sklearn*'s implementation of *k-means*, produced clusters grouping different players. These different players where characterised via a Silhouette Coefficient (SC), and together with the cluster mean and St.Dev, the optimal number of clusters was identified

2. **Considered the player morphology within the clusters:**
    - Produced radar charts with the different features at the different angle to help classify the differnt players.
    - Visualisation of the clustering with TSNE
    
3. **Build a Classification model:** Based on the cluster I went ahead and tested a group of potential models, namely (Precision):
    - Support Vector Machine Classifier (SVC) (98.49%)
    - Logistic Regression (LR) (98.72%)
    - k-Nearest Neighbours (KNN) (97.43%)
    - Decision Tree (TR) (89.35%)
    - Random Forest Classifier (RF) (98.97%)
    - Gradient Boost (GB) (99.42%)
    - Ensemble model combining GB, RF, LR, SVC via SoftVoting
    
    To do this I split the dataset into 3 datasets as follows:
        1. 10% = *X,Y_validation* to check inference once modelling is complete
        2. 90% * 80% = *X,Y_train* Training dataset
        3. 90% * 20% = *X,Y_testing* Testing dataset
    
4. **Prediction on a Validation Dataset:** The models above are validated against *X,Y_validation*. The resulting model output precisiont where:

Model | Precision
--- | --- 
Support Vector Machine|98.50%
Logistic Regression|98.96%
k-Nearest Neighbors|97.75%
Decision Tree|90.01%
Random Forest|98.73%
Gradient Boosting|99.25%
Voting Classifier|99.19%
