[![CyVers](https://i.imgur.com/yyhmZET.png)](https://www.cyvers.ai/)

# Solidus Blind Test - Exploratory Data Analysis (EDA)

> Notebook by:
> - Hakan UNAL Hakan@cyvers.ai
> - Royi Avital Royi@cyvers.ai

## Revision History

| Version | Date       | Content / Changes                      |
|---------|------------|----------------------------------------|
| 1.0.000 | 03/06/2022 | First version                          |
|         |            |                                        |

In [None]:
# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Misc
import random

import os
import datetime
from platform import python_version

# EDA Tools
import ppscore as pps #<! See https://github.com/8080labs/ppscore -> pip install git+https://github.com/8080labs/ppscore.git

# Ensemble Engines
import lightgbm
import xgboost

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from bokeh.plotting import figure, show

# Jupyter
from ipywidgets import interact, Dropdown, Layout

In [None]:
# Configuration
%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

sns.set_theme() #>! Apply SeaBorn theme

In [None]:
# Constants

DATA_FOLDER_NAME    = 'DataSet'
DATA_FOLDER_NAME    = 'DataSet/Test'
DATA_FILE_EXT       = 'csv'

In [None]:
# Parameters

csvFileName = 'Dataset Bitmart.csv'
csvFileName = 'All.csv'

In [None]:
# Loading / Generating Data

dfData = pd.read_csv(os.path.join(DATA_FOLDER_NAME, csvFileName))
numRows, numCols = dfData.shape

print(f'The number of rows (Samples): {numRows}, The number of columns: {numCols}')

In [None]:
# Convert time data into Pandas format
dfData['Transaction Time'] = pd.to_datetime(dfData['Transaction Time'], infer_datetime_format = 'True') #<! Stable time format

In [None]:
dfData.head(20)

In [None]:
dfData.info()

In [None]:
dfData.describe()

## Feature Engineering

This section adds features and engineers them.  
It is assuemd the files havd a single unique `Sender`. Hence all analysis is done on the eceivers.


The features are:

 1. 

Remarks:

 *  Features x-y are time / frequency related.
 *  Features z-t are trasnaction realted.


In [None]:
# Data grouped by user as most operations work on users
dfGrpUser = dfData.sort_values('Transaction Time').groupby('Receiver ID')

In [None]:
# General Features (Transactions)
dfData['Num Trns User'] = dfGrpUser['Receiver ID'].transform('size') #<! We sould use `count` instead of `size` to remove NaN
dfData

In [None]:
# Amount Related Features

# dfData['Sum Value User'] = dfGrpUser['Amount [USD]'].transform(lambda x: x.sum()) #<! We sould use `count` instead of `size` to remove NaN, Using lambda function is much slower
dfData['Sum Value User'] = dfGrpUser['Amount [USD]'].transform('sum') #<! We sould use `count` instead of `size` to remove NaN
dfData['Mean Value User'] = dfGrpUser['Amount [USD]'].transform('mean') 
dfData['STD Value User'] = dfGrpUser['Amount [USD]'].transform('std') 
dfData['Max Value User'] = dfGrpUser['Amount [USD]'].transform('max') 
dfData['Min Value User'] = dfGrpUser['Amount [USD]'].transform('min')
dfData

In [None]:
# Time Related Features

dfData['First Action User'] = dfGrpUser['Transaction Time'].transform('min')
dfData['Last Action User'] = dfGrpUser['Transaction Time'].transform('max')
dfData['Active Duration User'] = np.maximum((dfData['Last Action User'] - dfData['First Action User']).dt.total_seconds(), 1) #<! Limiting the minimal period to 1 [Sec] (Will be used for divison)
dfData['Frequency Trns. / Days'] = dfData['Num Trns User'] / (dfData['Active Duration User'] / (24 * 3600))

# dfData['Diff Trns Time'] = dfGrpUser['Transaction Time'].rolling(2).apply(lambda x: x[1] - x[0], raw = True)
# dfData['Diff Trns Time'] = np.NaN
# for grpName, dfGroup in dfGrpUser:

#     vIndx = dfGroup.index

#     for ii, idxVal in enumerate(vIndx):
#         if ii == 0:
#             continue

#         dfData.loc[idxVal, 'Diff Trns Time'] = (dfData.loc[idxVal, 'Transaction Time'] - dfData.loc[vIndx[ii - 1], 'Transaction Time']).total_seconds()

dfData['Diff Trns Time'] = dfGrpUser['Transaction Time'].diff().dt.total_seconds()

#<! Since we use newly created column we can't use the groups and apply
for grpName, dfGroup in dfGrpUser:

    vIndx = dfGroup.index

    dfData.loc[vIndx, 'Mean Time Diff'] = dfData.loc[vIndx, 'Diff Trns Time'].mean()
    dfData.loc[vIndx, 'STD Time Diff']  = dfData.loc[vIndx, 'Diff Trns Time'].std()
    dfData.loc[vIndx, 'Max Time Diff']  = dfData.loc[vIndx, 'Diff Trns Time'].max()
    dfData.loc[vIndx, 'Min Time Diff']  = dfData.loc[vIndx, 'Diff Trns Time'].min()

dfData

## Data Analysis & Visualization

Thsi section visuazlie the data and features.

In [None]:
# Selected features for analysis
lSlctdFeatures  = ['Amount [USD]', 'Num Trns User', 'Sum Value User', 'Mean Value User', 'STD Value User', 'Max Value User', 'Min Value User', 'Active Duration User', 'Frequency Trns. / Days', 'Mean Time Diff', 'STD Time Diff', 'Max Time Diff', 'Min Time Diff']
# lSlctdFeatures  = ['Amount [USD]', 'Num Trns User', 'Sum Value User', 'STD Value User', 'Max Value User', 'Min Value User', 'Active Duration User', 'Frequency Trns. / Days', 'STD Time Diff', 'Max Time Diff', 'Min Time Diff']
numFeatures     = len(lSlctdFeatures)

### Predictive Power Score (PPS)

This analysis shows the relation between the different features.  
The idea is to try estimate a feature by a different feature as a better way to see the link (Compared to correlation).

In [None]:
# Predictive Power Score (PPS)

mPPS = pps.matrix(dfData[lSlctdFeatures + ['Label']], **{'cross_validation': 20, 'random_seed': 1234})[['x', 'y', 'ppscore']].pivot(columns = 'x', index = 'y', values = 'ppscore')

In [None]:
# Visualization of PPS
hF, hA = plt.subplots(figsize = (30, 30))
sns.heatmap(mPPS, annot = True, fmt = '.2f', cmap = plt.get_cmap('coolwarm'), cbar = False, vmin = 0, vmax = 1, ax = hA) #<! Below the Diagonal Horizontal predict Vertical (x -> y), Above diagonal Vertical predict Horizontal (y -> x)

plt.setp(hA.get_xticklabels(), ha = "center", rotation = 45)
plt.setp(hA.get_yticklabels(), rotation = 'horizontal')
hA.set_title('Predictive Power Score (PPS)')

In [None]:
valA = pps.score(dfData, 'Num Trns User', 'Label') #<! Predict y by x (pps.score(dfData, 'x', 'y'))
valA

In [None]:
valA = pps.score(dfData, 'Label', 'Num Trns User') #<! Predict y by x (pps.score(dfData, 'x', 'y'))
valA

In [None]:
dfGrpLabel = dfData.groupby('Label')

### Scatter Plot per Feature

In [None]:
# Working on a DataFrameGroupBy
# Below is Royi's approach, another approach: https://stackoverflow.com/questions/25279810
numColsFig = 5
numRowsFig = np.ceil(numFeatures / numColsFig).astype('int')

hF, mHA = plt.subplots(nrows = numRowsFig, ncols = numColsFig, figsize = (7 * numColsFig, 5 * numRowsFig))

def JitterXData( xVal, numSamples, jitterLvl = 0.01 ):
    return xVal + jitterLvl * np.random.randn(numSamples)

kk = 0
vShift = [-0.3, 0.3]
vC = [{'c': 'b'}, {'c': 'r'}]
vC = [{'c': ['b'], 'label': '0'}, {'c': ['r'], 'label': '1'}]
for ii, colName in enumerate(lSlctdFeatures):
    for jj, (grpName, dfGroup) in enumerate(dfGrpLabel):
        # sns.scatterplot(data = dfGroup, x = ii + vShift[jj], y = colName, ax = mHA.flat[kk], **vC[jj]) #<! mA.flat[kk] Allows linear indexing for non 1D arrays
        sns.scatterplot(data = dfGroup, x = JitterXData(ii + vShift[jj], dfGroup.shape[0]), y = colName, ax = mHA.flat[kk], **vC[jj]) #<! Added manual jitter
        mHA.flat[kk].tick_params(top = False, bottom = False, labelbottom = False)
        mHA.flat[kk].legend(title = 'Suspicious')
    kk += 1

In [None]:
# hF, hA = plt.subplots(figsize = (20, 10))

def DisplayFeature( dfData, xColName, yColName, hA ):
    hF, hA = plt.subplots(figsize = (20, 10))
    
    sns.scatterplot(data = dfData, x = xColName, y = yColName, hue = xColName, ax = hA)
    hA.tick_params(top = False, bottom = False, labelbottom = False)
    hA.legend(title = 'Suspicious')
    plt.show()


oDropdwon = Dropdown(
    options     = lSlctdFeatures,
    value       = 'Amount [USD]',
    description = 'Select Feature:',
    style       = {'description_width' : 'initial'}
)

interact(lambda yColName: DisplayFeature(dfData, 'Label', yColName, hA), yColName = oDropdwon)

# DisplayFeature(dfData, 'Label', 'Amount [USD]', hA)

In [None]:
# Plot the distribution
hF, hA = plt.subplots(figsize = (32, 12))

for ii, colName in enumerate(lSlctdFeatures):
    sns.scatterplot(data = dfData, x = JitterXData(ii, dfData.shape[0]), y = colName, hue = 'Label', ax = hA) #<! Too crowded
    
hLegHandles, hLegLabels = hA.get_legend_handles_labels()
hA.legend(hLegHandles[:2], hLegLabels[:2], title = 'Suspicious')

hA.set_xticks(range(numFeatures), lSlctdFeatures)
plt.setp(hA.get_xticklabels(), ha = "right", rotation = 45)

hA.set_xlabel('Variable')
hA.set_ylabel('Value')

In [None]:
# Plot the distribution
hF, hA = plt.subplots(figsize = (32, 12))

sns.scatterplot(data = dfData, x = 'Amount [USD]', y = 'Frequency Trns. / Days', hue = 'Label', ax = hA)

### Violin Plot per Feature

In [None]:
# Pre Process data:
# 1. Flatten all Var columns into a single long columns.
# 2. Per element set its Var (in vB) and Cancer (vC)
vA = dfData.loc[:, lSlctdFeatures].to_numpy().flatten(order = 'F')
vB = np.tile(np.reshape(lSlctdFeatures, (numFeatures, 1)), (1, numRows))
vB = vB.flatten(order = 'C')
vC = np.tile(dfData.loc[:, 'Label'], (numFeatures,))

In [None]:
# Plot the distribution
hF, hA = plt.subplots(figsize = (32, 12))

# for ii, colName in enumerate(dfData.columns[featuresFirstIdx:]):
#     sns.scatterplot(data = dfData, x = ii, y = colName, hue = dfData.columns[2], ax = hA) #<! Too crowded
    # sns.violinplot(data = dfData, y = colName, x = ii * np.ones(dfData.shape[0]), hue = 'Cancer', split = True, ax = hA) #<! Doesn't work
    # sns.swarmplot(data = dfData, x = ii * np.ones(dfData.shape[0]), y = colName, hue = 'Cancer', ax = hA) #<! Doesn't work
    # sns.stripplot(data = dfData, x = ii * np.ones(dfData.shape[0]), y = colName, hue = 'Cancer', ax = hA) #<! Doesn't work
    


# hLegHandles, hLegLabels = hA.get_legend_handles_labels()
# hA.legend(hLegHandles[:2], hLegLabels[:2], title = 'Cancer')

# hA.set_xlabel('Variable')
# hA.set_ylabel('Value')

# Usign the Pre Process data works!
sns.violinplot(x = vB, y = vA, hue = vC, inner = None, split = True, ax = hA)
hA.legend(title = 'Suspicious')
plt.setp(hA.get_xticklabels(), ha = "right", rotation = 45)
hA.set_ylabel('Value')



The long tails in the negative direction in some of the features above (For example, feature `Var025`) are due to the `log10()` transform. In order to prevent `-Inf` values a value of `1e-6` was added to all values.

Most of the features have large overlap and there is no feature which can, on its own, predict the objective very well.

## Distribution per Variable

In this section we'll review the distribution of the classes (Healthy / Sick) per feature.  
Good features are the ones which the overlap between the distributions is small (Being more accurate, the mass under the overlapped curves).

In [None]:
# Currently not working!

# numColsFig = 5
# numRowsFig = np.ceil(numFeatures / numColsFig).astype('int')

# hF, mHA = plt.subplots(nrows = numRowsFig, ncols = numColsFig, figsize = (5 * numColsFig, 5 * numRowsFig))
# # hF.suptitle('Distribution of Features by Classes', fontsize = 16)
# # hF.supylabel('Density')

# kk = 0
# for ii in range(numFeatures):
#     sns.kdeplot(data = dfData, x = lSlctdFeatures[ii], hue = 'Label', ax = mHA.flat[kk]) #<! mA.flat[kk] Allows linear indexing for non 1D arrays
#     kk += 1


## Analysis of the Age

One of the most interesting question is how does the feature (Protein) profile changes with age.  
The idea is to understand whether a feature is a good predictor of the cancer or the age (Which is a predictor of the cancer in the general population).

Some remarks:

1. The _positive_ and _negative_ groups have different age profile. For instance, there is no _positive_ case with age below 36.  
2. Most of the features show little / no correlation with age. Which means they have a potential to predict a phenomenon regardless of the age.


> It is advised to keep the meta data fragmentation low while keeping the positive / negative ratio equal at each fragment.

In [None]:
# Pair Plot: Age vs. Feature

numColsFig = 6
numRowsFig = np.ceil(numFeatures / numColsFig).astype('int')

hF, mHA = plt.subplots(nrows = numRowsFig, ncols = numColsFig, figsize = (30, 60))

kk = 0
for ii in range(featuresFirstIdx, numCols):
    sns.scatterplot(data = dfData, x = 'Age', y = dfData.columns[ii],hue = 'Cancer', ax = mHA.flat[kk]) #<! mA.flat[kk] Allows linear indexing for non 1D arrays
    kk += 1

Even though no feature can discriminate the 2 groups in higher dimension a classifier might be able to do so.

## Classification / Anomaly Detection

In this section we'll apply a classifier on the data using the `LightGBM` package which is one of the top 3 decision trees ensemble packages.  
Usually for tabular data ensemble of decision trees is the most effective algorithm to apply.

In [None]:
# Pre Process

