# Creating synthetic SAMUeL data  data with SMOTE

## Description of SMOTE

SMOTE stands for Synthetic Minority Oversampling Technique [1]. SMOTE is more commonly used to create additional data to enhance modelling fitting, especially when one or more classes have low prevalence in the data set. Hence the description of *oversampling*. 

SMOTE works by finding near-neighbor points in the original data, and creating new data points from interpolating between two near-neighbor points.

Here we remove the real data used to create the synthetic data, leaving only the synthetic data. After generating synthetic data we remove any data points that, by chance, are identical to original real data points, and also remove 10% of points that are closest to the original data points. We measure 'closeness' by the Cartesian distance between standardised data values.

![](./images/smote.png)

*Demonstration of SMOTE method. (a) Data points with two features (shown on x and y axes) are represented. Points are colour-coded by class label. (b) A data point from a class is picked at random, shown by the black point, and then the closest neighbours of the same class are identified, as shown by yellow points. Here we show 3 closest neighbours, but the default in the SMOTE `Imbalanced-Learn` library is 6. One of those near-neighbour points is selected at random (shown by the second black point). A new data point, shown in red, is created at a random distance between the two selected data points.*

### Handling integer, binary, and categorical data

The standard SMOTE method generates floating point non-integer) values between data points. There are alternative ways of handing integer, binary, and categorical data using the SMOTE method. Here the methods we use are:

* *Integer* values: Round the resulting synthetic data point value to the closest integer.

* *Binary*: Code the value as 0 or 1, and round the resulting synthetic data point value to the closest integer (0 or 1).

* *Categorical*: One-hot encode the categorical feature. Generate the synthetic data for each category value. Identify the category with the highest value and set to 1 while setting all others to 0.

### Implementation with IMBLEARN

Here use the implementation in the IMBLEARN IMBALANCED-LEARN [2] 

[1] Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.

[2] Lemaitre, G., Nogueira, F. and Aridas, C. (2016), Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. arXiv:1609.06570 (https://pypi.org/project/imbalanced-learn/, `pip install imbalanced-learn`).

## Load packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

from imblearn.over_sampling import SMOTE

# Turn warnings off for notebook publication
import warnings
warnings.filterwarnings("ignore")

## Load data (k-fold split 0)

In [2]:
data = pd.read_csv('./../data/sam_1/kfold_5fold/train_0.csv')
original_col_list = list(data)

In [3]:
def make_one_hot(x):
    """
    Takes a list/array/series and returns 1 for highest value and 0 for all 
    others
    
    """
    # Get argmax
    highest = np.argmax(x)
    # Set all values to zero
    x *= 0.0
    # Set argmax to one
    x[highest] = 1.0
    
    return x

In [4]:
col_list = [
    'MoreEqual80y', 
    'S1Gender',
    'S1Ethnicity',
    'S1OnsetInHospital',
    'S1OnsetTimeType',
    'S1OnsetDateType',
    'S1ArriveByAmbulance',
    'S1AdmissionHour',
    'S1AdmissionDay',
    'S1AdmissionQuarter',
    'S1AdmissionYear',
    'CongestiveHeartFailure',
    'Hypertension',
    'AtrialFibrillation',
    'Diabetes',
    'StrokeTIA',
    'AFAntiplatelet',
    'AFAnticoagulentVitK',
    'AFAnticoagulentDOAC',
    'AFAnticoagulentHeparin',
    'S2NewAFDiagnosis',
    'S2StrokeType',
    'S2TIAInLastMonth']

X_col_names = list(data)
one_hot_cols = []
for col in col_list:
    one_hot = [x for x in X_col_names if x[0:len(col)] == col]
    one_hot_cols.append(one_hot)      

In [5]:
integer_cols = [
    'S1AgeOnArrival',
    'S2RankinBeforeStroke',
    'Loc',
    'LocQuestions',
    'LocCommands',
    'BestGaze',
    'Visual',
    'FacialPalsy',
    'MotorArmLeft',
    'MotorArmRight',
    'MotorLegLeft',
    'MotorLegRight',
    'LimbAtaxia',
    'Sensory',
    'BestLanguage',
    'Dysarthria',
    'ExtinctionInattention',
    'S2NihssArrival']

integer_min_max = dict()
for col in integer_cols:
    col_min = int(data[col].min())
    col_max = int(data[col].max())
    integer_min_max[col] = (col_min, col_max)
    
# Manually clip age to 30 - 100 to avoid using extremes
integer_min_max['S1AgeOnArrival'] = (30, 100)

In [6]:
synthetic_dfs = []

groups = data.groupby('StrokeTeam') # creates a new object of groups of data
count = 0 

for index, group_df in groups: # each group has an index and a dataframe of data

    count += 1
    print (f'\rSynthesising {count} out of {len(groups)}', end='')
    
    # Split data in X and y
    X = group_df.drop(['S2Thrombolysis'], axis=1)
    X.drop('StrokeTeam', inplace = True, axis=1)
    y = group_df['S2Thrombolysis']
    
    X_col_names = list(X)
    X = X.values
    y = y.values
    
    # Count number in each class
    count_label_0 = np.sum(y==0)
    count_label_1 = np.sum(y==1)
    
    # Skip hospitals with fewer than 10 in either class
    if min(count_label_0, count_label_1) < 10:
        continue
        
    # Will make SMOTE data to be 2x original data
    n_class_0 = 2 * count_label_0
    n_class_1 = 2 * count_label_1

    X_resampled, y_resampled = SMOTE(
        sampling_strategy = {0:n_class_0, 1:n_class_1}).fit_resample(X, y)

    # Get just synthetic data (ignore original data)
    X_synthetic = X_resampled[len(X):]
    y_synthetic = y_resampled[len(y):]
        
    # Reconstruct dataframe
    df = pd.DataFrame(X_synthetic, columns=X_col_names)
    df['S2Thrombolysis'] = y_synthetic
    df['StrokeTeam'] = index
    
    # Make one hot as necessary
    for one_hot_list in one_hot_cols:
        for row_index, row in df.iterrows():
            x = row[one_hot_list]
            x_one_hot = make_one_hot(x)
            row[x_one_hot.index] = x_one_hot.values
            df.loc[row_index] = row
    
    # Round and clip integer columns
    for col in integer_cols:
        df[col] = df[col].round(0)
        # Clip
        df[col] = np.clip(
            df[col], integer_min_max[col][0], integer_min_max[col][1])
    
    # Add to list
    synthetic_dfs.append(df)
    
# Concatenate results and shuffle
synthetic_data = pd.concat(synthetic_dfs)
synthetic_data = synthetic_data.sample(frac=1.0)
synthetic_data = synthetic_data[original_col_list]

Synthesising 132 out of 132

In [7]:
# Save data
filename = './../data/sam_1/kfold_5fold/synth_train_0.csv'
synthetic_data.to_csv(filename, index=False)

(Other k-fold data sets are create in other notebooks running at the same time)