# Example 06: How to use synthetic data to enable unsupervised learning

-------------------------------------------

## Overview



 - AitiaExplorer synthetic data to enable unsupervised learning. 
 - This achieved using a BayesianGaussianMixture (BGMM).
 - A BGMM can be used for clustering but it can also be used to model the data distribution that best represents the data.
 - This means that a BGMM can be used to model a data distribution and provide samples from that distribution, allowing the creation of sythentic data.
 - This synthetic data can then be combined with the real data, along with an extra label that separates the synthetic data from the real data.
 - This new dataset can then allow a classifier to be trained to recognise the real data in an unsupervised manner. 
 - The code below in the method `get_synthetic_training_data` creates such a dataset.
 - These classifiers are used internally in AitiaExplorer to select the most important features in a dataset.

### Imports

In [18]:
import os
import sys
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt 
from sklearn import mixture
from sklearn.linear_model import LinearRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score  

module_path = os.path.abspath(os.path.join('../src'))
if module_path not in sys.path:
    sys.path.append(module_path)
from aitia_explorer.app import App

# stop the warning clutter
import warnings
warnings.filterwarnings('ignore')

### Define Utility Methods

- `get_gmm_sample_data` creates sample data from a BGMM.
- `get_synthetic_training_data` creates training data from synthetic and real data.
- These methods are taken from the internals of AitiaExplorer.

In [2]:
def get_gmm_sample_data(incoming_df, column_list, sample_size):
    """
    Unsupervised Learning in the form of BayesianGaussianMixture to create sample data.
    """
    gmm = mixture.BayesianGaussianMixture(n_components=2,
                                          covariance_type="full",
                                          n_init=100,
                                          random_state=42).fit(incoming_df)
    clustered_data = gmm.sample(sample_size)
    clustered_df = pd.DataFrame(clustered_data[0], columns=column_list)
    return clustered_df

In [3]:
def get_synthetic_training_data(incoming_df):
    """
    Creates synthetic training data by sampling from a BayesianGaussianMixture supplied distribution.
    Synthetic data is then labelled differently from the original data.
    """
    # number of records in df
    number_records = len(incoming_df.index)

    # get sample data from the unsupervised BayesianGaussianMixture
    df_bgmm = get_gmm_sample_data(incoming_df, list(incoming_df), number_records)

    # set the class on the samples
    df_bgmm['original_data'] = 0

    # add the class to a copy of incoming df, stops weird errors due to changed dataframes
    working_df = incoming_df.copy(deep=True)
    working_df['original_data'] = 1

    # concatinate the two dataframes
    df_combined = working_df.append(df_bgmm, ignore_index=True)

    # shuffle the data
    df_combined = df_combined.sample(frac=1)

    # get the X and y
    x = df_combined.drop(['original_data'], axis=1).values
    y = df_combined['original_data'].values
    y = y.ravel()

    return x, y

### Set up training 

- Now we will set up for training the classifiers by creating an AitiaExplorer instance and using it to load the [HEPAR II](https://www.bnlearn.com/bnrepository/#hepar2) dataset.
- This data will then be divided into training and test datatsets.

In [6]:
aitia = App()

In [7]:
df = aitia.data.hepar2_10k_data()

In [9]:
# get ths synthetic data
X, y = get_synthetic_training_data(df)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### Train the classifiers

- Now we will train some classifiers that normally need labelled data i.e. for supervised learning. 
- However, as we have created a synthetic training set, we can use these classifiers in an unsupervised manner to learn the real data.
- The scores will then be displayed.

In [11]:
models = [LinearRegression, 
          SGDClassifier, 
          RandomForestClassifier, 
          GradientBoostingClassifier, 
          XGBClassifier]

In [19]:
model_results = dict()

# 
def sigmoid(y_pred):
    y_return = []
    for y in y_pred:
        y_return.append(1 / (1 + math.exp(-y)))
    return y_return

for model in models:
    current_model = model()
    # fit the model
    current_model.fit(X_train, y_train) 
    # predict
    y_pred = current_model.predict(X_test)
    print(y_pred)
    # store the accuracy
    model_results[type(current_model).__name__] = accuracy_score(y_test, sigmoid(y_pred))

model_df = pd.DataFrame(model_results)
model_df

[0.56515475 0.47125954 0.42217683 ... 0.53574198 0.48767576 0.49978216]


TypeError: only size-1 arrays can be converted to Python scalars

## Observations

- Several of the classifiers have an almost perfect score on the synthetic dataset.
- Even though the LinearClassifier does very poorly, it is still useful for feature selection.