# Data Distribution Evaluation of AutoMap
We are going to evaluate whether we can approve the accuracy of the most probable predicted class by adjusting the predicted probabilities on parameter names based on similarity to already mapped concepts.

This means we need to retrieve the original mapped data from disk which is already processed by hand to ensure uniformity and results in a table with at least columns:
- `parameter_name`, `concept_label`
- `p25`, `p50`, `p75`, `num_records` for numeric data
- an entry in `atc` for medication records if available

### Concept labels
Concept labels are retrieved from the general parameters table. A concept table is also loaded which contains the category structure for the labels for stratified analysis.
Output: dictionary concept_groups: {concept_label: category}
Output: dictionary concept_label_super_groups {concept_label: concept_label_super}


### Reference groups
A reference group is a set of previously mapped parameters which describe the expected content of a concept. Instead of defining extreme values for concepts, we retrieve reference values from actual measurements. This is a good idea because extreme values would need to be defined for each concept which is tedious work and up for debate, while using actual measurements may guide in sorting on subtle differences in measured values such as those observed between measured tidal volumes between inspiration and expiration. For each concept's reference group, matching parameters are retrieved and their data is pooled by taking a weighted average of the percentiles based on the number of records. This way, a parameter with a small sample size will have less influence on the overall shape of the data distribution.
Note: another approach may be to take the minimum of the p25, the weighted average for p50, and the max of the p75, for parameters with at least N records.
Output: nested dictionary of concept labels: {p25: float, p50: float, p75: float, N: int}

### Distribution
Distribution of underlying data is compared based on the 25th, 50th and 75th percentile, also known as the median and lower and upper quartiles. In an earlier attempt, we used a modified T-statistic to calculate similarity and increase the probability of predicted labels if data distribution was similar. If the T-statisic was low, the increase was higher than when the T-statistic was high, so that concepts with closer reference groups would become more visible. However, this lead to a lot of false positives and tanked the predictions. Therefore, we will now attempt to not select on similarity, but to deselect on dissimilarity by discounting predictions where the reference group is deemed to be dissimilar.
Output: Function(parameter_id, predicted_concept_label, parameter_distributions, reference_groups) -> similarity statistic
Output: Function(similarity_statistic, probability_of_predicted_concept_label_for_this_parameter) -> probability_of_predicted_concept_label_for_this_parameter || 0
Evaluation of performance change compared to base model at major group level

### Record type
For medication, we can set the probability for non-medication labels to `0` if the parameter contains an `ATC` value. This requires `concept_labels` to contain a grouping structure which labels the concepts as ATC-linked concepts. This can be either achieved through manual labeling of each concept (ideal, but a lot of work), or through the labeling of the concepts based on earlier mapped parameters. For practical purposes, we can assume all concept_labels starting with `med_*` to be medication `concept_labels`. Using the ATC to do a look-up for the concept_label is not preferred, as ATC codes are provided by hospitals and may not be validated as regularly seen during the mapping of the covid project.
Output: Function(parameter_id, predicted_concept_label, parameter_distributions) -> parameter_has_atc_and_predicted_concept_is_medication as bool
Output: Function(parameter_has_atc_and_predicted_concept_is_medication, probability_of_predicted_concept_label_for_this_parameter) -> probability_of_predicted_concept_label_for_this_parameter || 0
Output: Evaluation of performance change compared to base model at major group level
Output: Evaluation of performance change compared to base model + distribution at major group level


### Evaluation
To evaluate the performance of this model, we collect the accuracy, precision, recall and F1-score at concept_label level (tidal_volume_measured_ventilator) without stratifying, as well as accuracy stratified per relevance for relevant/irrelevant records, as well as stratified for major groups (hemodynamic, respiratory, medication), and stratified for minor groups (tidal volume, heart rate) where these are available in the grouping structure. The accuracy is plotted as a cumulative accuracy over the number of predicted labels sorted by their probability, both for the concept_label level and lines for the relevant/irrelevant parameters, over the number of parameters which still receive predicted labels at label position N. The precision, recall and F1-scores are reported as tables with y-values corresponding to the stratified labels.

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from typing import Tuple, List, Dict, Union

import sklearn

sns.set_style('white')
sns.set_context("paper", font_scale = 1)


## Definitions

In [None]:
TRAIN_HOSPITALS = ['vumc', 'amc', 'erasmus', 'olvg']
EHR_SYSTEMS = ['epic', 'mv', 'hix']
SOURCE = 'parameter_name'
TARGET = 'concept_label'
HOSPITAL_COLUMN = 'hospital_name'
DATA_DISTRIBUTION_COLUMNS = ['amin', 'amax', 'p25', 'p50', 'p75', 'p50_over_iqr', 'iqr_over_p50', 'skewed']
DATA_DISTRIBUTION_WEIGHTS = 'num_records'

In [None]:
def calculate_accuracy(y_true, y_pred):
    return np.sum(y_true == y_pred) / len(y_true)

def calculate_performance(y_true, y_pred):
    return {
        'accuracy': sklearn.metrics.accuracy_score(y_true, y_pred),
        'precision': sklearn.metrics.precision_score(y_true, y_pred, average='weighted'),
        'recall': sklearn.metrics.recall_score(y_true, y_pred, average='weighted'),
        'f1_score': sklearn.metrics.f1_score(y_true, y_pred, average='weighted'),
    }


In [None]:
def transform_predictions_to_proportions(predictions: pd.DataFrame,
                                         original_data: pd.DataFrame,
                                         cumulative_score: bool = False,
                                         ) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Transform probability predictions from the AutoMap class to proportions of correct predictions per rank stratified over relevance, irrelevance and overall scores to be used for plotting.

    :param predictions: pandas DataFrame of probability predictions as output by the AutoMap class, where each row represents a prediction for a concept for a document and its corresponding probability.
    :param original_data: pandas DataFrame of the original data as input to the AutoMap class, where each row represents a document and its corresponding true concept labels.
    :param cumulative_score: boolean indicating whether to calculate cumulative scores over the top X predictions.
    :return: tuple of pandas DataFrames where the first returns the probability while the second returns the number of parameters for which predictions were made
    """

    _predictions = predictions.sort_values(['id', 'value'], ascending=[True, False])
    _predictions[TARGET] = _predictions['id'].map(original_data.set_index('id')[TARGET])
    _predictions['rank'] = _predictions.groupby('id').cumcount()
    _predictions['rank_correct'] = (_predictions['label'] == _predictions[TARGET]).astype(int)
    _predictions['relevance'] = (_predictions[TARGET] == 'unmapped').map({True: 'irrelevant',
                                                                          False: 'relevant'})
    # calculate scores
    scores = _predictions.groupby(['relevance', 'rank'])['rank_correct'].sum().reset_index()
    scores_plot = scores.pivot(index=['rank'], columns=['relevance'], values=['rank_correct']).fillna(0)
    scores_plot.columns = scores_plot.columns.droplevel()
    scores_plot['overall'] = scores_plot.sum(axis=1)
    if cumulative_score:
        scores_plot = scores_plot.cumsum()
    scores_plot = scores_plot[sorted(scores_plot.columns)]

    # get the number of parameter in each group of relevance
    parameter_count = _predictions[['id', 'relevance']].groupby(['relevance'])['id'].nunique()
    parameter_count['overall'] = parameter_count.sum()
    parameter_count = parameter_count.sort_index()

    scores_plot_ratio = scores_plot / parameter_count
    print(scores_plot_ratio)

    parameter_count = _predictions[['id', 'relevance', 'rank']].groupby(['relevance', 'rank'])['id'].count()
    parameter_count = parameter_count.reset_index().pivot(index='rank', columns='relevance', values='id')
    parameter_count['overall'] = parameter_count['irrelevant'].fillna(0) + parameter_count['relevant'].fillna(0)
    print(parameter_count)

    return scores_plot_ratio, parameter_count

In [None]:
# merge predictions with original data to retrieve grouping categories
def merge_predictions_with_original_data(predictions: pd.DataFrame,
                                         original_data: pd.DataFrame,
                                         grouping_categories: Dict[str, Dict[str, str]] = None,
                                         ) -> pd.DataFrame:
    """
    Get grouping categories for predictions

    :param predictions: pandas DataFrame of probability predictions as output by the AutoMap class, where each row represents a prediction for a concept for a document and its corresponding probability.
    :param original_data: pandas DataFrame of the original data as input to the AutoMap class, where each row represents a document and its corresponding true concept labels.
    :param grouping_categories: dictionary of column name followed by dictionary to map source to target values
    :return: pandas DataFrame with predictions and grouping categories
    """
    _predictions = predictions.sort_values(['id', 'value']).copy()
    _predictions = _predictions.merge(original_data, on='id')
    if grouping_categories:
        for key, values in grouping_categories.items():
            _predictions[f"{key}_groups"] = _predictions[key].map(values)
    return _predictions

# assign correct flag to predictions
def assign_correct_flag(predictions: pd.DataFrame,
                        target_column: str = TARGET,
                        ) -> pd.DataFrame:
    """
    Assign correct flag to predictions

    :param predictions: pandas DataFrame of probability predictions as output by the AutoMap class, where each row represents a prediction for a concept for a document and its corresponding probability.
    :param target_column: column name of the target column
    :return: pandas DataFrame with predictions and correct flag
    """
    _predictions = predictions.copy()
    _predictions['correct'] = _predictions[target_column] == _predictions['label']
    return _predictions

# assign ranks to predictions
def assign_ranks(predictions: pd.DataFrame,
                 target_column: str = 'label',
                 ) -> pd.DataFrame:
    """
    Assign ranks to predictions

    :param predictions: pandas DataFrame of probability predictions as output by the AutoMap class, where each row represents a prediction for a concept for a document and its corresponding probability.
    :param target_column: column name of the target column
    :return: pandas DataFrame with predictions and ranks
    """
    _predictions = predictions.sort_values(['id', 'value'], ascending=[True, False]).copy()
    _predictions['rank'] = _predictions.groupby('id').cumcount() + 1
    return _predictions


def assign_relevance(predictions: pd.DataFrame,
                     target_column: str = TARGET,
                     ) -> pd.DataFrame:
    """
    Assign relevance to predictions

    :param predictions: pandas DataFrame of probability predictions as output by the AutoMap class, where each row represents a prediction for a concept for a document and its corresponding probability.
    :param target_column: column name of the target column
    :return: pandas DataFrame with predictions and relevance
    """
    _predictions = predictions.copy()
    _predictions['relevance'] = (_predictions[target_column] == 'unmapped').map({True: 'irrelevant', False: 'relevant'})
    return _predictions

def get_processed_data(predictions,
                       original_data,
                       grouping_categories,
                       ):

    _predictions = merge_predictions_with_original_data(predictions=predictions,
                                                        original_data=original_data,
                                                        grouping_categories=grouping_categories)
    _predictions = assign_correct_flag(predictions=_predictions)
    _predictions = assign_ranks(predictions=_predictions)
    _predictions = assign_relevance(predictions=_predictions)
    return _predictions


def calculate_scores_for_groups(
        data: pd.DataFrame,
        label_true: str = TARGET,
        label_pred: str = 'label',
        ) -> pd.DataFrame:
    """
    Calculate scores for the data passed in.
    :param data: pandas DataFrame with at least columns for true labels and predicted labels
    :param label_true: string with the name of the column containing the true labels
    :param label_pred: string with the name of the column containing the predicted labels
    :return: pandas DataFrame with scores for each true label
    """

    result = pd.DataFrame(sklearn.metrics.precision_recall_fscore_support(
        y_true=data[label_true],
        y_pred=data[label_pred],
        labels=data[label_true].unique(),
        average=None, #average='weighted',
        beta=1,
        zero_division=0,
    )).transpose().set_index(data[label_true].unique())
    result.columns = ['precision', 'recall', 'f1', 'support']
    result.index.name = label_true
    result = result.reset_index()

    result_accuracy = data.groupby([TARGET]).apply(lambda x: sklearn.metrics.accuracy_score(
        y_true=x[label_true],
        y_pred=x[label_pred],
        normalize=True,
        sample_weight=None,
    )).to_dict()
    result['accuracy'] = result[label_true].map(result_accuracy)

    result_num_records = data.groupby([TARGET]).apply(lambda x: np.sum(x['num_records'])).to_dict()
    result['num_records'] = result[label_true].map(result_num_records)

    return result

def weighted_average(x: pd.DataFrame,
                     score_types: List[str] = None,
                     ) -> Dict[str, float]:
    """
    Calculated the weighted average for each score type if they're present in the dataframe columns. Expects 'support' to contain counts for each score type.
    :param x: pandas DataFrame
    :param score_types: list of strings with score types to calculate weighted average for
    :return: dictionary with weighted average for each score type
    """
    score_types = ['accuracy', 'precision', 'recall', 'f1'] if score_types is None else score_types
    sum_cols = ['num_records', 'support', 'group_count']
    return_dict = {score_type: np.average(x[score_type], weights=x['support']) for score_type in score_types if score_type in x}
    for col in sum_cols:
        if col in x:
            return_dict[col] = int(np.sum(x[col]))
        elif col == 'group_count':
            return_dict[col] = len(x)
    return return_dict

def calculate_scores(predictions: pd.DataFrame,
                     original_data: pd.DataFrame,
                     grouping_categories: dict = None,
                     rank: int = 1,
                     ) -> Dict[str, pd.DataFrame]:
    """
    Retrieves various grouped scores for the publication.
    :param predictions: pandas DataFrame of the predictions with at least the columns 'id', 'label', 'value'
    :param original_data: panda DataFrame of the original data being predicted on. Must contain the columns 'id', TARGET concept label and ehr_name.
    :param grouping_categories: dictionary of column names and a corresponding dictionary to map values to. Groups will be written to {key}_group column.
    :param rank: integer of the rank to calculate scores for, default is 1 for the first prediction
    :return: dictionary of various grouping structures and the respective table for accuracy/precision/recall/f1-scores and support
    """
    proc = get_processed_data(predictions=predictions,
                              original_data=original_data,
                              grouping_categories=grouping_categories)
    # Table with scores for each concept label --> allows for grouping over data categories and relevance
    result = calculate_scores_for_groups(data=proc.loc[proc['rank'] == rank], label_true=TARGET, label_pred='label')
    result['concept_label_group'] = result['concept_label'].map(concept_category_groups)
    result['relevance'] = (result['concept_label'] == 'unmapped').map({True: 'irrelevant', False: 'relevant'})
    result = result.loc[result['support'] > 0].copy() # remove concepts that were not available in the test set as they cannot be evaluated

    # Table with scores for each concept label group
    result_per_concept_label_group = result.groupby(['concept_label_group']).apply(lambda x: weighted_average(x)
            ).apply(pd.Series).sort_values('support', ascending=False)

    # Table with scores for each relevance group
    result_per_relevance_group = result.groupby(['relevance']).apply(lambda x: weighted_average(x)).apply(pd.Series).sort_values('support', ascending=False)
    result_prg_overall = weighted_average(result_per_relevance_group.reset_index())
    result_prg_overall = pd.DataFrame(result_prg_overall, index=['zoverall'])
    result_per_relevance_group = pd.concat([result_prg_overall, result_per_relevance_group]).sort_index(ascending=False)

    # Table with scores for each EHR system and Relevance groups --> specifically for table in manuscript
    result_ehr = proc.loc[proc['rank'] == rank].groupby(['ehr_name']).apply(lambda x: calculate_scores_for_groups(data=x, label_true=TARGET, label_pred='label'))
    result_ehr['relevance'] = (result_ehr['concept_label'] == 'unmapped').map({True: 'irrelevant', False: 'relevant'})
    result_ehr_prg = result_ehr.reset_index().groupby(['ehr_name', 'relevance']).apply(lambda x: weighted_average(x)).apply(pd.Series).sort_values('support', ascending=False)
    # combine relevant and irrelevant into a weighted average overall score
    result_ehr_overall = result_ehr_prg.reset_index().groupby(['ehr_name']).apply(
        lambda x: weighted_average(x)
            ).apply(pd.Series).sort_values('support', ascending=False)
    result_ehr_overall = result_ehr_overall.reset_index()
    result_ehr_overall['relevance'] = 'zoverall'
    result_ehr_overall.set_index(['ehr_name', 'relevance'], inplace=True)
    result_ehr_prg = pd.concat([result_ehr_prg, result_ehr_overall]).sort_values(['ehr_name', 'relevance'], ascending=[True, False])

    # Table with scores for each EHR, Hospital Name groups, and relevance groups --> specifically for table in manuscript supplementary file
    result_ehr_hosp = proc.loc[proc['rank'] == rank].groupby(['ehr_name', 'hospital_name']).apply(lambda x: calculate_scores_for_groups(data=x, label_true=TARGET, label_pred='label'))
    result_ehr_hosp['relevance'] = (result_ehr_hosp['concept_label'] == 'unmapped').map({True: 'irrelevant', False: 'relevant'})
    result_ehr_hosp_prg = result_ehr_hosp.reset_index().groupby(['ehr_name', 'hospital_name', 'relevance']).apply(lambda x: weighted_average(x)).apply(pd.Series).sort_values('support', ascending=False)
    # combine relevant and irrelevant into a weighted average overall score
    result_ehr_hosp_overall = result_ehr_hosp_prg.reset_index().groupby(['ehr_name', 'hospital_name']).apply(
        lambda x: weighted_average(x)
            ).apply(pd.Series).sort_values('support', ascending=False)
    result_ehr_hosp_overall = result_ehr_hosp_overall.reset_index()
    result_ehr_hosp_overall['relevance'] = 'zoverall'
    result_ehr_hosp_overall.set_index(['ehr_name', 'hospital_name', 'relevance'], inplace=True)
    result_ehr_hosp_prg = pd.concat([result_ehr_hosp_prg, result_ehr_hosp_overall]).sort_values(['ehr_name', 'hospital_name', 'relevance'], ascending=[True, True, False])

    return {'label': result,
            'label_group': result_per_concept_label_group,
            'relevance': result_per_relevance_group,
            'ehr_relevance': result_ehr_prg,
            'ehr_hosp_relevance': result_ehr_hosp_prg,
            }



In [None]:
def get_plot_data(results: Dict[int, Dict[str, pd.DataFrame]],
                  dataset: str = 'relevance',
                  score_type='recall',
                  ) -> Tuple[pd.DataFrame, pd.DataFrame]:

    y_values = pd.concat([results[i][dataset][score_type] for i in range(1, 11)], axis=1)
    y_values.columns = list(range(1,11))

    s_values = pd.concat([results[i][dataset]['support'] for i in range(1, 11)], axis=1)
    s_values.columns = list(range(1,11))
    s_values = s_values.astype(int)
    return y_values.transpose(), s_values.transpose()

def plot_results(results: Dict[int, Dict[str, pd.DataFrame]],
                 dataset: str = 'relevance',
                 score_type: str = 'recall',
                 N: int = 10,
                 cumulative: bool = True,
                 plot_order=None,
                 color_palette=None,
                 save_loc=None,
                 ):
    """
    Plots the results of the evaluation.
    :param results:
    :param score_type:
    :param N:
    :param cumulative:
    :param plot_order:
    :param color_palette:
    :param save_loc:
    :return:
    """
    score_data, count_data = get_plot_data(results=results, dataset=dataset, score_type=score_type)

    if cumulative:
        score_data = score_data.cumsum()
    score_data.to_csv(f'{save_loc}score_data__{dataset}__{score_type}.csv')
    count_data.to_csv(f'{save_loc}count_data__{dataset}__{score_type}.csv')
    plot_order = sorted(score_data.columns, reverse=True) if plot_order is None else plot_order
    plot_order_rename_dict = {x: x.replace('zoverall', 'overall').capitalize() for x in plot_order}
    score_data.rename(columns=plot_order_rename_dict, inplace=True)
    count_data.rename(columns=plot_order_rename_dict, inplace=True)
    plot_order = plot_order_rename_dict.values()
    c_palette = ['black'] * len(plot_order) if color_palette is None else color_palette[0:len(plot_order)] #['black'] * count_values.shape[1]
    fig, (ax1, ax2) = plt.subplots(2, 1,
                                   sharex=True,
                                   figsize=(6,6),
                                   gridspec_kw={'height_ratios': [6,2],
                                                },
                                   )

    # Plot scores
    sns.lineplot(data=score_data[plot_order],
                 palette=c_palette,
                 legend=True,
                 ax=ax1,
                 )
    ax1.set_xlabel('Rank of predicted labels')
    ax1.set_ylabel(f"{score_type}".capitalize(), labelpad=25)
    ax1.set_ylim(0, 1.01)
    ax1.set_xlim(1, N)
    ax1.legend(
        loc='lower right',
        bbox_to_anchor=(1.0, 0.0),
        ncol=1,
    )

    # Plot parameter counts
    sns.lineplot(data=count_data[plot_order],
                 palette=c_palette,
                 legend=False,
                 ax=ax2,
                 )
    ax2.set_ylim(0, count_data.max().max()*1.05)
    ax2.set_xlim(1, N)
    ax2.set_xticks(list(range(1, N+1)))
    ax2.set_xticklabels(list(range(1, N+1)))
    ax2.set_xlabel('Number of predicted labels')
    ax2.set_ylabel('Parameter\ncount')

    plt.tight_layout()
    plt.savefig(f"{save_loc}plot__{dataset}__{score_type}.png", dpi=1200)
    plt.savefig(f"{save_loc}plot__{dataset}__{score_type}.pdf", dpi=1200)
    plt.show()
    return

## Get concept groupings

Concept groups are dictionaries which translate individual concept labels to their respective groups to enable reporting on performance stratified over major categories.

In [None]:
def create_concept_grouping(data, source, target) -> dict:
    assert data[source].duplicated().sum() == 0, 'source column is not unique'
    return data.set_index(source)[target].to_dict()

concepts = pd.read_csv('../data/input/concepts.csv')
concept_category_groups = create_concept_grouping(concepts, 'concept_label', 'category')
concept_label_super_groups = create_concept_grouping(concepts, 'concept_label', 'concept_label_super')

## Get reference groups

Reference groups are concept labels which use the training hospital's mapped parameters to retrieve an average value for their underlying data content. Reference groups contain averages for p25, p50, p75, weighted by the number of records per parameter, as well as normalized values for their distribution, such as p50_over_iqr and iqr_over_p50.

In [None]:
df_train = pd.read_csv('../data/input/static/train_set.csv')
print(df_train.shape)
df_test = pd.read_csv('../data/input/static/test_set.csv')
print(df_test.shape)

In [None]:
print('Train:', df_train['hospital_name'].nunique(), df_train['hospital_name'].unique())

print('Test:', df_test['hospital_name'].nunique(), df_test['hospital_name'].unique())


In [None]:
df_train['id'] = df_train.reset_index(drop=True).index
df_test['id'] = df_test.reset_index(drop=True).index

In [None]:
def create_reference_groups(data: pd.DataFrame,
                            group_by: Union[List[str], str],
                            filter_by: List[str],
                            filter_on: List[str],
                            values: List[str],
                            weights: str,
                            ) -> Dict[str, Dict[str, float]]:
    """
    Create reference groups for a given set of parameters.
    :param data:
    :param group_by:
    :param filter_by:
    :param filter_on:
    :param values:
    :param weights:
    :return: dictionary with keys corresponding to concept labels
    """
    var = data.loc[data[filter_by].isin(filter_on)].groupby(group_by, group_keys=False)
    d = dict()
    for value in values:
        if value == 'amin':
            d[value] = var[value].min().to_dict()
        elif value == 'amax':
            d[value] = var[value].max().to_dict()
        else:
            d[value] = var.apply(lambda x: calculate_average_weight(x=x, value=value, weights=weights)).to_dict()
    if 'p50' in values and 'p25' in values and 'p75' in values:
        d['p75_max'] = var['p75'].max().to_dict()
        d['p25_min'] = var['p25'].min().to_dict()
    d[weights] = var.apply(lambda x: x[weights].sum()).to_dict()
    return pd.DataFrame(d).transpose().to_dict()

def calculate_average_weight(x: pd.DataFrame,
                             value: str,
                             weights: str) -> float:
    return np.average(x[value], weights=x[weights])

### get base predictions

In [None]:
""""
AutoMap class
"""

# import
import os
import joblib
import pandas as pd
from datetime import datetime
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer


class AutoMap:
    """
    :params:
    train_files: list of files to load for training the prediction pipeline on
    predict_files: list of files to read and predict the parameters for
    output_files: list of files to save the predictions to, where length equals length of predict_files
    pipe_file: Previously created compatible pickled tuple of fitted (Pipeline, LabelEncoder) objects


    :args:



    """

    def __init__(self):
        nltk.download('punkt')
        nltk.download('wordnet')
        nltk.download('omw-1.4')  # hidden in Pipe creation

        # column names to use in training/predicting
        self.source = 'parameter_name'
        self.target = 'pacmed_subname'
        self.pred = 'predicted_subname'

        # if no label is given, impute with index[0] (unmapped)
        self.unlabeled = ['unmapped', 'microbiology']  # used for validation to filter out unlabeled

        # initial r'\w+' but 5% performance gain when underscores are omitted
        self.preprocess_text_regex_expression = r'[a-zA-Z0-9]+'
        return

    def preprocess_text(self, text):
        # Tokenise words while ignoring punctuation
        tokeniser = RegexpTokenizer(self.preprocess_text_regex_expression)
        tokens = tokeniser.tokenize(text)

        # Lowercase and lemmatise
        lemmatiser = WordNetLemmatizer()
        lemmas = [lemmatiser.lemmatize(token.lower(), pos='v') for token in tokens]

        # Remove stop words
        # keywords= [lemma for lemma in lemmas if lemma not in stopwords.words('english')]
        # return keywords
        return lemmas

    def create_pipe(self,
                    X=pd.Series,
                    y=pd.Series,
                    estimator=SGDClassifier(random_state=123),
                    grid: dict = None,
                    cv: int = 10,
                    n_jobs: int = None,
                    save: bool = False,
                    prefix=None):
        """
        Create the pipe object used to train and test text data
        """

        if y.isna().sum() > 0:
            y = y.fillna(self.unlabeled[0])

        # ensure labels are encoded
        self.le = LabelEncoder()
        self.le.fit(y=y.unique())

        # Create an instance of TfidfVectorizer
        vectoriser = TfidfVectorizer(analyzer=self.preprocess_text)

        # Fit to the data and transform to feature matrix
        X_train_tfidf = vectoriser.fit_transform(X)

        # try an initial accuracy before hyperparameter optimization
        clf = estimator
        # clf = SGDClassifier(random_state=123)
        # clf_scores = cross_val_score(clf, X_train_tfidf, self.y_train, cv=10)
        # print(clf_scores)
        # print("SGDClassfier Accuracy: %0.2f (+/- %0.2f)" % (clf_scores.mean(), clf_scores.std() * 2))

        if grid is None:
            grid = {'fit_intercept': [True, False],
                    'early_stopping': [True, False],
                    'loss': ['log', 'modified_huber', 'perceptron', 'huber', 'squared_loss', 'epsilon_insensitive',
                             'squared_epsilon_insensitive'],
                    # ['hinge', 'log', 'squared_hinge'], #PM squared_loss --> squared_error in v1.2
                    'penalty': ['l2', 'l1', 'none']}

            # Reduce to optimal grid for rerunning code
            grid = {'fit_intercept': [True],
                    'early_stopping': [False],
                    'loss': ['modified_huber'],
                    'penalty': ['l2']}

        # retry the SGDClassifier training with param_grid
        search = GridSearchCV(estimator=clf, param_grid=grid, cv=cv, n_jobs=n_jobs)
        search.fit(X_train_tfidf, y)

        # grid_sgd_clf_scores = cross_val_score(search.best_estimator_, X_train_tfidf, self.y_train, cv=5)
        # print(grid_sgd_clf_scores)
        # print("SGDClassifier optimal grid Accuracy: %0.2f (+/- %0.2f)" % (
        # grid_sgd_clf_scores.mean(), grid_sgd_clf_scores.std() * 2))

        # create Pipeline with vectoriser and optimal classifier
        self.pipe = Pipeline([('vectoriser', vectoriser),
                              ('classifier', search)])  # clf

        # fit the pipeline to the full training data
        self.pipe.fit(X, self.le.transform(y.values))

        # save pipe to file to prevent rerunning the same pipelines
        if prefix is None:
            prefix = ''
        if save:
            f_name = f'./data/pipes/{prefix}__{datetime.now().strftime("%Y%m%d%H%M%S")}.pipe'
            joblib.dump((self.pipe,
                         self.le,
                         ),
                        f_name,
                        compress=('gzip', 3),
                        protocol=5)
            print(f"Pipeline saved to: {f_name}")

        return self.pipe

    def save_pipe(self, f_name):
        joblib.dump((self.pipe, self.le), f_name)
        print(f"Pipeline saved to: {f_name}")

    def load_pipe(self, f_name):
        if os.path.isfile(f_name):
            self.pipe, self.le = joblib.load(f_name)
        else:
            self.pipe = None
            self.le = LabelEncoder()
        print(f"Pipeline loaded from: {f_name}")

    def predict_proba_transformed(self, X, **predict_proba_params):

        if isinstance(X, pd.Series):
            probs = self.pipe.predict_proba(X, **predict_proba_params)
            id_vars = [X.name]
            X = pd.DataFrame(X)
        else:
            probs = self.pipe.predict_proba(X[X.columns[1]], **predict_proba_params)
            id_vars = list(X.columns)
            print(id_vars)

        c = pd.concat(
            [X.reset_index(drop=True),
             pd.DataFrame(probs, columns=self.le.classes_),
             ],
            axis=1)
        c.loc[:, self.le.classes_] = c.loc[:, self.le.classes_].replace(0, np.nan)
        return (c
            .set_index(id_vars)
            .stack()
            .reset_index()
            .rename(columns={
                "level_1": "label",
                "level_2": "label",
                0: "value",
            }
            )
        )

In [None]:
os.makedirs('./output/static/overlapping', exist_ok=True)

In [None]:
am = AutoMap()

train_data = df_train.copy()
test_data = df_test.copy()

am.create_pipe(X=train_data[SOURCE],
               y=train_data[TARGET],
               cv=2, n_jobs=-1,
               )

#### Base performance

In [None]:
predictions = am.pipe.predict_proba(test_data[SOURCE])
predictions

In [None]:
am.le.classes_
y_true = am.le.transform(test_data[TARGET])
y_true

In [None]:
from sklearn.metrics import top_k_accuracy_score
top_k_accuracy_score(y_true=y_true,
                     y_score=predictions,
                     k=1,
                     labels=range(0, len(am.le.classes_)),
                     )

In [None]:
from sklearn.metrics import top_k_accuracy_score
top_k_accuracy_score(y_true=y_true,
                     y_score=predictions,
                     k=5,
                     labels=range(0, len(am.le.classes_)),
                     )

In [None]:
predicted_labels = am.predict_proba_transformed(test_data[['id', SOURCE]]).sort_values(['id', 'value'], ascending=[True, False])
predicted_labels[TARGET + '_original'] = predicted_labels.id.map(test_data.set_index('id')[TARGET].to_dict())

save_loc='./output/static/overlapping/'

predicted_labels.to_csv(f'{save_loc}predicted_labels.csv')
test_data.to_csv(f'{save_loc}test_data.csv')
train_data.to_csv(f'{save_loc}train_data.csv')
joblib.dump(am, f'{save_loc}am.joblib')

results = dict()
for i in range(1, 11):
    results[i] = calculate_scores(predicted_labels, test_data[['id', 'hospital_name' , 'ehr_name', 'concept_label', 'num_records']], {TARGET: concept_category_groups}, rank=i)
    for key, value in results[i].items():
        if isinstance(value, pd.DataFrame):
            for col in value.columns:
                if value[col].dtype == 'float64':
                    if (value[col] == value[col].astype(int).astype(float)).all():
                        value[col] = value[col].astype(int)
            value.to_csv(f'{save_loc}aprf__rank_{i}__{key}.csv')
            value.round(3).to_csv(f'{save_loc}aprf__rank_{i}__{key}__round3.csv', float_format='%.3f')
            if (key == 'relevance') & (i == 1):
                print(value.round(3))

plot_results(results, score_type='precision', save_loc=save_loc)
plot_results(results, score_type='recall', save_loc=save_loc)
plot_results(results, score_type='f1', save_loc=save_loc)