## Example/test of `get_train_df_with_fixed_PII_offsets()` and `calc_PII_offsets()`    


NOTES:
- Part of examples for the `feedback-prize-2021-lib` [library](https://www.kaggle.com/sentinel1/feedback-prize-2021-lib)
- For details of PII masking see the [discussion](https://www.kaggle.com/c/feedback-prize-2021/discussion/297688)

In [None]:
from pathlib import Path

if Path.cwd() == Path('/kaggle/working'):
    # Kaggle
    import sys
    LIB_PATH = (Path.cwd()/".."/"input"/"feedback-prize-2021-lib").resolve()
    assert LIB_PATH.is_dir(), ("Use the '+ Add data' feature to add the 'Notebook Output Files' from the 'sentinel1/feedback-prize-2021-lib' "
                               "in order to make some utilities importable from that library (one time restart is required after adding).")
    sys.path.insert(0, str(LIB_PATH))
else:
    # Local machine
    assert (Path.cwd()/"lib"/"feedback_util.py").is_file(), ("Run the 'sentinel1/feedback-prize-2021-lib' notebook locally "
                                                             "in order to generate the importable library on your machine")

In [None]:
from lib.feedback_util import color_print_essay, read_train_csv, calc_PII_offsets, get_train_df_with_fixed_PII_offsets

In [None]:
from IPython.display import display

RUN_OPTIONALS = True

## Reading original train.csv

In [None]:
train_df = read_train_csv()
train_df.head(2)

## Run the self_test of the `calc_PII_offsets()` function (optional)

In [None]:
%%time

if RUN_OPTIONALS:
    df_with_offsets = calc_PII_offsets(train_df, self_test=True)

## Compute PII offsets using the `calc_PII_offsets()` (optional)

In [None]:
%%time

if RUN_OPTIONALS:
    df_with_offsets = calc_PII_offsets(train_df)

## Print two essays BEFORE and AFTER fixing the PII masking offsets    

### Essay  `3A5D35053D40` BEFORE

In [None]:
if RUN_OPTIONALS:
    display(df_with_offsets[df_with_offsets['discourse_start_PII_offset'] > 2].head(1))

In [None]:
print("\n(1) *BEFORE* fixing PII offsets:\n")
color_print_essay('3A5D35053D40', train_df)

### Essay  `1F20005BEBB6` BEFORE

In [None]:
if RUN_OPTIONALS:
    display(df_with_offsets[df_with_offsets['discourse_start_PII_offset'] > 7].head(1))

In [None]:
print("\n(2) *BEFORE* fixing PII offsets:\n")
color_print_essay('1F20005BEBB6', train_df)

## Fix the PII masking offsets using `get_train_df_with_fixed_PII_offsets()`

In [None]:
%%time

if RUN_OPTIONALS:
    #NOTE: in order to save time passing the optional `df_with_offsets` argument as we have already computed it anyway:
    train_df = get_train_df_with_fixed_PII_offsets(df_with_offsets)
else:
    train_df = get_train_df_with_fixed_PII_offsets()
train_df.head()

### Essay  `3A5D35053D40` AFTER

In [None]:
print("\n(1) AFTER fixing PII offsets:\n")
color_print_essay('3A5D35053D40', train_df)

### Essay  `1F20005BEBB6` AFTER

In [None]:
print("\n(2) AFTER fixing PII offsets:\n")
color_print_essay('1F20005BEBB6', train_df)

## Some Statistics of PII masking offsets

In [None]:
train_df_orig = read_train_csv()

def print_stats(feature):
    num_changed = sum(train_df_orig[feature] != train_df[feature])
    num_total = len(train_df_orig)
    percent_changed = num_changed / num_total * 100
    feature = '"' + feature + '"'
    print(f"Out of total {num_total} discourses the {feature:<17} has been changed for the {num_changed:<3} discourse(s), which make {percent_changed:.2f}% of the training data.")

In [None]:
num_essays = len(train_df_orig.groupby('id'))
num_essays_affected = (
    train_df_orig[['id', 'discourse_text', 'discourse_start', 'discourse_end']]
    .join(train_df[['discourse_text', 'discourse_start', 'discourse_end']], rsuffix='_changed')
    .groupby('id')
    .apply(lambda group:
           any((group['discourse_text'] != group['discourse_text_changed']) |
               (group['discourse_start'] != group['discourse_start_changed']) |
               (group['discourse_end'] != group['discourse_end_changed']))
          )
    .sum()
)
percent_essays_affected = num_essays_affected / num_essays * 100

In [None]:
for feature in ['discourse_text', 'discourse_start', 'discourse_end']:
    print_stats(feature)

In [None]:
print(f"Out of total {num_essays} essays the {num_essays_affected} essays are affected, which make {percent_essays_affected:.2f}% of the training data.")