# Email Parsing for Relevant Information Using BERT

 Adopted from https://www.tensorflow.org/tutorials/text/classify_text_with_bert


This notebook contains complete code to preprocess and to train the classification network with BERT to decide which phrases are irrelevant in an email. This is the final product for Extract Relevant Data from Raw Email project for UC Berkeley's Data Science Discovery Program Sp'21.

In this notebook code for the following is contained:

- Preprocess the data
- Load a BERT model from TensorFlow Hub
- Build a model by combining BERT with a classifier



### Special Notes

If you want to preprocess, annotate data and train the network, please run and follow instructions under: 
1. Utility functions
2. Setup
3. Preprocess
4. Creating and Training Model

If you just want to use a trained model to filter emails, please run and follow instructions under: 
1. Utility functions
2. Setup
3. Using Model to Clean Email


*Unfortunately so far, we are only able to train using the sentences themselves, and not with any additional features avaliable in the dataset*

*In addition if there are any module import error, please check https://www.tensorflow.org/tutorials/text/classify_text_with_bert for updates on which libraries to import*

## Setup


In [None]:
# A dependency of the preprocessing for BERT inputs
!pip install tf-nightly
!pip install -q tensorflow-text-nightly
!pip install -q tf-models-official

Collecting grpcio<2.0,>=1.37.0
  Using cached https://files.pythonhosted.org/packages/31/d8/1bfe90cc49c166dd2ec1be46fa4830c254ce702004a110830c74ec1df0c0/grpcio-1.37.1-cp37-cp37m-manylinux2014_x86_64.whl
Collecting keras-nightly~=2.6.0.dev
  Using cached https://files.pythonhosted.org/packages/b1/f9/9366cd47fc47f2a2910881b4209864c0b08e7238c7ea568447322a88cefc/keras_nightly-2.6.0.dev2021051800-py2.py3-none-any.whl
[31mERROR: tensorflow 2.5.0 has requirement grpcio~=1.34.0, but you'll have grpcio 1.37.1 which is incompatible.[0m
[31mERROR: tensorflow 2.5.0 has requirement keras-nightly~=2.5.0.dev, but you'll have keras-nightly 2.6.0.dev2021051800 which is incompatible.[0m
[31mERROR: tensorflow-text 2.4.3 has requirement tensorflow<2.5,>=2.4.0, but you'll have tensorflow 2.5.0 which is incompatible.[0m
Installing collected packages: grpcio, keras-nightly
  Found existing installation: grpcio 1.34.1
    Uninstalling grpcio-1.34.1:
      Successfully uninstalled grpcio-1.34.1
  Found e

In [None]:
import os
import shutil

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization  # to create AdamW optmizer

from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

import matplotlib.pyplot as plt

import numpy as np
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
import re

nlp = spacy.load('en_core_web_sm')
## Run below for better accuracy and on local machine
# nlp = spacy.load('en_core_web_trf) 

tf.get_logger().setLevel('ERROR')

TensorFlow Addons offers no support for the nightly versions of TensorFlow. Some things might work, some other might not. 
If you encounter a bug, do not file an issue on GitHub.


In [None]:
# run this cell only if you are on google colabs
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
# cd to the directory where this notebook is located in your drive
%cd /gdrive/My\ Drive/Colab\ Notebooks/DSDP

/gdrive/My Drive/Colab Notebooks/DSDP


## Utility functions

In [None]:
def print_file(fname):
    with open(fname) as file:
        body = file.read()
    print(body)

def read_email(fname):
    with open(fname, 'r') as email:
        text = email.read()
    return text

def create_dir(dir):
    if not os.path.exists(dir):
        os.makedirs(dir)
        print("Created Directory : ", dir)
    else:
        print("Directory already existed : ", dir)
    return dir

In [None]:
# filter To:, Subject:, etc. Rough filter 
def rough_check(paragraph):
    # Words that denotes header/footer
    flags = ['To: ', 'Subject: ', 'cc: ', 'Sent by: ']
    for f in flags:
        if f in paragraph:
            return False
    return True

In [None]:
# filter things within lines (fine filter)
def fine_check(line):
    flags = ['From: ', '-----', 'Dear', 'Sincerely, ', 'Best Regards, ', 'Re: ', '---', '___', '> > ','>> ']
    cleaned_line = line.strip()

    if cleaned_line.isspace() or not cleaned_line:
        return False

    for f in flags:
        if f in cleaned_line:
            return False
    return True

In [None]:
# Clean document and writes to outputs folder
def clean_summarize(file_name, fname, annotation=True):
    main_text = read_email(file_name).strip()
    paragraphs = main_text.split(sep='\n\n')[1:]
    content = list(filter(rough_check, paragraphs))

    cleaned_content = []
    for p in content:
        lines = p.splitlines()
        cleaned_content.append('\n'.join(list(filter(fine_check, lines))))

    output_file = fname + '_out'
    output_text = ''.join(cleaned_content)
    output_text = ''.join(output_text.splitlines())
    doc = nlp(output_text)
    output_text = ''
    for sent in list(doc.sents):
        output_text += str(sent) + '\n'

    if annotation:
        with open(output_path_cleaned + output_file + '.txt', "w") as new_file:
            new_file.write(output_text)

        with open(output_path_manual + output_file + '.txt', "w") as new_file:
            new_file.write(output_text)
    return output_text

## Preprocess

### What it does

*Filtering*
- Essentially, it does a filter through the emails and get rid of obvious unwanted information such as the header and tail and lines with words such as "forwarded" etc
- outputs 2 copies the cleaned file into 2 different directories for annotation, one as the original for comparison, the other to be altered.

*Annotating*
- To annotate you go into one of the directories with the output to be altered and delete the lines you think are useless
- The script compare the files between the two directories and if a certain sentence is in the file of one directory, but not in the other, it gets a label of 0 (which means the line is useless) and vice versa.
- The script then computes the features for each sentence (position of the sentence in relation to the email, length of the sentence, etc.)
- All the labels get stored in a CSV associated with each file, and all the file's csv gets concatenated into a large csv as the input to the neural network

### Instructions
1. Define where the inputs and outputs will reside: 
    - ```outputs_path``` where the outputs resides 
    - ```inputs_path``` where all the raw input email files to cleaned resides 
2. Run the code below until "Instructions to Annotate"
3. Follow instructions under "Instructions to Annotate"

### Code to Generate Data to Label

In [None]:
# Path for the directory of data to be processed
inputs_path = './Inputs/arora-h/'
outputs_path = './outputs'


# Path for cleaned outputs
output_path_cleaned = outputs_path + '/original/'
output_path_manual = outputs_path + '/alter/'

# TODO: Generate necessary file structure
create_dir(output_path_cleaned)
create_dir(output_path_manual)
create_dir(outputs_path + '/labels')
create_dir(output_path_manual + 'done')

Directory already existed :  ./outputs/original/
Directory already existed :  ./outputs/alter/
Directory already existed :  ./outputs/labels
Directory already existed :  ./outputs/alter/done


'./outputs/alter/done'

In [None]:
# cleans all file in the inputs directory
for files in os.listdir(inputs_path)[1:]:
    clean_summarize(inputs_path + files, files)

### Instructions to Annotate
1. Go to your ```<outputs_path>/alter``` directory
2. Open any file and delete ANY WHOLE LINES that you think does not contain necessary information for the email. Remember to delete the whole line!!!! Not just the part in the line you think is trivial.
3. Move all the files in which you have changed into the ```<outputs_path>/done``` directory
4. Run the cell that says ***Run this to get labels and features!*** below, your labeled data will be in a CSV file residing in ```<outputs_path>/labels``` directory
5. After every session of annotating or when you are done with annotating and you want to train the network with your labeled data, run the cell that says ***Run code cell below once all emails are annotated*** to concatanate all your labels in one CSV to be fed into the network. You can see the CSV in your ```outputs_path``` directory named  ```all_labeled_data.csv```
6. Proceed to create and train your network.

**Run this to get labels and features!**

In [None]:
for fname in os.listdir(output_path_manual + '/done')[1:]:
    original = read_file(output_path_cleaned + fname).splitlines()
    edited = read_file(output_path_manual + '/done/'+fname).splitlines()
    labeled = []

    for ln in original:
        if ln in edited:
            labeled += [[ln, 1]]
        else:
            labeled += [[ln, 0]]
    data = pd.DataFrame(labeled, columns=['Sentence', 'Label'])
    
    #positions of the sentences
    pos__sentence = []
    
    for i in range(len(original)):
        pos__sentence.append(i)

            
    #length of the sentences
    len_sentence = []
    
    for line in original:
        len_sentence.append(len(line))
    
    #number of punctuations
    punctuation = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
    num_punc = []
    
    for i in range(len(original)):
        line = original[i]
        try:
            if math.isnan(line):
                line = ''
        except:
            type(line) == str
        sentence = str(line)
        punc = 0
        for word in sentence:
            if word in punctuation:
                punc += 1
        num_punc.append(punc)
        
    #number of digits
    digits = '''1234567890'''
    num_digits = []
    
    for i in range(len(original)):
        line = original[i]
        try:
            if math.isnan(line):
                line = ''
        except:
            type(line) == str
        sentence = str(line)
        digit = 0
        for word in sentence:
            if word in digits:
                digit += 1
        num_digits.append(digit)
        
    #sentiment
    text_detokenized = data['Sentence'].apply(TreebankWordDetokenizer().detokenize)
    sia = SentimentIntensityAnalyzer()
    
    polarity = []
    for i in range(len(data['Sentence'])):
        polarity.append(sia.polarity_scores(text_detokenized.iloc[i]))


    polarity_score = polarity
    
    
    #add all the features
    data['pos__sentence']=pos__sentence
    data['len_sentence']=len_sentence
    data['num_punc']=num_punc
    data['num_digits']=num_digits
    data['polarity_score']=polarity_score

    data.to_csv(outputs_path + '/labels/' + fname[:-4]+'.csv', index=False)

**Run Code cell below once all emails are annotated**

In [None]:
all_labeled = pd.read_csv(outputs_path + '/labels/' + os.listdir(outputs_path + '/labels/')[0])
for f in os.listdir(outputs_path + '/labels/')[1:]:
    all_labeled = all_labeled.append(pd.read_csv(outputs_path + '/labels/' + f))
all_labeled.to_csv(outputs_path + '/all_labeled_data.csv', index=False)


## Creating and Training Model

### Loading Data into Tensorflow to be trained

In [None]:

seed = 42

ds = tf.data.experimental.make_csv_dataset(
    outputs_path + '/all_labeled_data.csv',
    batch_size = 5,
    label_name='Label',
    num_epochs = 1,
    ignore_errors=True)

non_train_ds = ds.take(2000)
train_ds = ds.skip(2000)
val_ds = non_train_ds.take(1000)
test_ds = non_train_ds.skip(1000)

### Loading models from TensorFlow Hub

Here you can choose which BERT model you will load from TensorFlow Hub and fine-tune. There are multiple BERT models available.

  - [BERT-Base](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3), [Uncased](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3) and [seven more models](https://tfhub.dev/google/collections/bert/1) with trained weights released by the original BERT authors.
  - [Small BERTs](https://tfhub.dev/google/collections/bert/1) have the same general architecture but fewer and/or smaller Transformer blocks, which lets you explore tradeoffs between speed, size and quality.
  - [ALBERT](https://tfhub.dev/google/collections/albert/1): four different sizes of "A Lite BERT" that reduces model size (but not computation time) by sharing parameters between layers.
  - [BERT Experts](https://tfhub.dev/google/collections/experts/bert/1): eight models that all have the BERT-base architecture but offer a choice between different pre-training domains, to align more closely with the target task.
  - [Electra](https://tfhub.dev/google/collections/electra/1) has the same architecture as BERT (in three different sizes), but gets pre-trained as a discriminator in a set-up that resembles a Generative Adversarial Network (GAN).
  - BERT with Talking-Heads Attention and Gated GELU [[base](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1), [large](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_large/1)] has two improvements to the core of the Transformer architecture.

The model documentation on TensorFlow Hub has more details and references to the
research literature. Follow the links above, or click on the [`tfhub.dev`](http://tfhub.dev) URL
printed after the next cell execution.

The suggestion is to start with a Small BERT (with fewer parameters) since they are faster to fine-tune. If you like a small model but with higher accuracy, ALBERT might be your next option. If you want even better accuracy, choose
one of the classic BERT sizes or their recent refinements like Electra, Talking Heads, or a BERT Expert.

Aside from the models available below, there are [multiple versions](https://tfhub.dev/google/collections/transformer_encoders_text/1) of the models that are larger and can yield even better accuracy, but they are too big to be fine-tuned on a single GPU. You will be able to do that on the [Solve GLUE tasks using BERT on a TPU colab](https://www.tensorflow.org/tutorials/text/solve_glue_tasks_using_bert_on_tpu).

You'll see in the code below that switching the tfhub.dev URL is enough to try any of these models, because all the differences between them are encapsulated in the SavedModels from TF Hub.

In [None]:
#@title Choose a BERT model to fine-tune

bert_model_name = 'bert_en_uncased_L-12_H-768_A-12'  #@param ["bert_en_uncased_L-12_H-768_A-12", "bert_en_cased_L-12_H-768_A-12", "bert_multi_cased_L-12_H-768_A-12", "small_bert/bert_en_uncased_L-2_H-128_A-2", "small_bert/bert_en_uncased_L-2_H-256_A-4", "small_bert/bert_en_uncased_L-2_H-512_A-8", "small_bert/bert_en_uncased_L-2_H-768_A-12", "small_bert/bert_en_uncased_L-4_H-128_A-2", "small_bert/bert_en_uncased_L-4_H-256_A-4", "small_bert/bert_en_uncased_L-4_H-512_A-8", "small_bert/bert_en_uncased_L-4_H-768_A-12", "small_bert/bert_en_uncased_L-6_H-128_A-2", "small_bert/bert_en_uncased_L-6_H-256_A-4", "small_bert/bert_en_uncased_L-6_H-512_A-8", "small_bert/bert_en_uncased_L-6_H-768_A-12", "small_bert/bert_en_uncased_L-8_H-128_A-2", "small_bert/bert_en_uncased_L-8_H-256_A-4", "small_bert/bert_en_uncased_L-8_H-512_A-8", "small_bert/bert_en_uncased_L-8_H-768_A-12", "small_bert/bert_en_uncased_L-10_H-128_A-2", "small_bert/bert_en_uncased_L-10_H-256_A-4", "small_bert/bert_en_uncased_L-10_H-512_A-8", "small_bert/bert_en_uncased_L-10_H-768_A-12", "small_bert/bert_en_uncased_L-12_H-128_A-2", "small_bert/bert_en_uncased_L-12_H-256_A-4", "small_bert/bert_en_uncased_L-12_H-512_A-8", "small_bert/bert_en_uncased_L-12_H-768_A-12", "albert_en_base", "electra_small", "electra_base", "experts_pubmed", "experts_wiki_books", "talking-heads_base"]

map_name_to_handle = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_base/2',
    'electra_small':
        'https://tfhub.dev/google/electra_small/2',
    'electra_base':
        'https://tfhub.dev/google/electra_base/2',
    'experts_pubmed':
        'https://tfhub.dev/google/experts/bert/pubmed/2',
    'experts_wiki_books':
        'https://tfhub.dev/google/experts/bert/wiki_books/2',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1',
}

map_model_to_preprocess = {
    'bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_en_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-2_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-4_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-6_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-8_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-10_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-128_A-2':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-256_A-4':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-512_A-8':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'small_bert/bert_en_uncased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3',
    'albert_en_base':
        'https://tfhub.dev/tensorflow/albert_en_preprocess/3',
    'electra_small':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'electra_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_pubmed':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'experts_wiki_books':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
    'talking-heads_base':
        'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3',
}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


In [None]:
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
text_preprocessed = bert_preprocess_model(text_test)
bert_model = hub.KerasLayer(tfhub_handle_encoder)
bert_results = bert_model(text_preprocessed)

### Define your model

You will create a very simple fine-tuned model, with the preprocessing model, the selected BERT model, one Dense and a Dropout layer.

Note: for more information about the base model's input and output you can use just follow the model's url for documentation. Here specifically you don't need to worry about it because the preprocessing model will take care of that for you.


In [None]:
def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='Sentence')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)

In [None]:
classifier_model = build_classifier_model()

In [None]:
# Define loss function
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()

In [None]:
# Define optimizer
epochs = 5
steps_per_epoch = 300
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')

### Training the model

In [None]:
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

In [None]:
print(f'Training model with {tfhub_handle_encoder}')
history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               epochs=5)

Training model with https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Evaluate the model

Let's see how the model performs. Two values will be returned. Loss (a number which represents the error, lower values are better), and accuracy.

In [None]:
loss, accuracy = classifier_model.evaluate(test_ds)

print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')

Loss: 0.31804439425468445
Accuracy: 0.9039999842643738


### Export for inference

Now you just save your fine-tuned model for later use.
- Define the name of your model through setting ```dataset_name```
- Define the path where you would save your model by changing ```saved_model_path```

More info, visit https://www.tensorflow.org/tutorials/keras/save_and_load



In [None]:
dataset_name = 'DSDP'
saved_model_path = './Models/{}_bert'.format(dataset_name.replace('/', '_'))

# Saved model format
classifier_model.save(saved_model_path)

# HDF5 format
# classifier_model.save('./'+saved_model_path+'.h5')

Here you can test your model on any sentence you want, just add to the examples variable below too see what is a good threshold.

In [None]:
def print_my_examples(inputs, results):
  result_for_printing = \
    [f'input: {inputs[i]:<30} : score: {results[i][0]:.6f}'
                         for i in range(len(inputs))]
  print(*result_for_printing, sep='\n')
  print()


examples = [
    'That is too much work!!!',
    'However, my leg is feeling better, thank you.',
    'Howis your day going?',
    'Message sent by:> eric (ebass@enron.com)>',
    '>',
    'AAA',
    'Encrypted message:>> QTDHF AYTPE KHYYN WSJYE TBXNQ PJRRH GDVDX QHNNK QBKOD ZVQEU LTRMQ WDHURACFGZ VJDPV UFIZA M',
    ', visit http://www.bestfares.com/view.asp?id=3D10106401FLY BETWEEN THE MIDWEST AND EAST COAST FOR $198 TO $218',
    'Thank You for Your CooperationRT'

]

original_results = tf.sigmoid(classifier_model(tf.constant(examples)))
print('Results from the model in memory:')
print_my_examples(examples, original_results)

Results from the saved model:
input: That is too much work!!!       : score: 0.999104
input: However, my leg is feeling better, thank you. : score: 0.999778
input: Howis your day going?          : score: 0.996103
input: Message sent by:> eric (ebass@enron.com)> : score: 0.010370
input: >                              : score: 0.922872
input: AAA                            : score: 0.646321
input: Encrypted message:>> QTDHF AYTPE KHYYN WSJYE TBXNQ PJRRH GDVDX QHNNK QBKOD ZVQEU LTRMQ WDHURACFGZ VJDPV UFIZA M : score: 0.873582
input: , visit http://www.bestfares.com/view.asp?id=3D10106401FLY BETWEEN THE MIDWEST AND EAST COAST FOR $198 TO $218 : score: 0.008605
input: Thank You for Your CooperationRT : score: 0.652361

Results from the model in memory:


## Using Model to Clean Email

### Instructions
1. Define where the inputs and outputs will reside: 
    - ```outputs``` where the filtered outputs resides 
    - ```inputs``` where all the raw input email files to cleaned resides 
2. Define the path of model ```model_path``` to reload the model
2. Run the code below and check out the cleaned results in your ``outputs`` directory

In [None]:
# model_path = saved_model_path
model_path = './Models/DSDP_bert'
reloaded_model = tf.saved_model.load(model_path)

In [None]:
# cleans all file in the inputs directory and filter using model
outputs = create_dir('./Filtered/arnold-j/')
inputs = './Inputs/arnold-j/'
for files in os.listdir(inputs):
    corpus = clean_summarize(inputs + files, files, annotation=False).splitlines()

    # predicting
    try:
        reloaded_results = tf.sigmoid(reloaded_model(tf.constant(corpus)))
    except:
        print(files)
        print('The file above failed predicting.')

    # join and filters result
    corpus = np.array(corpus)
    results = np.hstack((corpus.reshape(len(corpus), 1), reloaded_results.numpy()))
    results = np.array(list(filter(lambda a : float(a[1]) >= 0.8, results)))
    final_output = '\n'.join(results[:, 0])

    with open(outputs + files[:-4] + '.txt', "w") as new_file:
        new_file.write(final_output)
