# Preprocessing Notebook
## Imports String from a text file as pre-processed data and outputs processed string

**Uses the synonyms json file as input for synonyms to be generated and loads a pretrained word2vec model for generating synonyms.** 

In [1]:
import json
import os
import gensim
from gensim.models import KeyedVectors
from gensim.models import Phrases
import spacy
import re
from thesaurus import synonym_expander, thesaurus_generator
from normalize import normalizer

#nlp model loaded from spacy for lemmatization, stop word and punctuation removal
nlp = spacy.load("en_core_web_sm")     

**Preloaded word2vec path given here with preloaded model filename.**

In [2]:
preload_path = r'..\pretrained_weights\glove'
w2v_weights = 'glove.6B.200d.txt.word2vec'
w2v_path = os.path.join(preload_path,w2v_weights)

**Text data for preprocessed string filepath provided here.**

In [3]:
text_path = r'..\data\test_text'
preprocess_filename = 'pre_processed.txt' 
preprocess_path = os.path.join(text_path,preprocess_filename)

**Synonyms JSON file loaded here**

In [4]:
data_path = r'..\data'
synonym_filename = 'specified_synonyms.json'

In [5]:
#Pretrained word2vec model loaded
w2v_model = KeyedVectors.load_word2vec_format(w2v_path, binary=False)

In [6]:
preprocess_string = open(preprocess_path).read()
preprocess_string = preprocess_string.lower()

In [7]:
synonym_file = open(os.path.join(data_path,synonym_filename)).read()
synonym_json = json.loads(synonym_file)

**Synonyms are generated from the synonym json and the legal dictionary is created for specific legal terms.**

In [8]:
legal_dict = {}
synonyms_expanded = {}

for i in range(2):
    priority = synonym_json[i]
    content = priority['content']
    thesaurus_split = thesaurus_generator(content)
    synonyms_dict = thesaurus_split['synonyms_dict']
    legal_dict.update(thesaurus_split['legal_dict'])
    synonyms_expanded.update(synonym_expander(synonyms_dict,nlp, w2v_model))
    
thesaurus_split = thesaurus_generator(content)
synonyms_dict = thesaurus_split['synonyms_dict']
legal_dict = thesaurus_split['legal_dict']
synonyms_expanded = synonym_expander(synonyms_dict,nlp, w2v_model)

**Synonyms found in input text string and replaced with the base word.**
**Normalization happens where text is lemmatized and punctuation and stop words are removed.**

In [9]:
for key in legal_dict:
    if key in preprocess_string:
        preprocess_string.replace(key,legal_dict[key])
for key in synonyms_expanded:
    if key in preprocess_string:
        preprocess_string.replace(key,synonyms_expanded[key])
        
post_process = normalizer(preprocess_string)  

In [10]:
print(post_process)

strictly private confidential placeholder non disclosure agreement dear sir understand contemplate potential investment placeholder register commercial register local court placeholder have business address placeholder company way acquire share company current shareholder and/or subscription new share issue company potential transaction write confirm basis company advisor provide placeholder receive party confidential information define company advisor shall obligation reserve right discontinue provision confidential information time give reason cardinal definition purpose letter affiliate mean respect company affiliate meaning org cardinal norp stock corporation act aktg confidential information mean â€ information form relate potential transaction company shareholder affiliate obtain provide connection potential transaction include discussion negotiation investigation relate date letter mark confidential â€ copy reproduction information form â€ information document record time contai