## Model Evaluation on Test Data

In [8]:
import sys
sys.path.append('../')
import os
import pandas as pd
import numpy as np
from datetime import datetime
from src.dataloader import DataLoader
from src.logger import Logger
from src.text_preprocessor import TextPreprocessor
from textblob import TextBlob
import spacy

In [2]:
# Laod testing labels
df_test_labels = pd.read_csv(os.path.join('..','Data','test_labels.csv'))

In [3]:
df_test_labels.head(10)

Unnamed: 0,doc_id,phrase,ric
0,0563_20171129_nL3N1NZ1MX_1,Aussie,AUD
1,0563_20171129_nL3N1NZ1MX_1,New Zealand dollar,NZD
2,0563_20171129_nL3N1NZ1MX_1,Australian dollar,AUD
3,0563_20171129_nL3N1NZ1MX_1,kiwi dollar,NZD
4,0563_20171129_nL3N1NZ1MX_1,Australian dollar,AUD
5,0563_20171129_nL3N1NZ1MX_1,U.S. dollar,USD
6,0730_20180329_nFCT29YLPY_1,USD,USD
7,0730_20180329_nFCT29YLPY_1,yen,JPY
8,0730_20180329_nFCT29YLPY_1,USD,USD
9,0730_20180329_nFCT29YLPY_1,JPY,JPY


In [4]:
# Lets group by the file names and collect the unique entities
df_collected_entities = df_test_labels.groupby('doc_id')['ric'].agg(lambda x: set(x)).reset_index()

In [5]:
df_collected_entities.head()

Unnamed: 0,doc_id,ric
0,0002_20171221_nFCT20VZKW_1,"{NZD, SEK, AUD, GBP, NOK, JPY, EUR, USD, CAD}"
1,001_yahoo,"{TWD, PHP, SGD, THB, MYR, IDR, EUR, INR, CNY, ..."
2,0243_20170120_nDJMS0289D_1,"{PLN, HUF, EUR, CZK}"
3,0244_20180416_nFCT16ZCHB_1,"{NZD, AUD, GBP, JPY, EUR, CHF, USD, CAD}"
4,0245_20180416_nNRA5ww5ev_1,"{TWD, PHP, MYR, THB, KRW, IDR, JPY, INR, CNY, ..."


In [6]:
df_collected_entities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   doc_id  70 non-null     object
 1   ric     70 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


__We have to extract entities for 70 documents__

In [9]:
# Load Model
model = spacy.load(os.path.join('..','model'))

In [33]:
# Load files one 
test_files_path = os.path.join('..','Data','test')
time_created = datetime.now()
logger = Logger(f'Evaluation_logs_{time_created.date()}_{time_created.strftime("%H%M%S")}.log')
data_loader = DataLoader(logger,test_files_path)
text_preprocessor = TextPreprocessor(logger)

files = []
entities = []
for index, row in df_collected_entities.iterrows():
    text = data_loader.read_file(row['doc_id'])
    processed_text = TextBlob(text)

     # Check sentence wise sentiment and extract entities

    entities_found = []
    for s in processed_text.sentences:
        if s.sentiment.polarity > 0.05 or s.sentiment.polarity < -0.05:
            text = text_preprocessor.clean_text(s)
            doc = model(text)
            for ent in doc.ents:
                entities_found.append(ent.label_)

    if len(entities_found)>0:
        files.append(row['doc_id'])
        entities.append(set(entities_found))

In [38]:
file_entity = {'file':files,'detected_entities':entities}


In [39]:
df_file_entity = pd.DataFrame(file_entity)

In [41]:
df_file_entity.head()

Unnamed: 0,file,detected_entities
0,0002_20171221_nFCT20VZKW_1,"{NZD, SEK, AUD, GBP, SGD, NOK, JPY, EUR, USD, ..."
1,001_yahoo,"{CNY, USD, IDR}"
2,0243_20170120_nDJMS0289D_1,"{PLN, EUR, USD, NOK}"
3,0244_20180416_nFCT16ZCHB_1,"{NZD, AUD, GBP, JPY, EUR, CHF, USD, CAD}"
4,0245_20180416_nNRA5ww5ev_1,"{TWD, IDR, RUB, INR, USD}"


In [45]:
df_true_pred = df_file_entity.merge(df_collected_entities, left_on = 'file',right_on='doc_id',how='right')

In [48]:
df_true_pred.head()

Unnamed: 0,file,detected_entities,doc_id,ric
0,0002_20171221_nFCT20VZKW_1,"{NZD, SEK, AUD, GBP, SGD, NOK, JPY, EUR, USD, ...",0002_20171221_nFCT20VZKW_1,"{NZD, SEK, AUD, GBP, NOK, JPY, EUR, USD, CAD}"
1,001_yahoo,"{CNY, USD, IDR}",001_yahoo,"{TWD, PHP, SGD, THB, MYR, IDR, EUR, INR, CNY, ..."
2,0243_20170120_nDJMS0289D_1,"{PLN, EUR, USD, NOK}",0243_20170120_nDJMS0289D_1,"{PLN, HUF, EUR, CZK}"
3,0244_20180416_nFCT16ZCHB_1,"{NZD, AUD, GBP, JPY, EUR, CHF, USD, CAD}",0244_20180416_nFCT16ZCHB_1,"{NZD, AUD, GBP, JPY, EUR, CHF, USD, CAD}"
4,0245_20180416_nNRA5ww5ev_1,"{TWD, IDR, RUB, INR, USD}",0245_20180416_nNRA5ww5ev_1,"{TWD, PHP, MYR, THB, KRW, IDR, JPY, INR, CNY, ..."


_It can be concluded that we can improve the performace even further by using a dataset which has __Inside–outside–beginning (tagging)__ _

1.  __I have currently trained the SPACY NER model for 10 iterations, it could have been trained further__
2.  __I would have used a Bi-LSTM or BERT NER for undertanding the context better.__

3.  __While I have tried to stick to proper coding convention for the major part, but it can be better given more time. I might have used dockers for easy reproducibility__

## Assumptions

I have taken some assumptions for creating the training datasets, Cleaning and during inferencing.

1. For creating the training data, I have used Spacy's Matcher object for matching tokens from the lexicon phrases and tagging them to an ID which would be used to train the entity label. I studied the pattern for few of the documents in the `Exploration.ipynb` file. And while matching I considered the POS tags as well. This can ofcourse be better structured, may be using PhraseMatcher or by adding more rules for the matching like an additional Dependency Tag.

2. During Cleaning, I performed some basic tasks like removing URL's, spaces , text between brackets. While this topic can alone span for days and there is literally no restriction on the amount of cleaning activities, I decided to continue with the basic stuff. I did not use a lower case conversion because in most cases the entities are CASE-SENSITIVE.

3. While Inferencing, I chose to convert the document into granular sentences, then calculate the Sentiment (TEXTBLOB Polarity) and only when there is a hint of positive or negative Sentiment , I proceed and extract the Entities from the document.

4. For most part the code (Variables, Classes, logs ,functions etc) are self-explanatory.