<h1><span style="color:red">Named Entity Recognition for SuAVE</span></h1>

This notebook uses spaCy to generate named entity tags by parsing a selected text field in a SuAVE dataset
See https://spacy.io/ for more information.

The following tags are generated and added as #multi variables to survey datafile (from https://spacy.io/api/annotation#named-entities), if they exist in text:

   * PERSON:	People, including fictional.
   * NORP:	Nationalities or religious or political groups.
   * FAC:	Buildings, airports, highways, bridges, etc.
   * ORG:	Companies, agencies, institutions, etc.
   * GPE:	Countries, cities, states.
   * LOC:	Non-GPE locations, mountain ranges, bodies of water.
   * PRODUCT:	Objects, vehicles, foods, etc. (Not services.)
   * EVENT:	Named hurricanes, battles, wars, sports events, etc.
   * WORK_OF_ART:	Titles of books, songs, etc.
   * LAW:	Named documents made into laws.
   * LANGUAGE:	Any named language.
   * DATE:	Absolute or relative dates or periods.


Additionally, users have an option to add user-defined dictionaries of terms, and add custom #multi variables with terms from the dictionary. These cells are optional. Users can also load larger pre-trained NER models.



For testing:

http://localhost:8888/notebooks/Downloads/jupyter-suave/operations/tagger/NER.ipynb?surveyurl=http://suave-dev.sdsc.edu/main/file=spatialsuave_Russian_FB_Ads_w_Concepts.csv&views=1110001&view=grid&user=spatialsuave&csv=spatialsuave_Russian_FB_Ads_w_Concepts.csv&params=none&dzc=https://maxim.ucsd.edu/dzgen/lib-staging-uploads/bea6f8abb86c98ef168775a159612828/content.dzc&activeobject=null

## 1. Retrieve survey parameters from the URL

In [None]:
%%javascript
function getQueryStringValue (key)
{  
    return unescape(window.location.search.replace(new RegExp("^(?:.*[&\\?]" + escape(key).replace(/[\.\+\*]/g, "\\$&") + "(?:\\=([^&]*))?)?.*$", "i"), "$1"));
}
IPython.notebook.kernel.execute("survey_url='".concat(getQueryStringValue("surveyurl")).concat("'"));
IPython.notebook.kernel.execute("views='".concat(getQueryStringValue("views")).concat("'"));
IPython.notebook.kernel.execute("view='".concat(getQueryStringValue("view")).concat("'"));
IPython.notebook.kernel.execute("user='".concat(getQueryStringValue("user")).concat("'"));
IPython.notebook.kernel.execute("csv_file='".concat(getQueryStringValue("csv")).concat("'")); 
IPython.notebook.kernel.execute("dzc_file='".concat(getQueryStringValue("dzc")).concat("'")); 
IPython.notebook.kernel.execute("params='".concat(getQueryStringValue("params")).concat("'")); 
IPython.notebook.kernel.execute("active_object='".concat(getQueryStringValue("activeobject")).concat("'")); 
IPython.notebook.kernel.execute("full_notebook_url='" + window.location + "'"); 

## 2. Setting up the environment (if needed) and importing libraries

<h2><span style="color:red">Skip this cell if the spaCy enviroment is already set up. Otherwise, un-comment and run the following commands to set up the environment.  </span></h2>

In [None]:
#### Install the main module (see https://spacy.io/)
# ! pip install spacy

#### lemmatization - only needed if creating a model from scratch
# !pip install -U spacy-lookups-data

####  Need to install one of these models
# !python -m spacy download en_core_web_lg   # 789 mb
# !python -m spacy download en_core_web_md   # 91 mb
# !python -m spacy download en_core_web_sm   # 11 mb

#### Installing these models via pip (see https://pypi.org/project/spacy/)
    
# !pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz    
# !pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz    
# !pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz    



### Import spacy and other libraries, load the default pre-trained spacy model (small)

In [None]:
from __future__ import print_function
import ipywidgets as widgets
import pandas as pd
from IPython.display import Markdown, display

import numpy as np
from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas()

# Importing additional libraries
import panel as pn
import requests
import re

# Loading extensions
pn.extension()
def printmd(string):
    display(Markdown(string))

absolutePath = "../../temp_csvs/"
url_partitioned = full_notebook_url.partition('/operations')
base_url = url_partitioned[0];


In [None]:
import spacy

# Currently installed is en_core_web_sm model version 2.2.5. It's size is 11 mb
# To update the small model (en_core_web_sm), uncomment and run 
# !python -m spacy download en_core_web_sm

# load the small model:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
def slider(data):
    """
    slider creates an interactive display of a
    data frame.
    
    :param df: data frame
    :returns: interactive dataframe
    """
    
    ## Row Selector widget
    row_selection = pn.widgets.IntSlider(name='Navigate Rows', width=350, 
                                         margin=(0,50,-15,0), end=len(df)-1)

    # Column Selector widget
    col_selection = pn.widgets.IntSlider(name='Navigate Columns', width=350, 
                                         margin=(0,0,5,0), end=len(df.columns))
    
    @pn.depends(row_selection.param.value, col_selection.param.value)
    def navigate_data(row=0, col=0):
        return data.iloc[row:row+5, col:col+10]
    
    sliders = pn.Row(row_selection, col_selection, margin=(0,0,0,10))
    full_widget = pn.Column(sliders, navigate_data)
    return full_widget

def extract_data(path):
    """
    extract_data reads files from various formats
    
    :param link: string representing path to file
    :returns: data frame of file
    """

    # Reading file at path
    if path.endswith(('.txt', 'tsv')):
        try:
            data = pd.read_csv(path, sep='\t', encoding="latin-1")
        except UnicodeDecodeError:
            data = pd.read_csv(path, sep='\t', encoding="ISO-8859-1")
    elif path.endswith('.csv'):
        try:
            data = pd.read_csv(path, encoding="latin-1")
        except UnicodeDecodeError:
            data = pd.read_csv(path, encoding="ISO-8859-1")
    else:
        return None
    
    return data

<h2><span style="color:red">Optionally, uncomment the lines below to install and import larger pretrained NER models. Otherwise, skip to step 3</span></h2>

In [None]:
#### Installing medium or large models will take a bit longer:

#### For the medium model:
# !python -m spacy download en_core_web_md   # 91 mb
#### and load it using
# import en_core_web_md
# nlp = en_core_web_md.load()

#### Or, for the large model:
# !python -m spacy download en_core_web_lg   # 789 mb
#### and load it using
# import en_core_web_lg
# nlp = en_core_web_lg.load()



## 3. Select a survey file from SuAVE or import a local CSV file

In [None]:
data_select = pn.widgets.RadioBoxGroup(name='Select notebook', options=['Load survey file from SuAVE', 
                                                                        'Import a local CSV file'], 
                                       inline=False)
data_select

In [None]:
data_input = pn.widgets.FileInput()
    
def check_selection():
    if data_select.value == 'Load survey file from SuAVE':
        global fname
        fname = absolutePath + csv_file
        printmd("<b><span style='color:red'>SuAVE survey will be loaded. Continue to step 4.</span></b>")

    else:
        message = pn.pane.HTML("<b><span style='color:red'>Upload data and continue to step 4.</span></b>")
        return pn.Column(message, data_input)
    
check_selection()

## 4. Visualize the data and select a text variable to parse

In [None]:
if not pd.isnull(data_input.filename):
    fname = absolutePath + data_input.filename
    data_input.save(fname)

df = extract_data(fname)

slider(df)


<h2><span style="color:red">4a. Optionally, in the cell below, remove those groups of terms that you don't want to extract from text</span></h2>
Alternatively, skip to step 5

In [None]:
ent_labels = ['PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC','PRODUCT', 'EVENT', 'WORK_OF_ART','LAW','LANGUAGE', 'DATE']
col_labels = ['nerPerson#multi', 'nerPopulation Group#multi', 'nerFacility#multi', 'nerOrganization#multi', 'nerAdministrative Area#multi', 'nerLocation#multi','nerProduct#multi', 'nerEvent#multi', 'nerWork of Art#multi','nerLegal Document#multi','nerLanguage#multi', 'nerDate#multi']


<h2><span style="color:red">4b. Optionally: add a user defined dictionary to the pipeline, and use entity matcher to generate an additional #multi variable</span></h2>
Alternatively, skip to step 5

In [None]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span
class EntityMatcher(object):
    name = "entity_matcher"

    def __init__(self, nlp, terms, label):
        patterns = [nlp.make_doc(text) for text in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        return doc


In [None]:
added_column = "nerAnimal#multi"
added_group = "ANIMAL"
terms = ("cat", "dog", "tree kangaroo", "giant sea spider", "monkey")

entity_matcher = EntityMatcher(nlp, terms, added_group)
nlp.add_pipe(entity_matcher, after="ner")
ent_labels.append(added_group)
col_labels.append(added_column)




## 5. Generate pre-defined #multi variables by doing NER over the selected text variable

In [None]:
varcols = df.columns.tolist()
# remove any variable names are unlikely to contain parsable text 
varcols = [x for x in varcols if '#number' not in x and '#date' not in x and '#img' not in x and '#href' not in x and '#link' not in x]

# Left panel
left_text = pn.Row("####Select Variables for NER", margin=(0,0,-15,270))
binary_selector = pn.widgets.CrossSelector(options=varcols, width=630)
left_panel = pn.Column(left_text, binary_selector, css_classes=['widget-box'], margin=(0,30,0,0))

remap_text = pn.pane.Markdown('####      Make selections and run the next cell ', width=650)

# Display widgets
widgets = pn.Row(left_panel)
full_display = pn.Column(widgets,remap_text)
full_display

In [None]:
def properize(txt):
    if len(txt) > 3:
        txt = txt.title()
    return txt
def extract_entity(doc, label):
    return '|'.join(list(set([properize(ent.text) for ent in doc.ents if ent.label_ == label])))
def extract_all(doc):
    data = {}
    for col_label, ent_label in zip(col_labels, ent_labels):
        data[col_label] = extract_entity(doc, ent_label)
    return pd.Series(data)
#     return pd.Series({
#       'person': extract_entity(doc, 'PERSON'),
#       'locs': extract_entity(doc, 'LOC'),
#     })

# Replace NA with empty in each row
# Convert row to string
# Join row with spaces
concatted = df[binary_selector.value].fillna('').astype(str).dropna().apply(lambda row: ' '.join(row), axis=1)

# Apply nlp and then extract
# extracted_df = concatted.head().apply(nlp).apply(extract_all)
extracted_df = concatted.progress_apply(nlp).apply(extract_all)

df_new = pd.concat([df, extracted_df], axis=1)
print('Dimensions:\n --- The original df: ' +str(df.shape) +'\n --- The ner-generated df: '+ str(extracted_df.shape)+'\n --- The concatenated df:' +str(df_new.shape))


## 6. Visualize the generated dataframe

In [None]:
slider(df_new)

In [None]:
# now write this back, or upload to SuAVE.

# df_new.to_csv('test_multi.csv', index=None)
# df = df_new.copy().fillna('')
#  or
# df_new.to_csv('test_2multi2.csv', index=None)
df = df_new.copy().fillna('')


## 7. Generate a new survey and open it in SuAVE

In [None]:
# new filename

if data_select.value == 'Import a local CSV file':
    csv_file = data_input.filename

new_file = absolutePath + csv_file[:-4]+'_v1.csv'
printmd("<b><span style='color:red'>A new temporary file will be created at: </span></b>")
print(new_file)
df.to_csv(new_file, index=None)

In [None]:
import ipywidgets as widgets

In [None]:
#Input survey name

from IPython.display import display
input_text = widgets.Text()
output_text = widgets.Text()

def bind_input_to_output(sender):
    output_text.value = input_text.value

# Tell the text input widget to call bind_input_to_output() on submit
input_text.on_submit(bind_input_to_output)

printmd("<b><span style='color:red'>Input survey name here, press Enter, and then run the next cell:</span></b>")
# Display input text box widget for input
display(input_text)

display(output_text)

In [None]:
#Print survey name
survey_name = output_text.value
printmd("<b><span style='color:red'>Survey Name is: </span></b>" + survey_name)

In [None]:
referer = survey_url.split("/main")[0] +"/"
upload_url = referer + "uploadCSV"
new_survey_url_base = survey_url.split(user)[0]

import requests
import re
csv = {"file": open(new_file, "rb")}

if data_select.value == 'Import a local CSV file':
    dzc_file = ''
    views = '1110001'
    view='grid'

upload_data = {
    'name': input_text.value,
    'dzc': dzc_file,
    'user':user
}
headers = {
    'User-Agent': 'suave user agent',
    'referer': referer
}

r = requests.post(upload_url, files=csv, data=upload_data, headers=headers)

if r.status_code == 200:
    printmd("<b><span style='color:red'>New survey created successfully</span></b>")
    regex = re.compile('[^0-9a-zA-Z_]')
    s_url = survey_name
    s_url =  regex.sub('_', s_url)

    url = new_survey_url_base + user + "_" + s_url + ".csv" + "&views=" + views + "&view=" + view
    print(url)
    printmd("<b><span style='color:red'>Click the URL to open the new survey</span></b>")
else:
    printmd("<b><span style='color:red'>Error creating new survey. Check if a survey with this name already exists.</span></b>")
    printmd("<b><span style='color:red'>Reason: </span></b>"+ str(r.status_code) + " " + r.reason)