<a href="https://colab.research.google.com/github/woldemarg/ds_tests/blob/master/nlp/company_7/task_solution/scripts/holomb_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import json
import re
import spacy

import pandas as pd
import numpy as np

from spacy.pipeline import EntityRuler
from keywords import keywords

Loading data

In [2]:
df = pd.read_csv("/content/data.csv.gz",
                 index_col="id")

As it said in a task description in order to get the input data we have to concatenate two parts of JSON the corresponding column.

In [3]:
df.head()

Unnamed: 0_level_0,url,email,json,title,first_name,last_name,academic_title,department,school,processed,created_at,updated_at
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,https://www.abac.edu/,vfenn@abac.edu,"{""left"": "" the winner. The number on each ball...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24
2,https://www.abac.edu/,bray@abac.edu,"{""left"": ""er person and can be purchased onlin...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24
3,https://www.abac.edu/,admissions@abac.edu,"{""left"": ""ty, Prince Automotive Group, Rotary ...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24
4,https://www.abac.edu/,webmaster@abac.edu,"{""left"": ""mics\nRegistrar\nTranscript Request\...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24
5,https://www.alu.edu/,admissions@alu.edu,"{""left"": ""Abraham Lincoln University & Online ...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24


In [4]:
strings = []

for jsn in df["json"]:
    d = json.loads(jsn)
    s = "".join(v.strip()
                .replace("\n", " ") #minor preprocessing
                .replace("\t", " ")
                for v in d.values())
    strings.append(s)

s_strings = pd.Series(strings, index=df.index) #to save identification

So, one of our inputs looks like below. Usually, we should consider preprocessing our inputs in terms of replacing artefacts and removing stop-words. But in this particular case, it turned out that ill-considered and quick text preprocessing blunts the effectiveness of further named entities recognition step. E.g. this part of a string Boeingâ€\x9d by Mark Camolett after removing odd characters and stop-words is tended to be recognised as a single PROPN-entity Boeing Mark Camolett. So, for now, I leave all the inputs as is.

Also, it should be noted, that the above string includes two *names* and one *job* (hereinafter - "academic") title. Further, I'll show how I deal with this challenge.

In [5]:
s_strings[1]

'the winner. The number on each ball will be associated with an individual that purchased chances to win. The grand prize is a check with a designated value equal to 50 per cent (nearest $10) of the numbered golf balls sold. To participate in the tournament or the ball drop event, interested persons can contact Fenn at (229) 391-5067, email her at, or register online at https://www.abac.edu/academics/sanr-classic/ . ### Baldwin Players Announce Cast for ABAC Fall Production September 3 2019 Baldwin Players Announce Cast for ABAC Fall Production TIFTONâ€”Baldwin Playersâ€™ Director Brian Ray has announced the cast for the theatre troupeâ€™s upcoming production of â€œBoeing, Boeingâ€\x9d by Mark Camolett'

For NER I've been testing several models and it seems that *_lg* performs slightly better, though it incorrectly recognised e.g. *Hoffman Estates* and *Wilshire Highway*, as well as other 3-5 words' combinations as proper names.

That is why I add a step in *nlp_main* pipeline in order to correct labels, firstly, for entities, which do not have a form of *Xxxx (X.) Xxxx*, and, secondly, for entities that include common *nouns* (as *Wilshire Highway*). For the POS tagging seems to be highly sensitive to the context, I've come up with using nlp_helper to feed only *doc.ents[i].text* without any confusing context.

In [6]:
nlp_main = spacy.load("en_core_web_lg", disable=["tagger", "parser"])

#to tag POS for words within recognised named entity
nlp_helper = spacy.load("en_core_web_sm", disable=["parser", "ner"]) 

In [7]:
def is_all_propn(st):
    propns = []
    for w in st.split(" "):
        w_doc = nlp_helper(w)
        for t in w_doc:
            propns.append(t.pos_)
    return all((el == "PROPN" for el in propns))


def correct_person_entities(nlp_doc):
    new_ents = []
    for ent in nlp_doc.ents:
        if ent.label_ == "PERSON":
            #name assumes to be in form of Xxxx Xxxx with/without X. inbetween
            if (re.search(r"^([A-Z][\w]+\s[A-Z]?\.?\s?[A-Z][\w]+)$", 
                          ent.text) and is_all_propn(ent.text)):
                new_ents.append(ent)
        else:
            new_ents.append(ent)
    nlp_doc.ents = new_ents
    return nlp_doc

In a task description, it was suggested to use *Matcher* for matching *academic titles*, though I've found *EntityRuler* much more convenient for adding named entities based on pattern dictionaries. Otherwise, I see this problem solving by manually building several sets of n-grams (n for the number of words in the compound terms in keyword dictionary) and sequential looping through those sets for (fuzzy) matches.

In [8]:
#reconstructing dictionary in a form needed for EntityRuler 
patterns = []

for k, v in keywords().items():
    for s in v:
        new = {}
        new["label"] = k
        new["pattern"] = [{"LOWER": w.lower()} for w in s.split(" ")]
        patterns.append(new)

In [9]:
ruler = EntityRuler(nlp_main)
ruler.add_patterns(patterns)

In [10]:
nlp_main.add_pipe(ruler, before="ner")
nlp_main.add_pipe(correct_person_entities, after="ner")

**Important note!** The task is understood as extracting data from strings containing BOTH person name and academic title. That is why partly available data from inputs was omitted.

Coming back to cases when there are several (and presumably not equivalent) numbers of names and/titles per input string. I assume that name and title placed as closer to each other as possible have more chance to be considered interconnected. For that reason, I build a matrix of differences between recognized patterns and extract for a given input only those within a minimal distance. 

In [11]:
data_ls = []

for i, el in s_strings.items():
    doc = nlp_main(el)
    #only those with both name and title
    if (any([n.label_ == "academic_title" for n in doc.ents]) and 
            any([n.label_ == "PERSON" for n in doc.ents])):

        names = [n.text for n in doc.ents if n.label_ == "PERSON"]
        nx = np.asarray([n.start_char for n
                         in doc.ents if n.label_ == "PERSON"])
        titles = [n.text for n
                  in doc.ents if n.label_ == "academic_title"]
        ty = np.asarray([n.start_char for n
                         in doc.ents if n.label_ == "academic_title"])

        diff_arr = np.abs(ty - nx[:, np.newaxis]) #matrix of distances
        min_vals = np.where(diff_arr == np.amin(diff_arr))
        indicies = list(zip(min_vals[0], min_vals[1])) #min value within matrix
        
        #restore identification
        data_ls.append((i, names[indicies[0][0]], titles[indicies[0][1]])) 

In [12]:
data_df = pd.DataFrame(data_ls, columns=["id", "name", "academic_title"])
data_df.set_index("id", inplace=True)


def remove_middle_name(nn):
    n = nn.split(" ")
    return " ".join((n[0], n[-1]))


data_df.loc[:, "name"] = data_df["name"].map(remove_middle_name)
data_df[["first_name", "last_name"]] = data_df["name"].str.split(expand=True)

data_df.loc[:, "academic_title"] = data_df["academic_title"].str.title()

In [13]:
df.update(data_df)
df.head()

Unnamed: 0_level_0,url,email,json,title,first_name,last_name,academic_title,department,school,processed,created_at,updated_at
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,https://www.abac.edu/,vfenn@abac.edu,"{""left"": "" the winner. The number on each ball...",,Brian,Ray,Director,,,,2019-09-16 11:37:24,2020-02-06 03:33:24
2,https://www.abac.edu/,bray@abac.edu,"{""left"": ""er person and can be purchased onlin...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24
3,https://www.abac.edu/,admissions@abac.edu,"{""left"": ""ty, Prince Automotive Group, Rotary ...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24
4,https://www.abac.edu/,webmaster@abac.edu,"{""left"": ""mics\nRegistrar\nTranscript Request\...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24
5,https://www.alu.edu/,admissions@alu.edu,"{""left"": ""Abraham Lincoln University & Online ...",,,,,,,,2019-09-16 11:37:24,2020-02-06 03:33:24


So far I was able to fill nearly 56% of the rows with missing data. 

In [14]:
print(1 - df["first_name"].isna().mean())

0.5609062887794725


In [15]:
df.reset_index(inplace=True)
df.to_csv("/content/data_new.csv.gz")

Mesuring time for processing a single text parcel

In [16]:
def get_data(line):
    l_doc = nlp_main(line)
    if (any([n.label_ == "academic_title" for n in l_doc.ents]) and
            any([n.label_ == "PERSON" for n in l_doc.ents])):

        names = [n.text for n in l_doc.ents if n.label_ == "PERSON"]
        nx = np.asarray([n.start_char for n
                         in l_doc.ents if n.label_ == "PERSON"])
        titles = [n.text for n
                  in l_doc.ents if n.label_ == "academic_title"]
        ty = np.asarray([n.start_char for n
                         in l_doc.ents if n.label_ == "academic_title"])

        diff_arr = np.abs(ty - nx[:, np.newaxis])
        min_vals = np.where(diff_arr == np.amin(diff_arr))
        indicies = list(zip(min_vals[0], min_vals[1]))
        return (names[indicies[0][0]], titles[indicies[0][1]])


%timeit get_data(s_strings[1])

100 loops, best of 3: 18 ms per loop


###What could be done differently:
1. Consider a flexible strategy of text preprocessing (removing artefacts and stop-words)
2. Consider using less heavy models
3. Consider building n-grams for fuzzy matching (also testing third-part libraries like [PhuzzyMatcher](https://github.com/jackmen/PhuzzyMatcher)) or applying another advanced algorithms like Aho-Corasick
4. Consider what is more efficient  docs = list(nlp.pipe(texts)) or iterating through each input
5. Consider dealing with triple named entities in *correct_person_entities*
