## Spacy Introduction
Let's first start spacy. In the first part we are exploring the spacy functions for Natural Language Processing. Particularly, we require to understand the attributes that we can extract out-of-box. We will also explore some functions that are essential for training custom Named-Entity-Recognition model.

In [165]:
# import required modules
import spacy
from spacy.matcher import PhraseMatcher
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [28]:
# Load english vocabulary
nlp = spacy.load("en_core_web_sm")

In [157]:
def print_nlp_details(doc):
    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

In [158]:
def print_ner_details(doc):
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [159]:
# check a small sentence
# understand their NLP tags
# understand their NER tags

doc = nlp(u'Barrack Obama was the President of United States of America.')
print_nlp_details(doc)
print_ner_details(doc)

Barrack barrack PROPN NNP compound Xxxxx True False
Obama obama PROPN NNP nsubj Xxxxx True False
was be VERB VBD ROOT xxx True True
the the DET DT det xxx True True
President president PROPN NNP attr Xxxxx True False
of of ADP IN prep xx True True
United united PROPN NNP compound Xxxxx True False
States states PROPN NNP pobj Xxxxx True False
of of ADP IN prep xx True True
America america PROPN NNP pobj Xxxxx True False
. . PUNCT . punct . False False
Barrack Obama 0 13 PERSON
United States of America 35 59 GPE


In [30]:
# English vocabulary
len(nlp.vocab)

57852

We can declare our own label and use it for custom mapping of the Named Entity Recoginition. For example, we create a label called `USPREZ`. We can use the in-built function of Spacy Matcher, Phrase Matcher to find the words in the vocabulary that match to the new word. However, first we need to add the label to out matcher. The phrase matcher returns phrase that match to the labels. We create a utility function `offsetter` that returns string index of the phrases. This is required to train our custom NER model.

In [31]:
# declare a cutom label - USPREZ for US president
# add it to matcher created on top of intial english vocabulary
label = "USPREZ"
matcher = PhraseMatcher(nlp.vocab)
for i in ["Barack", "Obama", "Barack Obama"]:
    matcher.add(label, None, nlp(i))

After adding the label to matcher, let's apply to one sentence. We look at the output that gives us the index of phrases in the sentence.

In [32]:
# test the output of matcher on a custom sentence
one = nlp('Barack Obama was the President of United States of America.')
matches = matcher(one)
print([match for match in matches])

[(2934372180660327885, 0, 1), (2934372180660327885, 0, 2), (2934372180660327885, 1, 2)]


In [33]:
"""
Utility function to get start and end indices of phrases that match in the doc.
"""
def offsetter(label, doc, matchitem):
    o_one = len(str(doc[:matchitem[1]]))
    subdoc = doc[matchitem[1]:matchitem[2]]
    o_two = o_one + len(str(subdoc))
    return o_one, o_two, label

Let's test it for the Barack Obama string.

In [34]:
# test utility function on custom matcher
print([offsetter(label, one, match) for match in matches])

[(0, 6, 'USPREZ'), (0, 12, 'USPREZ'), (6, 11, 'USPREZ')]


## Spacy applied to pandas dataframe
One way to fetch data from Wikipedia is to get the daily dumps provide by WikiMedia. But that is too large and parsing it for specific data according to our data may be a tedious task. however, let's see if we can parse similar data using one of the smaller datasets for wikibooks from https://dumps.wikimedia.org/backup-index.html. Let's create a script to parse data from enwikibooks dump.

In [35]:
# Data has already been extracted from xml dump using wikiextractor to these files on my local machine.
# sample data is as show below
# there is a doc block containing the content
"""
<doc id="5" url="https://en.wikibooks.org/wiki?curid=5" title="Organic Chemistry/Cover">
Organic Chemistry/Cover

Welcome to the world's foremost open content<br>Organic Chemistry Textbook<br>on the web!

Organic chemistry is primarily devoted to the unique properties of the carbon atom and its compounds. These compounds play a critical role in biology and ecology, Earth sciences and
 geology, physics, industry, medicine and — of course — chemistry. At first glance, the new material that organic chemistry brings to the table may seem complicated and daunting, bu
t all it takes is concentration and perseverance. Millions of students before you have successfully passed this course and you can too!

This field of chemistry is based less on formulas and more on reactions between various molecules under different conditions. Whereas a typical general chemistry question may ask a
student to compute an answer with an equation from the chapter that they memorized, a more typical organic chemistry question is along the lines of "what product will form when subs
tance X is treated with solution Y and bombarded by light". The key to learning organic chemistry is to "understand" it rather than cram it in the night before a test. It is all wel
l and good to memorize the mechanism of Michael addition, but a superior accomplishment would be the ability to explain "why" such a reaction would take place.

As in all things, it is easier to build up a body of new knowledge on a foundation of solid prior knowledge. Students will be well served by much of the knowledge brought to this su
bject from the subject of General Chemistry. Concepts with particular importance to organic chemists are covalent bonding, Molecular Orbit theory, VSEPR Modeling, understanding acid
/base chemistry vis-a-vis pKa values, and even trends of the periodic table. This is by no means a comprehensive list of the knowledge you should have gained already in order to ful
ly understand the subject of organic chemistry, but it should give you some idea of the things you need to know to succeed in an organic chemistry test or course.

Organic Chemistry is one of the subjects which are very useful and close to our daily life. We always try to figure out some of the unknown mysteries of our daily life through our f
actious thinking habit, which generates superstitions. Through the help of chemistry we can help ourselves to get out of this kind of superstition.
We always try to find the ultimate truth through our own convenience. In the ancient past we had struggled to make things to go as per our need. In that context we have found fire,
house, food, transportation, etc…

Now the burning question is: "how can chemistry help our daily life?" To find the answer of this questions, we have to know the subject thoroughly. Let us start it from now.


</doc>
"""
fname = "/Users/saurabh/workspace/datasets/wiki/AA/wiki_00"

In [36]:
from nltk.tokenize import sent_tokenize, WordPunctTokenizer
import uuid

In [37]:
wtkz = WordPunctTokenizer()

In [38]:
"""
We assign each document with a unique page id.
We assign every line in the document with unique id.
"""
def process_file(fname):
    with open(fname) as f:
        lines = f.readlines()
    data = {}
    curr = False
    curr_data = []
    for line in lines:
        if line.startswith('</doc'):
            if len(curr_data) >= 1:
                data[doc_id] = curr_data
            curr = False
        if curr:
            if len(line) > 20:
                curr_data.append((doc_id, uuid.uuid4(), line.strip()))
        if line.startswith('<doc'):
            doc_id = line.split('<doc id="')[1].split('"')[0]
            curr = True
    return curr_data

In [39]:
data = process_file(fname)
print(data[0])
raw_df = pd.DataFrame(data, columns=["ids", "txt_id", "text"])

('5', UUID('9d53b591-7309-418d-ac36-bde4b11c7902'), 'Organic Chemistry/Cover')


In [40]:
raw_df.head()

Unnamed: 0,ids,txt_id,text
0,5,9d53b591-7309-418d-ac36-bde4b11c7902,Organic Chemistry/Cover
1,5,51b182e3-b3e9-4454-8a55-3808b54a1c1b,Welcome to the world's foremost open content<b...
2,5,031c71b5-4db7-4e77-84ef-1b3dff096b49,Organic chemistry is primarily devoted to the ...
3,5,2e31e4ca-f43c-4d63-9fd1-6e14e7b55e64,This field of chemistry is based less on formu...
4,5,6ab99378-e42d-47bc-a25b-12909b1d0d0b,"As in all things, it is easier to build up a b..."


In [41]:
print(raw_df["ids"].count())

4542


In [42]:
# Apply spacy.nlp to all the documents. After this we will extract Name Entities dfrom the 
raw_df["nlp_txt"] = raw_df["text"].apply(nlp)

In [43]:
raw_df.head()

Unnamed: 0,ids,txt_id,text,nlp_txt
0,5,9d53b591-7309-418d-ac36-bde4b11c7902,Organic Chemistry/Cover,"(Organic, Chemistry, /, Cover)"
1,5,51b182e3-b3e9-4454-8a55-3808b54a1c1b,Welcome to the world's foremost open content<b...,"(Welcome, to, the, world, 's, foremost, open, ..."
2,5,031c71b5-4db7-4e77-84ef-1b3dff096b49,Organic chemistry is primarily devoted to the ...,"(Organic, chemistry, is, primarily, devoted, t..."
3,5,2e31e4ca-f43c-4d63-9fd1-6e14e7b55e64,This field of chemistry is based less on formu...,"(This, field, of, chemistry, is, based, less, ..."
4,5,6ab99378-e42d-47bc-a25b-12909b1d0d0b,"As in all things, it is easier to build up a b...","(As, in, all, things, ,, it, is, easier, to, b..."


In [44]:
# get named entities
def get_ners(doc):
    val = []
    for ent in doc.ents:
#         return ent.text, ent.start_char, ent.end_char, ent.label_
        val.append((ent.text, ent.label_))
    return val

In [45]:
data_new = raw_df["nlp_txt"].apply(get_ners)

In [46]:
data_new.loc[4]

[('General Chemistry', 'ORG'),
 ('Concepts', 'PERSON'),
 ('Molecular Orbit', 'PRODUCT'),
 ('Modeling', 'GPE')]

In [47]:
raw_df.loc[2, ["text"]][0]

'Organic chemistry is primarily devoted to the unique properties of the carbon atom and its compounds. These compounds play a critical role in biology and ecology, Earth sciences and geology, physics, industry, medicine and — of course — chemistry. At first glance, the new material that organic chemistry brings to the table may seem complicated and daunting, but all it takes is concentration and perseverance. Millions of students before you have successfully passed this course and you can too!'

## Create Seed Words
Fetch the seed words from wikipedia page of list of musical instruments

In [106]:
import requests
from bs4 import BeautifulSoup

list_mi_url = "https://en.wikipedia.org/wiki/List_of_musical_instruments"
resp = requests.get(list_mi_url).text

soup = BeautifulSoup(resp, 'lxml')

def get_links(lnks):
    mis = []
    for lnk in lnks:
        title = lnk.get('title')
        if title != None:
            mis.append(title)
    return mis

tbls = soup.findAll('table',{'class':'wikitable sortable'})

mi_seeds = []
for tbl in tbls:
    lnks = tbl.findAll('a')
    mi_seeds.extend(get_links(lnks))

print(len(mi_seeds))
mi_seeds[:5]

640


['Agung a Tamlang', 'Bamboo slit drum', 'Enlarge', 'Balafon', 'Cajón']

## Create Music Instrument Corpus

We can run NERs on these dumps from wikipedia. However, let's try capturing our own custom data. We will now try to train NER. But before that we require annotated data. We will create data basis the following algorithm.

This script is a starter for Named Entity Recognition for Wikipedia data.
We have to do the following tasks in sequence:
1. Extract data from Wikipedia<br />
2. Annotate the data corresponding to an algorithm. Current algorithm employed<br />
    2.1 Start with seed_words. call `wikipedia.page(seed_word)`<br />
    2.2 For every page, take the title and the first line. Since, seed are musical words, if title is present in first line of the page. We mark it as MUSIC token.<br />
    2.3 Navigate to the first 5 links, most probably they are also music words. So, fetch first link, and again do 2.2.<br />
    2.4 Go to specified depth.<br />

In [48]:
import wikipedia

In [107]:
# seed_words = ["musical instruments"]
seed_words = set(mi_seeds)

In [214]:
mi_seeds[50:60]

['Bass Drum',
 'Goblet drum',
 'Hira-daiko (page does not exist)',
 'Idakka',
 'Ilimba drum',
 'Janggu',
 "Jew's harp",
 'Kakko (instrument)',
 'Kanjira',
 'Kendang']

In [50]:
def fetch_page_data(wiki_page, indent=0):
    data_map = {}
    print('\t'*indent ,wiki_page.title)
#     print('\t' ,wiki_page.content.split('.')[0])
#     print('\t' ,wiki_page.links[:4])
    data_map["title"] = wiki_page.title
    data_map["line1"] = wiki_page.content.split('.')[0]
    return data_map

In [55]:
def fetch_music_data(seed, data, depth=-1):
    if depth == 1:
        return
    search_pages = wikipedia.search(seed)
    for sp in search_pages:
        wiki_page = wikipedia.page(sp)
        data.append(fetch_page_data(wiki_page))
        for wpl in wiki_page.links[:1]:
            lwpage = wikipedia.page(wpl)
            data.append(fetch_page_data(lwpage, 1))
    return data

In [101]:
def fetch_wiki_page_data(wiki_page, num_links=5, indent=0, print_title=True):
    if print_title:
        print('\t'*indent ,wiki_page.title)
    title = wiki_page.title
    line1 = wiki_page.content.split('.')[0]
    other_content = wiki_page.content
    linked_titles = wiki_page.links[:num_links]
    return title, line1, other_content, linked_titles

In [99]:
def fetch_music_related_titles(seed, data, depth=-1, fetch_associations=False):
    if depth == 1:
        return
    search_pages = wikipedia.search(seed)
    for sp in search_pages:
        try:
            wiki_page = wikipedia.page(sp)
            title, line1, other_content, linked_titles = fetch_wiki_page_data(wiki_page)
            music_corpus.append((title, line1, other_content, linked_titles))
            # fetch pages of associated links on the website
            if fetch_associations:
                for wpl in linked_titles:
                    try:
                        lwpage = wikipedia.page(wpl)
                        music_corpus.append(fetch_wiki_page_data(lwpage, indent=1))
                    except:
                        print(wpl, "No match")
        except:
            print(sp, "No match")
    return music_corpus

In [122]:
def get_wiki_page(title, print_title=False):
    wiki_page = wikipedia.page(title)
    return fetch_wiki_page_data(wiki_page, print_title)

In [None]:
music_corpus = []
for idx, seed in seed_words:
    print(idx)
    try:
        music_corpus.append(get_wiki_page(seed, print_title=False))
    except:
        print(seed, "**** No match or ambiguous ****")

In [115]:
music_df = pd.DataFrame(music_corpus, columns=["title", "line1", "content", "links"])
music_df.head()

Unnamed: 0,title,line1,content,links
0,Agung a tamlang,The Agung a Tamlang is a type of Philippine sl...,The Agung a Tamlang is a type of Philippine sl...,"[Acme siren, Agung, Babendil, Bass drum, Bell ..."
1,Slit drum,A slit drum is a hollow percussion instrument,A slit drum is a hollow percussion instrument....,"['Aparima, 'ote'a, 'upa'upa, Aboriginal dugout..."
2,Balafon,The balafon is a kind of xylophone or percussi...,The balafon is a kind of xylophone or percussi...,"[Acme siren, African Rumba, Afro Celt Sound Sy..."
3,Cajón,"A cajón (Spanish: [kaˈxon]; ""box"", ""crate"" or ...","A cajón (Spanish: [kaˈxon]; ""box"", ""crate"" or ...","[Acme siren, Acoustic guitar, Afro-Peruvian mu..."
4,Castanets,"Castanets, also known as clackers or palillos,...","Castanets, also known as clackers or palillos,...","[Acme siren, African dance, Ahenk, Ajoblanco, ..."


In [121]:
# save the crawled data in a csv file
# f_name = "/Users/saurabh/workspace/datasets/wikimusic/instruments.csv"
# music_df.to_csv(f_name, index=None, header=True, sep='|', na_rep='-')

# save title and first line of wikipage
f_name = "/Users/saurabh/workspace/datasets/wikimusic/instruments_line1.csv"
music_df.loc[:, ["title", "line1"]].to_csv(f_name, index=None, header=True, sep='|', na_rep='-')

## Spacy on Music Instruments data corpus

In [175]:
import numpy as np
from nltk import ngrams
from nltk.corpus import stopwords
import re

In [168]:
stopset = set(stopwords.words('english'))

In [124]:
# load instruments_line.csv
mi_f_name = "/Users/saurabh/workspace/datasets/wikimusic/instruments_line1.csv"
mi_df = pd.read_csv(mi_f_name, delimiter='|')
mi_df.head(2)

Unnamed: 0,title,line1
0,Agung a tamlang,The Agung a Tamlang is a type of Philippine sl...
1,Slit drum,A slit drum is a hollow percussion instrument


In [141]:
flatten = lambda l: [item for sublist in l for item in sublist]

In [176]:
in_brackets = r'\([^)]*\)'

In [197]:
def title_combinations(title):
    title = title.lower()
    title = re.sub(in_brackets, '', title).strip()
    nn_grams = [title]
    t_split = title.split(r'\s+')
    # hack to handle ngrams generated that start with stopwords
    if any(word in stopset for word in t_split):
        return nn_grams
    for i in range(2, len(title)-1):
        the_grams = ngrams(t_split, i)
        str_grams = [" ".join(words) for words in the_grams]
        nn_grams.extend(str_grams)
    return list(set(nn_grams))

In [198]:
t = "triange (musical instrument)"
doc = nlp(t)
print_ner_details(doc)
print("")
print_nlp_details(doc)
title_combinations(t)


triange triange ADJ JJ ROOT xxxx True False
( ( PUNCT -LRB- punct ( False False
musical musical ADJ JJ amod xxxx True False
instrument instrument NOUN NN appos xxxx True False
) ) PUNCT -RRB- punct ) False False


['triange']

In [206]:
mi_raw_vocab = mi_df["title"].apply(title_combinations)
mi_raw_vocab[-10:]
mi_vocab = flatten(mi_raw_vocab.values)

In [229]:
print(len(mi_vocab))
print(mi_vocab[:5])

nlp_vocab = nlp.vocab
print(len(nlp_vocab))

604
['agung a tamlang', 'slit drum', 'balafon', 'cajón', 'castanets']
59096


In [236]:
label = "MUSIC"
matcher = PhraseMatcher(nlp_vocab)
for mi_word in mi_vocab:
    matcher.add(label, None, nlp(mi_word))

In [240]:
test_string = "The ashiko  is a drum, shaped like a tapered cylinder (or truncated cone) with the head on the wide end, and the narrow end open"
one = nlp(test_string.lower())
matches = matcher(one)
print([match for match in matches])
offsets = [offsetter(label, one, match) for match in matches]
print(offsets)
print([test_string[offset[0]:offset[1]+1] for offset in offsets])

[(9502756511836460881, 1, 2), (9502756511836460881, 5, 6)]
[(3, 9, 'MUSIC'), (16, 20, 'MUSIC')]
[' ashiko', ' drum']


In [148]:
def append_to_data(text, ret_data):
    tkns = wtkz.tokenize(text)
    for tkn in tkns:
        ret_data.append((tkn, 'O'))
    return ret_data

def split_and_mark(row):
    ret_data = []
    line1, title = row
    lline1 = line1.lower()
    ltitle = title.lower()
    idx = line1.lower().find(title.lower())
    if idx == -1:
        ret_data = append_to_data(line1, ret_data)
    else:
        end_idx = idx+len(title)
        sd = line1[:idx]
        ret_data = append_to_data(sd, ret_data)
        ret_data.append((line1[idx:end_idx], 'MUSIC'))
        sd = line1[end_idx:]
        ret_data = append_to_data(sd, ret_data)
    return ret_data

In [157]:
marked_data = music_df.apply(split_and_mark, axis=1)
marked_data.head(5)

0    [(A, O), (musical instrument, MUSIC), (is, O),...
1    [(During, O), (the, O), (20th, O), (century, O...
2    [(21st-century classical music, MUSIC), (is, O...
3    [(The, O), (terms, O), (A-side and B-side, MUS...
4    [(This, O), (is, O), (a, O), (list of musical ...
dtype: object

In [162]:
write_data = ''
for dlist in marked_data:
    for dtup in dlist:
        write_data += dtup[0] + '\t' + dtup[1] + '\n'
    write_data += '\n'

In [164]:
with open("output.txt", "w") as f:
    f.write(write_data)

In [173]:
# sample data from output.txt
"""
The     O
ANS synthesizer MUSIC
is      O
a       O
photoelectronic O
musical O
instrument      O
created O
by      O
Russian O
engineer        O
Evgeny  O
Murzin  O
from    O
1937    O
to      O
1957    O

ARP Instruments MUSIC
,       O
Inc     O

The     O
ARP Odyssey     MUSIC
is      O
an      O
analog  O
synthesizer     O
introduced      O
in      O
1972    O
"""

""

''