## CodeLab 1.1: Exploring Spacy's Entity and Dependency Parsers--NLTweets

Using the same workflow as in CodeLab1Spacy, we start by listing the working directory's contents to verify the file we need is available to future cells:

In [1]:
ls

 Volume in drive C has no label.
 Volume Serial Number is 44BF-F5C5

 Directory of C:\Users\Joshf\Documents\nltweets\codelabs

04/24/2019  11:00 PM    <DIR>          .
04/24/2019  11:00 PM    <DIR>          ..
04/24/2019  10:05 PM    <DIR>          .ipynb_checkpoints
03/17/2019  01:54 PM            79,235 ClassificationExperiment.ipynb
03/16/2019  12:43 PM            32,883 CodeLab0TwitterAPI.ipynb
04/22/2019  03:13 PM            17,559 CodeLab1Spacy.ipynb
04/24/2019  11:00 PM            72,662 CodeLab2Spacy.ipynb
04/22/2019  03:13 PM           326,119 corpus.txt
03/17/2019  01:54 PM               178 creds.txt
03/03/2019  05:21 PM           145,906 labeled_tweets.csv
               7 File(s)        674,542 bytes
               3 Dir(s)  66,688,348,160 bytes free


This time, in order to be able to extract either the beginning or ending lines in the file for review, we'll assign the count of the lines in the text file and assign it to the variable "num." Then we'll proceed as before by assigning the opening of the file to a variable "f" and print out the lines we want with the 0-based index.

In [2]:
file  = 'corpus.txt'
num = len(open(file).readlines())
with open(file) as f:
    for line in f.readlines()[num-10:]:
        print(line)

b"It's #NYE, celebrate with family - not with us. Plan to take public transportation, taxi service or designate a sober driver. Find out more about taking public transportation @sfmta_muni. #SFPD #DontDrinkAndDrive #SF #Newyear #happynewyear2019 https://t.co/yD7iGMc1z0"

b'Reminder: You can now Ride #SFMuni Free until 5 a.m. tomorrow. Remember, do not tag your Clipper Card or activate your #MuniMobile ticket during the free service period. https://t.co/fIhbjdEXkg https://t.co/qtSuigqqWR'

b"It's #NYE, celebrate with family - not with us. Plan to take public transportation, taxi service or designate a sober driver. Find out more about taking public transportation @sfmta_muni. #SFPD #DontDrinkAndDrive #SF #Newyear #happynewyear2019 https://t.co/yD7iGMc1z0"

b'.@sfmta_muni = FREE\n@rideact = FREE\n@Caltrain = FREE\n@SFBART = NOT FREE (but there will be special service until 3 a.m.)\n\nhttps://t.co/wZWiCFsHhS'

b"It's #NYE, celebrate with family - not with us. Plan to take public transport

We'll extract a subset of the file and save in the variable "lines."

In [3]:
lines = []
with open(file) as f:
    for line in f.readlines():
        lines.append(line)
        
for index, line in enumerate(lines):
    lines[index] = line[2:-2]

for line in lines[num-5:]:
    print(line)

Have fun, stay warm, &amp; be safe on New Year\xe2\x80\x99s Eve. Dress in layers for the chilly &amp; windy San Francisco weather. Consider wearing a warm hat &amp; a pair of gloves. Take public transit such as @sfmta_muni or @SFBART to watch the fireworks at the Embarcadero. https://t.co/flgGFfSFGh
Reminder: You can now Ride #SFMuni Free until 5 a.m. tomorrow. Remember, do not tag your Clipper Card or activate your #MuniMobile ticket during the free service period. https://t.co/fIhbjdEXkg https://t.co/qtSuigqqWR
It's #NYE, celebrate with family - not with us. Plan to take public transportation, taxi service or designate a sober driver. Find out more about taking public transportation @sfmta_muni. #SFPD #DontDrinkAndDrive #SF #Newyear #happynewyear2019 https://t.co/EJOnYI17qZ
Reminder: You can now Ride #SFMuni Free until 5 a.m. tomorrow. Remember, do not tag your Clipper Card or activate your #MuniMobile ticket during the free service period. https://t.co/fIhbjdEXkg https://t.co/qtSuig

This time we're going to use a different language model in order to take advantage of spacy's [dependency parser](https://spacy.io/usage/linguistic-features).

In [4]:
!python -m spacy download en_core_web_sm

symbolic link created for C:\Users\Joshf\AppData\Local\Continuum\anaconda3\envs\nltweets\lib\site-packages\spacy\data\en_core_web_sm <<===>> C:\Users\Joshf\AppData\Local\Continuum\anaconda3\envs\nltweets\lib\site-packages\en_core_web_sm

    Linking successful
    C:\Users\Joshf\AppData\Local\Continuum\anaconda3\envs\nltweets\lib\site-packages\en_core_web_sm
    -->
    C:\Users\Joshf\AppData\Local\Continuum\anaconda3\envs\nltweets\lib\site-packages\spacy\data\en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



In [5]:
import spacy
from spacy import displacy

nlp = spacy.load('en_core_web_sm')

In [6]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x1fadaaf0f60>),
 ('parser', <spacy.pipeline.DependencyParser at 0x1fadab7bfc0>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x1fadac680a0>)]

Let's look at the third example in the dataset:

In [23]:
lines[11]

'@sfmta_muni pls send out the 31.'

By implementing the spacy pipeline method

In [24]:
doc = nlp(lines[11])

In [25]:
for token in doc:
    print(token.text, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

@sfmta_muni VERB VB punct @xxxx_xxxx False False
pls INTJ UH intj xxx True False
send VERB VB ROOT xxxx True False
out PART RP prt xxx True True
the DET DT det xxx True True
31 NUM CD dobj dd False False
. PUNCT . punct . False False


An interesting tool for exploring entities is the displacy style parameter's "ent" attribute. It's far from perfect, as you can see in the next cell, but with a domain specific dictionary (perhaps in the next CodeLab) it could be made more robust. 

In [26]:
displacy.render(doc, style="ent", jupyter=True)

At first glance, it looks like the the parts-of-speech (POS) tagger and the entity parser in displacy use a different nomenclature, since the number 24 is identified as "NUM" by the tokenizer and "CARDINAL" by the entity parser, but the entity parser is classifying the "NUM" further, and chooses among other possibilities such as "PERCENT" and "DATE."  

This is where domain context becomes important. The [OpenTransit](https://codeforsanfrancisco.org/projects/opentransit/) project at Code for San Francisco is focused on improving transit performance, and they are interested in which routes people are talking about. Most of these routes are identified by a number, such as in the above case the "24." As volunteers spend time labeling tweets, they will become subject matter experts (SME) to some degree, and be familiar with all the different ways a rider might refer to a transit route, e.g. "the 24" or "a IB 30" or "another outbound 14R." 

We are beginning to see how a learning function might assist in improving both precision and recall by looking for words like "the" and "a" adjacent to the use of a number like "24" or "30." To be more specific, we want to find occurrences of DET words adjacent prior to CARDINAL words. We'd probably have greater confidence if the CARDINAL words were followed by words like "route" or "line." Other words like "IB" \[inbound\] or "northbound" would also be of interest, and would likely appear between the words we otherwise require to be adjacent.

A great tool for exploring the semantic relationship of words in a sentence is the "displacy.render" method. This visualization will draw arrows from one word to another in addition to labeling the POS. 

In [27]:
displacy.render(doc, style="dep", jupyter=True)

If it's a little cramped inline within a notebook, spacy can also spin up a local server to display visualizations in a dedicated browser tab:

In [None]:
sentence_spans = list(doc.sents)
displacy.serve(sentence_spans, style="dep")


    Serving on port 5000...
    Using the 'dep' visualizer



127.0.0.1 - - [24/Apr/2019 22:14:35] "GET / HTTP/1.1" 200 40829
127.0.0.1 - - [24/Apr/2019 22:14:35] "GET /favicon.ico HTTP/1.1" 200 40829


To view it, open a browser tab to that port: [http://localhost:5000/](http://localhost:5000/)

In the next CodeLab, we'll implement custom components and pipeline in spacy similar to [this example](https://spacy.io/usage/processing-pipelines#component-example2) and create some learning functions based on the dependency parser information we have access to.