In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

import spacy

import torch
from torchtext import data, datasets

In [3]:
data_path = Path("../data/processed/") / "aclImdb"

In [5]:
spacy_en = spacy.load('en')
def tokenizer(text): # create a tokenizer function
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [14]:
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)

train_ds = datasets.IMDB(str(data_path / "train"), TEXT, LABEL)
test_ds = datasets.IMDB(str(data_path / "test"), TEXT, LABEL)

## Tokenization

By default, torchtext just splits on whitespace. Compare that with the spacy tokenizer.

In [17]:
comment = train_ds.examples[0].text
comment_text = " ".join(comment)
comment_text

'bromwell high is a cartoon comedy. it ran at the same time as some other programs about school life, such as "teachers". my 35 years in the teaching profession lead me to believe that bromwell high\'s satire is much closer to reality than is "teachers". the scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools i knew and their students. when i saw the episode in which a student repeatedly tried to burn down the school, i immediately recalled ......... at .......... high. a classic line: inspector: i\'m here to sack one of your teachers. student: welcome to bromwell high. i expect that many adults of my age think that bromwell high is far fetched. what a pity that it isn\'t!'

In [20]:
doc = spacy_en(comment_text)
" ".join([tok.text for tok in doc])

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life , such as " teachers " . my 35 years in the teaching profession lead me to believe that bromwell high \'s satire is much closer to reality than is " teachers " . the scramble to survive financially , the insightful students who can see right through their pathetic teachers \' pomp , the pettiness of the whole situation , all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school , i immediately recalled ......... at .......... high . a classic line : inspector : i \'m here to sack one of your teachers . student : welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it is n\'t !'

The spacy tokenizer splits out punctuation, apostrophes, quotations, etc. This has the effect that '"teachers"' and 'teachers' are the same token. The quotations marks are still preserved so that a model could still take this context into account.

With spacy, we can get the part of speech of each word. See that it incorrectly identifies the proper noun Bromwell high.

In [21]:
for token in doc[:20]:
    print(token.text, token.pos_)

bromwell VERB
high ADJ
is VERB
a DET
cartoon NOUN
comedy NOUN
. PUNCT
it PRON
ran VERB
at ADP
the DET
same ADJ
time NOUN
as ADP
some DET
other ADJ
programs NOUN
about ADP
school NOUN
life NOUN


In [22]:
for chunk in doc.noun_chunks:
    print(chunk.text)

a cartoon comedy
it
the same time
some other programs
school life
"teachers
my 35 years
the teaching profession
me
bromwell high's satire
reality
teachers
the scramble
the insightful students
who
their pathetic teachers' pomp
the pettiness
the whole situation
me
the schools
i
their students
i
the episode
a student
the school
i
a classic line
inspector
i
your teachers
student
i
many adults
my age
what a pity
it


In [23]:
for ent in doc.ents:
    print(ent.text)

35 years
one


In [29]:
sent = "I love to go to new york city"
doc = spacy_en(sent)
for ent in doc.ents:
    print(ent.text)

In [31]:
vocab = spacy_en.vocab

In [33]:
vocab.length

58635

In [37]:
vectors = vocab.vectors

In [51]:
'Boise' in vocab

True

## NER

How is it done?

How would I train my own?



## Lemmatization

## POS Tagging


## Sentiment

## Topic Modeling