In [4]:
import pandas as pd
data = pd.read_parquet("data/training.parquet")

In [5]:
data

Unnamed: 0,text,whence
0,You must write to me. Catherine sighed. And th...,legitimate
1,Who would have thought Mr. Crawford sure of he...,legitimate
2,He had only himself to please in his choice: h...,legitimate
3,Oh! One accompaniment to her song took her agr...,legitimate
4,"As soon as breakfast was over, she went to her...",legitimate
5,Mrs Clay's selfishness was not so great as to ...,legitimate
6,"But self, though it would intrude, could not e...",legitimate
7,"Elizabeth, though she did not wish to slight. ...",legitimate
8,Edmund had descended from that moral elevation...,legitimate
9,I read up on the morrow the Crawfords were eng...,legitimate


In [6]:
import spacy
english = spacy.load("en")

In [15]:
data["text"].get_values()[0]

"You must write to me. Catherine sighed. And there are other circumstances which I am now satisfied that I never brewed it. They will read together. Her praise had been given her at different times, but _this_ is the true one. So surrounded, so caressed, she was even positively civil; but it was not directed to me--it was to Mrs. Weston. And besides the operation of a sensible, intelligent man like Mr. Allen. I see that more than a little proud-looking woman of uncordial address, who met her husband's sisters without any affection, and almost without beauty. I walked over the the vending machine so I am very sorry--extremely sorry--But, Miss Smith, indeed!--Oh! Could she but have given Harriet her feelings about it all!"

In [17]:
doc = english(data["text"].get_values()[0])

In [18]:
type(doc)

spacy.tokens.doc.Doc

We can use spaCy to identify parts of speech.

In [21]:
for token in doc:
    print("%s is a %s" % (token.text, token.pos_))

You is a PRON
must is a VERB
write is a VERB
to is a ADP
me is a PRON
. is a PUNCT
Catherine is a PROPN
sighed is a VERB
. is a PUNCT
And is a CCONJ
there is a ADV
are is a VERB
other is a ADJ
circumstances is a NOUN
which is a ADJ
I is a PRON
am is a VERB
now is a ADV
satisfied is a ADJ
that is a ADP
I is a PRON
never is a ADV
brewed is a VERB
it is a PRON
. is a PUNCT
They is a PRON
will is a VERB
read is a VERB
together is a ADV
. is a PUNCT
Her is a ADJ
praise is a NOUN
had is a VERB
been is a VERB
given is a VERB
her is a PRON
at is a ADP
different is a ADJ
times is a NOUN
, is a PUNCT
but is a CCONJ
_ is a VERB
this is a DET
_ is a NOUN
is is a VERB
the is a DET
true is a ADJ
one is a NOUN
. is a PUNCT
So is a ADV
surrounded is a VERB
, is a PUNCT
so is a ADV
caressed is a ADJ
, is a PUNCT
she is a PRON
was is a VERB
even is a ADV
positively is a ADV
civil is a ADJ
; is a PUNCT
but is a CCONJ
it is a PRON
was is a VERB
not is a ADV
directed is a VERB
to is a ADP
me is a PRON
-- i

We can also use spaCy to identify the base forms of words -- it does this with a combination of part-of-speech-specific rules and a dictionary of exceptions.  The spaCy component that does this is called a [_lemmatizer_](https://en.wikipedia.org/wiki/Lemma_%28morphology%29).

In [22]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)

In [23]:
for token in doc:
    print("%s has a base form of %s" % (token.text, lemmatizer(token.text, token.pos_)))

You gets lemmatized to ['you']
must gets lemmatized to ['must']
write gets lemmatized to ['write']
to gets lemmatized to ['to']
me gets lemmatized to ['me']
. gets lemmatized to ['.']
Catherine gets lemmatized to ['catherine']
sighed gets lemmatized to ['sigh']
. gets lemmatized to ['.']
And gets lemmatized to ['and']
there gets lemmatized to ['there']
are gets lemmatized to ['be']
other gets lemmatized to ['othe', 'oth']
circumstances gets lemmatized to ['circumstance']
which gets lemmatized to ['which']
I gets lemmatized to ['i']
am gets lemmatized to ['be']
now gets lemmatized to ['now']
satisfied gets lemmatized to ['satisfied']
that gets lemmatized to ['that']
I gets lemmatized to ['i']
never gets lemmatized to ['never']
brewed gets lemmatized to ['brew']
it gets lemmatized to ['it']
. gets lemmatized to ['.']
They gets lemmatized to ['they']
will gets lemmatized to ['will']
read gets lemmatized to ['read']
together gets lemmatized to ['together']
. gets lemmatized to ['.']
Her ge

We can apply this process to the entire data frame if we'd like, but it might take a while.

In [None]:
def lemmas(s):
    return " ".join([lemmatizer(token.text, token.pos_)[0] for token in english(s)])

data["lemmas"] = data["text"].apply(lemmas,1)