## Part-of-Speech

For various NLP tasks it is useful to extract information about the Part-of-Speech structure of the text. In this notebook we will compute some statistics about trigrams, like the probability of a Noun following a Verb and a Preposition.

## Data Processing

First we will read data from [the Reuters dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/XDB74W&version=2.0).

In [None]:
import pandas as pd

reuters = pd.read_csv('../input/fake-news-data/reuters-newswire-2017.v5.csv')
data = reuters.drop(['publish_time'], axis=1).rename(columns={'headline_text': 'headline'})
headlines = data['headline']

Then we are going to define the special start/end tokens.

In [None]:
START = '÷'
END = '■'

## Part-of-Speech Extraction

We are going to import the required libraries, and define a SpaCy language model trained on web data.

In [None]:
from collections import defaultdict
import spacy, nltk
nlp = spacy.load('en_core_web_sm')

The information we extract will be based on trigrams of the text. Specifically, we will compute the Part-of-Speech, POS tags and Dependencies of each headline using SpaCy. We also need to include information about the start and end tokens. To accomplish this, we will append start and end tokens as prefixes and postfixes respectively to each parsed headline. Finally, we will extract trigrams on each headline for all three types of labels.

In [None]:
trigrams_pos = defaultdict(lambda: defaultdict(int))
trigrams_tag = defaultdict(lambda: defaultdict(int))
trigrams_dep = defaultdict(lambda: defaultdict(int))
for h in headlines:
    parsed = nlp(h)
    pos = [START, START] + [doc.pos_ for doc in parsed] + [END, END]
    tag = [START, START] + [doc.tag_ for doc in parsed] + [END, END]
    dep = [START, START] + [doc.dep_ for doc in parsed] + [END, END]
    for i in range(2, len(pos)):
        trigrams_pos[(pos[i-2], pos[i-1])][pos[i]] += 1
        trigrams_tag[(tag[i-2], tag[i-1])][tag[i]] += 1
        trigrams_dep[(dep[i-2], dep[i-1])][dep[i]] += 1

Next we are going to convert the counts to probabilities and store them in a dictionary.

In [None]:
trigrams_pos = {k: {k_: v_/sum(v.values()) for k_, v_ in v.items()} for k, v in trigrams_pos.items()}
trigrams_tag = {k: {k_: v_/sum(v.values()) for k_, v_ in v.items()} for k, v in trigrams_tag.items()}
trigrams_dep = {k: {k_: v_/sum(v.values()) for k_, v_ in v.items()} for k, v in trigrams_dep.items()}

To check what we are storing, we will print the probabilities of part-of-speech labels that follow a Noun and a Verb.

In [None]:
trigrams_pos[('NOUN', 'VERB')]

We are going to `pickle` the dictionaries so that we can re-use them in other Notebooks.

In [None]:
import pickle
pickle.dump(trigrams_pos, open("trigrams_pos.pkl", "wb"))
pickle.dump(trigrams_tag, open("trigrams_tag.pkl", "wb"))
pickle.dump(trigrams_dep, open("trigrams_dep.pkl", "wb"))

In a lot of use-cases we are going to need a map from words to their POS tags, so we are going to build them here with the use of some pre-trained tokenizers.

In [None]:
tokenizer_1000 = pickle.load(open('../input/tokenization/tokenizer_1000.pkl', 'rb'))
tokenizer_1500 = pickle.load(open('../input/tokenization/tokenizer_1500.pkl', 'rb'))
tokenizer_2000 = pickle.load(open('../input/tokenization/tokenizer_2000.pkl', 'rb'))
tokenizer_2500 = pickle.load(open('../input/tokenization/tokenizer_2500.pkl', 'rb'))
tokenizer_2750 = pickle.load(open('../input/tokenization/tokenizer_2750.pkl', 'rb'))

In [None]:
pos_map_1000, tag_map_1000, dep_map_1000 = defaultdict(int), defaultdict(int), defaultdict(int)
for k in tokenizer_1000.word_index.keys():
    tokens = nlp(k)
    pos_map_1000[k] = tokens[0].pos_
    tag_map_1000[k] = tokens[0].tag_
    dep_map_1000[k] = tokens[0].dep_

pos_map_1000[START], pos_map_1000[END] = START, END
tag_map_1000[START], tag_map_1000[END] = START, END
dep_map_1000[START], dep_map_1000[END] = START, END

In [None]:
pos_map_1500, tag_map_1500, dep_map_1500 = defaultdict(int), defaultdict(int), defaultdict(int)
for k in tokenizer_1500.word_index.keys():
    tokens = nlp(k)
    pos_map_1500[k] = tokens[0].pos_
    tag_map_1500[k] = tokens[0].tag_
    dep_map_1500[k] = tokens[0].dep_

pos_map_1500[START], pos_map_1500[END] = START, END
tag_map_1500[START], tag_map_1500[END] = START, END
dep_map_1500[START], dep_map_1500[END] = START, END

In [None]:
pos_map_2000, tag_map_2000, dep_map_2000 = defaultdict(int), defaultdict(int), defaultdict(int)
for k in tokenizer_2000.word_index.keys():
    tokens = nlp(k)
    pos_map_2000[k] = tokens[0].pos_
    tag_map_2000[k] = tokens[0].tag_
    dep_map_2000[k] = tokens[0].dep_

pos_map_2000[START], pos_map_2000[END] = START, END
tag_map_2000[START], tag_map_2000[END] = START, END
dep_map_2000[START], dep_map_2000[END] = START, END

In [None]:
pos_map_2500, tag_map_2500, dep_map_2500 = defaultdict(int), defaultdict(int), defaultdict(int)
for k in tokenizer_2500.word_index.keys():
    tokens = nlp(k)
    pos_map_2500[k] = tokens[0].pos_
    tag_map_2500[k] = tokens[0].tag_
    dep_map_2500[k] = tokens[0].dep_

pos_map_2500[START], pos_map_2500[END] = START, END
tag_map_2500[START], tag_map_2500[END] = START, END
dep_map_2500[START], dep_map_2500[END] = START, END

In [None]:
pos_map_2750, tag_map_2750, dep_map_2750 = defaultdict(int), defaultdict(int), defaultdict(int)
for k in tokenizer_2750.word_index.keys():
    tokens = nlp(k)
    pos_map_2750[k] = tokens[0].pos_
    tag_map_2750[k] = tokens[0].tag_
    dep_map_2750[k] = tokens[0].dep_

pos_map_2750[START], pos_map_2750[END] = START, END
tag_map_2750[START], tag_map_2750[END] = START, END
dep_map_2750[START], dep_map_2750[END] = START, END

And, finally, we need to pickle all these maps for future use.

In [None]:
pickle.dump(dict(pos_map_1000), open("pos_map_1000.pkl", "wb"))
pickle.dump(dict(tag_map_1000), open("tag_map_1000.pkl", "wb"))
pickle.dump(dict(dep_map_1000), open("dep_map_1000.pkl", "wb"))

pickle.dump(dict(pos_map_1500), open("pos_map_1500.pkl", "wb"))
pickle.dump(dict(tag_map_1500), open("tag_map_1500.pkl", "wb"))
pickle.dump(dict(dep_map_1500), open("dep_map_1500.pkl", "wb"))

pickle.dump(dict(pos_map_2000), open("pos_map_2000.pkl", "wb"))
pickle.dump(dict(tag_map_2000), open("tag_map_2000.pkl", "wb"))
pickle.dump(dict(dep_map_2000), open("dep_map_2000.pkl", "wb"))

pickle.dump(dict(pos_map_2500), open("pos_map_2500.pkl", "wb"))
pickle.dump(dict(tag_map_2500), open("tag_map_2500.pkl", "wb"))
pickle.dump(dict(dep_map_2500), open("dep_map_2500.pkl", "wb"))

pickle.dump(dict(pos_map_2750), open("pos_map_2750.pkl", "wb"))
pickle.dump(dict(tag_map_2750), open("tag_map_2750.pkl", "wb"))
pickle.dump(dict(dep_map_2750), open("dep_map_2750.pkl", "wb"))