# Extracting terminology (key words and phrases)

In order to build a glossary, we need to know what words and phrases qualify as terms. For that I will be processing the source (EN) part of the parallel corpus, and later look for matches in the target (RU) part.

In [1]:
%run utility_file     # handles module imports and loading .csv files
from utility_file import Preprocess     # custom class for preprocessing text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sveta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sveta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
path = 'pi2.csv'
source_lang = 'English'
target_lang = 'Russian'

source_list, target_list = load_separate_corpora_from_csv(path, source_lang, target_lang)

In [27]:
exceptions = [
    'flowerbed',
    'building',
    ]

clean_en_list = [Preprocess(sentence).preprocess_no_lemmatization(exceptions) \
                 for sentence in source_list if len(sentence) <= 80]
# only looking at strings < 80 symbols as longer strings are less likely to contain terms and will clutter the corpus
clean_en_str = ' '.join(clean_en_list)

## Extracting unigrams (one-word terms)

The first step is simple - I will look at the entire corpus and extract all the verbs and nouns using [spaCy morphology analyzer](https://spacy.io/api/morphology).

In [28]:
# creating sets of all nouns and verbs using spaCy

doc = nlp(clean_en_str)
nouns = set()
verbs = set()
ads = set()
for token in doc:
    if token.pos_ == "NOUN":
        nouns.add(token.lemma_)
    elif token.pos_ == "VERB":
        if token.text != "\'ve" and token.text != "s" and token.text != "\'m" and token.text != "\'re":
            if len(token.text) > 2:
                verbs.add(token.lemma_)
all_pos = nouns.union(verbs)

Then I'm extracting potential one-word terms using the [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) algorithm provided by [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [29]:
# Extracting candidate words using TfidfVectorizer

stop_words = "english"
count = TfidfVectorizer(min_df = 5, stop_words=stop_words).fit(clean_en_list)
bag_of_words = count.transform(clean_en_list)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in count.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
tfidf_candidates = []
for item in words_freq[:700]:      # only looking at 700 most common words
    tfidf_candidates.append(item[0])

As I am only interested in nouns and verbs at this stage, I'm creating the final list of unigram terms by intersecting the set of nouns and terms created previously and the list of tf-idf candidates.

In [30]:
# Intersecting with the list of nouns and verbs
candidates = [x for x in tfidf_candidates if x in all_pos]

# taking a look
print(candidates[:100])

['island', 'level', 'gift', 'open', 'upgrade', 'build', 'collect', 'hotel', 'complete', 'reward', 'set', 'time', 'hold', 'profit', 'building', 'union', 'season', 'play', 'holiday', 'bonus', 'help', 'game', 'buy', 'paradise', 'day', 'chest', 'need', 'event', 'beach', 'love', 'received', 'send', 'receive', 'want', 'christmas', 'use', 'magic', 'treasure', 'easter', 'energy', 'win', 'party', 'trade', 'rhee', 'make', 'mano', 'talk', 'snowball', 'year', 'box', 'water', 'great', 'place', 'sea', 'leprechaun', 'offer', 'flag', 'spring', 'workshop', 'good', 'start', 'santa', 'ice', 'look', 'come', 'store', 'friend', 'ghost', 'speed', 'turtilliada', 'pack', 'return', 'like', 'bank', 'festival', 'search', 'let', 'way', 'know', 'journey', 'earn', 'wondershop', 'park', 'sandy', 'competition', 'house', 'achievement', 'free', 'surprise', 'stage', 'castle', 'power', 'flower', 'battle', 'basket', 'kit', 'sweet', 'try', 'mini', 'fail']


## Extracting n-grams (key phrases)

The idea here is the same as with unigrams - first, extracting potential terms with a linguistic tool (spaCy [noun chunker](https://spacy.io/usage/linguistic-features#noun-chunks) in this case).

In [31]:
# extracting n-grams using spacy chunker

noun_chunks = set(chunk.text.strip().lower() for chunk in doc.noun_chunks)

Then extracting n-gram key phrases. In this case, I will be using sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) rather than TfidfVectorizer, for this algorithm seems to be providing better results.

In [32]:
# Extracting candidate n-grams with CountVectorizer and intersect them with the noun chunks

n_gram_range = (2, 3)
count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([clean_en_str])
ngrams = count.get_feature_names()

As previously, creating the final list of n-gram terms by intersecting the two lists. 

In [33]:
ngram_candidates = list(filter(lambda candidate: candidate in noun_chunks, ngrams))

print(ngram_candidates[:50])

['aaah eek ook', 'able fight', 'able stop', 'abracadabra hocus pocus', 'abyss speed construction', 'access wondershop', 'accessories buildings', 'accessory building accessory', 'accessory demolish building', 'accessory warehouse', 'accommodate guests', 'accommodate thinking nature', 'accumulated profit', 'accumulation speed arrival', 'achievement achievem', 'achievement points', 'achievement reward', 'achievements kinds achievements', 'acorn luck', 'acorn rich greenes', 'active event', 'active event demolish', 'active members', 'active members players', 'active players', 'actual locket', 'ad win prize', 'additional bonuses', 'adjacent ones', 'administration building', 'administration building staff', 'administrative building', 'advance buy boxes', 'aeronaut cafe', 'aeronaut cafes', 'afraid hammers', 'afraid heights', 'aiden forrester', 'air castle', 'air essence aaaa', 'air essences', 'alalaz celestial chest', 'alex bootman', 'alien athletes', 'alien chest ufologist', 'alien chests', '

## Combining + Pickling

In [34]:
keywords = set(candidates + ngram_candidates)
print(len(keywords), 'is the total number of extracted terms')

3820 is the total number of extracted terms


In [35]:
# pickling the keyword set

import pickle
with open('keywords.pkl', 'wb') as f:
       pickle.dump(set(keywords), f)