# Processing Reddit Posts to find mentions of Nootropics

The posts we downloaded from reddit are plain-text content, and there is nothing like on WikiPedia that allows us to detect a "link" - nootropics are mentionned in free-text, with many different names referring to the same substance, and with inconsistent casing and formatting. 

To be able to build a graph from the posts, the first step is thus to detect mentions of nootropics in our posts, for which we use [Spacy NLP](https://spacy.io/), a very nice library that can handle a large amount of NLP-related tasks at very high speed (for comparison, the tokenization of our data took around 20 minutes with Spacy, and would have taken around 6 hours with NLP). 

In [None]:
import spacy
import json
from json import JSONDecodeError

try:
    from library_functions.config import Config
except ModuleNotFoundError:
    from project.library_functions.config import Config
from tqdm.auto import tqdm
from spacy.matcher import PhraseMatcher
from pathlib import Path

Spacy relies on a *language model* object, which holds all the static language-related data such as tokenization rules, statistical models, etc.

If you want to replicate this at home, you will need to install the model as described [here](https://spacy.io/usage).

In [None]:
# Load the vocabulary
nlp = spacy.load("en_core_web_sm")

## Named Entity Recognition with PhraseMatchers

What we want to do is called, in NLP parlance, *Named Entity Recognition (NER)*. As the name implies, the task is to detect *named enities*  (in this case, nootropics) in text. Nowadays, most NER-engines are based on statistical, machine-learning based models which provide much greater sensitivity - given enough training data, they are able to recognize entities of a specific type without having been given a specific list of such entities, and they are also much better at handling alternative spellings, synonyms, badly formatted text as well as typos.

Due to the limited scope of this project, we don't have the time or resources to train a model that recognizes nootropics. Instead, we use another powerful feature of Spacy: [Rule-Based Matchers](https://spacy.io/usage/rule-based-matching), more specifically [Phrase Matchers](https://spacy.io/usage/rule-based-matching#phrasematcher). 

A PhraseMatcher can be given a list of entities, and for each entity, a list of phrases that will be recognized as an instance of that entity. The fact that they can recognize phrases rather than single words is essential in our case, as many nootropics have compound names (or, for instance, *caffeinated drink*  needs to be resolved to *caffeine*). It is also possible to choose which word property the matcher should use: *tokens* (for the literal text), *lowercase tokens*, *lemmas*, and more. In hindsight, it would have made most sense to use lemmas - as they are much less sensitive to typos and to inflections. We actually noticed very late that we were instead using the `LOWER` attribute, and by that time, changing such an early step would mean that we would have to re-write many aspects of our analysis, so we left it as-is.

Once it's initialized, a sequence of texts can be piped into the matcher, which will return for each text the entities that were recognized.

Let's initialize the phrasematcher:

In [None]:
# Initialize a spacy matcher. We want to match on lowercase tokens.
matcher = PhraseMatcher(nlp.vocab, attr="LOWER", validate=True)

WikiPedia pages store a list of *redirects*, which are used internally by WikiPedia to redirect people to an article when they search for one of the redirects. This is very convenient, as it effectively provides us with a list of synonyms that we can use. Let's load that list (which we saved in a previous step) as well as the list of all substance names:

In [None]:
# Load the list of synonyms  and titles from file
with open(Config.Path.synonym_mapping) as f:
    synonyms = json.load(f)

with open(Config.Path.substance_names) as f:
    names = json.load(f)

Before a phrase can be added to the PhraseMatcher, it needs to be parsed (processed) by Spacy:

In [None]:
# Make a dictionary that maps substances to a list of spacy docs containing their synonym
words_to_patterns = {synonyms[i]: [] for i in synonyms}
words_to_patterns.update({name: [] for name in names})

for synonym in tqdm(synonyms):
    word = synonyms[synonym]
    words_to_patterns[word].append(nlp.make_doc(synonym))

for name in tqdm(names):
    words_to_patterns[name].append(nlp.make_doc(name))

Finally, add all patterns to the matcher:

In [None]:
for word in tqdm(words_to_patterns):
    matcher.add(word, words_to_patterns[word])

Get a list of all the files we need to process. Note that as before, we also wrote code for processing the comments - but we do not use it.

In [None]:
submissions_path = Path().cwd() / "private_data" / "reddit_data" / "submissions"
# comments_path = Config.Path.private_data_folder / "reddit_data" / "comments"

submission_files = list(submissions_path.glob("**/*"))
# comments_files = list(comments_path.glob("**/*"))

To efficiently process posts, they need to be fed to the phrasematcher by a *generator* - which is like a list, but whose elements are generated on the fly, here by reading them from disk.

In [None]:
# Function that returns a generators that yields raw reddit oposts
def get_submissions_generator(submission_files):
    for file in tqdm(submission_files):
        with open(file, "r") as f:
            try:
                yield (json.load(f), file)
            except JSONDecodeError:
                pass
    # If comments are used as well, continue to feed those
    try:
        for file in tqdm(comments_files):
            with open(file, "r") as f:
                try:
                    yield (json.load(f), file)
                except JSONDecodeError:
                    pass
    except NameError:
        pass

The results from that generator are passed into a new generator, which uses spacy to tokenize them:

In [None]:
def get_submission_doc_generator(submissions_generator):
    for submission, path in submissions_generator:
        try:
            text = submission["title"] + " " + submission["content"]
        except:
            text = submission["body"]
        yield (nlp.make_doc(text), submission, path)

Instantiate the two generators:

In [None]:
submission_generator = get_submissions_generator(submission_files=submission_files)
submission_doc_generator = get_submission_doc_generator(submission_generator)

Finally, one last generator that produces *matches* for each post - that is, a list of substances that were detected in each post.

In [None]:
match_generator = (
    (matcher(text), submission, path, text)
    for text, submission, path in submission_doc_generator
)

Finally, let's start the processing, saving files along the way with the added match information, and saving all data into one large dict:

In [None]:
submissions_dict = {}
for matches, submission, path, doc in match_generator:
    # Get the found mathches actual name
    matches_resolved = [matcher.vocab[match[0]].text for match in matches]
    # Eliminate duplicates
    matches_unique = list(set(matches_resolved))
    if not matches_unique:
        continue
    # Add to the submission and save back to file. Also add to large reddit dictionnary
    submission["matches"] = matches_unique
    submission["n_of_words"] = len(doc)
    with open(path, "w") as f:
        json.dump(submission, f)
    submissions_dict[submission["id"]] = submission

And save the large dict to file:

In [None]:
# Save full reddit data to file:
with open(Config.Path.reddit_data_with_NER, "w+") as f:
    json.dump(submissions_dict, f)

As a convenience for further analysis, let's save a  mapping between substances and posts in which they appear:

In [None]:
posts_per_substance = {substance: [] for substance in names}
for id in submissions_dict:
    for substance in submissions_dict[id]["matches"]:
        posts_per_substance[substance].append(id)

with open(Config.Path.posts_per_substance, "w+") as f:
    json.dump(posts_per_substance, f)