
### 2.2.2 Detecting mentions of nootropics in reddit posts

And now, finally, some text analysis! While this is still just preparatory processing, we are using some NLP techniques that are different from those we saw during the year - so they're worth mentioning in a bit more detail. 

Once more, the full process can be found in [TODO-add link] - we present the most important points here. 

The posts we downloaded from reddit are plain-text content. To be able to build a graph from the posts, the first step is thus to detect mentions of nootropics in our posts, for which we use [Spacy NLP](https://spacy.io/), a great NLP library.

#### Named Entity Recognition with PhraseMatchers

What we want to do is called, in NLP parlance, *Named Entity Recognition (NER)*. As the name implies, the task is to detect *named enities*  (in this case, nootropics) in text. Nowadays, most NER-engines are based on statistical, machine-learning based models which provide much greater sensitivity - given enough training data, they are able to recognize entities of a specific type without having been given a specific list of such entities, and they are also much better at handling alternative spellings, synonyms, badly formatted text as well as typos.

Due to the limited scope of this project, we don't have the time or resources to train a model that recognizes nootropics. Instead, we use another powerful feature of Spacy: [Rule-Based Matchers](https://spacy.io/usage/rule-based-matching), more specifically [Phrase Matchers](https://spacy.io/usage/rule-based-matching#phrasematcher). 

A PhraseMatcher can be given a list of entities, and for each entity, a list of phrases that will be recognized as an instance of that entity.

Once it's initialized, a sequence of texts can be piped into the matcher, which will return for each text the entities that were recognized.

#### WikiPedia redirects as synonyms

One of the nice insights from this part was that the *redirects* provided by WikiPedia are a great source for building synonym lists: they are a list of phrases that redirect to the main wikipedia page for a topic, and can be used as the input of the PhraseMatcher.

#### The process

We won't put the code here, as it is not terribly interesting - for that, please refer to the relevant [notebook - TODO]. Instead, we'll mention the broad steps of the process:

1. Initialize the PhraseMatcher
2. Use spacy to parse the synonyms, and feed them to the phrasematcher in the form 

```py
{ 
    <substance_name>: [<synonym_1>,<synonym_2>,...]
}
```
3. Iterate through all posts:
    1. Tokenize the post with Spacy
    2. Feed the post to the PhraseMatcher
    3. Save the resulting matches, if any, to file




In [9]:
from library_functions.imports_explainer_notebook import *

### 2.3 Preliminary data analysis

Let's now look at some of the basic properties of our data. Note that the same information is available, in a more esthetically pleasing form, at [TODO - link website].

First let's see the size of our datasets:

In [10]:
reddit_data = lf.load_data_reddit()
wiki_data = lf.load_data_wiki()

number_of_posts = len(reddit_data)
number_of_pages = len(wiki_data["name"])

printmd(f"Number of posts on Reddit: \t {number_of_posts}")
printmd(f"Number of WikiPedia Pages: \t {number_of_pages}")

Number of posts on Reddit: 	 108588

Number of WikiPedia Pages: 	 1502

Let's also look at the average lengths:

In [11]:
total_length_reddit = sum(
    [len(p["title"]) + len(p["content"]) for p in reddit_data.values()]
)
total_length_wiki = sum([len(p) for p in wiki_data["content"]])

average_length_reddit = total_length_reddit / number_of_posts
average_length_wiki = total_length_wiki / number_of_pages


printmd(f"Average post length (in characters) on Reddit: \t {average_length_reddit}")
printmd(f"Average page length (in characters) on Wikipedia: \t {average_length_wiki}")


Average post length (in characters) on Reddit: 	 436.927634729436

Average page length (in characters) on Wikipedia: 	 4798.645139813582

The average page is 10x longer than the average post, but there are around 60x more posts than there are pages.

Let's see how many nootropics there are on average per Reddit Post:

In [12]:
total_number_of_matches = sum([len(p["matches"]) for p in reddit_data.values()])
average_matches_per_post = total_number_of_matches / number_of_posts

printmd(f"There are, on average, {average_matches_per_post:.2f} nootropics per reddit post.")

There are, on average, 1.00 nootropics per reddit post.

That's a surprisingly round number! Let's see how many posts contain two or more nootropics (those are the posts from which a link can be created):

In [13]:
n_of_posts_with_links = sum([1 for p in reddit_data.values() if len(p["matches"]) >= 2])
print(f"There are {n_of_posts_with_links} posts that contain two or more nootropics.")

There are 23361 posts that contain two or more nootropics.


Finally, let's see how many links (towards other nootropics) a WikiPedia page has, on 
average.
Note that to get this information, we are using a graph that isn't generated before the next section - but we found it interesting to includs.

In [14]:
wiki_graph = lf.create_graph_wiki()
outgoing_links = [d for n, d in wiki_graph.out_degree]
average_links = np.mean(outgoing_links)
printmd(f"There are on average {average_links:.2f} links to other nootropics on each WikiPedia page.")

There are on average 2.78 links to other nootropics on each WikiPedia page.

This is all nice said and done, but it doesn't give us any information on the spread of those metrics. For that, let's look at some histograms instead:

In [16]:
printmd("Reddit Dataset:")
lf.get_reddit_plots_figure()

Reddit Dataset:

In [17]:
printmd("WikiPedia Dataset:")
lf.get_wiki_plots_figure()

WikiPedia Dataset:

Because we found these plots interesting to a larger public, we discuss them in greater detail on our [TODO-website], so we invite the reader to take a look there.