In [3]:
from library_functions.imports_explainer_notebook import *

# Text Analysis: Wordclouds, Sentiments, and More

We already touched on text analysis when preparing our data - now is time to revisit our pages and post, and see what more we can extract from them. While this was one of the more exciting parts of the assignment from a practical perspective (as in, we were able to get a lot of insights on our dataset using text analysis), the techniques we use are very similar to what we did during the year. 

With that in mind, we will keep this section rather brief in this notebook - and you are encouraged to explore our results on the website instead.

## Wordcloud generation and comparison

Similar to what we did during the semester, we generated wordclouds based on the TF-IDFs of both pages and posts. In the case of posts, we aggregated all posts about every given nootropic, and used that as "documents". 


First off, we need to load our graphs - except this time, when generating reddit's graph, we includs as node attribute the text of all posts that mention that node, and as link attribute the text of all posts that mention both nodes:


In [4]:
graph_reddit = lf.create_graph_reddit(
    max_drugs_in_post=6,  # Ignore posts that have too many substances in them, as they are likely noise
    min_edge_occurrences_to_link=2,  # Include all mentions
    include_link_contents=True,
    include_node_contents=True,
    min_content_length_in_characters=30,
)
graph_wiki = lf.create_graph_wiki()

Then, we use spacy to generate a list of lemmas associated with each node. The commented code can be found in `library_functions/text_analysis.py`:

In [7]:
lf.assign_lemmas(graph_reddit)
lf.assign_lemmas(graph_wiki)

100%|██████████| 1502/1502 [00:46<00:00, 32.43it/s]
100%|██████████| 1502/1502 [00:09<00:00, 155.34it/s]


From there, we can compute term frequencies and inverse document frequencies. Again, the functions are in `library_functions/text_analysis.py` - and make use of the same functions to calculate TF/IDFs which we implemented during the year:

In [8]:
lf.assign_tfs(graph_reddit)
lf.assign_tfs(graph_wiki)
lf.assign_idfs(graph_reddit)
lf.assign_idfs(graph_wiki)

100%|██████████| 1502/1502 [00:06<00:00, 235.97it/s]
100%|██████████| 1502/1502 [00:00<00:00, 2061.67it/s]
  8%|▊         | 120/1502 [00:46<08:51,  2.60it/s]


KeyboardInterrupt: 

With this done, we can calculate TF-IDFS:

In [10]:
lf.assign_tf_idfs(graph_reddit)
lf.assign_tf_idfs(graph_wiki)

  8%|▊         | 120/1502 [00:00<00:04, 334.13it/s]


KeyError: 'idfs'

Which then lets us generate nice wordclouds of an arbitrary substance:

In [13]:
lf.wordcloud_from_node(graph_reddit, "caffeine")

TypeError: wordcloud_from_node() missing 1 required positional argument: 'color_func'

Similarly, we also implemented functions that allow generating wordclouds from either a collection of nodes, or from a link. the code and idea behind them is almost identical to the above - so we will not repeat them. For a showcase, playground, and discussion of our results, please see [TODO- insert link]

## Sentiment analysis - Wojciech