# Store Reddis Posts

This notebook iterates through Reddis posts for the given subreddit and stores posts in mongodb.

For each posts it calls another API to get comments and stores those as embedded array.

In [25]:
import requests
import time
import pprint
from pymongo import MongoClient

# Connect to Database

- Make sure you started mongodb with `docker-compose up -d`
- mongodb running in the container is exposed on port 37017 to prevent confusion in case you already have mongo installed on your system and running on default port 27017.

In [26]:
database_name = 'reddit-migraine'

client = MongoClient('mongodb://localhost:37017')
db = client[database_name]

# Create Indexes

We need to create indexes on the fields below so we can quickly access data when it grows big.


In [27]:
posts_collection = 'posts'

db[posts_collection].create_index('id', unique=True)
db[posts_collection].create_index('selftext', unique=False)
db[posts_collection].create_index('title', unique=False)
db[posts_collection].create_index('author', unique=False)

'author_1'

# Helper Methods

These methods access Reddit posts and comments from pushshift.io server.

In [28]:
def get_posts(pushshift_url, subreddit_name, before_time, max_size=100):
    should_retry = True
    while should_retry:
        try:
            req = requests.get(f'{pushshift_url}/?subreddit={subreddit_name}&sort=desc&sort_type=created_utc&before={before_time}&size={max_entries}')
            output = req.json()
            should_retry = False
        except:
            print(f'retrying post...')
            time.sleep(5)
            should_retry = True
    return output

def get_comments(pushshift_url, comment_id, max_size=100):
    should_retry = True
    while should_retry:
        try:
            req = requests.get(f'{pushshift_url}/?link_id={comment_id}&limit={max_size}')
            output = req.json()
            should_retry = False
        except:
            print(f'retrying comment {comment_id}...')
            time.sleep(5)
            should_retry = True
    return output

# URLs and so on

Define needed constants.

In [29]:
pushshift_post_url = 'https://api.pushshift.io/reddit/search/submission'
pushshift_comment_url = 'https://api.pushshift.io/reddit/comment/search'
subreddit_name = 'migraine'
max_entries = 10

# Get Posts

Get posts and comments associated with each post.  For each post store the comments in new field `comments` as a mongodb embedded array.


In [None]:
before_time = int(time.time())  # current epoch time
total_posts = 0

def get_page_of_posts(before_time):
    posts = get_posts(pushshift_post_url, subreddit_name, before_time, max_entries)
    data = posts.get('data', [])

    for entry in data:
        comment_id = entry['id']
        # print(f'comment id: {comment_id}')
        comments = get_comments(pushshift_comment_url, comment_id)
        entry['comments'] = comments['data']
        time.sleep(1)
    return data

# try:
for _ in range(1000):
    posts = get_page_of_posts(before_time)
    for post in posts:
        db['posts'].insert_one(post)
    total_posts += len(posts)
    print(f'Inserted {len(posts)} posts... Total posts so far: {total_posts}')
    print(f'Last before time: {before_time}')
    before_time = posts[len(posts) - 1]['created_utc']
    print(f'Next before time: {before_time}')
    time.sleep(1)
# except Exception as e:
#     print(f'ERROR - {e}')
#     print(f'Last before_time: {before_time}')

print(f'Done.  Total new posts: {total_posts}')

# About the Run Above

- Inserted 10 posts... Total posts so far: 10000
- Last before time: 1610935950
- Next before time: 1610921799
- Done.  Total new posts: 10000


# Into CVS Format

Next we will take posts and comments and transform them into CVS format.

The desired format of this file will be:

| Type | Parent | Author | Text | Title |
| ---- | ------ | ------ | ---- | ----- |
| P/C  |   id   | userid | text | text  |

- Type - P for post or C for comment
- Parent - own id for post or value of `parent_id` in comment structure without `t3_` prefix
- Author - userid
- Text - for posts value from `selftext` and for comment value from `body`
- Title - title of the post



In [39]:
import csv
from tqdm import tqdm

In [41]:
migraine_file_name = 'reddis_migraine_posts.csv'

In [42]:
def process_comments(comments, title, parent_id):
    return [{
            'Type': 'C',
            'Parent': parent_id,
            'Author': comment['author'],
            'Text': comment['body'],
            'Title': title
            } for comment in comments]
        

def process_post(post):
    entries = [{
        'Type': 'P',
        'Parent': post['id'],
        'Author': post['author'],
        'Text': post.get('selftext', ''),
        'Title': post['title']
    }]
    entries.extend(process_comments(
        post['comments'],
        post['title'],
        post['id']))
    return entries

created_header = False
posts_count = db['posts'].count_documents({})
with tqdm(total=posts_count, desc="Progress") as pbar:
    with open(f'data/{migraine_file_name}', 'w') as posts_file:
        field_names = ['Type', 'Parent', 'Author', 'Text', 'Title']
        csv_writer = csv.DictWriter(posts_file, fieldnames=field_names)
        for post in db['posts'].find():
            entries = process_post(post)
            if not created_header and len(entries) > 0:
                csv_writer.writeheader()
                created_header = True
            for entry in entries:
                csv_writer.writerow(entry)
                pbar.update(1)


Progress: 106818it [00:03, 29117.76it/s]


# Topic Modeling

This section will attempt to perform topic modeling using [Latent Dirichlet Allocation (LDA)](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/).  Topic modeling is statistical modeling for discovering the abstract topics that occur in a collection of documents.  It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

References:
[Topic Modeling and Latent Dirichlet Allocation (LDA) in Python](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)

# Read Data

Load just `Text` from the migraine posts into a list.

In [62]:
import numpy as np

migraine_texts = []
header = []

read_header = False
with open(f'data/{migraine_file_name}', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    for entry in csv_reader:
        if not read_header:
            header = entry
            read_header = True
        else:
            migraine_texts.append(entry[header.index('Text')])

print(migraine_texts[:10])

["I've been awake the entire night with the worst migraine I have ever had. Im a long time sufferer but this one is different. My thinking is more impaired than usual and my blood pressure is 150/111. I need to call out sick from work but I'm so afraid of getting in trouble. I'm afraid to go to hospital emergency room with the number of COVID cases in my area. I don't know what to do.", 'Hey y’all, I got a referral for a neurologist and while I’m waiting on it, I’ve decided to trial another preventative. I’ve tried Topomax (was on it 2 weeks, couldn’t handle the side effects), currently on Sandomigran (been on it for 19 months, stopped working 9 months in but I continued w it because I didn’t want to accept it wasn’t working 😭 when it did work, it was bloody amazing. While I can increase the dosage for effectiveness, I can barely handle the fatigue it gives). \n\nI have asthma and take venlafaxine for depression &amp; anxiety btw. My GP said this means my options are more limited for p

# Data Pre-processing

- Tokenization - text is split into sentences and sentences into words.
- Remove short words - words 3 character or less are removed.
- Stopwords - all stopwords removed.
- Lemmatized
- Stemmed


In [50]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/robsliwa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [37]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS and len(token) > 3]

In [68]:
processed_posts = [elem for elem in map(preprocess, migraine_texts) if len(elem)]

print(processed_posts[:10])

[['awake', 'entire', 'night', 'worst', 'migraine', 'long', 'time', 'sufferer', 'different', 'thinking', 'impaired', 'usual', 'blood', 'pressure', 'need', 'sick', 'work', 'afraid', 'getting', 'trouble', 'afraid', 'hospital', 'emergency', 'room', 'number', 'covid', 'cases', 'area', 'know'], ['referral', 'neurologist', 'waiting', 'decided', 'trial', 'preventative', 'tried', 'topomax', 'weeks', 'couldn', 'handle', 'effects', 'currently', 'sandomigran', 'months', 'stopped', 'working', 'months', 'continued', 'want', 'accept', 'wasn', 'working', 'work', 'bloody', 'amazing', 'increase', 'dosage', 'effectiveness', 'barely', 'handle', 'fatigue', 'gives', 'asthma', 'venlafaxine', 'depression', 'anxiety', 'said', 'means', 'options', 'limited', 'preventatives', 'suggestions', 'preventatives', 'check', 'happy', 'prescribe'], ['night', 'migraine', 'maxed', 'ubrevly', 'angry', 'want', 'hospital', 'underlying', 'health', 'conditions', 'covid', 'needing', 'vent', 'fall', 'asleep'], ['fucked', 'position'

In [85]:
word_dict = gensim.corpora.Dictionary(processed_posts)

count = 0
for k, v in word_dict.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 afraid
1 area
2 awake
3 blood
4 cases
5 covid
6 different
7 emergency
8 entire
9 getting
10 hospital


# Filter out Tokens

- less than 15 documents (absolute number) or
- more than 0.5 documents (fraction of total corpus size, not absolute number).
- after the above two steps, keep only the first 1000 most frequent tokens.

In [70]:
word_dict.filter_extremes(no_below=15, no_above=0.5, keep_n=1000)

# Gensim doc2bow

For each document determine how many words and how many times they appear.

In [72]:
bow_posts = [word_dict.doc2bow(doc) for doc in processed_posts]

print(bow_posts[0])

for i in range(len(bow_posts[0])):
    print(f'Word {bow_posts[0][i]}  ("{word_dict[bow_posts[0][i][0]]}") appears {bow_posts[0][i][1]} time(s).')

[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)]
Word (0, 2)  ("afraid") appears 2 time(s).
Word (1, 1)  ("area") appears 1 time(s).
Word (2, 1)  ("blood") appears 1 time(s).
Word (3, 1)  ("covid") appears 1 time(s).
Word (4, 1)  ("different") appears 1 time(s).
Word (5, 1)  ("emergency") appears 1 time(s).
Word (6, 1)  ("entire") appears 1 time(s).
Word (7, 1)  ("getting") appears 1 time(s).
Word (8, 1)  ("hospital") appears 1 time(s).
Word (9, 1)  ("know") appears 1 time(s).
Word (10, 1)  ("long") appears 1 time(s).
Word (11, 1)  ("migraine") appears 1 time(s).
Word (12, 1)  ("need") appears 1 time(s).
Word (13, 1)  ("night") appears 1 time(s).
Word (14, 1)  ("number") appears 1 time(s).
Word (15, 1)  ("pressure") appears 1 time(s).
Word (16, 1)  ("room") appears 1 time(s).
Word (17, 1)  ("sick") appears 1 time(s).
Word

# TF-IDF

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

In [74]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_posts)
posts_tfidf = tfidf[bow_posts]

for doc in posts_tfidf:
    pprint.pprint(doc)
    break

[(0, 0.47493475065658564),
 (1, 0.21115117180368562),
 (2, 0.1703657422979135),
 (3, 0.1955409353505147),
 (4, 0.14473119896123687),
 (5, 0.24500438221230278),
 (6, 0.2168788948413073),
 (7, 0.12595581002458722),
 (8, 0.20849570694048),
 (9, 0.09904807887858504),
 (10, 0.1330117860984265),
 (11, 0.06221695940442934),
 (12, 0.13038766004752086),
 (13, 0.16277642733814046),
 (14, 0.22581638263129158),
 (15, 0.15582283948886846),
 (16, 0.19008084161414196),
 (17, 0.19004990309436387),
 (18, 0.24696751178480564),
 (19, 0.19155983088238432),
 (20, 0.09543082720522261),
 (21, 0.23350827903668087),
 (22, 0.22581638263129158),
 (23, 0.10898930384768368),
 (24, 0.16928728332871606)]


# Running LDA  using Bag of Words


In [75]:
lda_model = gensim.models.LdaMulticore(bow_posts, num_topics=10, id2word=word_dict, passes=2, workers=2)

In [80]:
for idx, topic in lda_model.print_topics():
    print(f'Topic {idx}: Words: {topic}')

Topic 0: Words: 0.064*"thank" + 0.051*"thanks" + 0.041*"https" + 0.018*"migraine" + 0.018*"glad" + 0.017*"look" + 0.016*"know" + 0.013*"reddit" + 0.013*"good" + 0.013*"post"
Topic 1: Words: 0.066*"migraines" + 0.023*"blood" + 0.023*"trigger" + 0.022*"pressure" + 0.018*"triggers" + 0.013*"food" + 0.013*"stress" + 0.013*"weather" + 0.012*"high" + 0.012*"think"
Topic 2: Words: 0.053*"like" + 0.038*"feel" + 0.035*"migraine" + 0.025*"head" + 0.019*"pain" + 0.015*"feeling" + 0.013*"time" + 0.011*"usually" + 0.011*"right" + 0.010*"feels"
Topic 3: Words: 0.032*"hope" + 0.031*"sleep" + 0.023*"helps" + 0.022*"better" + 0.020*"work" + 0.020*"light" + 0.019*"help" + 0.019*"migraine" + 0.017*"soon" + 0.015*"time"
Topic 4: Words: 0.067*"migraine" + 0.035*"migraines" + 0.024*"drink" + 0.024*"water" + 0.016*"like" + 0.016*"caffeine" + 0.015*"years" + 0.014*"people" + 0.013*"control" + 0.013*"time"
Topic 5: Words: 0.030*"good" + 0.022*"work" + 0.018*"know" + 0.015*"insurance" + 0.015*"time" + 0.013*"lu

# Running LDA using TF-IDF


In [81]:
lda_model_tfidf = gensim.models.LdaMulticore(posts_tfidf, num_topics=10, id2word=word_dict, passes=2, workers=4)

In [82]:
for idx, topic in lda_model_tfidf.print_topics():
    print(f'Topic {idx}: Word: {topic}')

Topic 0: Word: 0.017*"hope" + 0.014*"glad" + 0.012*"right" + 0.011*"soon" + 0.011*"haha" + 0.010*"better" + 0.010*"migraine" + 0.010*"works" + 0.009*"feel" + 0.008*"like"
Topic 1: Word: 0.073*"thank" + 0.015*"light" + 0.015*"sorry" + 0.015*"love" + 0.014*"smell" + 0.011*"look" + 0.010*"glasses" + 0.010*"interesting" + 0.009*"like" + 0.009*"migraine"
Topic 2: Word: 0.015*"https" + 0.010*"migraine" + 0.010*"feel" + 0.010*"people" + 0.009*"migraines" + 0.008*"like" + 0.008*"know" + 0.007*"medical" + 0.007*"nice" + 0.006*"need"
Topic 3: Word: 0.055*"deleted" + 0.022*"like" + 0.018*"sounds" + 0.014*"know" + 0.013*"hear" + 0.012*"heard" + 0.011*"definitely" + 0.011*"people" + 0.010*"sorry" + 0.010*"doctor"
Topic 4: Word: 0.014*"botox" + 0.013*"insurance" + 0.010*"tried" + 0.009*"work" + 0.009*"nurtec" + 0.009*"month" + 0.009*"aimovig" + 0.008*"meds" + 0.008*"sumatriptan" + 0.008*"triptans"
Topic 5: Word: 0.012*"magnesium" + 0.011*"migraines" + 0.011*"dose" + 0.011*"taking" + 0.010*"migraine"

# Topic Extraction from Titles Only

Above shows words that could be topics from entire posts and comments.  There isn't any clear pattern here.  Let's see what happens if we use just post titles instead of posts/comments.


# Create Title Dataset

Here we will create list of titles.


In [84]:
migraine_titles = []
header = []

read_header = False
with open(f'data/{migraine_file_name}', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    for entry in csv_reader:
        if not read_header:
            header = entry
            read_header = True
        else:
            # comments have the same titles as posts so just ignore them
            if entry[header.index('Type')] == 'P':
                migraine_titles.append(entry[header.index('Title')])

print(len(migraine_titles))
print(migraine_titles[:10])

10000
["Worst I've ever had/calling in sick", 'What preventative to trial next? (Asthmatic &amp; take venlafaxine)', 'Pain', 'Pain vs Relationship', 'Reminder: Birth Control With Estrogen Causes Increased Risk of Stroke', 'New to this and wondering if others experience similar symptoms, including tinnitus?', 'Relatable or should I be worried', 'Does anyone else struggle with shaking hands?', 'Just upset', 'Non standard remedies for migraines']


In [87]:
processed_titles = [elem for elem in map(preprocess, migraine_titles) if len(elem)]

print(processed_titles[:10])

[['worst', 'calling', 'sick'], ['preventative', 'trial', 'asthmatic', 'venlafaxine'], ['pain'], ['pain', 'relationship'], ['reminder', 'birth', 'control', 'estrogen', 'causes', 'increased', 'risk', 'stroke'], ['wondering', 'experience', 'similar', 'symptoms', 'including', 'tinnitus'], ['relatable', 'worried'], ['struggle', 'shaking', 'hands'], ['upset'], ['standard', 'remedies', 'migraines']]


In [88]:
title_dict = gensim.corpora.Dictionary(processed_titles)

count = 0
for k, v in title_dict.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 calling
1 sick
2 worst
3 asthmatic
4 preventative
5 trial
6 venlafaxine
7 pain
8 relationship
9 birth
10 causes


# Filter out Extremes


In [90]:
title_dict.filter_extremes(no_below=15, no_above=0.5, keep_n=1000)

# Gensim doc2bow for titles


In [91]:
bow_titles = [title_dict.doc2bow(doc) for doc in processed_titles]

print(bow_titles[0])

for i in range(len(bow_titles[0])):
    print(f'Word {bow_titles[0][i]}  ("{title_dict[bow_titles[0][i][0]]}") appears {bow_titles[0][i][1]} time(s).')

[(0, 1), (1, 1)]
Word (0, 1)  ("sick") appears 1 time(s).
Word (1, 1)  ("worst") appears 1 time(s).


# TF-IDF for titles


In [92]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_titles)
titles_tfidf = tfidf[bow_titles]

for doc in titles_tfidf:
    pprint.pprint(doc)
    break

[(0, 0.7177268263227299), (1, 0.6963247825380781)]


# BOW LDA for titles


In [93]:
lda_model_titles = gensim.models.LdaMulticore(bow_titles, num_topics=10, id2word=title_dict, passes=2, workers=2)

In [94]:
for idx, topic in lda_model_titles.print_topics():
    print(f'Topic {idx}: Words: {topic}')

Topic 0: Words: 0.344*"migraine" + 0.026*"like" + 0.024*"feel" + 0.014*"days" + 0.013*"tried" + 0.013*"headache" + 0.010*"symptoms" + 0.010*"know" + 0.010*"aura" + 0.010*"relief"
Topic 1: Words: 0.057*"time" + 0.055*"migraine" + 0.040*"advice" + 0.034*"like" + 0.032*"topamax" + 0.031*"sumatriptan" + 0.031*"aura" + 0.028*"question" + 0.025*"migraines" + 0.020*"experience"
Topic 2: Words: 0.048*"migraines" + 0.042*"aimovig" + 0.032*"effects" + 0.029*"good" + 0.027*"head" + 0.025*"migraine" + 0.023*"headache" + 0.016*"amitriptyline" + 0.015*"rant" + 0.013*"advice"
Topic 3: Words: 0.105*"migraine" + 0.087*"headaches" + 0.033*"migraines" + 0.021*"week" + 0.020*"medication" + 0.019*"long" + 0.019*"auras" + 0.019*"tension" + 0.018*"headache" + 0.018*"anybody"
Topic 4: Words: 0.062*"migraine" + 0.036*"nurtec" + 0.026*"control" + 0.023*"migraines" + 0.023*"works" + 0.022*"help" + 0.022*"best" + 0.022*"weird" + 0.020*"pain" + 0.020*"birth"
Topic 5: Words: 0.123*"pain" + 0.051*"migraine" + 0.041*

# TF-IDF LDA for titles


In [95]:
lda_model_titles_tfidf = gensim.models.LdaMulticore(titles_tfidf, num_topics=10, id2word=title_dict, passes=2, workers=4)

In [96]:
for idx, topic in lda_model_titles_tfidf.print_topics():
    print(f'Topic {idx}: Word: {topic}')

Topic 0: Word: 0.152*"migraines" + 0.064*"migraine" + 0.039*"feel" + 0.036*"like" + 0.022*"head" + 0.021*"having" + 0.015*"people" + 0.014*"food" + 0.013*"related" + 0.012*"silent"
Topic 1: Word: 0.038*"migraine" + 0.035*"tried" + 0.031*"migraines" + 0.029*"topamax" + 0.027*"today" + 0.027*"life" + 0.025*"headaches" + 0.024*"emgality" + 0.023*"doctor" + 0.019*"helped"
Topic 2: Word: 0.040*"headaches" + 0.034*"neurologist" + 0.030*"treatment" + 0.026*"migraine" + 0.022*"post" + 0.020*"triptans" + 0.018*"eyes" + 0.018*"migraines" + 0.017*"chronic" + 0.016*"pain"
Topic 3: Word: 0.117*"migraines" + 0.090*"headache" + 0.024*"worse" + 0.021*"tension" + 0.020*"migraine" + 0.016*"chronic" + 0.014*"causing" + 0.013*"loss" + 0.013*"headaches" + 0.013*"tired"
Topic 4: Word: 0.204*"migraine" + 0.044*"feeling" + 0.016*"week" + 0.015*"like" + 0.014*"free" + 0.013*"think" + 0.012*"month" + 0.012*"looking" + 0.012*"good" + 0.012*"migraines"
Topic 5: Word: 0.047*"botox" + 0.038*"trigger" + 0.035*"aimov