### Resources

vegan subreddits:
- https://www.reddit.com/r/vegan/
- https://www.reddit.com/r/animalrights/
- https://www.reddit.com/r/animalwelfare/
- https://www.reddit.com/r/veg/
- https://www.reddit.com/r/vegetarian/
- https://www.reddit.com/r/vegetarianism/
- https://www.reddit.com/r/dietaryvegan/
- https://www.reddit.com/r/veganrecipes/
- https://www.reddit.com/r/vegproblems/

location-based vegan subreddits:
- https://www.reddit.com/r/VeganDenver/
- https://www.reddit.com/r/vegan/wiki/localvegansubreddits

### Next Steps

1. Get some Reddit data into a database
    - python scrape_reddit.py
    - eventually: have a script continuously running, scraping subreddit data as it comes in and updating with new comments and such
2. Create a pipeline for:
    - reading in the data
    - training a model on the data
    - saving the model out to disk
    - transforming the input data to topics and topic space
    - saving the transformed data out to disk

# MVP Time

A quick and dirty scraping of some data and topic modeling.

In [44]:
# Graphing / Plotting / Printing
%matplotlib inline

import matplotlib.pyplot as plt
from pprint import pprint  # pretty-printer
import seaborn as sns

# Standard Imports
from collections import defaultdict
import os
import string

# Text Analysis Packages
import gensim
from gensim import corpora, similarities, models
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [67]:
def get_models(file_name='', documents=[], num_topics=5):
    if not documents:
        with open(file_name, 'r') as f:
            documents = f.read().split('\n')

    # set up translator to remove punctuation
    translator = str.maketrans({key: None for key in string.punctuation})

    # remove punctuation, lowercase all words, remove stop words, stem words
    other_stop_words = set(['theyr', 'ive', 'there', 'im', 'he', 'dont', 'id'])
    stop_words = other_stop_words.union(ENGLISH_STOP_WORDS)
    stemmer = PorterStemmer()
    texts = [[stemmer.stem(word) for word in document.lower().translate(translator).split() if word not in stop_words]
             for document in documents]

    # remove words that appear only once
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    texts = [[token for token in text if frequency[token] > 1]
             for text in texts]

    dictionary = corpora.Dictionary(texts)

    #pprint("Dictionary: {}".format(dictionary))
    #print('\n')
    #pprint("Dictionary, Token To ID: {}".format(dictionary.token2id))

    corpus = [dictionary.doc2bow(text) for text in texts]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    
    # Latent Dirirchlet Allocation
    model = models.LdaModel(corpus, id2word=dictionary, num_topics=num_topics)
    print(model)

    model_tfidf = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=num_topics)
    print(model_tfidf)
    
    return model, model_tfidf

In [52]:
# Latent Semantic Indexing (LSI, or sometimes LSA)
#lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
#corpus_lsi = lsi[corpus_tfidf]
#model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
#print(model)

In [53]:
for i in range(model.num_topics - 1):
    print("{0}: {1}\n".format(i, model.print_topic(i)))

0: 0.019*make + 0.014*vegan + 0.012*anim + 0.012*lot + 0.011*just + 0.011*peopl + 0.009*eat + 0.009*mean + 0.008*realiz + 0.008*time

1: 0.026*make + 0.020*think + 0.017*your + 0.014*peopl + 0.013*vegan + 0.013*food + 0.012*just + 0.011*realli + 0.011*eat + 0.010*time

2: 0.023*vegan + 0.022*just + 0.020*make + 0.014*feel + 0.014*like + 0.014*know + 0.013*thing + 0.012*peopl + 0.012*tell + 0.011*food

3: 0.038*eat + 0.024*vegan + 0.018*vegetarian + 0.015*like + 0.014*want + 0.011*peopl + 0.010*anim + 0.010*restaur + 0.010*meat + 0.009*friend

4: 0.039*vegan + 0.017*vegetarian + 0.016*time + 0.014*look + 0.013*year + 0.013*good + 0.011*come + 0.010*like + 0.010*bad + 0.010*your

5: 0.028*meat + 0.022*eat + 0.014*tri + 0.013*vegan + 0.013*like + 0.012*peopl + 0.012*good + 0.010*famili + 0.010*tribe + 0.009*friend

6: 0.033*like + 0.015*just + 0.012*cook + 0.011*make + 0.011*live + 0.010*almond + 0.009*anim + 0.008*peopl + 0.008*go + 0.008*sure

7: 0.023*vegan + 0.023*eat + 0.016*just + 0

In [54]:
for i in range(model_tfidf.num_topics - 1):
    print("{0}: {1}\n".format(i, model_tfidf.print_topic(i)))

0: 0.009*do + 0.009*know + 0.008*egg + 0.008*base + 0.007*like + 0.007*2 + 0.007*cup + 0.007*luck + 0.006*recip + 0.006*vegan

1: 0.013*tofu + 0.011*hummu + 0.010*vegan + 0.010*product + 0.009*problem + 0.008*peopl + 0.008*funni + 0.008*error + 0.008*chees + 0.008*lentil

2: 0.018*eat + 0.011*food + 0.008*point + 0.008*thing + 0.007*meat + 0.007*hummu + 0.007*ill + 0.007*vegan + 0.006*buy + 0.006*mean

3: 0.010*ugh + 0.009*restaur + 0.009*make + 0.009*your + 0.009*got + 0.007*know + 0.007*time + 0.006*annoy + 0.006*that + 0.006*anim

4: 0.013*vegan + 0.010*food + 0.008*make + 0.008*hummu + 0.008*eat + 0.007*feel + 0.006*like + 0.006*just + 0.006*meat + 0.006*think

5: 0.019*vegan + 0.009*ye + 0.007*ask + 0.007*tofu + 0.007*chees + 0.006*egg + 0.006*live + 0.006*milk + 0.006*mean + 0.006*famili

6: 0.011*look + 0.009*recip + 0.007*suck + 0.007*make + 0.007*pretti + 0.007*vegan + 0.007*eat + 0.007*didnt + 0.007*nurs + 0.007*go

7: 0.009*time + 0.009*complaint + 0.007*order + 0.006*brand 

In [57]:
file_name_template = os.path.join('data', 'mvp-{}-subreddit.txt')
subreds = [
        'animalrights',
        'animalwelfare',
        'veg',
        'vegetarian',
        'vegetarianism',
        'dietaryvegan',
        'veganrecipes',
        'vegproblems',
    ]
file_names = [file_name_template.format(subred) for subred in subreds]

for file_name in file_names:
    model, model_tfidf = get_models(file_name)
    print('SUBRED {} MODEL'.format(file_name))
    for i in range(model.num_topics - 1):
        print('Topic {0}: {1}'.format(i, ', '.join([x[0] for x in model.show_topic(i)])))
    print('\n')
    print('SUBRED {} MODEL_TFIDF'.format(file_name))
    for i in range(model.num_topics - 1):
        print('Topic {0}: {1}'.format(i, ', '.join([x[0] for x in model_tfidf.show_topic(i)])))
    print('\n')
    print('***************************')

LdaModel(num_terms=517, num_topics=5, decay=0.5, chunksize=2000)
LdaModel(num_terms=517, num_topics=5, decay=0.5, chunksize=2000)
SUBRED data/mvp-animalrights-subreddit.txt MODEL
Topic 0: vegan, anim, peopl, meat, think, pig, help, like, eat, ban
Topic 1: anim, like, abus, cat, report, sub, hope, peopl, good, feel
Topic 2: anim, right, human, feel, cat, like, activist, test, just, help
Topic 3: anim, peopl, right, live, make, kill, help, new, want, vegan


SUBRED data/mvp-animalrights-subreddit.txt MODEL_TFIDF
Topic 0: anim, abus, kill, hunt, think, use, read, peopl, meat, wild
Topic 1: anim, right, welfar, report, realli, hope, world, group, good, extinct
Topic 2: human, anim, peopl, judg, right, milk, protect, fight, action, vegan
Topic 3: vegan, anim, feel, right, ranimalwelfar, like, check, cat, big, free


***************************
LdaModel(num_terms=520, num_topics=5, decay=0.5, chunksize=2000)
LdaModel(num_terms=520, num_topics=5, decay=0.5, chunksize=2000)
SUBRED data/mvp-ani

# Testing topics between different subreddits

Using the whole subreddit corpus as a document.

In [68]:
documents = []
for file_name in file_names:
    with open(file_name, 'r') as f:
        documents.append(f.read().replace('\n', ' '))

model, model_tfidf = get_models(documents=documents, num_topics=10)

for i in range(model.num_topics - 1):
    print('Topic {0}: {1}'.format(i, ', '.join([x[0] for x in model.show_topic(i)])))
print('\n')
for i in range(model.num_topics - 1):
    print('Topic {0}: {1}'.format(i, ', '.join([x[0] for x in model_tfidf.show_topic(i)])))

LdaModel(num_terms=2952, num_topics=10, decay=0.5, chunksize=2000)
LdaModel(num_terms=2952, num_topics=10, decay=0.5, chunksize=2000)
Topic 0: eat, vegetarian, anim, vegan, just, like, meat, make, good, food
Topic 1: like, anim, vegan, eat, vegetarian, make, just, meat, peopl, help
Topic 2: meat, vegan, eat, like, vegetarian, just, make, peopl, good, anim
Topic 3: vegan, eat, meat, anim, like, just, food, make, vegetarian, peopl
Topic 4: meat, eat, vegan, vegetarian, like, make, anim, food, just, realli
Topic 5: vegan, eat, anim, vegetarian, just, make, like, meat, recip, tri
Topic 6: vegan, anim, eat, make, like, meat, vegetarian, just, peopl, food
Topic 7: vegan, eat, meat, vegetarian, like, just, anim, make, peopl, food
Topic 8: anim, just, meat, like, peopl, make, vegetarian, eat, vegan, food


Topic 0: meat, vegan, eat, just, anim, butter, hummu, diet, like, dad
Topic 1: shelter, cat, vet, veterinari, anim, dog, bunni, uc, pet, stray
Topic 2: vegan, anim, wool, diet, meat, sweater