<div style="display: block; width: 100%; height: 120px;">

<p style="float: left;">
    <span style="font-weight: bold; line-height: 24px; font-size: 16px;">
        DIGHUM160 - Critical Digital Humanities
        <br />
        Digital Hermeneutics 2019
    </span>
    <br >
    <span style="line-height: 22x; font-size: 14x; margin-top: 10px;">
        Week 3-3: Topic modeling <br />
        Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)
    </span>
</p>


# Topic Modeling

Topic modeling for the topics in the subreddit /r/UpliftingNews

# Sentiment Analysis

Sentiment analysis on the comments for each topic. 


## Importing libraries

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction import text
import re
import pandas as pd
import numpy as np
from more_itertools import chunked
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize
from nltk.corpus import brown
!pip install pyLDAvis
import pyLDAvis.sklearn 
%matplotlib inline
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]

  regargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
  from collections import Sequence, defaultdict
  from collections import Counter, Iterable
[nltk_data] Downloading package punkt to /Users/rajesh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/rajesh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rajesh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/rajesh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
You should consider upgrading via the '/usr/local/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


## Importing data

I am using PRAW to get the subreddit topics and replies. Once I get them, I save them locally, so I can use them next time without hitting the API. If you have uplifting.csv and replies.csv, then you can skip Steps 1-6. 

In [32]:
import os
env_vars = !cat .env
env_dic = {}
for var in env_vars:
    key, value = var.split('=')
    env_dic[key] = value

In [33]:
import praw
reddit = praw.Reddit(client_id=env_dic['clientid'],
                     client_secret=env_dic['secret'], password=env_dic['password'],
                     user_agent=env_dic['user_agent'], username=env_dic['username'])


In [34]:
subreddit = reddit.subreddit('UpliftingNews')

# Hot, Top and Rising
(https://www.reddit.com/r/help/comments/32eu8w/what_is_the_difference_between_newrising_hot_top/) 
I am getting the Hot Uplifting news. 
Hot is what's been getting a lot up upvotes/comments recently

New sorts post by the time of submission with the newest at the top of the page. It sorts new posts in the area of reddit you clicked the 'new' tab. So all.reddit.com/new gives you the latest post from the entirety (almost) of reddit. While clicking new in askreddit will give you their latest posts. This applies to all the tabs I explain below.

Rising is what is getting a lot of activity (comments/upvotes) right now. This is the category you are looking for.

Controversial is what's getting multiple downvotes and upvotes.

Top is what has gotten the most upvotes over the set period.

Gilded are just comments which have been given reddit gold by someone. Typically comments are gilded for being exceptionally informative, or funny, but someone can gild a comment for any reason at all. You could give gold (gild) a comment full of hate speech if you wanted to.


# Data Preparation and Reading

In [5]:
# id,title,score,author,created ,num_comments,distinguished,edited


from datetime import datetime, timezone
datetime.now(timezone.utc).strftime("%Y%m%d")
hot_uplift = []
sub_columns = ['id','title','score','author','created' ,'num_comments','distinguished','edited']
hot_uplift_sub = subreddit.hot(limit = 100)
for submission in hot_uplift_sub:
    created = datetime.fromtimestamp(submission.created_utc).strftime("%m/%d/%Y")
    hot_uplift.append([submission.id,submission.title,submission.score,submission.author,created ,submission.num_comments,submission.distinguished,submission.edited])
hot_uplift_df = pd.DataFrame(hot_uplift,columns=sub_columns)

In [6]:
# Return just the first 5 and print their titles
top_uplift_sub = subreddit.top(limit = 100)
top_uplift = []
for submission in top_uplift_sub:
    created = datetime.fromtimestamp(submission.created_utc).strftime("%m/%d/%Y")
    top_uplift.append([submission.id,submission.title,submission.score,submission.author,created,submission.num_comments,submission.distinguished,submission.edited])
top_uplift_df = pd.DataFrame(top_uplift,columns=sub_columns)


In [7]:
# Return just the first 5 and print their titles
rising_uplift_sub = subreddit.rising(limit = 100)
rising_uplift = []
for submission in rising_uplift_sub:
    created = datetime.fromtimestamp(submission.created_utc).strftime("%m/%d/%Y")
    rising_uplift.append([submission.id,submission.title,submission.score,submission.author,created,submission.num_comments,submission.distinguished,submission.edited])
rising_uplift_df = pd.DataFrame(rising_uplift,columns=sub_columns)


In [8]:
subreddit_all = [hot_uplift_df,top_uplift_df,rising_uplift_df]
subreddit_all_df = pd.concat(subreddit_all)
len(subreddit_all_df)

217

In [9]:
# Write all the news to a file
subreddit_all_df.to_csv('uplifting.csv', index=False)


In [5]:
# Read all the news from a file. 
# You can start with this, if the uplifting.csv already exists. 

subreddit_all_df = pd.read_csv("uplifting.csv")


In [11]:
conversedict = {}
for id in subreddit_all_df.id:
    submission = reddit.submission(id=id)
    submission.comments.replace_more(limit=0)
    for comment in submission.comments.list():
        if comment.id not in conversedict:
            conversedict[comment.id] = [comment.body,{}]
            if comment.parent() != submission.id:
                parent = str(comment.parent())
                conversedict[parent][1][comment.id] = [comment.ups, comment.body]

In [13]:
converse_df = pd.DataFrame.from_dict(conversedict, orient="index")
converse_df.to_csv("comments.csv")
converse_df.head()

Unnamed: 0,0,1
g1354qo,I cant believe they found all this out in only...,"{'g136my3': [167, 'I would have thought it wou..."
g131ym2,"That's great news, except for ""gallons per kil...","{'g133lcy': [145, 'Tbh anytime I see these sor..."
g135e6r,"Interesting. Also because a cheap, effective f...","{'g137kkk': [19, 'Reverse osmosis also has ins..."
g136yyo,If anyone wanted to be saved the 30 seconds: \...,"{'g13jbp2': [43, 'Hm, I know some of those wor..."
g13os4y,Watch how we are never going to use it,"{'g14p7a5': [1, 'In fact, this is the last tim..."


In [17]:
## You can use the saved comments.csv for the corrosponding saved uplifting.csv
converse_df_new = pd.read_csv("comments.csv")
conversedict_new = converse_df.to_dict(orient="index")

converse_df_new.head()
conversedict_new


nedryl and it takes soooooo long but if they rush me, my throat closes up.',
  1: {'g11fgmd': [2,
    "My only reaction is I'm in a state of haze for like 24 hours afterwards. I have all of my mental faculties, they're just slower for a bit."],
   'g10rsg0': [1,
    'Uuuuugh I’m so sorry that happens to you! By some miracle, i don’t have much of a reaction to it. How many doses have you had? I’ve heard that it can slowly get better with each infusion. Also my neuro has me take a bunch of pre-meds to help with any reaction. You take Pepcid and Claritin for three days before, the day of, and two days after. And you take Benadryl the night before. If you haven’t tried pre-meds - May be worth talking to your doc about!']}},
 'g10htr0': {0: 'Good comment with great context; thanks for adding to the conversation. The only thing I can add is that I’ve read that, because of the difficulty with marketing and rebranding something that was discontinued, drug companies think it may be better to wa

In [18]:
for post_id in conversedict:
    message = conversedict[post_id][0]
    replies = conversedict[post_id][1]
    if len(replies) > 1:
        print('Original Message: {}'.format(message))
        print(35*'_')
        print('Replies:')
        for reply in replies:
            print(replies[reply])

 it feels, but not in your oxygen supply is the point."]
[9, 'I also wear a mask all the time and I don\'t know what most people here are talking about. I\'m not sure if the people saying "but oxygen is still at 98%" conduct vigorous exercise. If most people did sprints with a mask on they would either have to stop much sooner than they would without a mask because they couldn\'t catch their breath or they would pass out trying to prove a point. Spreading obviously wrong information isn\'t helping. The correct approach is, "hey these will hurt performance in physical activities but will save lives so they are worth it."']
Original Message: >stop doing HIIT while shopping.

Hell no, how else am I going to get my 6 24-packs of TP in these trying times?!
___________________________________
Replies:
[89, 'Ha this person is posting from March, get with the times!']
[10, 'The new TP is paper towels, judging by my shopping trip earlier.']
[2, 'Bidet though?']
Original Message: If you take you

In [19]:
# Export the comments to a list, for a dataframe. 

subreddit_comments = []
for post_id in conversedict:
    message = conversedict[post_id][0]
    replies = conversedict[post_id][1]
    if len(replies) > 1:
        for reply in replies:
            subreddit_comments.append(replies[reply])

In [20]:
len(conversedict)

56367

In [27]:
# Messages with some replies. 

len(subreddit_comments)

30181

In [22]:
subreddit_comments_df = pd.DataFrame(subreddit_comments, columns = ["Upvotes", "Text"])
subreddit_comments_df.head(50)


Unnamed: 0,Upvotes,Text
0,167,I would have thought it would take hours to re...
1,11,That’s all the time they had funding for
2,1,thank u for this
3,1,Hahahahahhahaha
4,145,Tbh anytime I see these sort of stories I just...
5,74,"Per cycle. They mention ""stability"" and ""multi..."
6,10,This kind of oversight is what ruined that mus...
7,2,You should try reading the actual research pap...
8,1,While the energy and material costs improving ...
9,1,120 gallons/kg isn't something to scoff at.


In [24]:
len(subreddit_comments_df)

30181

In [38]:
# All comments, with or without replies. 
# There are many comments which don't get any replies or upvotes. 

messages_output = []

for post_id in conversedict:
    message = conversedict[post_id][0]
    replies = conversedict[post_id][1]
    if len(message) > 1:
        messages_output.append([message])
messages_df = pd.DataFrame(messages_output, columns = ["Text"])
messages_df.head(50)


Unnamed: 0,Text
0,I cant believe they found all this out in only...
1,"That's great news, except for ""gallons per kil..."
2,"Interesting. Also because a cheap, effective f..."
3,If anyone wanted to be saved the 30 seconds: \...
4,Watch how we are never going to use it
5,What is the waste material?
6,Read that as 'per kilogram of flirtation mater...
7,I don't understand how this is not major world...
8,This sounds like really good news.
9,How long before we drain the oceans?


In [39]:
len(messages_df)

56318

In [69]:
subreddit_comments_df.to_csv('replies.csv', index=False)
subreddit_comments_df = pd.read_csv("replies.csv")


In [24]:
messages_df.to_csv('messages.csv', index=False)
messages_df = pd.read_csv("messages.csv")

In [27]:
len(subreddit_comments_df)

30181

In [25]:
len(messages_df)

56319

# Preprocessing
Let's have a look at the text in our `subreddit_comments_df['Text']` column.

### Data cleaning using RegEx
Regular Expressions are often used to clean up data – special characters, newlines, and so on. Here we use RegEx to remove newlines and single quotes from our `subreddit_comments_df['Text']` column:

In [28]:
# Remove new line characters
subreddit_comments_df['Text'] = [re.sub(r'\s+', ' ', sent) for sent in subreddit_comments_df['Text']]
# Remove distracting single quotes
subreddit_comments_df['Text'] = [re.sub(r"\'", "", sent) for sent in subreddit_comments_df['Text']]
subreddit_comments_df['Text'] = [re.sub(r'\\', ' ', sent) for sent in subreddit_comments_df['Text']]


In [29]:
subreddit_comments_df['Text'][:100]

0     I would have thought it would take hours to re...
1              That’s all the time they had funding for
2                                      thank u for this
3                                       Hahahahahhahaha
4     Tbh anytime I see these sort of stories I just...
                            ...                        
95    It’s funny when all the news shit on China for...
96    They have been in the facility for over a year...
97    Chinese fleets are too busy fucking up other c...
98    Can they just send the coast guard to arrest t...
99    Its in the article. A company bought the zoo t...
Name: Text, Length: 100, dtype: object

### Next, lets do sentiment analysis and find the frequent ngrams 

___Sentiment Analysis___

The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.



In [31]:
from textblob import TextBlob

subreddit_comments_df['Sentiment'] = subreddit_comments_df['Text'].apply(lambda comment: TextBlob(comment).sentiment)
subreddit_comments_df.head(50)

Unnamed: 0,Upvotes,Text,Sentiment
0,167,I would have thought it would take hours to re...,"(0.0, 0.0)"
1,11,That’s all the time they had funding for,"(0.0, 0.0)"
2,1,thank u for this,"(0.0, 0.0)"
3,1,Hahahahahhahaha,"(0.0, 0.0)"
4,145,Tbh anytime I see these sort of stories I just...,"(-0.25, 0.7166666666666667)"
5,74,"Per cycle. They mention ""stability"" and ""multi...","(0.175, 0.3625)"
6,10,This kind of oversight is what ruined that mus...,"(0.3, 0.45)"
7,2,You should try reading the actual research pap...,"(0.0, 0.1)"
8,1,While the energy and material costs improving ...,"(0.8, 0.75)"
9,1,120 gallons/kg isnt something to scoff at.,"(0.0, 0.0)"


In [61]:
## n-grams

subreddit_comments_df['N-grams'] = subreddit_comments_df['Text'].apply(lambda comment: TextBlob(comment).ngrams(n=3))
subreddit_comments_df.head(50)

Unnamed: 0,Upvotes,Text,Sentiment,N-grams
0,167,I would have thought it would take hours to re...,"(0.0, 0.0)","[[I, would, have], [would, have, thought], [ha..."
1,11,That’s all the time they had funding for,"(0.0, 0.0)","[[That, ’, s], [’, s, all], [s, all, the], [al..."
2,1,thank u for this,"(0.0, 0.0)","[[thank, u, for], [u, for, this]]"
3,1,Hahahahahhahaha,"(0.0, 0.0)",[]
4,145,Tbh anytime I see these sort of stories I just...,"(-0.25, 0.7166666666666667)","[[Tbh, anytime, I], [anytime, I, see], [I, see..."
5,74,"Per cycle. They mention ""stability"" and ""multi...","(0.175, 0.3625)","[[Per, cycle, They], [cycle, They, mention], [..."
6,10,This kind of oversight is what ruined that mus...,"(0.3, 0.45)","[[This, kind, of], [kind, of, oversight], [of,..."
7,2,You should try reading the actual research pap...,"(0.0, 0.1)","[[You, should, try], [should, try, reading], [..."
8,1,While the energy and material costs improving ...,"(0.8, 0.75)","[[While, the, energy], [the, energy, and], [en..."
9,1,120 gallons/kg isnt something to scoff at.,"(0.0, 0.0)","[[120, gallons/kg, isnt], [gallons/kg, isnt, s..."


### POS tagging & filtering

POS refers to "Part Of Speech". There are eight parts of speech in the English language: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. This indicates how the word functions in meaning as well as grammatically within the sentence.

Use Spacy to get the most frequent adjectives and verbs. 


In [62]:
import spacy
nlp = spacy.load('en_core_web_lg')


In [63]:
def lemmatizer(doc):
    # This takes in a doc of tokens from the NER and lemmatizes them. 
    # Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
    doc = [token.lemma_ for token in doc if token.lemma_ != '-PRON-']
    doc = u' '.join(doc)
    return nlp.make_doc(doc)
    
def remove_stopwords(doc):
    # This will remove stopwords and punctuation.
    # Use token.text to return strings, which we'll need for Gensim.
    doc = [token.text for token in doc if token.is_stop != True and token.is_punct != True]
    return doc

# The add_pipe function appends our functions to the default pipeline.
if not nlp.has_pipe("lemmatizer") :
    nlp.add_pipe(lemmatizer,name='lemmatizer',after='ner')
if not nlp.has_pipe("stopwords"): 
    nlp.add_pipe(remove_stopwords, name="stopwords", last=True)

In [64]:
subreddit_comments_df['POS'] = subreddit_comments_df['Text'].apply(lambda x: nlp(x))


In [70]:
subreddit_comments_df.head(10)

Unnamed: 0,Upvotes,Text,Sentiment
0,167,I would have thought it would take hours to re...,"Sentiment(polarity=0.0, subjectivity=0.0)"
1,11,That’s all the time they had funding for,"Sentiment(polarity=0.0, subjectivity=0.0)"
2,1,thank u for this,"Sentiment(polarity=0.0, subjectivity=0.0)"
3,1,Hahahahahhahaha,"Sentiment(polarity=0.0, subjectivity=0.0)"
4,145,Tbh anytime I see these sort of stories I just...,"Sentiment(polarity=-0.25, subjectivity=0.71666..."
5,74,"Per cycle. They mention ""stability"" and ""multi...","Sentiment(polarity=0.175, subjectivity=0.3625)"
6,10,This kind of oversight is what ruined that mus...,"Sentiment(polarity=0.3, subjectivity=0.45)"
7,2,You should try reading the actual research pap...,"Sentiment(polarity=0.0, subjectivity=0.1)"
8,1,While the energy and material costs improving ...,"Sentiment(polarity=0.8, subjectivity=0.75)"
9,1,120 gallons/kg isnt something to scoff at.,"Sentiment(polarity=0.0, subjectivity=0.0)"


In [71]:
## Sort the Comments by highest upvotes and check the Sentiment, N-grams and Parts of Speech
# Analysis
## 
subreddit_comments_df.sort_values('Upvotes',ascending=False).head(50)

Unnamed: 0,Upvotes,Text,Sentiment
14236,13179,She deserves a medal. There are plenty of grea...,"Sentiment(polarity=0.5, subjectivity=0.475)"
26568,10995,[removed],"Sentiment(polarity=0.0, subjectivity=0.0)"
16723,9516,I hope the guy actually lives a good life.,"Sentiment(polarity=0.35, subjectivity=0.350000..."
9710,9136,"Just think, one little thing he did that day m...","Sentiment(polarity=-0.060357142857142866, subj..."
7448,9042,Not sure why this is hard to understand for so...,"Sentiment(polarity=-0.030902777777777807, subj..."
14291,8672,>Smitherman will most likely end up paying a f...,"Sentiment(polarity=0.22916666666666669, subjec..."
4917,8646,"“Gotta have opposites, light and dark and dark...","Sentiment(polarity=0.24464285714285713, subjec..."
21514,7860,7 in 10 minutes?? Grandma was pounding them!,"Sentiment(polarity=0.0, subjectivity=0.0)"
19800,7841,Good news! we save 14th of them! 14? But there...,"Sentiment(polarity=1.0, subjectivity=0.6000000..."
9714,7476,Also kudos to the police taking this seriously...,"Sentiment(polarity=0.08333333333333334, subjec..."


In [72]:
## Sort the Comments by lowest upvotes and check the Sentiment, N-grams and Parts of Speech
# Analysis
## 
subreddit_comments_df.sort_values('Upvotes',ascending=True).head(50)

Unnamed: 0,Upvotes,Text,Sentiment
757,-155,Bill Gates is one scary-ass megalomaniac. He w...,"Sentiment(polarity=0.11636904761904762, subjec..."
758,-120,Only if it works and its safer than both the d...,"Sentiment(polarity=0.034374999999999996, subje..."
759,-81,Name definitely checks out.,"Sentiment(polarity=0.0, subjectivity=0.5)"
3126,-68,[removed],"Sentiment(polarity=0.0, subjectivity=0.0)"
3183,-68,"There is ""Lyme disease"" and then there is ""chr...","Sentiment(polarity=-0.008333333333333337, subj..."
350,-52,"Lyme disease isn’t a real threat, the chronic ...","Sentiment(polarity=0.0031250000000000028, subj..."
3023,-46,I dont know what supports my view more. Your s...,"Sentiment(polarity=0.48333333333333334, subjec..."
3069,-43,"No. Im sour over her statement. > ""I think rep...","Sentiment(polarity=0.125, subjectivity=0.55)"
29910,-41,JaGuIrS wIlL dIe BeCaUsE tRuMp Is PrEsIdEnT,"Sentiment(polarity=0.0, subjectivity=0.0)"
2379,-41,or like calll someboday mabay idk that mighttt...,"Sentiment(polarity=0.25, subjectivity=0.25)"


## Topic modeling
Time to build our topic model! Before we do so, we need to turn our corpus into word counts.

### Using `CountVectorizer`
We use the Scikit-LEARN's `CountVectorizer` from last week's tf-idf exercise again – but this time, we only look at term frequencies. What this results in is a matrix of (almost) the entire vocabulary within our corpus, and the counts of these words. 

Note that we set the `max_features` to 1000, which means we only use the top-1000 words in terms of TF. We also remove words that don't occur more than twice (`min_df=2`), and words that occur in more than 95% of the documents (`max_df=0.95`).

In [73]:
subreddit_all_df

Unnamed: 0,id,title,score,author,created,num_comments,distinguished,edited
0,i8td9j,Stray dog who visited car dealership every day...,22605,cyanocittaetprocyon,08/13/2020,217,,False
1,i8lrm1,First-grader starts foundation to help feed th...,6500,hildebrand_rarity,08/12/2020,127,,False
2,i90mk2,Kenya's Elephant Numbers Double Over Three Dec...,187,TherealTushsar,08/13/2020,6,,False
3,i8q8d3,New treatments spur sharp reduction in lung ca...,1831,DeliciousBowler2041,08/12/2020,37,,False
4,i8ytet,'Extinct' large blue butterfly successfully re...,86,Lt_Quill,08/13/2020,1,,False
...,...,...,...,...,...,...,...,...
212,i89zbi,Elephant chained in a Pakistan zoo for 35 year...,1798,robbiekhan,08/12/2020,27,,False
213,i83e37,Indian company to offer leave for tough times ...,6927,OrangeMonkey42,08/11/2020,339,,False
214,i8kahy,Firefighter fosters dog he saved from flames,144,kojima100,08/12/2020,4,,False
215,i7z8pz,Cities cannot fine homeless people for living ...,25583,TrumpDumper,08/11/2020,838,,False


In [74]:
# We're only training for 1000 features (i.e., most-occurring words) Feel free to change this.
no_features = 1000

# Using TF vectorizer to get top terms
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(subreddit_all_df['title'])
tf_feature_names = tf_vectorizer.get_feature_names()


In [75]:
len(tf_feature_names)

437

In [76]:
# Python program to sort a list of tuples by the second Item 
def Sort_Tuple(tup):  
  
    # reverse = None (Sorts in Ascending order)  
    # key is set to sort using second element of  
    # sublist lambda has been used  
    tup.sort(key = lambda x: x[1], reverse=True)  
    return tup 
  

In [12]:
## Features (Topics) using Count Vectorizer
from itertools import islice
feature_list = list(islice(tf_vectorizer.vocabulary_.items(), 100))
Sort_Tuple(feature_list)


[('york', 435),
 ('woolly', 432),
 ('visited', 424),
 ('viral', 423),
 ('video', 422),
 ('uk', 415),
 ('treatments', 408),
 ('tony', 404),
 ('tattoo', 396),
 ('support', 390),
 ('successfully', 386),
 ('stray', 382),
 ('stimulus', 378),
 ('starts', 375),
 ('st', 372),
 ('spur', 371),
 ('small', 364),
 ('sharp', 357),
 ('share', 356),
 ('school', 348),
 ('scholarship', 347),
 ('saving', 344),
 ('reintroduced', 325),
 ('reduction', 322),
 ('rate', 320),
 ('rain', 318),
 ('protection', 315),
 ('promise', 312),
 ('powerful', 309),
 ('park', 295),
 ('orphaned', 287),
 ('onassis', 286),
 ('numbers', 280),
 ('nigerian', 278),
 ('new', 276),
 ('mrs', 269),
 ('mortality', 268),
 ('money', 266),
 ('michigan', 256),
 ('lung', 242),
 ('louis', 241),
 ('list', 234),
 ('large', 222),
 ('lands', 221),
 ('lambs', 219),
 ('lambkin', 218),
 ('knits', 215),
 ('kenya', 211),
 ('kennedy', 210),
 ('jumpers', 206),
 ('job', 202),
 ('jacqueline', 200),
 ('inhalable', 195),
 ('homeless', 184),
 ('help', 177),


In [77]:
from wordcloud import WordCloud

def draw_wordcloud(frequencies): 
    plt.figure()
    wordcloud_doc = WordCloud(max_words = 25, background_color = "white", collocations=False).generate_from_frequencies(frequencies)
    plt.imshow(wordcloud_doc, interpolation='bilinear')
    plt.axis("off")
    plt.title('Top 25 Words')
    plt.show

### Topic modeling using `LatentDirichletAllocation` 

Next, we run scikit-LEARN's `LatentDirichletAllocation` class. Note that we can choose how many topics we want to find – by far the most important parameter to set when creating a topic model. Let's start with 10.

Some other parameters to understand:
- `max_iter` determines the maximum number of iterations to be performed when fitting the model.
- Setting `random_state` to 0 controls the random number generator used by Scikit-LEARN. This results in reproducible topics.

For more info, see [here](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).


In [78]:
# We're only training for 10 topics in our topic model (feel free to change this)
no_topics = 10
numWords=8
# Run LDA
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_
for i, topic in enumerate(lda_model.components_):
    print("Topic {}".format(i))
    print(" ".join([tf_feature_names[i] for i in topic.argsort()[:-numWords - 1:-1]]))

Topic 0
leave people indian homeless company children offer 4m
Topic 1
day new year dog help old saved homeless
Topic 2
elephant blood dealership left years boom baby zoo
Topic 3
marijuana donates goes police study use coronavirus wearing
Topic 4
years home million dogs use 20 year help
Topic 5
son birthday food uk help metal extinct making
Topic 6
set missing department abuse illegal stop year university
Topic 7
new boy rescued hour 10 york 000 rain
Topic 8
million new kenya people look just properly numbers
Topic 9
year school old trash girl student stop light


In [15]:
len(lda_W)

217

Okay, what variables have we created here? 
- `lda_model` sets up our model and its parameters, after which we apply `.fit(tf)` to *fit* it to our TF matrix; 
- `LDA_W` is a topics-to-documents matrix (the probability distribution of the topics present in each document, or in our case, comment) - so a list of 10000 elements (the amount of comments we have here).
- `LDA_H` is a words-to-topics matrix (the probability distribution of the words belonging to each topic) - so a list of 10 elements (the amount of topics we decided to infer).

We can use these two matrices to print out the most significant words for each topic in the next step.

### Displaying the topics

So we have a topic model. But how to display it?
We'll write a `display_topics()` function, which takes both the words-to-topics matrix (`H`) and the `feature_names` as parameters.

Our `display_topics` function prints out a numerical index as the topic name, and prints the top words in the topic. Numpy's `argsort()` method is used to sort the row or column of the matrix: it returns the indexes for the cells that have the highest weights in order.

In [79]:
def display_topics(H, feature_names,no_top_words):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [80]:
# Now print out the top words for each topic
no_top_words = 10
display_topics(lda_H, tf_feature_names, no_top_words)

Topic 0:
leave people indian homeless company children offer 4m paid living
Topic 1:
day new year dog help old saved homeless foundation granted
Topic 2:
elephant blood dealership left years boom baby zoo person amboseli
Topic 3:
marijuana donates goes police study use coronavirus wearing quietly new
Topic 4:
years home million dogs use 20 year help plastic people
Topic 5:
son birthday food uk help metal extinct making butterfly 000
Topic 6:
set missing department abuse illegal stop year university navy female
Topic 7:
new boy rescued hour 10 york 000 rain video jacqueline
Topic 8:
million new kenya people look just properly numbers coronavirus skater
Topic 9:
year school old trash girl student stop light collecting nicknamed


Here we have our 10 topics and the most-associated words. While these topics are probably not very accurate (we used only a very small dataset), we could derive some insights from this already: for instance, topic 8 seems to be about posts in which members of The Red Pill discuss how men are turned into "betas" through the force of feminism. 

You can see that, in order to make sense of the topics you create, you have to understand the lingo and logic of a particular community – hence the annotations you've been doing!

### Retrieving top documents per topic

The output of our `display_topics` function involved assigning a numeric label to the topics and printing out the top words in each topic. This is common practice. However, just displaying the top words in a topic may not help us to understand what each topic is about or determine the *context* in which these words are used.

So, let's define a function that gets both the topics and the associated top document.

This function now also needs to take the original "document" collection (our `trp_com['body']` column) and number of top documents (no_top_documents), as well as the words (feature_names) and number of top words (no_top_words). It then prints the top documents in the topic. The top words and top documents have the highest weights in the returned matrices. 

The function returns our 10 topics again, but this time also prints the associated top document.

In [81]:
def display_topic_docs(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])

# We're printing 10 top words per topic, and 3 "most representative" documents per topic. Feel free to change this.
no_top_words = 10
no_top_documents = 3
display_topic_docs(lda_H, lda_W, tf_feature_names, subreddit_all_df['title'], no_top_words, no_top_documents)

Topic 0:
leave people indian homeless company children offer 4m paid living
A Tanzanian small-scale miner, who became an overnight millionaire in June for selling two rough Tanzanite stones valued at $3.4m, has sold another gem for $2m. on Monday he said the money will be used to build a school & health facility in his community.
A Tanzanian small-scale miner, who became an overnight millionaire in June for selling two rough Tanzanite stones valued at $3.4m, has sold another gem for $2m. on Monday he said the money will be used to build a school & health facility in his community.
Cities cannot fine homeless people for living outside, U.S. judge rules in Grants Pass case
Topic 1:
day new year dog help old saved homeless foundation granted
Mrs Lambkin knits woolly jumpers for orphaned lambs while saving herself from COVID-19 boredom
Mrs Lambkin knits woolly jumpers for orphaned lambs while saving herself from COVID-19 boredom
Good boy! UK police dog helps find missing woman, 1-year-old 

### Visualizing topics using pyLDAvis
We can visualize our topic model using pyLDAvis:

In [82]:
import pyLDAvis.sklearn 

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, tf, tf_vectorizer, mds='tsne')
pyLDAvis.save_html(panel, 'lda.html')
panel

To the left, you see your topics, represented as bubbles. To the right, you see the top words based on overall term frequency. You can click on the bubbles to see the most-prevalent words within particular topics.

Using the λ slider, you can rank the terms according to term relevance. By default, the terms of a topic are ranked in decreasing order according to their topic-specific probability ( λ = 1 ). Moving the slider allows you to adjust the rank of terms based on how discriminatory (or "relevant") they are for the specific topic. The suggested “optimal” value of λ is 0.6.

*Note: a "good" topic model will have non-overlapping, fairly big sized blobs for each topic.*


## Topic Modeling and comments
If you have time left, have another look at your original two DataFrames. Try to think of a way to use the IDs of both comments and submissions to create a list in which the original submission and comments are put together. Tip: look into the `pd.merge()` method.



In [83]:
def preprocess_text(text):
    # Lower the text
    lower_text = text.lower()

    # tokenize the text into a list of words
    tokens = nltk.tokenize.word_tokenize(lower_text)
    return tokens

In [84]:
def flat_tokenized (text): 
    # Final list with tokenized words
    tokenized_final = []
    # Iterating over each string in data
    # for x in text:
        # Calliing preprocess text function
    token = preprocess_text(text)
    tokenized_final.append(token) 
    flattened_tokeninized_final = [i for j in tokenized_final for i in j]
    return flattened_tokeninized_final

In [58]:
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
nltk.download("punkt")
from collections import Counter

def get_sorted_token_dict (id):
    stop_words = set(stopwords.words('english')) 

    submission = reddit.submission(id=id)
    submission.comments.replace_more(limit=0)
    comment_tokens = []
    for comment in submission.comments.list():
        comment_tokens.append(flat_tokenized(comment.body)) 
    final_comment_tokens = [i for j in comment_tokens for i in j]

    filtered_tokens = [w for w in final_comment_tokens if not w in stop_words] 
    filtered_tokens= [word for word in filtered_tokens if word.isalnum()]

    token_dict = Counter(filtered_tokens)
    sorted_token_dict = {k: v for k, v in sorted(token_dict.items(), key=lambda item: item[1],reverse=True)}
    return sorted_token_dict


[nltk_data] Downloading package punkt to /Users/rajesh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [85]:
comments_col = ['id','Comments_Tokens']
comments_list = []
for id in subreddit_all_df.id:
    comment_tokens = get_sorted_token_dict(id=id)
    comments_list.append([id,comment_tokens])

comments_df = pd.DataFrame(comments_list,columns=comments_col)


In [88]:
comments_df

Unnamed: 0,id,Comments_Tokens
0,i8td9j,"{'dog': 52, 'like': 28, 'job': 26, 'people': 2..."
1,i8lrm1,"{'people': 47, 'homeless': 35, 'like': 22, 'gi..."
2,i90mk2,"{'elephants': 32, 'like': 20, 'elephant': 12, ..."
3,i8q8d3,"{'cancer': 34, 'lung': 22, 'drugs': 12, 'would..."
4,i8ytet,"{'ants': 3, 'love': 2, 'ant': 2, 'stuff': 2, '..."
...,...,...
212,i89zbi,"{'zoo': 7, 'years': 6, 'animal': 5, 'time': 5,..."
213,i83e37,"{'women': 152, 'work': 93, 'days': 91, 'get': ..."
214,i8kahy,"{'love': 2, 'good': 1, 'man': 1, 'gooood': 1, ..."
215,i7z8pz,"{'people': 199, 'homeless': 177, 'like': 81, '..."


In [89]:
subreddit_all_df

Unnamed: 0,id,title,score,author,created,num_comments,distinguished,edited
0,i8td9j,Stray dog who visited car dealership every day...,22605,cyanocittaetprocyon,08/13/2020,217,,False
1,i8lrm1,First-grader starts foundation to help feed th...,6500,hildebrand_rarity,08/12/2020,127,,False
2,i90mk2,Kenya's Elephant Numbers Double Over Three Dec...,187,TherealTushsar,08/13/2020,6,,False
3,i8q8d3,New treatments spur sharp reduction in lung ca...,1831,DeliciousBowler2041,08/12/2020,37,,False
4,i8ytet,'Extinct' large blue butterfly successfully re...,86,Lt_Quill,08/13/2020,1,,False
...,...,...,...,...,...,...,...,...
212,i89zbi,Elephant chained in a Pakistan zoo for 35 year...,1798,robbiekhan,08/12/2020,27,,False
213,i83e37,Indian company to offer leave for tough times ...,6927,OrangeMonkey42,08/11/2020,339,,False
214,i8kahy,Firefighter fosters dog he saved from flames,144,kojima100,08/12/2020,4,,False
215,i7z8pz,Cities cannot fine homeless people for living ...,25583,TrumpDumper,08/11/2020,838,,False


In [95]:

subreddit_all_df.columns

Index(['id', 'title', 'score', 'author', 'created', 'num_comments',
       'distinguished', 'edited'],
      dtype='object')

In [96]:
comments_df.columns

Index(['id', 'Comments_Tokens'], dtype='object')

In [107]:
# Merge the data frame with the comments dict using the unique submission id as the column to merge. 
subreddit_merge_df = pd.merge(subreddit_all_df, comments_df, on=['id'], how='outer')
subreddit_merge_df = subreddit_merge_df.drop_duplicates(subset=['id'])
subreddit_merge_df = subreddit_merge_df.sort_values(['score','num_comments'], ascending=[0,0])


In [116]:

s_df = subreddit_merge_df[['title','score','num_comments','Comments_Tokens']]
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(s_df)


Unnamed: 0,title,score,num_comments,Comments_Tokens
157,"Man falsely imprisoned for 10 years, uses pris...",131681,2335,"{'get': 83, 'people': 69, 'prison': 61, 'like'..."
158,First paralyzed human treated with stem cells ...,126193,2665,"{'stem': 197, 'cells': 161, 'cell': 91, 'resea..."
159,Over a Million People Sign Petition Calling Fo...,112694,5294,"{'people': 130, 'kkk': 120, 'group': 110, 'ter..."
160,Hollywood Superstar Keanu Reeves Has Secretly ...,111742,2219,"{'keanu': 89, 'like': 72, 'people': 70, 'good'..."
161,Amazon tribe wins legal battle against oil com...,107276,1636,"{'oil': 101, 'people': 71, 'like': 62, 'amazon..."
162,Chattanooga's Police Chief has updated his dep...,96808,1850,"{'police': 158, 'people': 82, 'like': 71, 'law..."
163,No children died in traffic accidents in Norwa...,92673,2153,"{'driving': 104, 'norway': 90, 'speed': 80, 'r..."
164,President Trump signs animal cruelty bill into...,90362,6155,"{'animal': 127, 'animals': 81, 'crushing': 65,..."
165,Man finds $24 million lottery ticket in an old...,87031,2438,"{'would': 92, 'like': 68, 'money': 67, 'time':..."
166,Police say a teenager who attached uplifting m...,85798,1257,"{'people': 83, 'like': 62, 'bridge': 48, 'some..."
