Analyzing post content and sentiment - Matthew Vorsteg

In this section I am using NLTK, Scikit-Learn and a few other libraries to categorize all of the posts into different categories. The first step is to install all of NLTK, and also VADER (Valence Aware Dictionary and sEntiment Reasoner) which will be used for sentiment analysis.

Firstly, I will be breaking categorizing all of r/UMD's posts into different groups. No posts are labeled already, and in fact the categories are not defined yet. We will be using Scikit-Learn's KMeans algorithm, which requires us to properly prepare our text dataset and create a TF-IDF Matrix. This will be an example of unsupervised machine learning, as we do not have a labeled dataset to test the KMeans model.

In [59]:
!pip install nltk
!pip install vaderSentiment
!pip install plotly

Collecting plotly
[?25l  Downloading https://files.pythonhosted.org/packages/8e/ce/6ea5683c47b682bffad39ad41d10913141b560b1b875a90dbc6abe3f4fa9/plotly-4.4.1-py2.py3-none-any.whl (7.3MB)
[K     |████████████████████████████████| 7.3MB 4.5MB/s eta 0:00:01
[?25hCollecting retrying>=1.3.3
  Downloading https://files.pythonhosted.org/packages/44/ef/beae4b4ef80902f22e3af073397f079c96969c69b2c7d52a57ea9ae61c9d/retrying-1.3.3.tar.gz
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25ldone
[?25h  Created wheel for retrying: filename=retrying-1.3.3-cp37-none-any.whl size=11429 sha256=d2db92fdab5819290512ba13432a589e682277b597829b8e94a611cdc3d003de
  Stored in directory: /home/jovyan/.cache/pip/wheels/d7/a9/33/acc7b709e2a35caa7d4cae442f6fe6fbf2c43f80823d46460c
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.4.1 retrying-1.3.3


In [60]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
import pandas as pd
import plotly.express as px
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from random import randint
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
#connect to the sql db
conn = sqlite3.connect('R_UMD.db')
df_post = pd.read_sql('SELECT * FROM post', conn)

We need to define a function to clean up our text. This function combines tokenization into stemming, as well as removing our stopwords. This effectively 'sanitizes' our text and makes it about as uniform as possible.

In [3]:
#create stemmer and stopwords
ps = PorterStemmer()
words = stopwords.words('english')

#breaks words into stems, forces them into lowercase, tokenizes based on whitespace, and removes stopwords
def clean(x) :
    return ' '.join([ps.stem(i) for i in re.sub('[^a-zA-Z]', ' ', x).split() if i not in words]).lower()

Now that we have defined our tokenization function, we can use it to clean up our dataset.
Let's go ahead and apply it to both the title and the body of the post, and save the 'clean' versions as new columns.
While we're at it, we can add a new column which contains the title appended to the body. For most of our analysis, we will treat the combination of title and body as the text of each post.

In [5]:
#clean the title and text by applying the clean function as stated above
title_clean = df_post['title'].apply(clean)
text_clean = df_post['selftext'].apply(clean)

#concatenates the cleaned title and text to create a column with all words per post
df_post['doc'] = title_clean.map(str) + text_clean

df_post.head()

Unnamed: 0,id,name,url,title,selftext,score,created_utc,permalink,link_flair_text,doc
0,dxv1c4,Baking-and-books,https://www.reddit.com/r/UMD/comments/dxv1c4/r...,Re-Leasing Apartment!,Re-Leasing my room in Commons 6. Amazing room...,1,1574036000.0,/r/UMD/comments/dxv1c4/releasing_apartment/,Housing,re leas apartre leas room common amaz roommat ...
1,dxuxzp,cdrgnvrk,https://www.reddit.com/r/UMD/comments/dxuxzp/e...,Eduroam ACTUALLY sucks dick,Fuck the division of IT for allowing this bull...,1,1574035000.0,/r/UMD/comments/dxuxzp/eduroam_actually_sucks_...,,eduroam actual suck dickfuck divis it allow bu...
2,dxuwpy,TonyChen616,https://v.redd.it/z0rvpzqi4cz31,OG Legends Strikes Again,,4,1574035000.0,/r/UMD/comments/dxuwpy/og_legends_strikes_again/,Discussion,og legend strike again
3,dxu8we,Shalleycat,https://www.reddit.com/r/UMD/comments/dxu8we/s...,Sustainable turtle sticker,Anyone know where I can get one of those susta...,1,1574032000.0,/r/UMD/comments/dxu8we/sustainable_turtle_stic...,,sustain turtl stickeranyon know i get one sust...
4,dxttl0,Rooser1212,https://www.reddit.com/r/UMD/comments/dxttl0/s...,Spring/Summer 2020 Sublease,I am studying abroad next semester and am look...,1,1574030000.0,/r/UMD/comments/dxttl0/springsummer_2020_suble...,Housing,spring summer subleasi studi abroad next semes...


In order to create a TF-IDF matrix, we will use Scikit-Learn's TfidfVectorizer package.
Since we have already cleaned our data, all we have to do is create a new TfidfVectorizer, convert the post texts to a list and fit the vectorizer, and construct a new dataframe from the result.

In [6]:
titles = df_post['title'].tolist()
corpus = df_post['doc'].tolist()

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
df_post_tfidf = pd.DataFrame(X.T.todense(), index=vectorizer.get_feature_names(), columns = titles)
df_post_tfidf

Unnamed: 0,Re-Leasing Apartment!,Eduroam ACTUALLY sucks dick,OG Legends Strikes Again,Sustainable turtle sticker,Spring/Summer 2020 Sublease,Umd italian club?,Tool Concert 11/25,COMM PR or Public Health FS?,Open mic at Milkboy!,Orgo 1: Stocker or Dixon?,...,Thoughts on the closing of Campus Drive?,"Since there seems to be a few terp redditors, would you all want to have a meet up in the fall?",What groups/clubs do you belong to?,What is you major and year?,Easy Electives?,"All of you people that are ahead of me on the wait list, I'm going to need you to go ahead and back out so I can get in my class - Thanks",Welcome to the UMD subreddit!,Poop? In my McKeldin?,Poop? In my McKeldin?.1,The Camera that Sees Sound!
aa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aaaaaaaaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aaaaaaaaaaaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zze,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzfrahjlgz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzzsleepytim,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzzz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that we have a TF-IDF matrix, we can use Scikit-Learn's KMeans function to split the data up into clusters based on similarities in text from the TF-IDF matrix.

Unfortunately, KMeans is a non-deterministic algorithm, meaning that it will give a different result if run multiple times. This can lead to some interesting results, as the clusters change when the algorithm is run. This can be alleviated by providing an integer for a random seed, which will cause the KMeans to give the same result every time.

Here we are specifying k = 13 to categorize the data into 13 clusters. We need to manually inspect each cluster to see what the posts in each cluster have in common, and we can give names to our clusters.

In [7]:
#using KMeans, cluster the data into a set number of categories
true_k = 15
r = 971 #KMeans is non-deterministic unless we specify the random seed
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1, random_state = r)
print("RANDOM STATE",model.random_state)

#fit the model
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),

RANDOM STATE 971
Top terms per cluster:
Cluster 0:
 major
 cs
 scienc
 comput
 engin
 doubl
 minor
 program
 school
 career
Cluster 1:
 transfer
 umd
 student
 school
 credit
 gpa
 appli
 fall
 colleg
 semest
Cluster 2:
 math
 class
 calc
 exam
 major
 stat
 semest
 cours
 placement
 anyon
Cluster 3:
 ticket
 game
 student
 michigan
 basketbal
 anyon
 sell
 extra
 guest
 state
Cluster 4:
 maryland
 like
 terp
 student
 look
 time
 need
 help
 peopl
 final
Cluster 5:
 park
 colleg
 lot
 permit
 free
 car
 campu
 dot
 overnight
 summer
Cluster 6:
 room
 hous
 look
 apart
 common
 live
 roommat
 leas
 rent
 sublet
Cluster 7:
 cmsc
 class
 semest
 anyon
 summer
 math
 exam
 taken
 cs
 cours
Cluster 8:
 campu
 place
 south
 live
 job
 know
 hous
 look
 best
 anyon
Cluster 9:
 umd
 student
 edu
 http
 school
 connect
 like
 know
 email
 use
Cluster 10:
 talk
 week
 happen
 promot
 jpg
 sport
 upcom
 com
 http
 event
Cluster 11:
 cours
 credit
 class
 semest
 onlin
 taken
 anyon
 summer
 leve

After looking at the clusters, I have decided on some appropirate titles for each group

In [8]:
subjects = {0 : 'major requirements', 1 : 'admissions / trasfer', 2 : 'math', 3 : 'sports', 4 : 'general umd',
            5 : 'parking', 6 : 'housing', 7 : 'cs classes', 8 : 'housing', 9 : 'events / internet', 10 : 'weekly posts',
            11 : 'registration', 12 : 'course / campus questions'}

In [9]:
header_str = '~~~~~~~~~~~~~~~'

Now our KMeans model is trained! we can test it out by taking a random sample of the data and predicting each post's category by using the model's predict() function. I have printed this output below so you can see how accurate the groupings are.

In [10]:
def classify(post) :
    Y = vectorizer.transform([post])
    prediction = model.predict(Y)[0]
    if prediction == 12 :
        prediction = 11
    if prediction > 12 :
        prediction = 12
    return prediction

#create random sample of dataframe
sample = df_post.sample(n=100)
#sample = df
pred = []
#add column for the prediction to the dataframe
for row in sample.iterrows() : 
    pred.append(classify(row[1]['doc']))
sample['pred'] = pred
#display sample posts by subject
for i in range(0,13) :
        print()
        print(header_str,subjects[i],header_str)
        sub = sample[sample['pred'] == i]
        for row in sub.iterrows() :
            print(row[1]['title'])


~~~~~~~~~~~~~~~ major requirements ~~~~~~~~~~~~~~~
Class Comparison
My biggest regret since coming here is not having a better idea of what I wanted to study, and what I wanted to do after graduation.
In light of peoples recent complaints about the difficulty of their majors

~~~~~~~~~~~~~~~ admissions / trasfer ~~~~~~~~~~~~~~~
Has anyone taken upper level computer science courses and transferred them in?

~~~~~~~~~~~~~~~ math ~~~~~~~~~~~~~~~
Will I get be guaranteed to get into major coruses
MATH241 Professor
Could I take math 120 at a different institution or do I have to take it here?

~~~~~~~~~~~~~~~ sports ~~~~~~~~~~~~~~~
Anyone got an extra football ticket for the Howard game? 🙏
If I am not a student can I use a student ticket at the basketball game?

~~~~~~~~~~~~~~~ general umd ~~~~~~~~~~~~~~~
What's the fastest way I can get a pdf version of my unofficial transcript?
Discussion for STAT400?
What to expect in the summer?
Slipped under my door in my dorm. Fuck off.
AV Williams C

Since we can see that the categorization is accurate, we can go ahead and append a new column to our df_post with their classification, and we are done with classifying the posts of r/UMD!

In [11]:
classifications = []

for row in df_post.iterrows() :
    classifications.append(subjects[classify(row[1]['doc'])])
df_post['class'] = classifications
df_post.drop(['doc'], axis = 1) #don't need this anymore
df_post

Unnamed: 0,id,name,url,title,selftext,score,created_utc,permalink,link_flair_text,doc,class
0,dxv1c4,Baking-and-books,https://www.reddit.com/r/UMD/comments/dxv1c4/r...,Re-Leasing Apartment!,Re-Leasing my room in Commons 6. Amazing room...,1,1.574036e+09,/r/UMD/comments/dxv1c4/releasing_apartment/,Housing,re leas apartre leas room common amaz roommat ...,housing
1,dxuxzp,cdrgnvrk,https://www.reddit.com/r/UMD/comments/dxuxzp/e...,Eduroam ACTUALLY sucks dick,Fuck the division of IT for allowing this bull...,1,1.574035e+09,/r/UMD/comments/dxuxzp/eduroam_actually_sucks_...,,eduroam actual suck dickfuck divis it allow bu...,general umd
2,dxuwpy,TonyChen616,https://v.redd.it/z0rvpzqi4cz31,OG Legends Strikes Again,,4,1.574035e+09,/r/UMD/comments/dxuwpy/og_legends_strikes_again/,Discussion,og legend strike again,general umd
3,dxu8we,Shalleycat,https://www.reddit.com/r/UMD/comments/dxu8we/s...,Sustainable turtle sticker,Anyone know where I can get one of those susta...,1,1.574032e+09,/r/UMD/comments/dxu8we/sustainable_turtle_stic...,,sustain turtl stickeranyon know i get one sust...,general umd
4,dxttl0,Rooser1212,https://www.reddit.com/r/UMD/comments/dxttl0/s...,Spring/Summer 2020 Sublease,I am studying abroad next semester and am look...,1,1.574030e+09,/r/UMD/comments/dxttl0/springsummer_2020_suble...,Housing,spring summer subleasi studi abroad next semes...,housing
...,...,...,...,...,...,...,...,...,...,...,...
42580,cscgl,Ares__,https://www.reddit.com/r/UMD/comments/cscgl/al...,All of you people that are ahead of me on the ...,,3,1.279777e+09,/r/UMD/comments/cscgl/all_of_you_people_that_a...,,all peopl ahead wait list i go need go ahead b...,course / campus questions
42581,cj6m4,maxpericulosus,https://www.reddit.com/r/UMD/comments/cj6m4/we...,Welcome to the UMD subreddit!,Welcome to the University of Maryland subreddi...,4,1.277528e+09,/r/UMD/comments/cj6m4/welcome_to_the_umd_subre...,,welcom umd subredditwelcom univers maryland su...,events / internet
42582,cj564,chrisg90,https://www.reddit.com/r/UMD/comments/cj564/po...,Poop? In my McKeldin?,[Defecator Banned from McKeldin](http://www.di...,6,1.277516e+09,/r/UMD/comments/cj564/poop_in_my_mckeldin/,,poop in mckeldindefec ban mckeldin http www di...,general umd
42583,cj526,,https://www.reddit.com/r/UMD/comments/cj526/po...,Poop? In my McKeldin?,[deleted],1,1.277515e+09,/r/UMD/comments/cj526/poop_in_my_mckeldin/,,poop in mckeldindelet,general umd


In [13]:
subject_count = {'major requirements' : 0, 'admissions / trasfer' : 0, 'math' :0, 'sports' : 0, 'general umd' : 0,
            'parking' : 0, 'housing' : 0, 'cs classes' : 0, 'housing' : 0, 'events / internet' : 0, 'weekly posts' : 0,
            'registration' : 0, 'course / campus questions' : 0}

for row in df_post.iterrows() :
    subject_count[row[1]['class']] += 1

We can also use pyplot to create pie chart, which can be a nice visual aid to show the breakdown of the post categories across the entire subreddit

In [88]:
df_temp = pd.DataFrame()
df_temp['classification'] = subject_count.keys()
df_temp['count'] = subject_count.values()
fig = px.pie(df_temp, values = 'count', names='classification', title='Classification of r/UMD Posts by Percent')
fig.show()

Next, I will be using the VADER (Valence Aware Dictionary and sEntiment Reasoner) algorithm, which is a pre-trained model that specializes in sentiment analysis of social media posts. 

VADER takes in a string and returns 4 scores: positive, neutral, negative, and compound. The first 3 reflect the percent of the string made up of positive, negative, and neutral keywords. These scores always add up to 1. Compound score is a composite of the first 3, between -1 and 1, which is normalized to account for context, length, and emphasis of the words. We define a score of >= 0.05 as positive, < 0.05 && > -0.05 as neutral, and <= 0.05 as negative, in accordance with the VADER guidelines.

Firstly, we define a simple function to return 'pos', 'neg', or 'neu' based on the composite score

In [44]:
analyzer = SentimentIntensityAnalyzer()

def classify_sentiment(sentence) :
    score = analyzer.polarity_scores(sentence)
    if score['pos'] >= 0.05 :
        return 'positive'
    if score['neg'] <= 0.05 :
        return 'negative'
    return 'neutral'

We need to iterate throuhh all posts and run this function to get a sentiment for each post
We can also add a new column to the table corresponding to the sentiment of each post

In [45]:
sentiments = []

for row in df_post.iterrows() :
    p = classify_sentiment(row[1]['title'] + ' ' + row[1]['selftext'])
    sentiments.append(p)
    
df_post['sentiment'] = sentiments
df_post

Unnamed: 0,id,name,url,title,selftext,score,created_utc,permalink,link_flair_text,doc,class,sentiment
0,dxv1c4,Baking-and-books,https://www.reddit.com/r/UMD/comments/dxv1c4/r...,Re-Leasing Apartment!,Re-Leasing my room in Commons 6. Amazing room...,1,1.574036e+09,/r/UMD/comments/dxv1c4/releasing_apartment/,Housing,re leas apartre leas room common amaz roommat ...,housing,positive
1,dxuxzp,cdrgnvrk,https://www.reddit.com/r/UMD/comments/dxuxzp/e...,Eduroam ACTUALLY sucks dick,Fuck the division of IT for allowing this bull...,1,1.574035e+09,/r/UMD/comments/dxuxzp/eduroam_actually_sucks_...,,eduroam actual suck dickfuck divis it allow bu...,general umd,neutral
2,dxuwpy,TonyChen616,https://v.redd.it/z0rvpzqi4cz31,OG Legends Strikes Again,,4,1.574035e+09,/r/UMD/comments/dxuwpy/og_legends_strikes_again/,Discussion,og legend strike again,general umd,neutral
3,dxu8we,Shalleycat,https://www.reddit.com/r/UMD/comments/dxu8we/s...,Sustainable turtle sticker,Anyone know where I can get one of those susta...,1,1.574032e+09,/r/UMD/comments/dxu8we/sustainable_turtle_stic...,,sustain turtl stickeranyon know i get one sust...,general umd,positive
4,dxttl0,Rooser1212,https://www.reddit.com/r/UMD/comments/dxttl0/s...,Spring/Summer 2020 Sublease,I am studying abroad next semester and am look...,1,1.574030e+09,/r/UMD/comments/dxttl0/springsummer_2020_suble...,Housing,spring summer subleasi studi abroad next semes...,housing,positive
...,...,...,...,...,...,...,...,...,...,...,...,...
42580,cscgl,Ares__,https://www.reddit.com/r/UMD/comments/cscgl/al...,All of you people that are ahead of me on the ...,,3,1.279777e+09,/r/UMD/comments/cscgl/all_of_you_people_that_a...,,all peopl ahead wait list i go need go ahead b...,course / campus questions,positive
42581,cj6m4,maxpericulosus,https://www.reddit.com/r/UMD/comments/cj6m4/we...,Welcome to the UMD subreddit!,Welcome to the University of Maryland subreddi...,4,1.277528e+09,/r/UMD/comments/cj6m4/welcome_to_the_umd_subre...,,welcom umd subredditwelcom univers maryland su...,events / internet,positive
42582,cj564,chrisg90,https://www.reddit.com/r/UMD/comments/cj564/po...,Poop? In my McKeldin?,[Defecator Banned from McKeldin](http://www.di...,6,1.277516e+09,/r/UMD/comments/cj564/poop_in_my_mckeldin/,,poop in mckeldindefec ban mckeldin http www di...,general umd,neutral
42583,cj526,,https://www.reddit.com/r/UMD/comments/cj526/po...,Poop? In my McKeldin?,[deleted],1,1.277515e+09,/r/UMD/comments/cj526/poop_in_my_mckeldin/,,poop in mckeldindelet,general umd,negative


Now, just as before, we can create a pie chart to illustrate the sentiment distribution of r/UMD using pyplot

In [90]:
sent_count = {'positive' : 0, 'negative' : 0, 'neutral' : 0}
for row in df_post.iterrows() :
    sent_count[row[1]['sentiment']] += 1

In [91]:
df_temp = pd.DataFrame()
df_temp['sentiment'] = sent_count.keys()
df_temp['count'] = sent_count.values()
fig = px.pie(df_temp, values = 'count', names='sentiment', title='Sentiment of r/UMD Posts by Percent')
fig.show()

We could easily stop there, but since it was so straightforward to perform a sentiment analysis on the posts, I am going to do the same process for the comments.

In [47]:
df_comment = pd.read_sql('SELECT * FROM comment', conn)
df_comment.head()

Unnamed: 0,id,name,body,score,parent_id,link_id,created_utc
0,f7wxk0a,DeltaHex106,lol nooiiiccee.,4,t3_dxuwpy,t3_dxuwpy,1574041000.0
1,f7x0i9l,The_Joker_07,"It says occurred on October 17, what????",1,t3_dxuwpy,t3_dxuwpy,1574043000.0
2,f7x0mwv,YaBoiAtUMD,💀💀💀,1,t3_dxuwpy,t3_dxuwpy,1574043000.0
3,f7w5tlk,Thedaniel4999,I never had Dixon but can confirm that Stocker...,1,t3_dxtdkl,t3_dxtdkl,1574030000.0
4,f7w95pd,lordkaramat,"Yea, you need to show a student ID when you ge...",4,t3_dxt72p,t3_dxt72p,1574031000.0


In [52]:
sentiments = []
for row in df_comment.iterrows() :
    p = classify_sentiment(row[1]['body'])
    sentiments.append(p)
    
df_comment['sentiment'] = sentiments

In [56]:
sent_count = {'positive' : 0, 'negative' : 0, 'neutral' : 0}
for row in df_comment.iterrows() :
    sent_count[row[1]['sentiment']] += 1

In [92]:
df_temp = pd.DataFrame()
df_temp['sentiment'] = sent_count.keys()
df_temp['count'] = sent_count.values()
fig = px.pie(df_temp, values = 'count', names='sentiment', title = 'Sentiment of r/UMD Comments by Percent')
fig.show()