## Sentiment Analysis
Modified: 10/12/2020


# RUBRIC FOR 8B
Total possible 10 points

* 0 pt - must have - Repeat the analysis from 8A for the original 8A data source **AND** a second similar data source.
* 1 pt - must have - Create a combined dataframe that has separate columns for each source. You can choose to add columns to the 8A data frame for the new data or add rows. If adding rows, be sure to have a column that will allow to you differentiate between the two data sources.
* 4 pt - Create a multi-plot scatter diagram that displays polarity vs subjectivity for the two sources. Be sure they are different colors and a legend is shown.
* 3 pt - Produce some basic statistics for the two sources including overall sentence count, overall sentiment for each source, basic descriptive (e.g. describe) stats for each source, and the correlation between polarity and subjectivity for each source.
* 2 pt - Write a few sentences of key observations that you made regarding the analysis. Include additional analysis you would suggest doing based on the initial results.

Deductions for:
* Crashing code
* Logic errors producing incorrect results
* Lack of attempt for some or all components.

### Objectives:
* Retrieve text data from an article, speech, story, debate or some other web-based source.
* Clean the data then perform sentiment analysis.


### Requirements:
* Direct link to web-site to retrieve data though ok to turn text into a locally stored file.
* Locally stored files will be saved and loaded from same folder as the notebook so professor can run your code.
* Text will not include substantial amounts of leftover HTML code. (No HTML is preferred!)
* Dropping of some lines of text should be performed if the web-site delivers text that should not be included in the sentiment analysis. For example, if the text is a commencment speech, the introductory paragraph would likely be dropped so only the keynote speech is analyzed.
* Use any methods you like to process the data and perform the sentiment analysis. You do not need to use the examples from the text book or the live class.

### Special Note:
* Assignment 8B will perform a similar analysis on two opposite views of a topic. Think ahead. Choose a topic here that will allow you to reuse this code and text on the next assignment.

Potential examples: 
* Two essays on different views on climate change.
* Two public domain e-books or essays that represent different historical views on racial equality or women's rights.
* A debate that includes two individuals debating on several topics. (This assignment would show results for just one individual. Assignment 8B would have both individuals.)
* Similar venue speeches from two individuals with different ideologies, for example: two commencment speeches from different US presidents, two acceptance speeches from political party conventions, two, two newspaper opinion pieces on similar topics, etc.)
* A speech that led to a successful outcome (e.g. JFK - "We choose to go to the moon.") and on speech that fell flat and didn't unite people to action. 

### Reminder: 
* Assignment 8A will only analyze one of the two. Assignment 8B will recreate 8A for the first speech/article/book and add and compare the second speech/article.

In [151]:
# Import dependencies and required magics
import requests        # import from web
from bs4 import BeautifulSoup      # clean up text
from wordcloud import WordCloud    # create word clouds
from textblob import TextBlob      # basic NLP, install first

from pathlib import Path    # for quick import of text file for NLP
from newspaper import Article 
import pandas as pd

# Magics
%config InlineBackend.figure_format = 'retina'
%matplotlib inline


# Parsing with `Newspaper3K` and `Article` 

In [181]:
# Load text from web-page, save to local file
# Biden Speech 10/13/2020
url = 'https://www.rev.com/blog/transcripts/joe-biden-campaign-speech-pembroke-pines-florida-transcript-october-13-talks-social-security'
article = Article(url)
article.download()
article.parse()
text = article.text

event = '-10-13-20'   

# write the text to a temp file
with open('speech.txt', 'w') as f:
    f.writelines(text)

In [182]:
# Load from saved file, review it, 
# drop lines as needed, perform necessary processing.

with open('speech.txt', 'r') as f:
    speech = f.readlines()

tmp = []
speaker = []
time = []
words = []

for cnt, line in enumerate(speech):
    if cnt % 2 == 0:
        tmp.append(line.rstrip())   # temp list of just the text lines 0,2

# last line of text is not part of speech, alter range in next line
# of code to stop short

for i in range(0,len(tmp)-1,2):
    speaker.append(tmp[i].split(': ')[0])  #split speaker line into 2 parts
    time.append(tmp[i].split(': ')[1])
    words.append(tmp[i+1])    # words from speaker

set(speaker)

{'Audience', 'Joe Biden', 'Speaker 1', 'Speaker 2', 'Toby Feuer'}

In [183]:
spkr = 'Joe Biden'
file = spkr.split()[len(spkr.split())-1] + event + '.txt'
file

# use write instead of writelines since we don't want entire list
# remember to add new line

with open(file,'w') as f:
    for i in range(0,len(speaker)):
        if speaker[i] == spkr:
            f.write(words[i]+'\n')
            
text = Path(file).read_text()
text

'Hello, hello, hello.\nGood to see you all. Please, please, take a seat if you have one. Thank you so very, very much for… It’s good to be back in Florida. I want to thank my good friend Debbie Wasserman Schultz, congresswoman. She’s been a tireless fighter. Where’s Debbie? There she is. Thank you, Debbie. You’ve been a friend a long time, and thank you for all the encouragement.\nToby, I also want to thank you. Toby moved from a high rent district in Brooklyn. That’s the highest rent district in America right now these days. [inaudible 00:09:48], good to see you. I want to thank you for that kind introduction. It’s wonderful to be here with all of you to hear the stories, talk about how we’re going to get through these tough times, the difficult times. Today’s-\n… tough times and difficult times. Today’s story is a familiar one here in South Florida. We’re all living in some version of it right now with some of the most important parts of our lives being put on hold and the same story

In [184]:
# Perform sentiment analysis
blob = TextBlob(text)
blob.sentiment

Sentiment(polarity=0.10319898019459418, subjectivity=0.4083973928930071)

In [185]:
# Save sentiment data to dataframe

# Create sentiment dataframe
pd.set_option('max_colwidth', 400)

p = []
s = []
txt = []
spk = []
for sentence in blob.sentences:
    p.append(sentence.sentiment.polarity)
    s.append(sentence.sentiment.subjectivity)
    txt.append(str(sentence))
    spk.append(spkr.split()[len(spkr.split())-1])

df_sent = pd.DataFrame(p,columns=['polarity'])
df_sent['subjectivity'] = s
df_sent['text'] = txt
df_sent['speaker'] = spk
df_sent['order'] = df_sent.index.values
df_sent.head()

Unnamed: 0,polarity,subjectivity,text,speaker,order
0,0.0,0.0,"Hello, hello, hello.",Biden,0
1,0.7,0.6,Good to see you all.,Biden,1
2,0.0,0.0,"Please, please, take a seat if you have one.",Biden,2
3,0.32,0.286667,"Thank you so very, very much for… It’s good to be back in Florida.",Biden,3
4,0.7,0.6,"I want to thank my good friend Debbie Wasserman Schultz, congresswoman.",Biden,4


In [186]:
df_sent.shape

(322, 5)

In [187]:
# Output key sentiment analysis results including:
#   Overall sentiment analysis scores for the document
#   Correlation of polarity and subjectivity scores across sentences

sent_polarity = blob.polarity
sent_subjectivity = blob.subjectivity
sent_corr = df_sent.polarity.corr(df_sent.subjectivity)
sent_words = sum(blob.word_counts.values())
sent_sentences = len(blob.sentences)
sent_avewords = sent_words/sent_sentences

print(f'Overall polarity: {sent_polarity:.2f}')
print(f'Overall subjectivity: {sent_subjectivity:.2f}')
print(f'Correlation: {sent_corr:.3f}')
print(f'Words: {sent_words:,}')
print(f'Sentences: {sent_sentences:,}')
print(f'Words/Sentece: {sent_avewords:.2f}')

Overall polarity: 0.10
Overall subjectivity: 0.41
Correlation: 0.366
Words: 4,797
Sentences: 322
Words/Sentece: 14.90


In [188]:
# Save to combined data frame and candidate variables

biden_polarity = sent_polarity
biden_subjectivity = sent_subjectivity
biden_corr = sent_corr
biden_words = sent_words
biden_sentences = sent_sentences
biden_avewords = sent_avewords 

df_combined = df_sent.copy()
df_combined.head()

Unnamed: 0,polarity,subjectivity,text,speaker,order
0,0.0,0.0,"Hello, hello, hello.",Biden,0
1,0.7,0.6,Good to see you all.,Biden,1
2,0.0,0.0,"Please, please, take a seat if you have one.",Biden,2
3,0.32,0.286667,"Thank you so very, very much for… It’s good to be back in Florida.",Biden,3
4,0.7,0.6,"I want to thank my good friend Debbie Wasserman Schultz, congresswoman.",Biden,4


## Process Second Data Source


In [189]:
# Load text from web-page, save to local file
# Trump Speech 10/12/2020
url = 'https://www.rev.com/blog/transcripts/donald-trump-campaign-rally-sanford-florida-transcript-october-12-first-rally-since-diagnosis'
article = Article(url)
article.download()
article.parse()
text = article.text

event = '-10-12-20'   

# write the text to a temp file
with open('speech.txt', 'w') as f:
    f.writelines(text)

In [190]:
# Load from saved file, review it, 
# drop lines as needed, perform necessary processing.

with open('speech.txt', 'r') as f:
    speech = f.readlines()

tmp = []
speaker = []
time = []
words = []

for cnt, line in enumerate(speech):
    if cnt % 2 == 0:
        tmp.append(line.rstrip())   # temp list of just the text lines 0,2

# last line of text is not part of speech, alter range in next line
# of code to stop short

for i in range(0,len(tmp)-1,2):
    speaker.append(tmp[i].split(': ')[0])  #split speaker line into 2 parts
    time.append(tmp[i].split(': ')[1])
    words.append(tmp[i+1])    # words from speaker

set(speaker)

{'Audience', 'Donald Trump'}

In [191]:
spkr = 'Donald Trump'
file = spkr.split()[len(spkr.split())-1] + event + '.txt'
file

# use write instead of writelines since we don't want entire list
# remember to add new line

with open(file,'w') as f:
    for i in range(0,len(speaker)):
        if speaker[i] == spkr:
            f.write(words[i]+'\n')
            
text = Path(file).read_text()
text

'Hello everybody. Hello Orlando. Hello Sanford. It’s great to be with you. Thank you. It’s great to be back. That’s a lot of people. Our competitor sleepy Joe, he had a rally today and practically nobody showed up. I don’t know what’s going on. Sleepy Joe. But it’s great to be back in my home state, Florida to make my official return to the campaign trail. I am so energized by your prayers and humbled by your support. We’ve had such incredible support and here we are. Here we are. But we’re going to finish. We’re going to make this country greater than ever before.\nThank you. Thank you. Thank you very much. Thank you very much. We’ve made tremendous progress. If you look at what we’re doing with therapeutics and frankly cures, we’ve made tremendous progress. And I said to my people, we are going to take whatever the hell they gave me and we’re going to distribute it around to hospitals and everyone’s going to have the same damn thing. We’ve all endured a lot together. And we are doing

In [192]:
# Perform sentiment analysis
blob = TextBlob(text)
blob.sentiment

Sentiment(polarity=0.18057246202315747, subjectivity=0.5082357300656914)

In [193]:
# Save sentiment data to dataframe

# Create sentiment dataframe
pd.set_option('max_colwidth', 400)

p = []
s = []
txt = []
spk = []
for sentence in blob.sentences:
    p.append(sentence.sentiment.polarity)
    s.append(sentence.sentiment.subjectivity)
    txt.append(str(sentence))
    spk.append(spkr.split()[len(spkr.split())-1])

df_sent = pd.DataFrame(p,columns=['polarity'])
df_sent['subjectivity'] = s
df_sent['text'] = txt
df_sent['speaker'] = spk
df_sent['order'] = df_sent.index.values
df_sent.head()

Unnamed: 0,polarity,subjectivity,text,speaker,order
0,0.0,0.0,Hello everybody.,Trump,0
1,0.0,0.0,Hello Orlando.,Trump,1
2,0.0,0.0,Hello Sanford.,Trump,2
3,0.8,0.75,It’s great to be with you.,Trump,3
4,0.0,0.0,Thank you.,Trump,4


In [194]:
df_sent.shape

(856, 5)

In [195]:
# Output key sentiment analysis results including:
#   Overall sentiment analysis scores for the document
#   Correlation of polarity and subjectivity scores across sentences

sent_polarity = blob.polarity
sent_subjectivity = blob.subjectivity
sent_corr = df_sent.polarity.corr(df_sent.subjectivity)
sent_words = sum(blob.word_counts.values())
sent_sentences = len(blob.sentences)
sent_avewords = sent_words/sent_sentences

print(f'Overall polarity: {sent_polarity:.2f}')
print(f'Overall subjectivity: {sent_subjectivity:.2f}')
print(f'Correlation: {sent_corr:.3f}')
print(f'Words: {sent_words:,}')
print(f'Sentences: {sent_sentences:,}')
print(f'Words/Sentece: {sent_avewords:.2f}')

Overall polarity: 0.18
Overall subjectivity: 0.51
Correlation: 0.313
Words: 10,374
Sentences: 856
Words/Sentece: 12.12


In [196]:
# Save to combined data frame and candidate variables

trump_polarity = sent_polarity
trump_subjectivity = sent_subjectivity
trump_corr = sent_corr
trump_words = sent_words
trump_sentences = sent_sentences
trump_avewords = sent_avewords 


print('Before:',df_combined.shape) 
df_combined = df_combined.append(df_sent) # keep both indexes to give time estimate for animation
print('After:',df_combined.shape) 

Before: (322, 5)
After: (1178, 5)


# Combined Data Frame Complete
Show results

In [197]:
print('Statistic\tBiden\t\tTrump')
print(f'Polarity\t{biden_polarity:.2f}\t\t{trump_polarity:.2f}')
print(f'Subjectivity\t{biden_subjectivity:.2f}\t\t{trump_subjectivity:.2f}')
print(f'Total Words\t{biden_words:,}\t\t{trump_words:,}')
print(f'Sentences\t{biden_sentences:,}\t\t{trump_sentences:,}')
print(f'Words/Sentence\t{biden_avewords:.2f}\t\t{trump_avewords:.2f}')

Statistic	Biden		Trump
Polarity	0.10		0.18
Subjectivity	0.41		0.51
Total Words	4,797		10,374
Sentences	322		856
Words/Sentence	14.90		12.12


In [198]:
# Print out 20 sentences and their scores including:
#    5 most negative sentences including polarity and subjectivity
#    5 most positive sentences including polarity and subjectivity
#    5 most subjective sentences including polarity and subjectivity
#    5 most objective sentences including polarity and subjectivity


In [199]:
# Most negative
df_combined.sort_values(by=['polarity']).head(10)

Unnamed: 0,polarity,subjectivity,text,speaker,order
388,-1.0,1.0,"He may be the worst presidential candidate in history, and I got him, I got him.",Trump,388
717,-1.0,1.0,What a horrible deal.,Trump,717
623,-1.0,1.0,It’s so horrible.,Trump,623
596,-0.928571,1.0,"If you’re a murderer, if you’re a rapist, if you’re very, very sick with a disease that can spread all over, just come on in.",Trump,596
38,-0.91,0.866667,He had a very bad day.,Trump,38
616,-0.8,0.9,"And by the way, Mexico is paying, they hate to say it.",Trump,616
627,-0.714286,0.857143,He shouldn’t do that.” They are sick people.,Trump,627
774,-0.7,0.666667,He’s actually a bad guy.,Trump,774
304,-0.7,0.666667,It was so bad.,Trump,304
43,-0.7,0.666667,They didn’t have to be this bad.,Biden,43


In [200]:
# Most positive
df_combined.sort_values(by=['polarity']).tail(10)

Unnamed: 0,polarity,subjectivity,text,speaker,order
284,0.9,0.9,These guys are incredible.,Trump,284
565,0.9,0.9,"Yeah, they’ve been incredible.",Trump,565
557,0.91,0.78,"The one thing we know about Florida, you’re very good at this stuff.",Trump,557
594,1.0,1.0,"Oh, that’s wonderful, where’s our border?",Trump,594
695,1.0,1.0,Because we built the greatest economy in history.,Trump,695
225,1.0,0.3,But it was Florida’s best year.,Trump,225
372,1.0,0.3,"I’m not going to say the best, but I’m just about the best thing that ever happened to Puerto Rico.",Trump,372
337,1.0,1.0,"I know one thing, I was very happy to take it, that I can tell you.",Trump,337
641,1.0,1.0,"Some of them are here, they’re very proud people.",Trump,641
90,1.0,1.0,"When we inherited the largest recession, the greatest recession since the depression, what happened?",Biden,90


In [201]:
# Most subjective
df_combined.sort_values(by=['subjectivity']).tail(10)

Unnamed: 0,polarity,subjectivity,text,speaker,order
20,0.333333,1.0,We’ve made tremendous progress.,Trump,20
339,0.0,1.0,They’re waiting for final approval.,Trump,339
596,-0.928571,1.0,"If you’re a murderer, if you’re a rapist, if you’re very, very sick with a disease that can spread all over, just come on in.",Trump,596
95,-0.5,1.0,"I’m sorry, at the time I thought it was.",Trump,95
337,1.0,1.0,"I know one thing, I was very happy to take it, that I can tell you.",Trump,337
594,1.0,1.0,"Oh, that’s wonderful, where’s our border?",Trump,594
36,-0.3,1.0,He’s not a nice guy.,Trump,36
194,-0.1,1.0,"But the Trump campaign, which is not unusual because I’ve had a piece of it, the trump campaign has deliberately lied.",Biden,194
301,0.6,1.0,"And his own person, his chief of staff, said that it was a disaster the way they ran it.",Trump,301
612,0.85,1.0,"It’s going to be finished and so beautiful, wait’ll you see that.",Trump,612


In [202]:
# Most objective
df_combined.sort_values(by=['subjectivity']).head(10)

Unnamed: 0,polarity,subjectivity,text,speaker,order
0,0.0,0.0,"Hello, hello, hello.",Biden,0
332,0.0,0.0,"Johnson & Johnson, Moderna, Pfizer.",Trump,332
335,0.0,0.0,And maybe it’s why I’m here with you.,Trump,335
340,0.0,0.0,"And that vaccine will end the pandemic, but we’re also launching a historic effort to bring your medical supply chains back home.",Trump,340
341,0.0,0.0,"In 1996, Joe Biden voted to obliterate Puerto Rico’s thriving pharmaceutical industry.",Trump,341
343,0.0,0.0,He cut it out.,Trump,343
344,0.0,0.0,"And when he cut it out, he sent Puerto Rico into a nosedive like nobody’s ever seen before.",Trump,344
345,0.0,0.0,"So we’re bringing it all back, and we’re bringing it back to Florida too.",Trump,345
346,0.0,0.0,"We’re bringing it all the way back, taking our jobs away from China.",Trump,346
347,0.0,0.0,We’re bringing it back from China.,Trump,347


# Plot it

In [203]:
import plotly.graph_objects as go
import plotly.express as px

In [204]:
df_biden = df_combined[df_combined['speaker']=='Biden']
df_trump = df_combined[df_combined['speaker']=='Trump']
df_combined.shape

(1178, 5)

In [205]:

fig = px.scatter(df_combined,
                 x = 'polarity' ,
                 y = 'subjectivity',
                 color='speaker',
                 hover_data = ['speaker','text']
                )
fig.show()

In [207]:
# animated by time
fig = px.scatter(df_trump,
                 x = 'polarity' ,
                 y = 'subjectivity',
                 color='speaker',
                 hover_data = ['speaker','text'],
                 animation_frame = 'order'
                )
fig.show()

* Overall sentiment stats are similar though actual distribution different
* Trump tends to have wider distribution with more extremes on the polarity scale
* Interesting T/V shape
* T shape indicates most subjective sentences run the entire polarity scale, while the most objective statements tend to be somewhat neutral
* V shape tends to be more Trump

Additional Analysis:
* Would be interesting to do animation by time stamp

In [150]:
df_combined.head()

Unnamed: 0,polarity,subjectivity,text,speaker
0,0.0,0.0,"Hello, hello, hello.",Biden
1,0.7,0.6,Good to see you all.,Biden
2,0.0,0.0,"Please, please, take a seat if you have one.",Biden
3,0.32,0.286667,"Thank you so very, very much for… It’s good to be back in Florida.",Biden
4,0.7,0.6,"I want to thank my good friend Debbie Wasserman Schultz, congresswoman.",Biden
