In [None]:
from IPython.display import HTML
display(HTML('<marquee direction="down" width=100% height=300 behavior="alternate" >\
             <marquee height=300 style = "color: red; font-size : 100px;" behavior="alternate" >\
    <b>Trump vs. Biden!</b>\
             </marquee>\
             </marquee>'))

# Analysis of First and Second Presidential Debates in 2020

In this notebook we're trying to analyze a hot dataset that's filled with political topics and mystries! To me as someone who got both confused and intrigued watching the debates, nothing was more interesting than figuring out the character and personality of each debater by analyzing their speech!

So I picked up my laptop and searched through the internet and could find the script of both first and second debate, but hopefully a great soul had already put it in a neat csv file on [Kaggle](https://www.kaggle.com/headsortails/us-election-2020-presidential-debates) so a bit of the hassle was already saved! 

In this notebook I want to present what I found interesting as I went through their talk. I used different packages and methods to analyze the data as best as I could, so I'll talk about libraries like TextBlob, Transformers, nltk and so many others and I'll give you an example of how each can be used and what advantages or disadvantages each has.

Firstl, we need to import and install some packages and libraries.

In [None]:
%%HTML
<a id="Analysis"></a>
<center>
<iframe width="700" height="315" src="https://www.youtube.com/embed/kyuDlnYGGQI"" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" style="position: relative;top: 0;left: 0;" allowfullscreen ng-show="showvideo"></iframe>
</center>

In [None]:
# For math and analyzing dataframes
import numpy as np
import pandas as pd


# For analyzing text
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()
import nltk
import string
import regex as re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from collections import Counter 

# TextBlob and its classifier
from textblob import TextBlob 
from textblob.classifiers import NaiveBayesClassifier 

# For vis
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
import warnings
import os
%matplotlib inline

# For time column
import datetime

# For word cloud
from PIL import Image
from wordcloud import WordCloud 

In [None]:
pip install -U textblob

In [None]:
!python -m textblob.download_corpora

In [None]:
!python -m pip install -U pip
!pip install wheel

In [None]:
pip install plotly==4.12.0

Take a look at the datasets:

In [None]:
path = '../input/us-election-2020-presidential-debates'

first = pd.read_csv(path + '/us_election_2020_1st_presidential_debate.csv')
second = pd.read_csv(path + '/us_election_2020_2nd_presidential_debate.csv')

In [None]:
first[100:105]

In [None]:
second[50:55]

# Analysis Begins!

### **Null Values**
From the code below, we can see how many null values exist in each column and dataset. As you can see there is only one null value and it is in the first dataset, in the minute column. This column displays the starting point of each person's utterance. After searching through the first dataset we understand that it belongs to the starting point of the second part. So we can simply substitute it with 00:00.

Also for more coherence and simplicity we change the names of debaters to '**Donald Trump**' and '**Joe Biden**' and the mediators to '**mediator_1**' and '**mediator_2**'.

In [None]:
null_df = pd.DataFrame(pd.concat([first.isnull().sum(), second.isnull().sum()], axis = 1))
null_df.columns = ['first', 'second']
null_df

In [None]:
first.iloc[178:181,:]

In [None]:
first.loc[first.minute.isnull(), 'minute'] = '00:00'

In [None]:
print('names in the first dataset:', (first.speaker.unique()))
print('names in the second dataset:', (second.speaker.unique()))

first.loc[first.speaker.str.contains('Chris Wallace:'), 'speaker'] = 'Chris Wallace' # correcting the typo in the name

# changing their names for more simplicity and coherence in two datasets
first.loc[first.speaker.str.contains('Vice President Joe Biden'), 'speaker'] = 'Joe Biden'
first.loc[first.speaker.str.contains('President Donald J. Trump'), 'speaker'] = 'Donald Trump'

first.loc[first.speaker.str.contains('Chris Wallace'), 'speaker'] = 'mediator_1'
second.loc[second.speaker.str.contains('Kristen Welker'), 'speaker'] = 'mediator_2'

print('Modified names in the first dataset:', (first.speaker.unique()))
print('Modified names in the second dataset:', (second.speaker.unique()))

# **The Minute Column**

In this part we're trying to have a consistent timeframe instead of having two parts we'll have one that covers all their speaches. So what we do is that we parse the minute column into hour, minute and second and then give a specific format to all.

For each cell(paragraph) we calculate the time difference between this paragraph and the previous one. And then add this difference to the previous time to calculate the current time. But because we have two parts and the second part starts at 00:00, we again start calculating from zero in the beggining of the second part, but we keep the last value of the first part(the overal time of the first debate in seconds) and also the index of this last paragraph so we can find from what row we should start adding the time of the first part to the values of the second part.

This was my way of finding the consecutive time frame, but if you had any better ideas please let me know I'd love to hear your comment and suggestions!

In [None]:
# making the time consecutive

# First Debate

first['seconds'] = 0 # we assume we start from 0
                  # and then add the values accordingly


for i, tm in enumerate(first.minute[1:],1):
    timeParts = [int(s) for s in str(tm).split(':')]
    
    # when we have hour like 01:10:50
    if (len(timeParts)>2) and (i<len(first)):
        
        current = (timeParts[0] * 60 + timeParts[1]) * 60 + timeParts[2]
        difference = current - first.loc[i-1, 'seconds']
        first.loc[i, 'seconds'] = first.loc[i-1, 'seconds'] + difference
    # when we get to the second half of the debate
    elif str(tm) == '00:00' :
        first.loc[i, 'seconds'] = 0
        second_round_idx = i
        second_round_final_time = first.loc[i-1, 'seconds']

    # when there's only minute and seconds like 10:50
    elif (i<len(first)):
        current = timeParts[0] * 60 + timeParts[1]
        difference = current - first.loc[i-1, 'seconds']
        first.loc[i, 'seconds'] = first.loc[i-1, 'seconds'] + difference

first.loc[second_round_idx:, 'seconds'] += second_round_final_time


# Second Debate

second['seconds'] = 0 

for i, tm in enumerate(second.minute[1:],1):
    timeParts = [int(s) for s in str(tm).split(':')]
    
    # when we have hour like 01:10:50
    if (len(timeParts)>2) and (i<len(second)):
        
        current = (timeParts[0] * 60 + timeParts[1]) * 60 + timeParts[2]
        difference = current - second.loc[i-1, 'seconds']
        second.loc[i, 'seconds'] = second.loc[i-1, 'seconds'] + difference

    # when we get to the second half of the debate
    elif str(tm) == '00:00' :
        first.loc[i, 'seconds'] = 0
        second_round_idx = i
        second_round_final_time = second.loc[i-1, 'seconds']
    # when there's only minute and seconds like 10:50
    elif (i<len(second)):
        current = timeParts[0] * 60 + timeParts[1]
        difference = current - second.loc[i-1, 'seconds']
        second.loc[i, 'seconds'] = second.loc[i-1, 'seconds'] + difference

second.loc[second_round_idx:, 'seconds'] += second_round_final_time



first['minutes'] = first.seconds.apply(lambda x:x//60)
second['minutes'] = second.seconds.apply(lambda x:x//60)

# We use this format of %h:%m:%s by using the following command
first['time'] = first.seconds.apply(lambda x:str(datetime.timedelta(seconds=x)))
second['time'] = second.seconds.apply(lambda x:str(datetime.timedelta(seconds=x)))

In [None]:
first[55:60]

# Heat of The Discussion with <span style ="color :red">Heat Map</span>!

This part came into my mind when I saw how many times the candidates were interupting in eachothers speach. So for seeing which parts were the candidates most anxious to talk or to interupt each other, I plotted a heatmap for each debate. I found another cool representation of heatmap in [this article](https://towardsdatascience.com/1st-presidential-debate-by-the-numbers-dee50b35f4ac), although the author uses multiple resources (and not just python) for creating the plot.

In the following plot the darker the color, the more times each one started talking(or even interupted each other). As you can see there are times that both speakers start talking more than what is usuall and normal. I consider a discussion normal when you see no color(silence or Nan) for one speaker when the other is colored(is speaking).

Generally speaking, in both debates it looks like there are **three parts** where the candidates start firing at each other, one after the introductions and warming up, another in the middle and one around 15 minutes before the end where they may try to prove their points by talking faster and persumably finish off strong!

Also after each heated discussion, we can see that they cool down and talk normally for about 20 minutes.

In [None]:
heat = first.groupby(['minutes', 'speaker']).count().reset_index()
fig = go.Figure(data=go.Heatmap(
                z=heat.minute,
                x=heat.minutes,
                y=heat.speaker,
                colorscale='Viridis_r',
                colorbar=dict(
                title="Heat of the discussion",
                titleside="top",
                tickmode="array",
                tickvals=[1, 4, 10],
                ticktext=["very cool", "normal", "Hot!"],
                ticks="outside"
    )
        ))

fig.update_layout(title='First Debate: # of times each one talks in each minute',
                 xaxis_nticks=36)


fig.show()

# Create and show figure


In [None]:
heat = second.groupby(['minutes', 'speaker']).count().reset_index()
fig = go.Figure(data=go.Heatmap(
        z=heat.minute,
        x=heat.minutes,
        y=heat.speaker,
        colorscale='Viridis_r',
        colorbar=dict(
        title="Heat of the discussion",
        titleside="top",
        tickmode="array",
        tickvals=[2, 5, 10],
        ticktext=["very cool", "normal", "Hot!"],
        ticks="outside"
    )
        ))

fig.update_layout(title='Second Debate: # of times each one talks in each minute',
                 xaxis_nticks=36)

fig.show()

# Create and show figure


# Go Sentence Level

Now let's use a sentence tokenizer and analyze the sentences used in each debate. Below I used NLTK sent_detector for this task and also added one another field called number_of_sents for the length of each sentence used by each every time they start speaking.

In [None]:
# we want to analyze the debate based on the sentences
# so we use a sentence level tokenizer from nltk
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

# number of sentences used by each person, each time their allowed to talk
first['number_of_sents'] = first.text.apply(lambda x:len(sent_detector.tokenize(x)))
second['number_of_sents'] = second.text.apply(lambda x:len(sent_detector.tokenize(x)))

In [None]:
# summing up the number of sentences
number_of_sentences_1 = first.groupby(['speaker']).sum()[['number_of_sents']].reset_index()
number_of_sentences_2 = second.groupby(['speaker']).sum()[['number_of_sents']].reset_index()
# total number of sentences in each debate
total_num_sents_1 = number_of_sentences_1.sum().number_of_sents 
total_num_sents_2 = number_of_sentences_2.sum().number_of_sents 
# percentage of conversation dominated by each candidate
# based on the number of sentences they used
number_of_sentences_1.loc[:, 'percentage'] = number_of_sentences_1.number_of_sents.apply(lambda x:round(x/total_num_sents_1, 2))
number_of_sentences_2.loc[:, 'percentage'] = number_of_sentences_2.number_of_sents.apply(lambda x:round(x/total_num_sents_2, 2))

number_of_sentences_1

In [None]:
fig = make_subplots(rows=2, cols=2,
                    specs=[[{"rowspan": 2}, {}]
                            ,[None,         {}]],
                    subplot_titles=("# of sentences in total","First Debate", "Second Debate"))

fig.add_trace(go.Bar(x=['First Debate', 'Second Debate'], 
                     y=[total_num_sents_1, total_num_sents_2],
                     text =[total_num_sents_1, total_num_sents_2]),
                     row=1, col=1)

fig.add_trace(go.Bar(x=number_of_sentences_1.speaker,
                     y=number_of_sentences_1.number_of_sents,
                     text =number_of_sentences_1.percentage),
                     row=1, col=2)
fig.add_trace(go.Bar(x=number_of_sentences_2.speaker,
                     y=number_of_sentences_2.number_of_sents,
                     text =number_of_sentences_2.percentage),
                     row=2, col=2)

fig.update_traces(textposition='outside', textfont_size=14)
fig.update_layout(showlegend=False, title_text="Number of sentences in Both debates")
fig.update_yaxes(title_text='count')
fig.show()

while analyzing this plot, I though that the number of sentences is not showing all the characteristics of the debaters because you don't know how many words they used in each sentence. Like when you say "what?!" or "Oh, really?!" you are more likely someone who interupts more and uses less words for each sentence. So this time we simply use a word tokenizer and see how each character presents himself!
For a better understanding of the debaters you can click on the mediator_1 and mediator_2 to make them disappear and look at what is left.

In [None]:
fig = make_subplots(rows=2, cols=2,
                    specs=[[{"colspan": 2}, None]
                            ,[{"colspan": 2}, None]],
                    subplot_titles=("First Debate", "Second Debate"))
#-------------------------------------------------- First Debate -----------------------------------------------------#
fig.add_trace(go.Histogram(
    x=first[first.speaker == 'Donald Trump'].number_of_sents,
    name='Trump_1',  xbins=dict(start=-1, end=24, size=1),
    marker_color='red', opacity=0.75),
    row = 1, col = 1)

fig.add_trace(go.Histogram(
    x=first[first.speaker == 'Joe Biden'].number_of_sents,
     name='Biden_1', xbins=dict(start=-1, end=24, size=1),
    marker_color='#3498DB', opacity=0.75),
    row = 1, col = 1)

fig.add_trace(go.Histogram( 
    x=first[first.speaker == 'mediator_1'].number_of_sents,
    name='mediator_1', xbins=dict(start=-1, end=24, size=1),
    marker_color='#5D6D7E', opacity=0.75),
    row = 1, col = 1)
#-------------------------------------------------- Second Debate -----------------------------------------------------#

fig.add_trace(go.Histogram(
    x=second[second.speaker == 'Donald Trump'].number_of_sents,
    name='Trump_2',  xbins=dict(start=-1, end=24, size=1),
    marker_color='red', opacity=0.75),
    row = 2, col = 1)

fig.add_trace(go.Histogram(
    x=second[second.speaker == 'Joe Biden'].number_of_sents,
     name='Biden_2', xbins=dict(start=-1, end=24, size=1),
    marker_color='#3498DB', opacity=0.75),
    row = 2, col = 1)

fig.add_trace(go.Histogram( 
    x=second[second.speaker == 'mediator_2'].number_of_sents,
    name='mediator_2', xbins=dict(start=-1, end=24, size=1),
    marker_color='#5D6D7E', opacity=0.75),
    row = 2, col = 1)




fig.update_layout(
    title_text='First Debate: Histogram of # of Sentences Each One Used Each Time',
    yaxis_title_text='Count', 
    bargap=0.1, bargroupgap=0.1)

fig.show()

Well as we can see the Trump is the one who uses the smallest and also the biggest number of sentences each time he starts talking. In comparison to Biden we can see that he uses one sentence, 20 to 30 times more which shows how he tends to interupt his apponenet or simply answer shortly with one ore two sentences(which I think is not the case because he uses long paragraphs more than Biden ,too).

# Go Word Level

In order to explore the lexicon used by candidates we simply take all the text and try to clean the unnecessary words and punctuations like 'is', 'in', 'at', etc. as much as possible to get to the gist of what they're trying to say. Below you'll see a function called clean() and you may notice arguments like 'http' although we don't have any links in this dataset.

This is because this function is a general custom function that I built for myself and I usually use it with a few tweeks here and there to adapt it to the dataset. I wanted to share the whole thing here so you can use it for cleaning other datasets ,as well. So don't get confused and just set each one to be **True**, **False**, or whatever you wish.

In [None]:
# Spacy packages
sp = spacy.load('en_core_web_sm') 
spacy_st = list(nlp.Defaults.stop_words) # 362 stop words 
# NLTK packages
nltk.download('stopwords')
nltk.download('punkt')
nltk_st = stopwords.words('english') # 179 stop words

# some additional pucntuations observed in the dataset
punc = '‚Äô‚Äù‚Äú‚Ä¶'.join(string.punctuation)


# a general cleaning function
def clean(t, lower = False, http = False, punct = False,
          lem = False, stop_w = False, num = False,
          custom_st = ['a','the', 'and', 'there', 'that', 'this', 'am', 'on',
                       'if', 'it', 'to', 'at' 'a', 'of', 'in', 'out', 'were',
                       'was', 'do', 'did', "don't","didn't", 'be', 'are', 'is',
                       'being', "it's", 'have', 'had', 's', 'j', 't', 're',
                       'at', 'with', 'just', 'now', "can't", 'can', 'up',
                       'as', 'from', 'thing', 'by', 'so', 'here', 'will', 'for']):

    if lower:
        t = t.lower()
    
    if http:
        t = re.sub("https?:\/\/t.co\/[A-Za-z0-9]*", '', t)

    # lemmitizing
    if lem:
        # spacy replaces pronouns with '-PRON-' and we don't want that to happen
        # so we lemmatize everything exept words that are recognized as pronouns
        lemmatized = [word.lemma_ if word.lemma_ !='-PRON-' else word.text for word in sp(t)]
        t = ' '.join(lemmatized)

    # stop words
    if stop_w == 'nltk':
        t = [word for word in word_tokenize(t) if not word.lower() in nltk_st]
        t = ' '.join(t)

    elif stop_w == 'spacy':
        t = [word for word in word_tokenize(t) if not word.lower() in spacy_st]
        t = ' '.join(t)
        
    elif stop_w == 'custom':
        t = [word for word in word_tokenize(t) if not word.lower() in custom_st]
        t = ' '.join(t)

    # punctuation removal
    if punct:
        t = t.translate(str.maketrans('', '', punc))
    if num:
        t = re.sub("[0-9]","", t)
    # removing extra spaces and letters
    t = re.sub("\s+", ' ', t)
    t = re.sub("\b\w\b", '', t)
    return t

first['cleaned_text'] = first.text.apply(lambda x: clean(x, lower = True, http = False,
                                                         punct = True, lem = True, 
                                                         stop_w = 'custom', num = True))
second['cleaned_text'] = second.text.apply(lambda x: clean(x, lower = True, http = False, 
                                                           punct = True, lem = True, 
                                                           stop_w = 'custom', num = True))

Because we want to findout about the most common words, we simply split all the words in the cleaned_text column and then with the most_common() function, we count and sort all the most frequently used words by each candidate in both debates. I applied the most_common function on concatinated version of first and second dataframes, but you can do the same for each separately. ***(I did so at first but there wasn't much difference. Turns out people have the propensity to use the same lexicon and structure, given different topics and situations.)***

In [None]:
first['words'] = first['cleaned_text'].apply(lambda x:x.split())
second['words'] = second['cleaned_text'].apply(lambda x:x.split())

def most_common(df, name, top_n):
    list_ = []
    for w in df[df.speaker == name].words:
        list_.extend(w)
    Counter_1 = Counter(list_) 
    most_occured = Counter_1.most_common(top_n) 
    return most_occured

Trump_w = most_common(pd.concat([first, second], axis = 0), 'Donald Trump', 60)
Biden_w = most_common(pd.concat([first, second], axis = 0), 'Joe Biden', 60)


If you look closer to the treemap above, you'll notice that each person has a different way of referring to the other. Trump is more likely to address his opponent directly, while Biden tends to talk more to the mediator and use the pronoun 'he' to address Trump. From what I see, they mostly are using the same words, but in different ways. For example, of all the top 60 words Trump uses, around 10 percent is the pronoun 'I' (nearly twice as much as Biden does), another 10 percent is the pronoun 'you', while Biden tends to use words more broadly than Trump by not having any word that dominates more than 7 percent of his commonly used words.

In [None]:
fig = make_subplots(
    cols = 2, rows = 1,
    column_widths = [0.5, 0.5],
    subplot_titles = ('most common words used by: <b>Trump', 'most common words used by: <b>Biden'),
    specs = [[{'type': 'treemap', 'rowspan': 1}, {'type': 'treemap'}]]
)

fig.add_trace(go.Treemap(
    labels = [k for k,v in Trump_w],
    parents = ['Trump']*100,
    values = [v for k,v in Trump_w],
    textinfo = "label+value+percent parent",
    ),
              row = 1, col = 1)

fig.add_trace(go.Treemap(

    labels = [k for k,v in Biden_w],
    parents = ['Biden']*100,
    values = [v for k,v in Biden_w],
    textinfo = "label+value+percent parent",
    outsidetextfont = {"size": 20, "color": "darkblue"},
    marker = {"line": {"width": 2}}),
              row = 1, col = 2)

fig.show()

## Word Cloud

Having all these words, we can use the most common words used by each candidate to make a word cloud for each. For a word cloud in a picture(or a mask) you have to have two things:
 1. Provide the words 
 2. Provide the mask(the picture)

You can use the images named 'Trump.png' and 'Biden.png' from my repository on github or make your own image. Just remember that they have to have transparent backgrounds.

In [None]:
tweet_mask = np.array(Image.open("../input/analysis-of-trump-biden-debates/Biden.png"))
wc = WordCloud(collocations=False,
               background_color="black",
               max_words=200,
               mask = tweet_mask,
               contour_color='yellow',
               contour_width=20,)

# Generate a wordcloud
wc.generate(' '.join([k for k,v in Biden_w]))


# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:

tweet_mask = np.array(Image.open("../input/analysis-of-trump-biden-debates/Trump.png"))
wc = WordCloud(collocations=False,
               background_color="black",
               max_words=200,
               mask = tweet_mask,
               contour_color='yellow',
               contour_width=10,)

# Generate a wordcloud
wc.generate(' '.join([k for k,v in Trump_w]))


# show
plt.figure(figsize=[20,10])
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

# Now we get to the Sentiment Analysis

 We want to analyze each sentence so once again we use the sent_tokenizer and tokenize paragraphs into sentences but this time we put each sentence in one row to have a better understanding of our data. In this part we use packages like TextBlob, ü§ó Transformers Pipelines and models.


In [None]:
# number of sentences in each cell
lens_1 = first.number_of_sents
lens_2 = second.number_of_sents

# making a long list of all sentences
list_1 =[]
list_2 = []
for x in first.text.apply(lambda x:sent_detector.tokenize(x)):
    list_1.extend(x)
for x in second.text.apply(lambda x:sent_detector.tokenize(x)):
    list_2.extend(x)

# create new dataframes, repeating as appropriate
first_sent = pd.DataFrame({'speaker': np.repeat(first.speaker, lens_1),
                            'time': np.repeat(first.time, lens_1),
                            'sent': list_1})
second_sent = pd.DataFrame({'speaker': np.repeat(second.speaker, lens_2),
                            'time': np.repeat(second.time, lens_2),
                            'sent': list_2})
first_sent.head()

In [None]:
display(HTML('<dev direction="down" width=100% height="40" behavior="still" >\
<dev  style = "  font-size : 50px;" behavior="still" >\
  <b>üëà  Polarity and Subjectivity üëâ </b>\
  </dev>\
</dev>'))

### **TextBlob**
TextBlob is a library specified for textual data and also common tasks related to natural language processing. In this notebook, we only want to use the sentiment property to see how each candidate expresses their feelings and also we'll see whether this is the best package for doing so.
\
**Polarity** is a value between -1 and 1 that shows how positive or negative the sentiment of a sentence is. What TextBlob does is that it first identifies the words that are in its [lexicon](https://github.com/sloria/TextBlob/blob/eb08c120d364e908646731d60b4e4c6c1712ff63/textblob/en/en-sentiment.xml) and averages the polarity of all different meanings of one word and then multiplies these average with each other and gives us that number as the polarity of the whole sentence. [This is a great blog post](https://planspace.org/20150607-textblob_sentiment/) where you can read more on how it is calculated.


But because polarity is a value and not a concrete label, I decided to divide the spectrum into 5 different groups and give a lable to each:

0 - negative

 1 - somewhat negative 
 
 2 - neutral 
 
 3 - somewhat positive 
 
 4 - positive
 
**Subjectivity** is a factor between 0 and 1, showing how subjective a word or phrase is.

(Pretty obvious! I know.)

Advantages of TextBlob: TextBlob provides you with fast and easy to use tools to analyze the overal sentiment and subjectivity of sentences and phrases in your data. But as we said it has a pretty naive approach compared to deep learning models. Let's see an example of how it may disappoint us. Take a look at the following sentence:


***Fewer people are dying every day***

As human being we can understand that this sentence is evidently conveying a positive messege. But remember that we said TextBlob only takes into account words and phrases that are in its lexicon and then multiplies their polarity. Well it turns out that its vocabulary does not include words like "fewer" or "dying" and even after lemmatization and getting the following sentence:

***Few people are die every day***

you won't be able to recognize this sentence as a positive one because again, "die" is not included in its vocabulary.
So for sentences that don't have a definite negative or positive tone and can be a little like facts, polarity doesn't do a great job, but subjectivity if the sentence is short and the tone clear, it can show good results as you can see in the table bellow. Go ahead and experiment yourself with different words and sentences to see how it performs.

### more info:
You may say that we can change this and it is true we can do something rather different. Hopefully TextBlob provides us with a NaiveBayesClassifier that you can train on your own and then ask it to classify your text based on that labeled data.

But still even this classifier isn't our best bet as it won't help us with polarity and we have to provide a lot of data to a model that is not the best model for text classification.
Though you can find some datasets online([like these two from IBM datasets](https://www.research.ibm.com/cgi-bin/haifa/vst/debating_ds19.pl)) that have a large lexicon with specified polarity and maybe train your naive bayes on them to improve your classification. But as they are so much better models out there for sentiment classification, personally I won't get into its details.




In [None]:
TextBlob('Fewer people are dying everyday').sentiment.polarity

In [None]:
# first df
first_sent['polarity'] = first_sent.sent.apply(lambda x: TextBlob(x).polarity)
first_sent['subjectivity'] = first_sent.sent.apply(lambda x: TextBlob(x).subjectivity)
first_sent['sentiment'] = first_sent.polarity.apply(lambda x: 4 if x>0.6 else 3 if x>0.2 else 2 if x>-0.2 else 1 if x>-0.6  else 0)

# second df
second_sent['polarity'] = second_sent.sent.apply(lambda x: TextBlob(x).polarity)
second_sent['subjectivity'] = second_sent.sent.apply(lambda x: TextBlob(x).subjectivity)
second_sent['sentiment'] = second_sent.polarity.apply(lambda x: 4 if x>0.6 else 3 if x>0.2 else 2 if x>-0.2 else 1 if x>-0.6  else 0)

first_sent.reset_index(drop = True, inplace = True)
second_sent.reset_index(drop = True, inplace = True)

In [None]:
both = pd.concat([first_sent, second_sent], axis = 0)
fig = go.Figure()
fig.add_trace(go.Histogram(
    x=both[both.speaker == 'Donald Trump'].subjectivity,
    name='Trump',  xbins=dict(start=-1, end=2, size=0.1),
    marker_color='red', opacity=0.75))

fig.add_trace(go.Histogram(
    x=both[both.speaker == 'Joe Biden'].subjectivity,
     name='Biden', xbins=dict(start=-1, end=2, size=0.1),
    marker_color='#3498DB', opacity=0.75))

fig.update_layout(
    title_text="Number of Sentences used by Debaters with different Subjectivities",
    yaxis_title_text='Number of Sentences', 
    xaxis_title_text='Subjectivity',
    bargap=0.1, bargroupgap=0.1)

What we see here is that Trump has more sentences with 0 subjectivity(uses more neutral statements), but we cannot conclude that he talks less subjectivly because as we saw earlier he dominates around 40 percent of the conversations and therefore it is normal to see him have more neutral words. But interestingly, there is only one time when Biden wins over Trump by using twise as much sentences with 0.8 subjectivity on the right hand side of the graph.

In [None]:
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True, )

first_sent.loc[(first_sent['polarity']>0.6) | (first_sent['polarity']<-0.6),['speaker', 'sent', 'polarity']].head(15).style.background_gradient(cmap, subset=['polarity'])

In [None]:
fig = make_subplots(rows=2, cols=2,
                    specs=[[{"rowspan": 2}, {}]
                            ,[None,         {}]],
                    subplot_titles=("Both Debates and all sentences",
                                    "First Debate individual candidates", 
                                    "Second Debate Debate individual candidates"))

#-------------------------------------------------- Both Debates -----------------------------------------------------#
fig.add_trace(go.Histogram(
    x=first_sent['sentiment'],
    name='First Debate',  xbins=dict(start=-1, end=5, size=1),
    marker_color='purple', opacity=0.75),
    row = 1, col = 1)

fig.add_trace(go.Histogram(
    x=second_sent['sentiment'],
     name='Second Debate', xbins=dict(start=-1, end=5, size=1),
    marker_color='#ba8cd7', opacity=0.75),
    row = 1, col = 1)
#-------------------------------------------------- First Debate -----------------------------------------------------#


fig.add_trace(go.Histogram(
    x=first_sent[first_sent.speaker == 'Donald Trump']['sentiment'],
    name='Trump_1',  xbins=dict(start=-1, end=5, size=1),
    marker_color='red', opacity=0.75),
    row = 1, col = 2)


fig.add_trace(go.Histogram(
    x=first_sent[first_sent.speaker == 'Joe Biden']['sentiment'],
    name='Biden_1', xbins=dict(start=-1, end=5, size=1),
    marker_color='#3498DB', opacity=0.75),
    row = 1, col = 2)

fig.add_trace(go.Histogram( 
    x=first_sent[first_sent.speaker == 'mediator_1']['sentiment'],
    name='mediator_1', xbins=dict(start=-1, end=5, size=1),
    marker_color='#5D6D7E', opacity=0.75),
    row = 1, col = 2)
#-------------------------------------------------- Second Debate -----------------------------------------------------#

fig.add_trace(go.Histogram(
    x=second_sent[second_sent.speaker == 'Donald Trump']['sentiment'],
    name='Trump_2',  xbins=dict(start=-1, end=5, size=1),
    marker_color='red', opacity=0.75),
    row = 2, col = 2)

fig.add_trace(go.Histogram(
    x=second_sent[second_sent.speaker == 'Joe Biden']['sentiment'],
     name='Biden_2', xbins=dict(start=-1, end=5, size=1),
    marker_color='#3498DB', opacity=0.75),
    row = 2, col = 2)

fig.add_trace(go.Histogram( 
    x=second_sent[second_sent.speaker == 'mediator_2']['sentiment'],
    name='mediator_2', xbins=dict(start=-1, end=5, size=1),
    marker_color='#5D6D7E', opacity=0.75),
    row = 2, col = 2)




fig.update_layout(
    title_text='First Debate: Histogram of # of Sentences Each One Used Each Time',
    yaxis_title_text='Count', 
    bargap=0.1, bargroupgap=0.1)

fig.show()

This is a visualization of polarity resulted from TextBlob, but I wasn't satisfied with this version of sentiment analysis after seeing all the flaws this package has and the naive approach it uses to classify and analyze the data. So I took another step and went over what could really classify this text as best as possible with all the tools that I have!

In [None]:
display(HTML('<dev width=100% height="40" behavior="still" >\
             <dev  style = "  font-size : 50px;" >\
             <b>ü§ó Transformers </b></dev></dev>'))

Now we want to start using some ligit tools that are specified for text classification and sentiment analysis! I'll introduce pipeline and also a trained model to again find out the best way of classifying this debate.
1. Pipelines:

**Advantages:** Pipelines are tools provided by the transformers library to help us do a wide range of tasks with our textual data with only one or two lines of code! For example by using the pipeline function and just inputing our data and the name of the task we want to perform we can simply get the result we want without worrying about constructing a model or training it. Take a look at the code below to get the intuition.

You may wonder this function performs. well, the default model that is used by the transformers pipeline for sentiment analysis is a pretrained "distilbert-base-uncased" model that has been trained on sst2 (movie reviews dataset). So you are actually using transfer learning and predict new sets of data based on the weights that were trained by another person or organization. 

You can see a live demo of sentiment classification [here](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english). The models are very veried and you can both use them here or see the live demo of every and each one in the [Huggingface website](https://huggingface.co/models?filter=text-classification&search=roberta-large).

**Disadvantages:** All these models are pretty useful and good but remember all models are not as accurate or good as their other fellows! For example if you go with the default(distill bert model which is not the most accurate or best among transformers models) you can see that it will identify some sentences like

 **"The audience here in the hall has promised to remain silent"**

 as Negative with 0.92 percent certainty!

 Also, because these models were trained on a certain dataset with specified labels, you can't get anything different from what they can offer you. Like in this example if we want to have 5 labels, we won't be able to get it because the sst2 dataset was based on a dataset with 3 different labels: Negative, Neutral and Positive.

 Although there are a few flaws when you classify your text with distill bert, it is still a great way (I think better than polarity) for sentiment analysis as it is not simply averaging and takes into account the real meaning behind each word.

In [None]:
!pip install transformers
from transformers import pipeline

In [None]:
sentimentAnalysis = pipeline("sentiment-analysis")
print(sentimentAnalysis("I have to say this is the coolest kernel ever"))

In [None]:
# first dataset
first_sent['pipeline_sentiment'] = first_sent.sent.apply(lambda x: sentimentAnalysis(x))
first_sent['pipeline_score'] = first_sent.pipeline_sentiment.apply(lambda x:x[0]['score'])
first_sent['pipeline_sentiment'] = first_sent.pipeline_sentiment.apply(lambda x:x[0]['label'])
# second dataset
second_sent['pipeline_sentiment'] = second_sent.sent.apply(lambda x: sentimentAnalysis(x))
second_sent['pipeline_score'] = second_sent.pipeline_sentiment.apply(lambda x:x[0]['score'])
second_sent['pipeline_sentiment'] = second_sent.pipeline_sentiment.apply(lambda x:x[0]['label'])

In [None]:
second_sent.head(3)

In [None]:
display(HTML('<dev direction="down" width=100% height="40" behavior="still" >\
<dev  style = "font-size : 50px;" behavior="alternate" >\
<b>Optional but Cool and Accurate results with RoBERTa ;) </b></dev></dev>'))


So as I said I really wanted to get the best and most accurate results and havev 5 labeled outputs to identify different sentiments as best as I could. So after searching a bit, I found[ this dataset on Kaggle](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data?select=sampleSubmission.csv). Although it contains classified text based on movie reviews, it's training set includes all different words and phrases separately so we can identify and learn the sentiment of each word, phrase and sentence using a partly customized model!

The best model that we can get our hands on, is a Large Roberta model that outperforms Large Bert, XLNet and DistillBert. (I actually experimented with all these models to see if it really is the best model and it turns out that it can give us great results with even a few number of epochs)

Because this notebook is dedicated to analysis and visualization I won't write the RoBERTa model code here but instead I'll represent you with the results. [Here is the Colab notebook](https://colab.research.google.com/drive/16F3RoWOGY1knLRdqMw2yHw68TdxpaImf?usp=sharing) with all you need to use any Tensorflow transformers model. You can see how I trained these models and use it to train your own models too!

In [None]:
both = pd.read_csv('../input/analysis-of-trump-biden-debates/Trump_Biden_debates_sentiments.csv')
first_sent = both.loc[:first.shape[0], :]
second_sent = both.loc[first.shape[0]:, :]


In [None]:
fig = make_subplots(rows=2, cols=2,
                    specs=[[{"rowspan": 2}, {}]
                            ,[None,         {}]],
                    subplot_titles=("Both Debates and all sentences",
                                    "First Debate individual candidates", 
                                    "Second Debate Debate individual candidates"))

#-------------------------------------------------- Both Debates -----------------------------------------------------#
fig.add_trace(go.Histogram(
    x=first_sent['sentiment'],
    name='First Debate',  xbins=dict(start=-1, end=5, size=1),
    marker_color='purple', opacity=0.75),
    row = 1, col = 1)

fig.add_trace(go.Histogram(
    x=second_sent['sentiment'],
     name='Second Debate', xbins=dict(start=-1, end=5, size=1),
    marker_color='#ba8cd7', opacity=0.75),
    row = 1, col = 1)
#-------------------------------------------------- First Debate -----------------------------------------------------#


fig.add_trace(go.Histogram(
    x=first_sent[first_sent.speaker == 'Donald Trump']['sentiment'],
    name='Trump_1',  xbins=dict(start=-1, end=5, size=1),
    marker_color='red', opacity=0.75),
    row = 1, col = 2)


fig.add_trace(go.Histogram(
    x=first_sent[first_sent.speaker == 'Joe Biden']['sentiment'],
    name='Biden_1', xbins=dict(start=-1, end=5, size=1),
    marker_color='#3498DB', opacity=0.75),
    row = 1, col = 2)

fig.add_trace(go.Histogram( 
    x=first_sent[first_sent.speaker == 'mediator_1']['sentiment'],
    name='mediator_1', xbins=dict(start=-1, end=5, size=1),
    marker_color='#5D6D7E', opacity=0.75),
    row = 1, col = 2)
#-------------------------------------------------- Second Debate -----------------------------------------------------#

fig.add_trace(go.Histogram(
    x=second_sent[second_sent.speaker == 'Donald Trump']['sentiment'],
    name='Trump_2',  xbins=dict(start=-1, end=5, size=1),
    marker_color='red', opacity=0.75),
    row = 2, col = 2)

fig.add_trace(go.Histogram(
    x=second_sent[second_sent.speaker == 'Joe Biden']['sentiment'],
     name='Biden_2', xbins=dict(start=-1, end=5, size=1),
    marker_color='#3498DB', opacity=0.75),
    row = 2, col = 2)

fig.add_trace(go.Histogram( 
    x=second_sent[second_sent.speaker == 'mediator_2']['sentiment'],
    name='mediator_2', xbins=dict(start=-1, end=5, size=1),
    marker_color='#5D6D7E', opacity=0.75),
    row = 2, col = 2)




fig.update_layout(
    title_text='First Debate: Histogram of # of Sentences Each One Used Each Time',
    yaxis_title_text='Count', 
    bargap=0.1, bargroupgap=0.1)

fig.show()

Well now this is something! We can see how Trump outperforms Biden when it comes to positive comments in the second debate although in the first debate they were somehow equal. Overall,I think Biden tends to use more general and neutral words in his speach whereas trump sometimes tries to show his emotions with expressive and too positive or too negative commands!

All that you read in this notebook was what I could understand from what I see from the data. I do not oppose or confirm any candidate or political group. Also I'd love to hear from you guys and see what you think from all the plots above! So leave your comments here or on YouTube or Medium!

<h1 style="border:2px solid blue; text-align: center">Thanks for reading! Don't forget to upvote!</h1>