<a href="https://colab.research.google.com/github/sljm12/machine_learning_notebooks/blob/master/2020USElection/2nd_Presidential_Debate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip -q install newspaper3k

[K     |████████████████████████████████| 215kB 2.6MB/s 
[K     |████████████████████████████████| 7.4MB 8.4MB/s 
[K     |████████████████████████████████| 92kB 9.3MB/s 
[K     |████████████████████████████████| 81kB 8.0MB/s 
[?25h  Building wheel for jieba3k (setup.py) ... [?25l[?25hdone
  Building wheel for feedfinder2 (setup.py) ... [?25l[?25hdone
  Building wheel for tinysegmenter (setup.py) ... [?25l[?25hdone
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone


In [2]:
import spacy

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("ggplot")

import pandas as pd
import numpy as np

import tensorflow as tf

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence, tokenizer_from_json
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.sequence import pad_sequences
import json
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from pathlib import Path

# Processing the data

In [4]:
from bs4 import BeautifulSoup
import requests

In [5]:
soup = BeautifulSoup(requests.get("https://www.rev.com/blog/transcripts/donald-trump-joe-biden-final-presidential-debate-transcript-2020").text)

In [6]:
text=soup.find("div",{"class":"fl-callout-text"})

In [7]:
rows=text.find_all("p")


In [8]:
len(rows)

512

In [9]:
rows_text = [r.text for r in rows]
full_text = ''.join(rows_text)

In [10]:
rows[2].text

'Kristen Welker: (07:58)\nAnd I do want to say a very good evening to both of you. This debate will cover six major topics. At the beginning of each section, each candidate will have two minutes, uninterrupted, to answer my first question. The Debate Commission will then turn on their microphone only when it is their turn to answer. And the Commission will turn it off exactly when the two minutes have expired. After that, both microphones will remain on. But on behalf of the voters, I’m going to ask you to please speak one at a time.'

In [11]:
rows[511].text

'Joe Biden: (27:16)\nThank you.'

In [12]:
lines = []
for r in rows:
  a = r.text.split('\n',1)
  lines = lines +a 

In [13]:
lines[0:10]

['Kristen Welker: (00:18)',
 'Good evening, everyone. Good evening. Thank you so much for being here. It is such an honor for me to moderate this debate tonight, the final debate. I want to welcome the first family and the first lady. We’re so glad and thankful that you are feeling better. I want to welcome the Biden family, Dr. Jill Biden. Thank you all for being here tonight. We are so excited. We’re looking forward to a really robust discussion. And the only thing I would reiterate are the CPD guidelines that when the candidates are talking, please hold any applause or any other reactions. Except of course, when they walk out, make sure you cheer and loud and applause so that everyone can hear you. Thank you for having me. This is really the honor of a lifetime. I am going to sit down and just get organized and get settled and the show will start very soon. Thank you for being here. (silence). Good evening from Belmont University in Nashville, Tennessee. I’m Kristen Welker of NBC Ne

In [14]:
len(lines)

1024

In [15]:
lines[1023]

'Thank you.'

In [16]:
def get_name_time(s):
  name, ti = s.split(':',1)
  ti = ti.replace('(',"")
  ti = ti.replace(')',"")
  return name,ti

rows = []
current_data={
    "name":'',
    'time': '',
    "text":''
}

for i in lines:
  i=i.strip()

  if i.endswith(')'):
    rows.append(current_data)
    current_data={
      "name":'',
      'time': '',
      "text":''
    }
    n, t = get_name_time(i)
    current_data["name"]=n
    current_data["time"]=t.strip()
  else:
    current_data["text"]=i

In [17]:
rows[0:5]

[{'name': '', 'text': '', 'time': ''},
 {'name': 'Kristen Welker',
  'text': 'Good evening, everyone. Good evening. Thank you so much for being here. It is such an honor for me to moderate this debate tonight, the final debate. I want to welcome the first family and the first lady. We’re so glad and thankful that you are feeling better. I want to welcome the Biden family, Dr. Jill Biden. Thank you all for being here tonight. We are so excited. We’re looking forward to a really robust discussion. And the only thing I would reiterate are the CPD guidelines that when the candidates are talking, please hold any applause or any other reactions. Except of course, when they walk out, make sure you cheer and loud and applause so that everyone can hear you. Thank you for having me. This is really the honor of a lifetime. I am going to sit down and just get organized and get settled and the show will start very soon. Thank you for being here. (silence). Good evening from Belmont University in Na

In [18]:
#Split into sentences
nlp = spacy.load("en_core_web_sm")
sent_data = []
for i in rows:
  name = i["name"]
  sentences = nlp(i["text"]).sents
  for s in sentences:
    sent_data.append({
        "name":name,
        "text": s.text
    })

In [19]:
for s in nlp(rows[1]["text"]).sents:
  print(s.text)

Good evening, everyone.
Good evening.
Thank you so much for being here.
It is such an honor for me to moderate this debate tonight, the final debate.
I want to welcome the first family and the first lady.
We’re so glad and thankful that you are feeling better.
I want to welcome the Biden family, Dr. Jill Biden.
Thank you all for being here tonight.
We are so excited.
We’re looking forward to a really robust discussion.
And the only thing I would reiterate are the CPD guidelines that when the candidates are talking, please hold any applause or any other reactions.
Except of course, when they walk out, make sure you cheer and loud and applause so that everyone can hear you.
Thank you for having me.
This is really the honor of a lifetime.
I am going to sit down and just get organized and get settled and the show will start very soon.
Thank you for being here.
(silence).
Good evening from Belmont University in Nashville, Tennessee.
I’m Kristen Welker of NBC News.
And I welcome you to the f

In [20]:
len(sent_data)

1846

In [21]:
sent_data[0]

{'name': 'Kristen Welker', 'text': 'Good evening, everyone.'}

In [22]:
rows[419]

{'name': 'Joe Biden',
 'text': 'Because what it does, it will create millions of new good paying jobs, we’re going to invest in, for example, 500,000… Excuse me, 50,000 charging stations on our highways so that we can own the electric car market of the future. In the meantime, China is doing that. We’re going to be in a position where we’re going to see to it that we’re going to take 4 million existing buildings and 2 million existing homes and retrofit them so they don’t leak as much energy, saving hundreds of millions of barrels of…',
 'time': '15:32'}

# Prediction

In [29]:
max_len =100
max_features = 20000
batch_size=64
dims=50

Loading back the tokenizer that we use in the Trump Biden Kamala Classifier

In [23]:
j = json.loads(Path("/content/drive/My Drive/Machine Learning/2020USElectionModel/token.json").read_text())
tokenizer = tokenizer_from_json(j)

Loading back the model that we have trained for 30 Epochs in the LSTM model

In [24]:
model = tf.keras.models.load_model("/content/drive/My Drive/Machine Learning/2020USElectionModel/LSTM30epochs")

Tokenize the data and pad it and predict

In [26]:
sent_data_df=pd.DataFrame(data=sent_data)

In [27]:
X = tokenizer.texts_to_sequences(sent_data_df["text"])

In [30]:
x_train_pad=pad_sequences(maxlen=max_len, sequences=X, padding="post", value=0)

In [31]:
predict = model.predict(x_train_pad)

In [32]:
predict.shape

(1846, 5)

Lets get the answer in terms of names

In [34]:
answer = ['Bernie Sanders', 'Joe Biden', 'Kamala Harris', 'Donald Trump',
       'Mike Pence']
results = []
for i in predict:
  pos = np.argmax(i)
  results.append(answer[pos])

Add it to the dataframe

In [35]:
sent_data_df["results"]=results

In [45]:
# Lets see the results for Donald Trump, seems to identify Donald Trump very well
sent_data_df[sent_data_df["name"]=="Donald Trump"].groupby("results").count()

Unnamed: 0_level_0,name,text
results,Unnamed: 1_level_1,Unnamed: 2_level_1
Bernie Sanders,16,16
Donald Trump,701,701
Joe Biden,119,119
Kamala Harris,9,9
Mike Pence,8,8


In [47]:
#Still manage to identify that its Joe Biden but not as strong as the Donald Trump one
sent_data_df[sent_data_df["name"]=="Joe Biden"].groupby("results").count()

Unnamed: 0_level_0,name,text
results,Unnamed: 1_level_1,Unnamed: 2_level_1
Bernie Sanders,14,14
Donald Trump,256,256
Joe Biden,304,304
Kamala Harris,8,8
Mike Pence,8,8
