# Data Processing

In this notebook we aggregate all of the podcast and interview transcripts from both candidates into a single list of dictionaries to aggregate and organize data for future analysis. We include the following fields in each dictionary: 

- **title**: A string of the including either Harris or Trump and whatever interview or podcast the data is from.
- **medium**: A string that classifies the dictionary as either a Harris interview, Harris podcast, Trump interview, or Trump podcast
- **transcript**: A list of tuples organized with speaker as the key and their line as the value. This data structure allows for a clear ordered sequence of lines that mirror the form of the discourse. 
- **tokens**: A list of all the words that the candidate says in the transcript. It's stripped for white space and punctuation and all normalized in lowercase.

### Setup

In [1]:
import string
import json
from collections import Counter
import glob

In [2]:
%run functions.ipynb

In [3]:
def load_transcript(transcript):
    lines = transcript.strip().split("\n")
    conversation = []
    current_speaker = None
    accumulated_text = ""
    
    for line in lines:
        # Only process lines with a colon
        if ":" in line:
            speaker, text = line.split(":", 1)
            speaker = speaker.strip()
            text = text.strip()
            
            if speaker == current_speaker:
                accumulated_text += " " + text
            else:
                if current_speaker is not None:
                    conversation.append((current_speaker, accumulated_text))
                current_speaker = speaker
                accumulated_text = text

    # Append the final accumulated text for the last speaker
    if current_speaker is not None:
        conversation.append((current_speaker, accumulated_text))

    return conversation

In [4]:
TRANSCRIPTION_DIR = '../data/final_transcriptions'

text_files = glob.glob(f'{TRANSCRIPTION_DIR}/*/*.txt')
len(text_files)

24

In [5]:
text_files

['../data/final_transcriptions/harris_interviews/Harris_60_Minutes.txt',
 '../data/final_transcriptions/harris_interviews/Harris_CNN.txt',
 '../data/final_transcriptions/harris_interviews/Harris_Fox.txt',
 '../data/final_transcriptions/harris_interviews/Harris_NBC.txt',
 '../data/final_transcriptions/harris_podcasts/Harris_All_The_Smoke.txt',
 '../data/final_transcriptions/harris_podcasts/Harris_Call_Her_Daddy.txt',
 '../data/final_transcriptions/harris_podcasts/Harris_Club_Shay_Shay.txt',
 '../data/final_transcriptions/harris_podcasts/Harris_Howard_Stern.txt',
 '../data/final_transcriptions/harris_podcasts/Harris_The_Breakfast_Club.txt',
 '../data/final_transcriptions/trump_interviews/Trump_Bloomberg.txt',
 '../data/final_transcriptions/trump_interviews/Trump_Fox_News_Faulkner.txt',
 '../data/final_transcriptions/trump_interviews/Trump_Fox_News_Ingram.txt',
 '../data/final_transcriptions/trump_interviews/Trump_NABJ.txt',
 '../data/final_transcriptions/trump_podcasts/Trump_Adin_Ross.tx

In [6]:
text_files[0].split('/')[4].replace('.txt','')

'Harris_60_Minutes'

In [7]:
master_list = []

for file_path in text_files: 
    transcript = open(file_path).read()
        
    processed_transcript = load_transcript(transcript)
    medium = file_path.split('/')[3]
    title = file_path.split('/')[4].replace('.txt','')
    tokens = []
    
    for line in processed_transcript:
        if line[0] == 'Trump': ### this will eventually change to Trump
            toks = tokenize(line[1], lowercase=True, strip_chars = string.punctuation)
            tokens.extend(toks)
        elif line[0] == 'Harris':
            toks = tokenize(line[1], lowercase=True, strip_chars = string.punctuation)
            tokens.extend(toks)
            
    master_list.append({'title': title, 'medium': medium, 'transcript': processed_transcript, 'tokens': tokens})

In [8]:
len(master_list)

24

In [9]:
master_list[0]['title']

'Harris_60_Minutes'

In [10]:
master_list[0]['medium']

'harris_interviews'

In [11]:
master_list[0]['transcript']

[('Host',
  "Kamala Harris has been a candidate for president for just two and a half months, and the post-convention honeymoon is over. With the election just 29 days away, Harris and her running mate, Minnesota Governor Tim Walz, face unrelenting attacks from Donald Trump, and the race remains extremely close. We met the 59-year-old vice president this past week on the campaign trail, and later  at the vice president's residence in Washington, D.C. We spoke about the economy and immigration, Ukraine and China, but we began with the escalating war in the Middle East one year after the Hamas terror attack on Israel. The events of the past few weeks have pushed us to the brink, if not into, an all-out regional war in the Middle East. What can the U.S. do at this point to stop this from spinning out of control?"),
 ('Harris',
  "Well, let's start with October 7th. 1,200 people were massacred, 250 hostages were taken, including Americans. Women were brutally raped. And as I said then, I m

In [12]:
master_list[0]['transcript'][0]

('Host',
 "Kamala Harris has been a candidate for president for just two and a half months, and the post-convention honeymoon is over. With the election just 29 days away, Harris and her running mate, Minnesota Governor Tim Walz, face unrelenting attacks from Donald Trump, and the race remains extremely close. We met the 59-year-old vice president this past week on the campaign trail, and later  at the vice president's residence in Washington, D.C. We spoke about the economy and immigration, Ukraine and China, but we began with the escalating war in the Middle East one year after the Hamas terror attack on Israel. The events of the past few weeks have pushed us to the brink, if not into, an all-out regional war in the Middle East. What can the U.S. do at this point to stop this from spinning out of control?")

In [13]:
master_list[0]['transcript'][0][0]

'Host'

In [14]:
master_list[0]['transcript'][0][1]

"Kamala Harris has been a candidate for president for just two and a half months, and the post-convention honeymoon is over. With the election just 29 days away, Harris and her running mate, Minnesota Governor Tim Walz, face unrelenting attacks from Donald Trump, and the race remains extremely close. We met the 59-year-old vice president this past week on the campaign trail, and later  at the vice president's residence in Washington, D.C. We spoke about the economy and immigration, Ukraine and China, but we began with the escalating war in the Middle East one year after the Hamas terror attack on Israel. The events of the past few weeks have pushed us to the brink, if not into, an all-out regional war in the Middle East. What can the U.S. do at this point to stop this from spinning out of control?"

In [15]:
master_list[0]['tokens']

['well',
 'lets',
 'start',
 'with',
 'october',
 '7th',
 '1200',
 'people',
 'were',
 'massacred',
 '250',
 'hostages',
 'were',
 'taken',
 'including',
 'americans',
 'women',
 'were',
 'brutally',
 'raped',
 'and',
 'as',
 'i',
 'said',
 'then',
 'i',
 'maintain',
 'israel',
 'has',
 'a',
 'right',
 'to',
 'defend',
 'itself',
 'we',
 'would',
 'and',
 'how',
 'it',
 'does',
 'so',
 'matters',
 'far',
 'too',
 'many',
 'innocent',
 'palestinians',
 'have',
 'been',
 'killed',
 'this',
 'war',
 'has',
 'to',
 'end',
 'the',
 'work',
 'that',
 'we',
 'do',
 'diplomatically',
 'with',
 'the',
 'leadership',
 'of',
 'israel',
 'is',
 'an',
 'ongoing',
 'pursuit',
 'around',
 'making',
 'clear',
 'our',
 'principles',
 'were',
 'not',
 'going',
 'to',
 'stop',
 'pursuing',
 'what',
 'is',
 'necessary',
 'for',
 'the',
 'united',
 'states',
 'to',
 'be',
 'clear',
 'about',
 'where',
 'we',
 'stand',
 'on',
 'the',
 'need',
 'for',
 'this',
 'war',
 'to',
 'end',
 'i',
 'think',
 'with',


In [16]:
Counter(master_list[0]['tokens']).most_common(10)

[('the', 44),
 ('and', 39),
 ('to', 35),
 ('of', 32),
 ('that', 30),
 ('i', 26),
 ('a', 23),
 ('have', 22),
 ('you', 22),
 ('we', 21)]

In [17]:
Counter(master_list[1]['tokens']).most_common(10)

[('the', 120),
 ('to', 102),
 ('and', 90),
 ('of', 79),
 ('i', 74),
 ('that', 74),
 ('we', 61),
 ('a', 53),
 ('in', 53),
 ('have', 45)]

In [18]:
with open('../data/master_list.json', 'w') as json_file:
    json.dump(master_list, json_file, indent=4)