# Cleanup Press Briefing Transcripts

Import libraries

In [9]:
import json
import pandas as pd

Read stored json documents from previous scrape by year

In [2]:
jsonDocumentList = []
for year in range(1993,2018,1):
    with open('/Users/cooldude/DataScience/Metis/classProjects/Project_4/whiteHousePressBriefings_Data/whiteHousePressBriefings.'+str(year)+'.json') as json_data:
        jsonDocumentList.extend(json.load(json_data))

Inspect a document

In [3]:
jsonDocumentList[0]

{'briefing': [{'paragraph': 'The Briefing Room '},
  {'paragraph': ' 5:02 P.M. EST '},
  {'paragraph': " MS. MYERS: -- (in progress) -- we're not currently planning to do an op. That's all I have to report. "},
  {'paragraph': ' Q: How long -- '},
  {'paragraph': ' MS. MYERS: However long it takes. I would assume it would go at least an hour. '},
  {'paragraph': ' Q: Are you going to -- come in this way, or are you going to sneak him in the back? '},
  {'paragraph': " MS. MYERS: We're going to sneak him through the underground passage. No, I don't know which way they're coming in. I assume they'll come into the front, as have all the other congressional members who have met with the President today. "},
  {'paragraph': ' Q: More importantly, which way will they go out? '},
  {'paragraph': ' Q: -- Aspen and Christopher. '},
  {'paragraph': ' MS. MYERS: Oh, did they -- '},
  {'paragraph': ' Q: -- Southwest Gate. '},
  {'paragraph': " MS. MYERS: Most of them -- most of the members have co

Press briefing transcript is store in the *briefing* attribute

In [4]:
jsonDocumentList[0]['briefing']

[{'paragraph': 'The Briefing Room '},
 {'paragraph': ' 5:02 P.M. EST '},
 {'paragraph': " MS. MYERS: -- (in progress) -- we're not currently planning to do an op. That's all I have to report. "},
 {'paragraph': ' Q: How long -- '},
 {'paragraph': ' MS. MYERS: However long it takes. I would assume it would go at least an hour. '},
 {'paragraph': ' Q: Are you going to -- come in this way, or are you going to sneak him in the back? '},
 {'paragraph': " MS. MYERS: We're going to sneak him through the underground passage. No, I don't know which way they're coming in. I assume they'll come into the front, as have all the other congressional members who have met with the President today. "},
 {'paragraph': ' Q: More importantly, which way will they go out? '},
 {'paragraph': ' Q: -- Aspen and Christopher. '},
 {'paragraph': ' MS. MYERS: Oh, did they -- '},
 {'paragraph': ' Q: -- Southwest Gate. '},
 {'paragraph': " MS. MYERS: Most of them -- most of the members have come in through the front,

Press briefing text is in the form of questions and answers. At times briefings start with a monologue describing current issues and things that the president is working on and then move on to questions and answers. Our focus here is questions and answers. 

Following function attempts to parse the speaker for each paragraph. In most cases the reporters question is prefixed with *"Q:"* and the answer is prefixed by the press secretary's name. For paragraphs with a speaker missing, previous speaker is assigned using forward fill. Idea here is that these paragraphs might be a part of a lengthy answer given by the press secretary.

In [5]:
def tranformToQandADf(document):
#df = pd.DataFrame(d[0]['briefing'])
    df = pd.DataFrame(document['briefing'])

    df['paragraph'] = df.paragraph.str.strip()
    
    # Find the speaker's name
    tmp = df.paragraph.str.extract(r"""^((((MR|MS|MRS)\.\s+[A-Z]+)|Q:|Q ))""", expand=False)
    df['speaker'] = tmp[0]

    df['speaker'] = df.speaker.fillna(method='ffill')
    df['speaker'] = df.speaker.fillna('SPEECH')

    # Remove speaker name from paragraph text
    df['paragraph'] = df.apply(lambda x: x.paragraph.replace(x.speaker, ''), axis=1)
    df['speaker'] = df.speaker.str.replace(':','')
    
    df['date'] = document['date']
    df['host'] = document['heldBy']
    
    #remove colons in the beginning
    df['paragraph'] = df.paragraph.str.strip(':')
    
    return(df)

Run through all documents and transform each press briefing to yield a new column speaker (and remove speaker from the paragraph column)

In [6]:
dfs = pd.concat([tranformToQandADf(document) for document in jsonDocumentList])

In [11]:
dfs.speaker.value_counts().head(10)

Q                330444
MR. EARNEST       60886
MR. CARNEY        51242
MR. MCCURRY       48395
MR. GIBBS         35471
MR. M             34158
MR. FLEISCHER     30720
MR. LOCKHART      19938
MS. MYERS         17534
SPEECH            15685
Name: speaker, dtype: int64

In [41]:
dfs[dfs.speaker == 'Q'].shape

(330444, 4)

In [42]:
dfs[(dfs.speaker != 'Q') & (dfs.speaker != 'SPEECH')].shape

(422351, 4)

In [43]:
dfs[(dfs.speaker == 'SPEECH')].shape

(15685, 4)

Save the new dataframe as a pickle

In [39]:
dfs.to_pickle('whiteHousePressBriefings_Data/whiteHouseBriefings.DataFrame.pkl')

In [35]:
dfs[(dfs.speaker=='SPEECH') & (dfs.paragraph.str.len() > 20)]

Unnamed: 0,paragraph,speaker,date,host
0,Room 479 Old Executive Office Building,SPEECH,"February 23, 1993",William J. Clinton
2,DR. GIBBONS: Good afternoon. I'm delighted to ...,SPEECH,"February 23, 1993",William J. Clinton
3,I presume all of you have copies of the docume...,SPEECH,"February 23, 1993",William J. Clinton
4,"We are pleased to talk about this plan today, ...",SPEECH,"February 23, 1993",William J. Clinton
5,You understand that this is part of the broade...,SPEECH,"February 23, 1993",William J. Clinton
6,My associates from the OSTP that were involved...,SPEECH,"February 23, 1993",William J. Clinton
7,"One of the innovations in this plan is, I thin...",SPEECH,"February 23, 1993",William J. Clinton
8,It is an evolving process. You'll probably hav...,SPEECH,"February 23, 1993",William J. Clinton
9,"It does represent, I believe, a fairly major d...",SPEECH,"February 23, 1993",William J. Clinton
10,It represents a step in setting new priorities...,SPEECH,"February 23, 1993",William J. Clinton


In [38]:
dfs.head(100)

Unnamed: 0,paragraph,speaker,date,host
0,The Briefing Room,SPEECH,"January 27, 1993",William J. Clinton
1,5:02 P.M. EST,SPEECH,"January 27, 1993",William J. Clinton
2,-- (in progress) -- we're not currently plann...,MS. MYERS,"January 27, 1993",William J. Clinton
3,How long --,Q,"January 27, 1993",William J. Clinton
4,However long it takes. I would assume it woul...,MS. MYERS,"January 27, 1993",William J. Clinton
5,"Are you going to -- come in this way, or are ...",Q,"January 27, 1993",William J. Clinton
6,We're going to sneak him through the undergro...,MS. MYERS,"January 27, 1993",William J. Clinton
7,"More importantly, which way will they go out?",Q,"January 27, 1993",William J. Clinton
8,-- Aspen and Christopher.,Q,"January 27, 1993",William J. Clinton
9,"Oh, did they --",MS. MYERS,"January 27, 1993",William J. Clinton
