The following code shows the file hierarchy for the Stanford Congress Data on M2. For a description of the files, refer to the codebook in the README.

In [5]:
import glob

path_to_congress = "/scratch/group/oit_research_data/stanford_congress"

glob.glob('{}/*'.format(path_to_congress))

['/scratch/group/oit_research_data/stanford_congress/__MACOSX',
 '/scratch/group/oit_research_data/stanford_congress/speakermap_stats',
 '/scratch/group/oit_research_data/stanford_congress/keywords.txt',
 '/scratch/group/oit_research_data/stanford_congress/topic_phrases.txt',
 '/scratch/group/oit_research_data/stanford_congress/congress_download.sh',
 '/scratch/group/oit_research_data/stanford_congress/vocabulary',
 '/scratch/group/oit_research_data/stanford_congress/hein-daily',
 '/scratch/group/oit_research_data/stanford_congress/false_matches.txt',
 '/scratch/group/oit_research_data/stanford_congress/party_full',
 '/scratch/group/oit_research_data/stanford_congress/partisan_phrases',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound',
 '/scratch/group/oit_research_data/stanford_congress/audit']

One file can be read like so:

In [49]:
path = '/scratch/group/oit_research_data/stanford_congress/hein-bound/descr_096.txt'

one_file = pd.read_csv(path, sep='|', encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTE_NONE)

The data is divided into many .txt files. 

Speeches are in different .txt files (files labeled speaches_) from metadata like dates (files labeled descr_). The following code iterates through every speech_ file and every descr_ file and creates DataFrames containing each of these file types. 

It also prints lines that are skipped. 

In [3]:
import glob
import os
import csv
import pandas as pd

directory = '/scratch/group/oit_research_data/stanford_congress/hein-bound/'
file_type = 'txt'
seperator ='|'

speeches_df = pd.concat([pd.read_csv(f, sep=seperator, encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTE_NONE) for f in glob.glob(directory + "speeches_*"+file_type)])


descr_df = pd.concat([pd.read_csv(f, sep=seperator, encoding="ISO-8859-1", error_bad_lines=False, quoting=csv.QUOTE_NONE) for f in glob.glob(directory + "descr_*"+file_type)])

b'Skipping line 207724: expected 2 fields, saw 3\nSkipping line 208494: expected 2 fields, saw 5\n'
b'Skipping line 45205: expected 2 fields, saw 3\nSkipping line 96589: expected 2 fields, saw 3\n'
b'Skipping line 9177: expected 2 fields, saw 3\nSkipping line 9232: expected 2 fields, saw 3\nSkipping line 10391: expected 2 fields, saw 3\nSkipping line 10767: expected 2 fields, saw 3\nSkipping line 19439: expected 2 fields, saw 3\nSkipping line 20135: expected 2 fields, saw 3\nSkipping line 38635: expected 2 fields, saw 3\nSkipping line 46625: expected 2 fields, saw 3\nSkipping line 67408: expected 2 fields, saw 3\nSkipping line 96433: expected 2 fields, saw 3\nSkipping line 111918: expected 2 fields, saw 3\nSkipping line 126420: expected 2 fields, saw 3\nSkipping line 127531: expected 2 fields, saw 3\nSkipping line 142222: expected 2 fields, saw 3\n'
b'Skipping line 7466: expected 2 fields, saw 3\nSkipping line 73461: expected 2 fields, saw 3\nSkipping line 107105: expected 2 fields, sa

In [5]:
speeches_df

Unnamed: 0,speech_id,speech
0,740000001,The Chair lays before the Senate the credentia...
1,740000002,(John C. Crockett) proceeded to read the certi...
2,740000003,Mr. President. I suggest that credentials foun...
3,740000004,Is there objection to the request? The Chair h...
4,740000005,Secretary of State.
...,...,...
382520,940382525,Mr. Speaker. it is a great personal honor for ...
382521,940382526,Mr. Speaker. given the fact that Chairman MADD...
382522,940382527,Mr. Speaker. taie 94th Congress has officially...
382523,940382528,designateApril 24 as a National Day of Remembr...


Now we can create a large DataFrame that combines the speeches with the metadata.

In [6]:
all_data = pd.merge(speeches_df, descr_df, on='speech_id').fillna(0) #, how='outer').fillna(0)

In [7]:
all_data

Unnamed: 0,speech_id,speech,chamber,date,number_within_file,speaker,first_name,last_name,state,gender,line_start,line_end,file,char_count,word_count
0,740000001,The Chair lays before the Senate the credentia...,S,19350103,1,The VICE PRESIDENT,Unknown,Unknown,Unknown,Special,48,51,01031935.txt,184,32
1,740000002,(John C. Crockett) proceeded to read the certi...,S,19350103,2,The Chief Clerk,Unknown,Unknown,Unknown,Special,52,54,01031935.txt,124,21
2,740000003,Mr. President. I suggest that credentials foun...,S,19350103,3,Mr. ROBINSON,Unknown,ROBINSON,Unknown,M,55,57,01031935.txt,153,30
3,740000004,Is there objection to the request? The Chair h...,S,19350103,4,The VICE PRESIDENT,Unknown,Unknown,Unknown,Special,58,62,01031935.txt,238,44
4,740000005,Secretary of State.,S,19350103,5,Mrs. MARGUERITE P. BACA,MARGUERITE P.,BACA,Unknown,F,273,275,01031935.txt,19,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17394636,940382525,Mr. Speaker. it is a great personal honor for ...,E,19761001,4819,Mr. BIAGGI,Unknown,BIAGGI,Unknown,M,383002,383082,10011976.txt,2782,453
17394637,940382526,Mr. Speaker. given the fact that Chairman MADD...,E,19761001,4820,Mr. PHILLIP BURTON,PHILLIP,BURTON,Unknown,M,383088,383104,10011976.txt,337,60
17394638,940382527,Mr. Speaker. taie 94th Congress has officially...,E,19761001,4821,Mr. JOHNSON of California,Unknown,JOHNSON,California,M,383111,383127,10011976.txt,586,102
17394639,940382528,designateApril 24 as a National Day of Remembr...,E,19761001,4822,For.-......To,Unknown,......TO,Unknown,M,383248,383342,10011976.txt,6066,892


We could also explore the file hierarchy some more if we were so inclined. 

In [53]:
all_speech_file_paths = glob.glob('/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_*.txt')

In [46]:
for file in all_speech_files:
    print(file)

['/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_043.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_044.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_045.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_046.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_047.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_048.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_049.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_050.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_051.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_052.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_053.txt',
 '/scratch/group/oit_research_data/stanford_congress/hein-bound/speeches_054.txt',
 '/s