# 01-congress-prep
**Purpose**: extract social media and biographical information from the [congress-legislators](https://github.com/unitedstates/congress-legislators) repo for legislators who served in the 114th, 115th, and/or 116th session.
- 114th started on Jan 3, 2015. Get last commit pre-Jan 3, 2017 not related to 115th. [(November 15, 2016)](https://github.com/unitedstates/congress-legislators/tree/a35d649180d55a0b7d1e381e1774d315371a9188)
- 115th started on Jan 3, 2017. Get last commit pre-Jan 3, 2019 not related to 116th. [(Jan 1, 2019)](https://github.com/unitedstates/congress-legislators/tree/b36de263d6d9dbf40b95f0d1dfc1e0fcbb764bd6)
- 116th started on Jan 3, 2019. Get last commit pre-Jan 3, 2021 not related to 117th. [(Dec 7, 2020)](https://github.com/unitedstates/congress-legislators/tree/47223a5a7b96976f2eadfdbbb9646ca0da2b5927)

### 1. Clone the repo and navigate to it (i.e., make it the working directory)
### 2. Copy the .yaml files into the associated data subdirectory

```
# social media info
git show a35d649180d55a0b7d1e381e1774d315371a9188:legislators-social-media.yaml > legislators-social-media-114.yaml
git show b36de263d6d9dbf40b95f0d1dfc1e0fcbb764bd6:legislators-social-media.yaml > legislators-social-media-115.yaml
git show 47223a5a7b96976f2eadfdbbb9646ca0da2b5927:legislators-social-media.yaml > legislators-social-media-116.yaml
mv legislators-social-media-114.yaml ../us-right-media/data/01-raw/01-congress-legislators
mv legislators-social-media-115.yaml ../us-right-media/data/01-raw/01-congress-legislators
mv legislators-social-media-116.yaml ../us-right-media/data/01-raw/01-congress-legislators

# biographical info
# note: the most recent commit hash at the time of this project is `1dfb8b8c73c3f06ff411930fd94a5d945dd4ae94`.
cp legislators-historical.yaml ../us-right-media/data/01-raw/01-congress-legislators/legislators-historical.yaml
cp legislators-current.yaml ../us-right-media/data/01-raw/01-congress-legislators/legislators-current.yaml
```

### 3. load libraries and specify input/output locations

In [1]:
import os
dir_inp = os.path.join('..', '..', 'data', '01-raw', '01-congress-legislators')
dir_out = os.path.join('..', '..', 'data', '02-intermediate', '01-congress-legislators')

import pandas as pd
import re
from ruamel.yaml import YAML
yaml=YAML(typ='base')
# include typ='base': otherwise it reads twitter_id as float first, which messes with the digits
# "If you load after doing `yaml = YAML(typ='base'), you will get the baseloader and every scalar loads as a string"
# https://stackoverflow.com/questions/54820256/how-to-read-load-yaml-parameters-with-leading-zeros-as-a-string

### 4. extract data from .yaml files into dataframes

In [2]:
def extract_social_media(fn):
    """Extract social media information for congressional legislators.
     
    Args:
        fn (str): name of .yaml file
        
    Returns:
        A dataframe where each row corresponds to a legislator.
    
    """

    with open(os.path.join(dir_inp, fn)) as stream:
        data = yaml.load(stream)
        df = pd.DataFrame(data)
    print(f'The {fn} file contains {len(df)} legislators.')
    
    df_soc = pd.DataFrame({'bioguide': [x.get('bioguide') for x in df['id']],
                            'thomas': [x.get('thomas') for x in df['id']],
                            'govtrack': [x.get('govtrack') for x in df['id']],
                            'twitter': [x.get('twitter') for x in df['social']],
                            'twitter_id': [x.get('twitter_id') for x in df['social']],
                            'facebook': [x.get('facebook') for x in df['social']],
                            'facebook_id': [x.get('facebook_id') for x in df['social']],
                            'youtube_id': [x.get('youtube_id') for x in df['social']],
                            'congress': re.findall(r'\d+', fn)[0]
                            })
    
    return df_soc

In [3]:
soc_114 = extract_social_media('legislators-social-media-114.yaml')
soc_115 = extract_social_media('legislators-social-media-115.yaml')
soc_116 = extract_social_media('legislators-social-media-116.yaml')

soc = soc_114.append(soc_115).append(soc_116).reset_index(drop=True)
soc.to_pickle(os.path.join(dir_out, 'legislators-social-media.pkl'))

print(f'The combined dataframe contains {len(soc)} rows. There are duplicate entries when legislators serve more than 1 term.')

The legislators-social-media-114.yaml file contains 539 legislators.
The legislators-social-media-115.yaml file contains 529 legislators.
The legislators-social-media-116.yaml file contains 532 legislators.
The combined dataframe contains 1600 rows. There are duplicate entries when legislators serve more than 1 term.


In [4]:
# show dataframe
soc

Unnamed: 0,bioguide,thomas,govtrack,twitter,twitter_id,facebook,facebook_id,youtube_id,congress
0,R000600,02222,412664,RepAmata,3026622545,congresswomanaumuaamata,1537155909907320,UCGdrLQbt1PYDTPsampx4t1A,114
1,H001070,02260,412645,RepHardy,2964222544,RepCresentHardy,320612381469421,UCc8E6NWCdgrXjBVI2NNPYdA,114
2,Y000064,02019,412428,RepToddYoung,234128524,RepToddYoung,186203844738421,UCuknj4PGn91gHDNAfboZEgQ,114
3,E000295,02283,412667,SenJoniErnst,2856787757,senjoniernst,351671691660938,UCLwrmtF_84FIcK3TyMs4MIw,114
4,T000476,02291,412668,senthomtillis,2964174789,SenatorThomTillis,1576257352609470,UCUD9VGV4SSGWjGdbn37Ea2w,114
...,...,...,...,...,...,...,...,...,...
1595,B001311,,412844,RepDanBishop,1176522535531360257,repdanbishop,,,116
1596,G000061,,456792,repmikegarcia,1262531473057423361,RepMikeGarcia,,,116
1597,M000687,,407672,RepKweisiMfume,1276209702322438148,RepKweisiMfume,,,116
1598,M001210,,412845,RepGregMurphy,1173978070535024642,RepGregMurphy,,,116


In [5]:
# show terms served by legislator
soc.groupby(['bioguide', 'congress']).size()

bioguide  congress
A000055   114         1
          115         1
          116         1
A000360   114         1
          115         1
                     ..
Y000066   115         1
Z000017   114         1
          115         1
          116         1
Z000018   114         1
Length: 1600, dtype: int64

In [6]:
def extract_bio_info(fn):
    """Extract biographical information for congressional legislators.
     
    Args:
        fn (str): name of .yaml file
        
    Returns:
        A dataframe where each row corresponds to a legislator.
        
    """
     
    with open(os.path.join(dir_inp, fn)) as stream:
        data = yaml.load(stream)
        df = pd.DataFrame(data)
        print(f'The {fn} file contains {len(df)} legislators.')
    
    df_bio = pd.DataFrame({'bioguide': [x.get('bioguide') for x in df['id']],
                            'thomas': [x.get('thomas') for x in df['id']],
                            'lis': [x.get('lis') for x in df['id']],
                            'govtrack': [x.get('govtrack') for x in df['id']],
                            'opensecrets': [x.get('opensecrets') for x in df['id']],
                            'votesmart': [x.get('votesmart') for x in df['id']],
                            'fec': [x.get('fec') for x in df['id']],
                            'cspan': [x.get('cspan') for x in df['id']],
                            'wikipedia': [x.get('wikipedia') for x in df['id']],
                            'house_history': [x.get('house_history') for x in df['id']],
                            'ballotpedia': [x.get('ballotpedia') for x in df['id']],
                            'maplight': [x.get('maplight') for x in df['id']],
                            'icpsr': [x.get('icpsr') for x in df['id']],
                            'wikidata': [x.get('wikidata') for x in df['id']],
                            'google_entity_id': [x.get('google_entity_id') for x in df['id']],
                            'first': [x.get('first') for x in df['name']],
                            'last': [x.get('last') for x in df['name']],
                            'official_full': [x.get('official_full') for x in df['name']],
                            'birthday': [x.get('birthday') for x in df['bio']],
                            'gender': [x.get('gender') for x in df['bio']],
                            'terms': [x for x in df['terms']],
                            'leadership_roles': [x for x in df['leadership_roles']],
                            'other_names': [x for x in df['other_names']],
                            'family': [x for x in df['family']]}
                            )
                            
    return df_bio

In [7]:
bio_curr = extract_bio_info('legislators-current.yaml')
bio_hist = extract_bio_info('legislators-historical.yaml')

bio = bio_curr.append(bio_hist).reset_index(drop=True)
bio.to_pickle(os.path.join(dir_out, 'legislators-biographical.pkl'))

print(f'The combined dataframe contains {len(bio)} legislators.')

The legislators-current.yaml file contains 538 legislators.
The legislators-historical.yaml file contains 12045 legislators.
The combined dataframe contains 12583 legislators.


### 5. merge social media and biographical dataframes
- "`bioguide`: The alphanumeric ID for this legislator in [http://bioguide.congress.gov](http://bioguide.congress.gov). Note that at one time some legislators (women who had changed their name when they got married) had two entries on the bioguide website. Only one bioguide ID is included here. This is the best field to use as a primary key." [(source: congress-legislators)](https://github.com/unitedstates/congress-legislators)

In [8]:
# check that bioguide as primary key makes sense
# use bio as base dataframe and join soc to it
for key in ['bioguide', 'thomas', 'govtrack']:
    print(f"{key}: missing {soc[key].isna().sum()} entries")

print(f"every row in `bio` has a unique bioguide: {len(bio)==bio.bioguide.nunique()}")
print(f"every row in `soc` has a unique bioguide: {len(soc)==soc.bioguide.nunique()}")

bioguide: missing 0 entries
thomas: missing 247 entries
govtrack: missing 94 entries
every row in `bio` has a unique bioguide: True
every row in `soc` has a unique bioguide: False


In [9]:
# sort by bioguide and congress and then only keep the most recent entry to collect the most updated social media information
soc_sorted = soc.sort_values(by=['bioguide','congress'])
soc_latest = soc_sorted.drop_duplicates('bioguide', keep='last').reset_index(drop=True)
soc_latest

Unnamed: 0,bioguide,thomas,govtrack,twitter,twitter_id,facebook,facebook_id,youtube_id,congress
0,A000055,01460,400004,Robert_Aderholt,76452765,RobertAderholt,,UC71CAgpg1gbLTew_pfTneGA,116
1,A000360,01695,300002,SenAlexander,76649729,senatorlamaralexander,,UChDLBjn5RWqgMmCSswT05IQ,116
2,A000367,02029,412438,,,repjustinamash,,UCeg6HhoCXrS8xpON9dxtZgA,116
3,A000368,02075,412493,KellyAyotte,229592356,kellyayottenh,123436097729198,UCe_jD6bQuBwAo4CxwUm_ztw,114
4,A000369,02090,412500,MarkAmodeiNV2,402719755,MarkAmodeiNV2,,UCjOGx2iqSn1r3BQxaVgYhYw,116
...,...,...,...,...,...,...,...,...,...
710,Y000064,02019,412428,SenToddYoung,234128524,SenatorToddYoung,,UCuknj4PGn91gHDNAfboZEgQ,116
711,Y000065,02115,412525,RepTedYoho,1071900114,CongressmanTedYoho,,UCGmDQgEgP2Z0NjavmnS9Hwg,116
712,Y000066,02242,412628,RepDavidYoung,314205957,RepDavidYoung,,UCD2SzM1kn4iZfHccN4bocQA,115
713,Z000017,02261,412646,RepLeeZeldin,2750127259,RepLeeZeldin,,UCHzZuesCPDka2NhZO8icqzA,116


In [10]:
# check: before + after filtering
# example ID
display(soc.loc[soc['bioguide']=='K000386'])
display(soc_latest.loc[soc_latest['bioguide']=='K000386'])

# count of legislators is the same for 116th (i.e., most recent) and decreases for 114th and 115th since some legislators also served in 116th
display(soc.groupby('congress').size())
display(soc_latest.groupby('congress').size())

Unnamed: 0,bioguide,thomas,govtrack,twitter,twitter_id,facebook,facebook_id,youtube_id,congress
494,K000386,2264,412649,RepJohnKatko,2966765501,RepJohnKatko,,,114
957,K000386,2264,412649,RepJohnKatko,2966765501,RepJohnKatko,,,115
1403,K000386,2264,412649,RepJohnKatko,2966765501,RepJohnKatko,,,116


Unnamed: 0,bioguide,thomas,govtrack,twitter,twitter_id,facebook,facebook_id,youtube_id,congress
357,K000386,2264,412649,RepJohnKatko,2966765501,RepJohnKatko,,,116


congress
114    539
115    529
116    532
dtype: int64

congress
114     81
115    102
116    532
dtype: int64

- `bio` contains legislators who only served before the 114th session. We want to exclude these legislators from the merged dataframe since we are only interested in the 114th session onwards.
- To do so, we use an inner join between `bio` and `soc` based on `bioguide` to only keep legislators from the 114-116th sessions.

In [11]:
# because there are overlapping keys (thomas, govtrack) between `soc` and `bio`,
# first select the keys we want to merge from `soc` into `bio`. This avoids creating thomas_x, thomas_y, govtrack_x, govtrack_y.
# get the difference between the two dataframes' columns. Add `bioguide` back in since it is the key that is used for joining.
merge_cols = soc_latest.columns.difference(bio.columns).to_list() + ['bioguide']
merge_cols

['congress',
 'facebook',
 'facebook_id',
 'twitter',
 'twitter_id',
 'youtube_id',
 'bioguide']

In [12]:
pol = pd.merge(left=bio,
                   right=soc_latest[merge_cols],
                   how='inner', # Use intersection of keys from both frames
                   on='bioguide', # Column or index level names to join on. Must be found in both the left and right DataFrame and/or Series objects.
                   validate='one_to_one') # “one_to_one” or “1:1”: checks if merge keys are unique in both left and right datasets.

display(pol[['official_full', 'twitter', 'twitter_id']])
print(pol.columns)

Unnamed: 0,official_full,twitter,twitter_id
0,Sherrod Brown,SenSherrodBrown,43910797
1,Maria Cantwell,SenatorCantwell,117501995
2,Benjamin L. Cardin,SenatorCardin,109071031
3,Thomas R. Carper,SenatorCarper,249787913
4,"Robert P. Casey, Jr.",SenBobCasey,171598736
...,...,...,...
710,Ben McAdams,RepBenMcAdams,196362083
711,Denver Riggleman,RepRiggleman,1080504024695222273
712,Cedric L. Richmond,RepRichmond,267854863
713,Kamala D. Harris,SenKamalaHarris,803694179079458816


Index(['bioguide', 'thomas', 'lis', 'govtrack', 'opensecrets', 'votesmart',
       'fec', 'cspan', 'wikipedia', 'house_history', 'ballotpedia', 'maplight',
       'icpsr', 'wikidata', 'google_entity_id', 'first', 'last',
       'official_full', 'birthday', 'gender', 'terms', 'leadership_roles',
       'other_names', 'family', 'congress', 'facebook', 'facebook_id',
       'twitter', 'twitter_id', 'youtube_id'],
      dtype='object')


In [13]:
# Extract information about term(s) served
pol['terms_party'] = pol['terms'].map(lambda x: [term.get('party', None) for term in x])
pol['latest_party'] = pol['terms_party'].map(lambda x: x[-1])
pol['terms_type'] = pol['terms'].map(lambda x: [term.get('type', None) for term in x])
pol['terms_state'] = pol['terms'].map(lambda x: [term.get('state', None)  for term in x])
pol['latest_state'] = pol['terms_state'].map(lambda x: x[-1])
pol['terms_district'] = pol['terms'].map(lambda x: [term.get('district', None) for term in x])
# merge state/district lists into tuples (state, district number if applicable)
pol['terms_region'] = [list(x) for x in map(zip, pol['terms_state'], pol['terms_district'])]

In [14]:
pol

Unnamed: 0,bioguide,thomas,lis,govtrack,opensecrets,votesmart,fec,cspan,wikipedia,house_history,...,twitter,twitter_id,youtube_id,terms_party,latest_party,terms_type,terms_state,latest_state,terms_district,terms_region
0,B000944,00136,S307,400050,N00003535,27018,"[H2OH13033, S6OH00163]",5051,Sherrod Brown,9996,...,SenSherrodBrown,43910797,UCgy8jfERh-t_ixkKKoCmglQ,"[Democrat, Democrat, Democrat, Democrat, Democ...",Democrat,"[rep, rep, rep, rep, rep, rep, rep, sen, sen, ...","[OH, OH, OH, OH, OH, OH, OH, OH, OH, OH]",OH,"[13, 13, 13, 13, 13, 13, 13, None, None, None]","[(OH, 13), (OH, 13), (OH, 13), (OH, 13), (OH, ..."
1,C000127,00172,S275,300018,N00007836,27122,"[S8WA00194, H2WA01054]",26137,Maria Cantwell,10608,...,SenatorCantwell,117501995,UCN52UDqKgvHRk39ncySrIMw,"[Democrat, Democrat, Democrat, Democrat, Democ...",Democrat,"[rep, sen, sen, sen, sen]","[WA, WA, WA, WA, WA]",WA,"[1, None, None, None, None]","[(WA, 1), (WA, None), (WA, None), (WA, None), ..."
2,C000141,00174,S308,400064,N00001955,26888,"[H6MD03177, S6MD03177]",4004,Ben Cardin,10629,...,SenatorCardin,109071031,UCiQaJnMzlfzzG3VESgyZChA,"[Democrat, Democrat, Democrat, Democrat, Democ...",Democrat,"[rep, rep, rep, rep, rep, rep, rep, rep, rep, ...","[MD, MD, MD, MD, MD, MD, MD, MD, MD, MD, MD, M...",MD,"[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, None, None, None]","[(MD, 3), (MD, 3), (MD, 3), (MD, 3), (MD, 3), ..."
3,C000174,00179,S277,300019,N00012508,22421,[S8DE00079],663,Tom Carper,10671,...,SenatorCarper,249787913,UCgLnvbKwu4B3navofj6Qvvw,"[Democrat, Democrat, Democrat, Democrat, Democ...",Democrat,"[rep, rep, rep, rep, rep, sen, sen, sen, sen]","[DE, DE, DE, DE, DE, DE, DE, DE, DE]",DE,"[0, 0, 0, 0, 0, None, None, None, None]","[(DE, 0), (DE, 0), (DE, 0), (DE, 0), (DE, 0), ..."
4,C001070,01828,S309,412246,N00027503,2541,[S6PA00217],47036,Bob Casey Jr.,,...,SenBobCasey,171598736,UCtVssXhx-KuZa-hSvnsnJ0A,"[Democrat, Democrat, Democrat]",Democrat,"[sen, sen, sen]","[PA, PA, PA]",PA,"[None, None, None]","[(PA, None), (PA, None), (PA, None)]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
710,M001209,,,412829,N00042013,117512,[H8UT04053],,Ben McAdams,,...,RepBenMcAdams,196362083,,[Democrat],Democrat,[rep],[UT],UT,[4],"[(UT, 4)]"
711,R000611,,,412831,N00043541,181445,[H8VA05171],,Denver Riggleman,,...,RepRiggleman,1080504024695222273,,[Republican],Republican,[rep],[VA],VA,[5],"[(VA, 5)]"
712,R000588,02023,,412432,N00030184,35384,[H8LA02054],62391,Cedric Richmond,20816,...,RepRichmond,267854863,UCsB66Dq8sCGwr5K3g1kqTrA,"[Democrat, Democrat, Democrat, Democrat, Democ...",Democrat,"[rep, rep, rep, rep, rep, rep]","[LA, LA, LA, LA, LA, LA]",LA,"[2, 2, 2, 2, 2, 2]","[(LA, 2), (LA, 2), (LA, 2), (LA, 2), (LA, 2), ..."
713,H001075,,S387,412678,N00036915,120012,[S6CA00584],1018696,Kamala Harris,,...,SenKamalaHarris,803694179079458816,UCe1ciA1TDa5F9K6Ufr_6Fsw,[Democrat],Democrat,[sen],[CA],CA,[None],"[(CA, None)]"


### 6. Manually check/add Twitter accounts if account information isn't available from congress-legislators

In [15]:
# members with no Twitter info
bioguides_no_tw = (pol.groupby(['bioguide'])['twitter'] # same set when using ['twitter_id']
                      .count()
                      .sort_values()
                      .where(lambda x : x<1).dropna()
                      .index.tolist())

pol[pol['bioguide'].isin(bioguides_no_tw)][['bioguide', 'official_full', 'twitter', 'terms_state', 'latest_party']]        

Unnamed: 0,bioguide,official_full,twitter,terms_state,latest_party
272,K000384,Tim Kaine,,"[VA, VA]",Democrat
484,V000127,David Vitter,,"[LA, LA, LA, LA, LA]",Republican
509,M000689,John L. Mica,,"[FL, FL, FL, FL, FL, FL, FL, FL, FL, FL, FL, FL]",Republican
510,M001144,Jeff Miller,,"[FL, FL, FL, FL, FL, FL, FL, FL]",Republican
522,G000556,Alan Grayson,,"[FL, FL, FL]",Democrat
548,D000604,Charles W. Dent,,"[PA, PA, PA, PA, PA, PA, PA]",Republican
560,B001245,Madeleine Z. Bordallo,,"[GU, GU, GU, GU, GU, GU, GU, GU]",Democrat
656,A000367,Justin Amash,,"[MI, MI, MI, MI, MI]",Libertarian
658,C001049,Wm. Lacy Clay,,"[MO, MO, MO, MO, MO, MO, MO, MO, MO, MO]",Democrat
695,G000584,Greg Gianforte,,"[MT, MT]",Republican


[Finding Twitter User IDs](https://gist.github.com/kentbrew/8942accb5c584f11a775af02d097dd40)
- search name in Google, look for verified accounts
- twitter ID is after `profile_banners` in HTML
- can find account page using `https://twitter.com/intent/user?user_id=`

In [16]:
# https://stackoverflow.com/a/38467449:
# conditional lookup using .loc, don't use chained indexing since it can return either a view of the original or a separate copy

# 272 Tim Kaine
pol.loc[pol['official_full']=='Tim Kaine', 'twitter'] = 'timkaine'
pol.loc[pol['official_full']=='Tim Kaine', 'twitter_id'] = '172858784'

# 484 David Vitter - if adding df_trp in 2-congress-prep.ipynb, this legislator is already excluded from dataset
pol.loc[pol['official_full']=='David Vitter', 'twitter'] = 'DavidVitter'
pol.loc[pol['official_full']=='David Vitter', 'twitter_id'] = '19028248'

# 509 John L. Mica
# pol.loc[pol['official_full']=='John L. Mica', 'twitter'] = 'RepJohnMica' # private
# pol.loc[pol['official_full']=='John L. Mica', 'twitter_id'] = 

# 510 Jeff Miller
# pol.loc[pol['official_full']=='Jeff Miller', 'twitter'] = 'RepJeffMiller' # does not exist anymore
# pol.loc[pol['official_full']=='Jeff Miller', 'twitter_id'] = 

# 522 Alan Grayson
pol.loc[pol['official_full']=='Alan Grayson', 'twitter'] = 'AlanGrayson'
pol.loc[pol['official_full']=='Alan Grayson', 'twitter_id'] = '41017380'

# 548 Charles W. Dent
pol.loc[pol['official_full']=='Charles W. Dent', 'twitter'] = 'RepCharlieDent'
pol.loc[pol['official_full']=='Charles W. Dent', 'twitter_id'] = '242376736'

# 560 Madeleine Z. Bordallo - Guam delegate; no account
# pol.loc[pol['official_full']=='Madeleine Z. Bordallo', 'twitter'] = 
# pol.loc[pol['official_full']=='Madeleine Z. Bordallo', 'twitter_id'] = 

# 656 Justin Amash
pol.loc[pol['official_full']=='Justin Amash', 'twitter'] = 'justinamash'
pol.loc[pol['official_full']=='Justin Amash', 'twitter_id'] = '233842454'

# 658 Wm. Lacy Clay
pol.loc[pol['official_full']=='Wm. Lacy Clay', 'twitter'] = 'LacyClayMO1'
pol.loc[pol['official_full']=='Wm. Lacy Clay', 'twitter_id'] = '584912320'

# 695 Greg Gianforte
pol.loc[pol['official_full']=='Greg Gianforte', 'twitter'] = 'GregForMontana'
pol.loc[pol['official_full']=='Greg Gianforte', 'twitter_id'] = '3420965229'

### 7. Export datasets

In [17]:
# print all unique party affiliations across legislators
parties = []
list(map(parties.extend, pol['terms_party'].tolist()))
parties = list(set(parties))
parties

['Democrat', 'Libertarian', 'Independent', 'Republican']

In [18]:
# all politicians together regardless of party affiliation
pol.to_pickle(os.path.join(dir_out, f'politicians-all-parties.pkl'))

# by party affiliation
for party in parties:
    pol_party = pol.loc[pol['latest_party']==party].reset_index(drop=True)
    pol_party.to_pickle(os.path.join(dir_out, f'politicians-{party.lower()}.pkl'))

In [19]:
pol.groupby('latest_party',dropna=False).size()

latest_party
Democrat       332
Independent      3
Libertarian      1
Republican     379
dtype: int64

In [20]:
# misc fun fact: politicians who switched parties
switched = (pol['terms_party'].map(set) # get the set of unique party values 
                  .map(len) # count how many are in the set 
                  .sort_values() # optional (just easier for checking)
                  .where(lambda x : x>1).dropna().index # remove if the party affiliation has stayed constant (i.e., equals 1)
                  .tolist()) # convert to list for looping

with pd.option_context('display.max_colwidth', None):
    for member_idx in switched:
        print('---------------------------------------------------------------------------------------------------------------------------------------')
        display(pol.iloc[member_idx][['official_full', 'twitter', 'terms_party']])

---------------------------------------------------------------------------------------------------------------------------------------


official_full                                                     Justin Amash
twitter                                                            justinamash
terms_party      [Republican, Republican, Republican, Republican, Libertarian]
Name: 656, dtype: object

---------------------------------------------------------------------------------------------------------------------------------------


official_full                                                                                                 Richard C. Shelby
twitter                                                                                                               SenShelby
terms_party      [Democrat, Democrat, Democrat, Democrat, Democrat, Republican, Republican, Republican, Republican, Republican]
Name: 186, dtype: object

---------------------------------------------------------------------------------------------------------------------------------------


official_full                Paul Mitchell
twitter                    RepPaulMitchell
terms_party      [Republican, Independent]
Name: 694, dtype: object

---------------------------------------------------------------------------------------------------------------------------------------


official_full                                               Gregorio Kilili Camacho Sablan
twitter                                                                      Kilili_Sablan
terms_party      [Democrat, Democrat, Democrat, Independent, Democrat, Democrat, Democrat]
Name: 173, dtype: object