# Lecture 24 - Text Data  

<font size = "5">

In  this lecture we will work with text data

<font size = "5">

Import Libraries

In [1]:
import pandas as pd

<font size = "5">

Import Data

- Congressional bills in the United States

In [2]:
bills_actions = pd.read_csv("data_raw/bills_actions.csv")
bills_actions.dtypes

Congress        int64
bill_number     int64
bill_type      object
action         object
main_action    object
category       object
member_id       int64
dtype: object

# II. Basic Text Operations 

<font size = "5">

Count Frequency

In [3]:
bills_actions["category"]

0         amendment
1         amendment
2         amendment
3       senate bill
4       senate bill
           ...     
3298      amendment
3299      amendment
3300      amendment
3301      amendment
3302      amendment
Name: category, Length: 3303, dtype: object

In [4]:
bills_actions["category"].value_counts()

category
amendment                       1529
house bill                       902
senate bill                      514
house resolution                 234
senate resolution                 60
house joint resolution            22
house concurrent resolution       20
senate concurrent resolution      14
senate joint resolution            8
Name: count, dtype: int64

<font size = "5">

Subset text categories

In [5]:
# For this analysis we are only interested in bills. With ".query()" ...
#     - We select all entries in the column called "category" 
#       which have values contain in "list_categories"
#     - "in" is used to test whether a word belongs to a list
#     - @ is the syntax to reference "global" variables that
#       are defined in the global environment

list_categories = ["house bill","senate bill"]
bills = bills_actions.query('category in @list_categories')

# Verify that the code worked:
bills["category"].value_counts()


category
house bill     902
senate bill    514
Name: count, dtype: int64

<font size = "5">

Data manipulation with sentences

In [6]:
# How many bills mention the word Senator?
bool_contains = bills["action"].str.contains("Senator")
print(bool_contains.mean())

0.3199152542372881


In [7]:
bills[bills["action"].str.contains("Senator")]

Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
3,116,1199,s,"Committee on Health, Education, Labor, and Pen...",senate committee/subcommittee actions,senate bill,1561
4,116,1208,s,Committee on the Judiciary. Reported by Senato...,senate committee/subcommittee actions,senate bill,1580
5,116,1231,s,Committee on the Judiciary. Reported by Senato...,senate committee/subcommittee actions,senate bill,1580
6,116,1228,s,"Committee on Commerce, Science, and Transporta...",senate committee/subcommittee actions,senate bill,1002
7,116,123,s,Committee on Veterans' Affairs. Reported by Se...,senate committee/subcommittee actions,senate bill,1490
...,...,...,...,...,...,...,...
2944,116,617,hr,Committee on Energy and Natural Resources. Rep...,senate committee/subcommittee actions,house bill,1581
3081,116,762,hr,Committee on Energy and Natural Resources. Rep...,senate committee/subcommittee actions,house bill,1581
3142,116,828,hr,Committee on Homeland Security and Governmenta...,senate committee/subcommittee actions,house bill,1701
3150,116,829,hr,Committee on Homeland Security and Governmenta...,senate committee/subcommittee actions,house bill,1701


In [8]:
# How to replace the word "Senator" with "Sen."
bills["action_custom"] = bills["action"].str.replace("Senator","Sen.")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bills["action_custom"] = bills["action"].str.replace("Senator","Sen.")


In [9]:
# To avoid this warning, you can use the following code:
bills.loc[:, "action_custom"] = bills["action"].str.replace("Senator", "Sen.")

<font size = "5">

Try it yourself!

- Obtain a new dataset called "resolutions" <br>
 which subsets rows contain the "category" values:

 ``` ["house resolution","senate resolution"] ```

In [10]:
# Write your own code

# III. Regular Expressions 

<font size = "5">

Regular expressions enable advanced searching <br>
for string data.

In [11]:
dataset = pd.read_csv("data_raw/bills_actions.csv")
senate_bills = dataset.query('category == "senate bill"')
amendments = dataset.query('category == "amendment"')

In [12]:
dataset[dataset['action'].str.contains('to reconsider')]

Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
38,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
39,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
40,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
41,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
268,116,2657,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
269,116,2657,s,S.Amdt.1407 Motion by Senator McConnell to rec...,other senate amendment actions,amendment,858
400,116,3985,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
548,116,50,sres,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate resolution,858
823,116,28,hjres,VITIATION OF EARLIER PROCEEDINGS - Mr. Hoyer a...,house floor actions,house joint resolution,1065
1023,116,758,hres,Mr. Nadler moved to table the motion to recons...,house floor actions,house resolution,546


<font size = "5">

Search word

In [13]:
# We use the ".str.findall()" subfunction
# The argument is an expression
import re

amendments["action"].str.findall(r"Amdt\.\d+\D")


0       [Amdt.1274 ]
1       [Amdt.2698 ]
2       [Amdt.2659 ]
8       [Amdt.2424 ]
11      [Amdt.1275 ]
            ...     
3298     [Amdt.172 ]
3299     [Amdt.171 ]
3300     [Amdt.170 ]
3301              []
3302     [Amdt.169 ]
Name: action, Length: 1529, dtype: object

<font size = "5">

Wildcards

$\quad$ <img src="figures/wildcards_regex1.png" alt="drawing" width="300"/>

In [14]:
# Get digits after string
amendments["action"].str.findall(r"Amdt\.\d+")

0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3298     [Amdt.172]
3299     [Amdt.171]
3300     [Amdt.170]
3301             []
3302     [Amdt.169]
Name: action, Length: 1529, dtype: object

In [15]:
# Get any character before string
amendments["action"].str.findall(r"\wmdt\.")

0       [Amdt.]
1       [Amdt.]
2       [Amdt.]
8       [Amdt.]
11      [Amdt.]
         ...   
3298    [Amdt.]
3299    [Amdt.]
3300    [Amdt.]
3301         []
3302    [Amdt.]
Name: action, Length: 1529, dtype: object

In [16]:
# Get two characters before string and four characters after string
amendments["action"].str.findall(r"\w{2}dt\.\w{4}")

0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3298             []
3299             []
3300             []
3301             []
3302             []
Name: action, Length: 1529, dtype: object

<font size = "5">

Wildcards + Quantifiers

$\quad$ <img src="figures/wildcards_regex2.png" alt="drawing" width="300"/>

In [17]:
# Match any characters (including none) before "Amdt" followed by non-whitespace
amendments["action"].str.findall(r".*Amdt\S*")

0       [S.Amdt.1274]
1       [S.Amdt.2698]
2       [S.Amdt.2659]
8       [S.Amdt.2424]
11      [S.Amdt.1275]
            ...      
3298     [H.Amdt.172]
3299     [H.Amdt.171]
3300     [H.Amdt.170]
3301               []
3302     [H.Amdt.169]
Name: action, Length: 1529, dtype: object

In [18]:
# Get all consecutive digits after string
amendments["action"].str.findall(r"Amdt\.\d+")

0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3298     [Amdt.172]
3299     [Amdt.171]
3300     [Amdt.170]
3301             []
3302     [Amdt.169]
Name: action, Length: 1529, dtype: object

<font size = "5">

Try it yourself

- Practice using the ```senate_bills``` dataset
- Use ```.str.findall()``` to find the word "Senator"
- Use the regular expression ```"Senator \S"``` to extract the the first letter of senator
- Use ```*``` to extract senator names

In [19]:
# Write your own code