In [1]:
import cudf

In [2]:
df = cudf.read_parquet('../data/american_stories_1938_1945.parquet')

In [3]:
!nvidia-smi

Fri Feb  7 21:11:43 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   32C    P2             55W /  250W |    5256MiB /  32760MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [4]:
df.head(1)

Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article
0,1_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,1,1938-11-08,p3,Fear Heavy Toll Among Civilians In Next Conflict,,Recognition of a probable heavy toll among non...


## Changing a Column Type

In this section, we'll convert the 'date' column from string format to datetime format using cuDF's to_datetime() function. This allows us to perform date-based operations and analysis on our data more effectively.

In [5]:
# Convert date column to datetime
df['date'] = cudf.to_datetime(df['date'])


In [6]:
df.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 4368788 entries, 0 to 4368787
Data columns (total 8 columns):
 #   Column          Dtype
---  ------          -----
 0   article_id      object
 1   newspaper_name  object
 2   edition         object
 3   date            datetime64[ns]
 4   page            object
 5   headline        object
 6   byline          object
 7   article         object
dtypes: datetime64[ns](1), object(7)
memory usage: 3.9+ GB


## Using str.findall() for Pattern Matching

The str.findall() method is a powerful tool for extracting patterns from text data. In our case, we're using it to:

1. Find all capitalized words in the article text
2. Match words that:
    - Begin with a capital letter (\b[A-Z])
    - Are followed by 4 or more letters ([a-zA-Z]{4,})
    - Are bounded by word boundaries (\b)

The regex pattern '\b[A-Z][a-zA-Z]{4,}\b' breaks down as:
- \b: Word boundary
- [A-Z]: First letter must be capital
- [a-zA-Z]{4,}: Followed by 4 or more letters (upper or lowercase)
- \b: Word boundary

This helps us identify proper nouns and important terms in the articles.


In [7]:
# Add capitalized words as a new column (longer than 3 letters)
df['capitalized_words'] = df['article'].str.findall(r'\b[A-Z][a-zA-Z]{4,}\b')
df

Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article,capitalized_words
0,1_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,Fear Heavy Toll Among Civilians In Next Conflict,,Recognition of a probable heavy toll among non...,"[Recognition, International, Conference, Paris..."
1,3_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,,,"Conforming to tradition, the Democratic candid...","[Conforming, Democratic, Bridge, Attorney, Pat..."
2,4_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,Audience Thrilled By\n\n Early Masters Works\n...,,second by Kasper Ferdinand Fisch- CT.\n\n FOlk...,"[Kasper, Ferdinand, Fisch, FOlkSOngS, Trapp, C..."
3,5_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,Democrats Institute Court Action T oday\n\n To...,,in behalf of Charles Maloney of Se4 East Main ...,"[Charles, Maloney, Attorney, Lynch, Democratic..."
4,6_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,q WOMEN SEEK\n\n ELECTION JOBS\n\nNone Candida...,BY RUBY A. BLACK nifAd preea Sfaff Corresnovad...,NUnlte0 freSS Stam COrreSpOn0ent)\n\n Washingt...,"[Washington, Twenty, United, States, Senate, H..."
...,...,...,...,...,...,...,...,...,...
4368783,28_1945-12-05_p7_sn88063294_00340589130_194512...,Detroit evening times.,01,1945-12-05,p7,,,Here's a contest you win! And what prize!\nA b...,"[Chevrolet, Victory, Bonds, Victory]"
4368784,10_1945-04-18_p35_sn83045462_00280604082_19450...,Evening star.,01,1945-04-18,p35,"ADVERTISEMENT,\n\nADNL n lSLhLt>1\nTorment OF ...",,II you can't get your feet of your mind\nbecau...,"[Presto, Scholl, Department, Stores, Toiletry,..."
4368785,3_1945-04-18_p35_sn83045462_00280604082_194504...,Evening star.,01,1945-04-18,p35,Jury Rules Chaplin\nIs.Father;; Conference\nOn...,By the Associated Press.,"By the Associated Press.\n\n\nLOS ANGELES, Apr...","[Associated, Press, ANGELES, April, Court, Cha..."
4368786,21_1945-04-18_p35_sn83045462_00280604082_19450...,Evening star.,01,1945-04-18,p35,Use Your\nBeIdqet Account,,"It's easy as A-B-C to open a\nCharge, Budget o...","[Charge, Budget, Coupon, Goldenberg, Floor, Cr..."


## str.contains() for Simple Pattern Matching

The str.contains() method is a straightforward way to check if a string contains a specific pattern. In our case, we're using it to:

1. Check if the article text contains the word "war"

2. Create a new column to store the result

This method is efficient and easy to understand, making it a good choice for simple pattern matching tasks.

In [8]:
df['contains_war'] = df['article'].str.contains('war|War', regex=True)
df["contains_war"].value_counts()

contains_war
False    3240769
True     1128019
Name: count, dtype: int64

## str.lower() for Case-Insensitive Matching

The str.lower() method is useful when you need to perform case-insensitive pattern matching. In our case, we're using it to:

1. Convert the article text to lowercase

2. Check if the lowercase text contains the word "war"

In [9]:
df['contains_war_lower'] = df['article'].str.lower().str.contains('war')
df["contains_war_lower"].value_counts()

contains_war_lower
False    3217171
True     1151617
Name: count, dtype: int64

## Comparing Performance of Different Approaches

In this section, we'll compare the performance of different approaches for checking if the article text contains the word "war". We'll use the timeit module to measure the execution time of each approach.


In [10]:
import time

# Benchmark regex approach
start = time.time()
df['contains_war_regex'] = df['article'].str.contains('war|War', regex=True)
regex_time = time.time() - start

# Benchmark lowercase approach
start = time.time()
df['contains_war_lower'] = df['article'].str.lower().str.contains('war')
lower_time = time.time() - start

print(f"Regex time: {regex_time:.4f} seconds")
print(f"Lowercase time: {lower_time:.4f} seconds")

Regex time: 0.5435 seconds
Lowercase time: 0.1004 seconds


## Exploding a List Column into Multiple Rows

In this section, we'll explode the 'capitalized_words' column, which contains a list of words, into separate rows. This allows us to analyze each word individually and count their frequencies.

In [11]:
# Explode the capitalized_words list into separate rows
exploded_words = df['capitalized_words'].explode()
exploded_words.head(5)

0      Recognition
0    International
0       Conference
0            Paris
0        Countries
Name: capitalized_words, dtype: object

We can examine the type of the exploded_words column to confirm that it is a Series of strings.

In [12]:
type(exploded_words)

cudf.core.series.Series

Let's calculate the word frequencies for the exploded_words column.

In [13]:
# Calculate word frequencies
word_frequencies = exploded_words.value_counts()

# Display top 20 most frequent capitalized words
print("Top 20 most frequent capitalized words:")
print(word_frequencies.head(20))

Top 20 most frequent capitalized words:
capitalized_words
American      534776
Washington    505441
United        454882
States        449000
State         332361
District      326188
George        325751
Sunday        320644
William       295381
There         288402
National      258777
President     257932
German        253964
British       251564
Church        250112
James         249267
Charles       229777
Press         225028
House         209311
Saturday      206893
Name: count, dtype: int64


## Real-World Example: Tracking Historical Figures

In this section, we'll use the str.contains() method to track mentions of historical figures over time. We'll create columns for each figure and then group by date to count mentions.

In this example, we'll analyze how frequently certain historical figures were mentioned in newspaper articles over time.
We'll focus on major World War II leaders: Roosevelt, Hitler, Churchill, and Stalin.

For each historical figure:
1. We'll create a boolean column indicating whether each article mentions that figure
2. Extract the year from the article dates
3. Group the data by year to see how mentions changed over time

This analysis can help us understand:
- Which leaders received more media attention
- How coverage of different leaders changed throughout the war
- Potential correlations between historical events and media coverage



In [14]:
# Extract year from date and track mentions of historical figures by year
historical_figures = ['Roosevelt', 'Hitler', 'Churchill', 'Stalin']

# Create columns for each figure
for figure in historical_figures:
    df[f'contains_{figure}'] = df['article'].str.contains(figure)

# Extract year and group by it
df['year'] = df['date'].dt.year

# Create a summary dataframe for all figures by year
yearly_summary = df.groupby('year')[[f'contains_{figure}' for figure in historical_figures]].sum()
print("\nYearly mention counts:")
print(yearly_summary)



Yearly mention counts:
      contains_Roosevelt  contains_Hitler  contains_Churchill  contains_Stalin
year                                                                          
1941               15981            10071                3167              628
1943               10558             4814                3443             1230
1939               14384             7686                 591              337
1938               18187             6352                 516              181
1942               11829             6410                2498             2265
1945                9359             4812                3480             1210
1940               14811             6300                1664              284
1944               12081             4205                2452              741
