In [1]:
import pandas as pd

## Loading Speed Test

In [2]:
%%time

df = pd.read_parquet('../data/american_stories_1938_1945.parquet')

CPU times: user 6.67 s, sys: 4.7 s, total: 11.4 s
Wall time: 7.68 s


## Using the cudf.pandas extension

In [3]:
%load_ext cudf.pandas

In [4]:
import pandas as pd

In [5]:
%%time

df = pd.read_parquet('../data/american_stories_1938_1945.parquet')

CPU times: user 673 ms, sys: 648 ms, total: 1.32 s
Wall time: 799 ms


In [6]:
del df

## The cuDF Profiler

In [7]:
%%cudf.pandas.profile

df = pd.read_parquet('../data/american_stories_1938_1945.parquet')

In [8]:
!nvidia-smi

Wed Apr 30 19:02:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:01:00.0  On |                  Off |
| 30%   33C    P8             24W /  250W |   11223MiB /  32760MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Selecting a Column by its Label

You can select a single column from a DataFrame by using its label (column name) in square brackets, similar to accessing a dictionary value by its key. This returns a Series object containing just that column's data. In the example below, we select the 'newspaper_name' column to see all the newspaper names in our dataset.



## Selecting rows by Value and Boolean Indexing

You can filter rows in a DataFrame based on column values using boolean conditions. In the example below, we use a boolean condition `df['newspaper_name'] == 'Evening star.'` to select only the rows where the newspaper_name column matches "Evening star." This filtering returns a new DataFrame containing just those matching rows.


Boolean indexing allows you to filter rows in a DataFrame based on conditions. You can:

1. Create boolean masks using comparison operators (==, !=, >, <, etc.)
2. Combine multiple conditions using logical operators (&, |, ~)
3. Use the mask to select matching rows from the DataFrame


In [9]:
%%cudf.pandas.profile

df[df['newspaper_name'] == 'Evening star.']

Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article
28,3_1938-07-18_p21_sn83045462_00280601809_193807...,Evening star.,01,1938-07-18,p21,DEATHS REPORTED,,"Carrie F. Mason, 30, Portner Apartments.\nWill..."
29,4_1938-07-18_p21_sn83045462_00280601809_193807...,Evening star.,01,1938-07-18,p21,CITY NEWS IN BRIEF.,TODAY.,"TO-DAY.\n\n\nExcursion, st. Elizabeth's Hospit..."
30,5_1938-07-18_p21_sn83045462_00280601809_193807...,Evening star.,01,1938-07-18,p21,A|RLINES' SAFETY\n' IS FORUM THEME\n\nClinton ...,,Safety on the airlines is tonight's\nRadio For...
31,6_1938-07-18_p21_sn83045462_00280601809_193807...,Evening star.,01,1938-07-18,p21,,,on q par but the soldiers have COM\nplanned fo...
32,8_1938-07-18_p21_sn83045462_00280601809_193807...,Evening star.,01,1938-07-18,p21,BIRTHS REPORTED,,"Clarence and Helen Norris, boy.\nJohn and Phyl..."
...,...,...,...,...,...,...,...,...
4368774,39_1945-11-24_p20_sn83045462_0028060463A_19451...,Evening star.,01,1945-11-24,p20,,,80 - 1'Ae' iish oho <eiiiid\n\n\nTfiYfPiIss;TE...
4368784,10_1945-04-18_p35_sn83045462_00280604082_19450...,Evening star.,01,1945-04-18,p35,"ADVERTISEMENT,\n\nADNL n lSLhLt>1\nTorment OF ...",,II you can't get your feet of your mind\nbecau...
4368785,3_1945-04-18_p35_sn83045462_00280604082_194504...,Evening star.,01,1945-04-18,p35,Jury Rules Chaplin\nIs.Father;; Conference\nOn...,By the Associated Press.,"By the Associated Press.\n\n\nLOS ANGELES, Apr..."
4368786,21_1945-04-18_p35_sn83045462_00280604082_19450...,Evening star.,01,1945-04-18,p35,Use Your\nBeIdqet Account,,"It's easy as A-B-C to open a\nCharge, Budget o..."


## Indexing by Row with Loc

The .loc accessor allows you to select rows by their index labels.
You can use it to:
- Select single rows by label
- Select ranges of rows using start:end labels 
- Select specific sets of rows by passing a list of labels

In the example below, we use .loc to select rows 2000-2004 from the DataFrame.
The start and end labels are inclusive when using .loc.


In [10]:
%%cudf.pandas.profile

df.loc[2000:2004]

Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article
2000,11_1938-12-03_p4_sn82014085_00393347429_193812...,The Waterbury Democrat.,1,1938-12-03,p4,Christ. Science,CHRISTIAN SCIENCE SERVICES,CHRISTIAN SCIENCE SERVICES\n\n God the Only Ca...
2001,12_1938-12-03_p4_sn82014085_00393347429_193812...,The Waterbury Democrat.,1,1938-12-03,p4,ROTARY BOWLERS\n\n DEFEAT KIWANIS\n\nHarold Po...,,Four out of six games were won by Rotary Club ...
2002,13_1938-12-03_p4_sn82014085_00393347429_193812...,The Waterbury Democrat.,1,1938-12-03,p4,UNION COUNCIL\n\n SUPPORTS TONE\n\ndisappointm...,,State Labor Commissioner Joseph I'M. Tones rea...
2003,14_1938-12-03_p4_sn82014085_00393347429_193812...,The Waterbury Democrat.,1,1938-12-03,p4,ORDER OF VASA\n\n PICKS CRANDELL\n\nGota Lejon...,,"Sixty members of Gota Lejon, Or der of Vasa at..."
2004,15_1938-12-03_p4_sn82014085_00393347429_193812...,The Waterbury Democrat.,1,1938-12-03,p4,ERNEST PAKUL\n\n GIVEN PERMIT\n\nPoliceman to ...,,There were IA one-family dwell lings built in ...


## Multi-Condition Example
 
In this example, we combine two filtering conditions to find specific articles:

1. Articles published on December 3rd, 1938 ('1938-12-03')
2. Articles from 'The Waterbury Democrat.' newspaper

By using the & operator, we create a filter that only returns rows matching BOTH conditions. Looking at the output, we can see this successfully narrows down our dataset to articles from The Waterbury Democrat published on that specific date.

This demonstrates how multiple conditions can be used together to precisely target the data we want to analyze.


In [11]:
%%cudf.pandas.profile

mask = (df['date'] == '1938-12-03') & (df['newspaper_name'] == 'The Waterbury Democrat.')
filtered_df = df[mask]

# Show the first few rows
filtered_df.head(5)


Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article
1992,2_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,1,1938-12-03,p4,,,ganizations will be represented at the program...
1993,4_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,1,1938-12-03,p4,Rabbi Liebnaan\n\n ToT Tour Schools,,Nine of the most famous Protestant theological...
1994,5_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,1,1938-12-03,p4,|DE MILLEPICKS l GREATEST FOOLS\n\nNoted Autho...,,Can you name the ten greatest fools in history...
1995,6_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,1,1938-12-03,p4,,,Rabbi Moses D. Sheinkopf will de- liver the pr...
1996,7_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,1,1938-12-03,p4,l WATR Rearranges Schedule For New Radio Highl...,(BY CHARLES CUTLERY,It's beyond the capabilities of even the best ...


In [12]:
filtered_df

Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article
1992,2_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,01,1938-12-03,p4,,,ganizations will be represented at the program...
1993,4_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,01,1938-12-03,p4,Rabbi Liebnaan\n\n ToT Tour Schools,,Nine of the most famous Protestant theological...
1994,5_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,01,1938-12-03,p4,|DE MILLEPICKS l GREATEST FOOLS\n\nNoted Autho...,,Can you name the ten greatest fools in history...
1995,6_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,01,1938-12-03,p4,,,Rabbi Moses D. Sheinkopf will de- liver the pr...
1996,7_1938-12-03_p4_sn82014085_00393347429_1938120...,The Waterbury Democrat.,01,1938-12-03,p4,l WATR Rearranges Schedule For New Radio Highl...,(BY CHARLES CUTLERY,It's beyond the capabilities of even the best ...
...,...,...,...,...,...,...,...,...
598946,18_1938-12-03_p3_sn82014085_00393347429_193812...,The Waterbury Democrat.,01,1938-12-03,p3,KILLED\n\nINJURED\n\nN. Y. BROKERS FUNERAL,,"Greenwich. Conn, Dec. 3-(UP),- Funeral service..."
598947,19_1938-12-03_p3_sn82014085_00393347429_193812...,The Waterbury Democrat.,01,1938-12-03,p3,"CHIMNEY TOPPLES,\n\n FIREMAN INJURED\n\nPrivat...",,Leg injuries were suffered by Private Maurice ...
598948,20_1938-12-03_p3_sn82014085_00393347429_193812...,The Waterbury Democrat.,01,1938-12-03,p3,,,avenue. JUNE\n\n g. Captain James Mcdonald suf...
598949,29_1938-12-03_p3_sn82014085_00393347429_193812...,The Waterbury Democrat.,01,1938-12-03,p3,Outstanding EDents\n\n OF Year ChronicLed,,"Six deaths resulting from fire, accidental dea..."
