In [1]:
import cudf

In [2]:
df = cudf.read_parquet('../data/american_stories_1938_1945.parquet')

In [3]:
!nvidia-smi

Fri Feb  7 20:50:27 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX 5000 Ada Gene...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   32C    P2             55W /  250W |    5172MiB /  32760MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [4]:
df.head(1)

Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article
0,1_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,1,1938-11-08,p3,Fear Heavy Toll Among Civilians In Next Conflict,,Recognition of a probable heavy toll among non...


## Creating Numeric Columns

In this section, we'll create two numeric columns to help analyze our articles:
- word_count: The number of words in each article (calculated by counting spaces + 1)
- text_length: The total number of characters in each article

We will use these columns to calculate the average word count and text length of our articles, as well as other statistics.

In [5]:
df['word_count'] = df['article'].str.count(' ') + 1
# Get text lengths
df['text_length'] = df['article'].str.len()

In [6]:
df.head(5)

Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article,word_count,text_length
0,1_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,1,1938-11-08,p3,Fear Heavy Toll Among Civilians In Next Conflict,,Recognition of a probable heavy toll among non...,233,1506
1,3_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,1,1938-11-08,p3,,,"Conforming to tradition, the Democratic candid...",256,1624
2,4_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,1,1938-11-08,p3,Audience Thrilled By\n\n Early Masters Works\n...,,second by Kasper Ferdinand Fisch- CT.\n\n FOlk...,124,766
3,5_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,1,1938-11-08,p3,Democrats Institute Court Action T oday\n\n To...,,in behalf of Charles Maloney of Se4 East Main ...,224,1278
4,6_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,1,1938-11-08,p3,q WOMEN SEEK\n\n ELECTION JOBS\n\nNone Candida...,BY RUBY A. BLACK nifAd preea Sfaff Corresnovad...,NUnlte0 freSS Stam COrreSpOn0ent)\n\n Washingt...,280,1746


## Mean Values
 
Looking at the mean values above:
- The average word count is around 120 words per article
- The average text length is around 811 characters per article

This suggests that most articles in our dataset are relatively short, perhaps being news briefs or small updates rather than long-form articles.


In [7]:
df.mean(numeric_only=True)

word_count     120.27740
text_length    810.95049
dtype: float64

## Median Values

The median values provide a more representative measure of the central tendency of our data:
- The median word count is around 72 words per article
- The median text length is around 488 characters per article



In [8]:
df.median(numeric_only=True)


word_count      72.0
text_length    488.0
dtype: float64

## Mode Values

The mode values represent the most frequently occurring values in our data:
- The most common word count is 4 words per article
- The most common text length is 12 characters per article

In [9]:
df.mode(numeric_only=True)

Unnamed: 0,word_count,text_length
0,4,22


In [10]:
df[df['word_count'] < 5]

Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article,word_count,text_length
105,21_1938-10-16_p35_sn83045462_00280601858_19381...,Evening star.,01,1938-10-16,p35,,,(see HEALEY. Page C-3.),4,23
106,24_1938-10-16_p35_sn83045462_00280601858_19381...,Evening star.,01,1938-10-16,p35,,,"(see EDGERToN, Page C-s0.",4,25
172,56_1938-08-02_p8_sn83045499_00393342353_193808...,The Daily Alaska empire.,01,1938-08-02,p8,,,Today's News Today.-Empire.,3,27
173,57_1938-08-02_p8_sn83045499_00393342353_193808...,The Daily Alaska empire.,01,1938-08-02,p8,,,Try an Empire ad.,4,17
175,60_1938-08-02_p8_sn83045499_00393342353_193808...,The Daily Alaska empire.,01,1938-08-02,p8,,,Empire classifieds pay.,3,23
...,...,...,...,...,...,...,...,...,...,...
4368496,88_1945-11-16_p1_sn82014085_00393346838_194511...,The Waterbury Democrat.,01,1945-11-16,p1,,,Continued on Page D,4,19
4368497,90_1945-11-16_p1_sn82014085_00393346838_194511...,The Waterbury Democrat.,01,1945-11-16,p1,,,tcOntinued on Page ID,4,21
4368498,91_1945-11-16_p1_sn82014085_00393346838_194511...,The Waterbury Democrat.,01,1945-11-16,p1,,,Continued on Page D,4,19
4368505,35_1945-02-25_p6_sn83045462_00280603843_194502...,Evening star.,01,1945-02-25,p6,,,\n\ncapital\n\n\n\n\n\n\nNavy,1,20


## Standard Deviation

The standard deviation measures the amount of variation or dispersion in our data:
- The word count standard deviation is around 137 words
- The text length standard deviation is around 930 characters



In [11]:
df.std(numeric_only=True)

word_count     137.986893
text_length    930.291942
dtype: float64

## Minimum and Maximum Values

The minimum and maximum values give us the smallest and largest values in our data:
- The minimum word count is 1 words
- The minimum text length is 1 characters
- The maximum word count is 1920 words
- The maximum text length is 8969 characters

In [12]:
df.min(numeric_only=True)

word_count     1
text_length    1
dtype: int32

In [13]:
df.max(numeric_only=True)

word_count     1920
text_length    8969
dtype: int32

## Unique Values

The number of unique values in our data:

In [14]:
df.nunique()

article_id        4368788
newspaper_name         90
edition                 2
date                 2922
page                  152
headline          2284264
byline             201573
article           4304503
word_count           1212
text_length          7270
dtype: int64

## Cleaning DataFrame

In [15]:
df['article'].value_counts().head(50)

article
Continued from Page D                                                                 2093
(Continued from Page D                                                                1939
Continued from Page One)                                                               710
(Continued on Page 4)                                                                  670
Continued on Page D                                                                    640
(Continued on Page D                                                                   596
Continued on Page 9                                                                    576
Continued on Page A                                                                    575
(Continued on Page 9                                                                   571
(Continued on Page A                                                                   559
Continued From First Page.)                                                       

In [16]:
# Get articles that appear more than 50 times
frequent_articles = df['article'].value_counts()[df['article'].value_counts() > 4].index

# Remove rows with those frequent articles
df = df[~df['article'].isin(frequent_articles)]
df


Unnamed: 0,article_id,newspaper_name,edition,date,page,headline,byline,article,word_count,text_length
0,1_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,Fear Heavy Toll Among Civilians In Next Conflict,,Recognition of a probable heavy toll among non...,233,1506
1,3_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,,,"Conforming to tradition, the Democratic candid...",256,1624
2,4_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,Audience Thrilled By\n\n Early Masters Works\n...,,second by Kasper Ferdinand Fisch- CT.\n\n FOlk...,124,766
3,5_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,Democrats Institute Court Action T oday\n\n To...,,in behalf of Charles Maloney of Se4 East Main ...,224,1278
4,6_1938-11-08_p3_sn82014085_00393347429_1938110...,The Waterbury Democrat.,01,1938-11-08,p3,q WOMEN SEEK\n\n ELECTION JOBS\n\nNone Candida...,BY RUBY A. BLACK nifAd preea Sfaff Corresnovad...,NUnlte0 freSS Stam COrreSpOn0ent)\n\n Washingt...,280,1746
...,...,...,...,...,...,...,...,...,...,...
4368783,28_1945-12-05_p7_sn88063294_00340589130_194512...,Detroit evening times.,01,1945-12-05,p7,,,Here's a contest you win! And what prize!\nA b...,51,321
4368784,10_1945-04-18_p35_sn83045462_00280604082_19450...,Evening star.,01,1945-04-18,p35,"ADVERTISEMENT,\n\nADNL n lSLhLt>1\nTorment OF ...",,II you can't get your feet of your mind\nbecau...,55,379
4368785,3_1945-04-18_p35_sn83045462_00280604082_194504...,Evening star.,01,1945-04-18,p35,Jury Rules Chaplin\nIs.Father;; Conference\nOn...,By the Associated Press.,"By the Associated Press.\n\n\nLOS ANGELES, Apr...",221,1616
4368786,21_1945-04-18_p35_sn83045462_00280604082_19450...,Evening star.,01,1945-04-18,p35,Use Your\nBeIdqet Account,,"It's easy as A-B-C to open a\nCharge, Budget o...",23,142


In [17]:
df['article'].value_counts().head(5)

article
Pacific Coast League                                                           4
BROOM apartment, hot and cold water, steam heat, electric range. Phone 569.    4
Want to make Hitler unhappy?\nBuy Defense stamps and bonds.                    4
Let the race driver do the speeding\ndrive sanely.                             4
(See CONGRESS, Page A-3)                                                       4
Name: count, dtype: int64