In [2]:
import cudf

## Single GPU NLP Capabilities

Let's analyze some coronavirus related tweets from April 1st, 2020.

In [249]:
path = "/raid/vjawa/string_exp/tweets/2020-04-01 Coronavirus Tweets.CSV"
df = cudf.read_csv(path)
df = df.loc[df.lang == 'en']
df.shape

(341697, 22)

In [250]:
df.head(2)

Unnamed: 0,status_id,user_id,created_at,screen_name,text,source,reply_to_status_id,reply_to_user_id,reply_to_screen_name,is_quote,...,retweet_count,country_code,place_full_name,place_type,followers_count,friends_count,account_lang,account_created_at,verified,lang
3,1245138808921874432,131872671,2020-04-01T00:00:00Z,i3health,A records review of patients with #cancer at a...,Buffer,,,,False,...,0,,,,616,4571,,2010-04-11T16:04:57Z,False,en
12,1245138811404685313,824565311437410305,2020-04-01T00:00:00Z,ViralTabNews,This American missionary has been accused of p...,TweetDeck,,,,False,...,0,,,,90,519,,2017-01-26T10:31:04Z,False,en


In [251]:
df.text.head(3)

3     A records review of patients with #cancer at a...
12    This American missionary has been accused of p...
15    Deadliest day for Europe virus hotspots, as US...
Name: text, dtype: object

Let's tokenize the data.

In [252]:
df.text.str.tokenize()

0                                A
1                          records
2                           review
3                               of
4                         patients
                    ...           
9125619               #coronavirus
9125620                   #science
9125621                      #nerd
9125622         #TestingForCovid19
9125623    https://t.co/OHNPQXNPgd
Name: text, Length: 9125624, dtype: object

What are the most common tokens?

In [147]:
df.text.str.tokenize().value_counts()

the                                         284343
to                                          248970
of                                          159108
and                                         145431
#COVID19                                    126944
                                             ...  
#coronavirus.\n\nhttps://t.co/9hquPvTLp3         1
Malpractice,                                     1
https://t.co/dOxzRkqLDf                          1
ups.\n\n#telemedicine                            1
526\nJDC                                         1
Name: text, Length: 995725, dtype: int32

Stopwords. Of course we need to handle these.

In [253]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nfs/nicholasb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [254]:
STOPWORDS = nltk.corpus.stopwords.words('english')

(df
 .text
 .str.replace_tokens(STOPWORDS, "")
 .str.tokenize()
).value_counts()

#COVID19                                             126944
#coronavirus                                         104465
I                                                     40590
&amp;                                                 37712
The                                                   28515
                                                      ...  
School)                                                   1
https://t.co/7PUVnbJcw6                                   1
#System\n\nhttps://t.co/rzOfHXvMrU\n\n#WallStreet         1
https://t.co/mj6LDW0u7c                                   1
REQUIRED!                                                 1
Name: text, Length: 994026, dtype: int32

Case-sensitivity. Need to handle that too.

In [255]:
(df
 .text
 .str.lower()
 .str.replace_tokens(STOPWORDS, "")
 .str.tokenize()
).value_counts()

#covid19                   160309
#coronavirus               133624
&amp;                       37712
people                      26287
-                           23337
                            ...  
23-min                          1
#mumbailover                    1
@sandboy39                      1
here!scientist                  1
https://t.co/v0wymwdfrl         1
Name: text, Length: 910908, dtype: int32

Punctuation may be affecting the results.

In [256]:
PUNCTUATION = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\', ':', ';', '<', '=', '>',
           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\~', '\t','\\n',"'",",",'~' , '—']

In [257]:
(df
 .text
 .str.lower()
 .str.replace(PUNCTUATION, [" "]*len(PUNCTUATION), regex=False)
 .str.replace_tokens(STOPWORDS, "")
 .str.tokenize()
).value_counts()



co                 280872
https              237614
covid19            201009
coronavirus        178962
\n                  74188
                    ...  
robotovelords           1
nieuwemarlean           1
5d6jlh8mhd              1
dq4pqdnj4z              1
geraldinemorris         1
Name: text, Length: 641145, dtype: int32

Looks like web address terms are the most common now. That kind of makes sense. We should explicitly include these in our `STOPWORDS`.

In [258]:
STOPWORDS += ["co", "https", "com"]

In [259]:
(df
 .text
 .str.lower()
 .str.replace(PUNCTUATION, [" "]*len(PUNCTUATION), regex=False)
 .str.replace_tokens(STOPWORDS, "")
 .str.tokenize()
).value_counts()

covid19        201009
coronavirus    178962
\n              86816
\n\n            63948
amp             40612
                ...  
litgleam            1
c7d84llqs7          1
yf1ewiusay          1
vkd5hayimf          1
ezzjadvbg5          1
Name: text, Length: 638516, dtype: int32

Handling newlines and doing whitespace normalization is generally a good idea.

In [260]:
results = (df
 .text
 .str.lower()
 .str.replace(PUNCTUATION, [" "]*len(PUNCTUATION), regex=False)
 .str.replace_tokens(STOPWORDS, "")
 .str.normalize_spaces()
 .str.tokenize()
).value_counts()

results.head(10)

covid19        207801
coronavirus    184641
amp             40612
covid           37620
people          31137
19              31019
us              24577
cases           24087
pandemic        23828
new             22103
Name: text, dtype: int32

We've got the most common tokens. What about bigrams or trigrams?

In [261]:
(df
 .text
 .str.lower()
 .str.replace(PUNCTUATION, [" "]*len(PUNCTUATION), regex=False)
 .str.replace_tokens(STOPWORDS, "")
 .str.normalize_spaces()
 .str.ngrams_tokenize(n=2, separator=" ")
 .value_counts()
).head(10)

covid 19                29379
coronavirus covid19     14559
covid19 coronavirus     13482
coronavirus pandemic     6181
covidー19 coronavirus     6130
covid19 pandemic         6027
stay home                5948
coronavirus covidー19     4193
social distancing        3866
new cases                3381
Name: text, dtype: int32

This makes sense. These sound like they could be terms used commonly in hashtags. What about trigrams?

In [262]:
(df
 .text
 .str.lower()
 .str.replace(PUNCTUATION, [" "]*len(PUNCTUATION), regex=False)
 .str.replace_tokens(STOPWORDS, "")
 .str.normalize_spaces()
 .str.ngrams_tokenize(n=3, separator=" ")
 .value_counts()
).head(10)

keep safe coronavirus        1995
ppe frontline nhs            1981
nhs keep safe                1980
frontline nhs keep           1978
provide ppe frontline        1978
govt provide ppe             1967
uk govt provide              1966
coronavirus covid 19         1866
coronavirus sign petition    1791
safe coronavirus sign        1761
Name: text, dtype: int32

RAPIDS provides an immense amount of NLP functionality, and what's particularly powerful is that we can take this into the Dask world.

# Expanding to Larger, More Complex Tasks using Dask

Let's touch on the previous example, and then move to something more complex like document search.

In [3]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [2]:
import nltk

from dask.distributed import Client
import dask.array as da

from dask_cuda import LocalCUDACluster
import cudf
import dask_cudf
import cupy as cp

from cuml.dask.feature_extraction.text import TfidfTransformer
from cuml.feature_extraction.text import HashingVectorizer as CumlHashVect

In [3]:
cluster = LocalCUDACluster(
    CUDA_VISIBLE_DEVICES="0,1,2,3",
    
)
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:39239  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 1.48 TiB


In [4]:
path = "/raid/vjawa/string_exp/tweets/*.CSV"
df = dask_cudf.read_csv(path)

df = df.loc[df.lang == 'en'].persist()
print(len(df))

4827372


In [5]:
df['text'].head(5)

2     “People are just storing up. They are staying ...
6     .@PatriceHarrisMD spoke with @YahooFinance abo...
7     First medical team aiding #Wuhan in fight agai...
9     .@KathyGriffin: @realDonaldTrump Is 'Lying' Ab...
14    #CoronaUpdate | Johns Hopkins University has s...
Name: text, dtype: object

## Tokenization (Again)

We can do all the same processing we did before, this time using all of our GPU power.

In [6]:
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS += ["co", "https", "com"]

PUNCTUATION = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\', ':', ';', '<', '=', '>',
           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\~', '\t','\\n',"'",",",'~' , '—']

In [7]:
# Same code, using Dask this time to scale out to unlimited data

results = (df
 .text
 .str.lower()
 .str.replace(PUNCTUATION, [" "]*len(PUNCTUATION), regex=False)
 .str.replace_tokens(STOPWORDS, "")
 .str.normalize_spaces()
 .str.tokenize()
).value_counts()

results.head(10)



covid19        2744517
coronavirus    2350495
covid           832122
19              741327
amp             568308
people          445454
cases           334449
us              333554
pandemic        328756
new             308146
Name: text, dtype: int64

## Distributed TF-IDF Based Document Search

Now that we know we can do these kinds of NLP operations with Dask, let's build a search tool using TF-IDF that lets us find tweets corresponding to our search query.

In [8]:
vectorizer = CumlHashVect(stop_words='english')
multi_gpu_transformer = TfidfTransformer()

Note that there is a `preprocessor` argument for the HashingVectorizer and it takes a callable. Let's actually redefine this with our own function, using the core logic from above.

In [9]:
def our_preprocessor(s):
    processed = (s
                .str.lower()
                .str.replace(PUNCTUATION, [" "]*len(PUNCTUATION), regex=False)
                .str.replace_tokens(STOPWORDS, "")
                .str.normalize_spaces()
                )
    return processed

vectorizer = CumlHashVect(stop_words='english', preprocessor=our_preprocessor)

In [10]:
meta = da.from_array(cp.sparse.csr_matrix(cp.zeros(1, dtype=cp.float32)))
X = df["text"].map_partitions(vectorizer.fit_transform, meta=meta).astype(cp.float32)
X = X.persist()
X.compute_chunk_sizes()

Unnamed: 0,Array,Chunk
Bytes,18.41 TiB,1.44 TiB
Shape,"(4827372, 1048576)","(378267, 1048576)"
Count,18 Tasks,18 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 18.41 TiB 1.44 TiB Shape (4827372, 1048576) (378267, 1048576) Count 18 Tasks 18 Chunks Type float32 numpy.ndarray",1048576  4827372,

Unnamed: 0,Array,Chunk
Bytes,18.41 TiB,1.44 TiB
Shape,"(4827372, 1048576)","(378267, 1048576)"
Count,18 Tasks,18 Chunks
Type,float32,numpy.ndarray


In [11]:
X_transformed = multi_gpu_transformer.fit_transform(X).persist()
X_transformed.compute_chunk_sizes()

[I] [09:08:26.630553] [Delayed('_merge_stats_to_model-69103aae-1dfc-4d20-a10a-586f1437f4fd')]
[I] [09:08:26.633220] [Delayed('_merge_stats_to_model-fd49d712-627f-4fe2-b820-c5c31de4ba7e')]
[I] [09:08:26.634659] [Delayed('_merge_stats_to_model-059176a8-7c15-4ff2-b55c-e78b1316b230')]
[I] [09:08:26.636136] [Delayed('_merge_stats_to_model-80cd2d5c-f960-402c-9a20-449ee0811fe7')]
[I] [09:08:26.777757] [<Future: finished, type: cuml.TfidfTransformer, key: _merge_stats_to_model-69103aae-1dfc-4d20-a10a-586f1437f4fd>]
[I] [09:08:26.777856] [<Future: finished, type: cuml.TfidfTransformer, key: _merge_stats_to_model-fd49d712-627f-4fe2-b820-c5c31de4ba7e>]
[I] [09:08:26.778428] [<Future: finished, type: cuml.TfidfTransformer, key: _merge_stats_to_model-059176a8-7c15-4ff2-b55c-e78b1316b230>]
[I] [09:08:26.778530] [<Future: finished, type: cuml.TfidfTransformer, key: _merge_stats_to_model-80cd2d5c-f960-402c-9a20-449ee0811fe7>]
[I] [09:08:26.781680] [Delayed('_merge_stats_to_model-07c3c5b0-effa-4a1e-ba2

Unnamed: 0,Array,Chunk
Bytes,18.41 TiB,1.44 TiB
Shape,"(4827372, 1048576)","(378267, 1048576)"
Count,18 Tasks,18 Chunks
Type,float32,cupy.ndarray
"Array Chunk Bytes 18.41 TiB 1.44 TiB Shape (4827372, 1048576) (378267, 1048576) Count 18 Tasks 18 Chunks Type float32 cupy.ndarray",1048576  4827372,

Unnamed: 0,Array,Chunk
Bytes,18.41 TiB,1.44 TiB
Shape,"(4827372, 1048576)","(378267, 1048576)"
Count,18 Tasks,18 Chunks
Type,float32,cupy.ndarray


For simplicity, we'll collect our corpus and sparse tf-idf matrix to a single GPU and use the Dask multi-GPU vectorizer. This is not the most optimized approach, but it's simple and easy to walk through.

In [12]:
corpus = df[["text", "status_id"]].compute()
X_transformed_singlegpu = X_transformed.compute()

Using cuML's NearestNeighbors we can calculate the most similar records using Cosine Similarity on the sparse tf-idf matrix.

In [13]:
from cuml.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, metric="cosine")
nn.fit(X_transformed_singlegpu)

def search(haystack, needle):
    query_vector = vectorizer.transform(cudf.Series(needle))
    distances, indices = nn.kneighbors(query_vector)
    return haystack.iloc[indices.ravel()]

In [14]:
search(corpus, "NVIDIA AI")



Unnamed: 0,text,status_id
229594,Nvidia and IBM provide further AI solutions to...,1247496980529586177
233595,Nvidia and IBM provide further AI solutions to...,1247500757982724098
404210,"NVIDIA Brings GPU, HPC and AI Expertise to COV...",1247217527362478080
420528,"Yes! #STEM in action, #NVIDIA assisting in sci...",1247224492826464257
134358,NVIDIA is contributing its AI smarts to help f...,1247427803836035074


In [15]:
search(corpus, "distributed computing")



Unnamed: 0,text,status_id
374874,Distributed computing for #COVID19 https://t.c...,1246135241712041988
352253,"If you have spare computing power, this is a w...",1246500181635477507
165023,We've all developed computing infrastructure i...,1249742404342575104
443602,In an effort to help in the #coronavirus crisi...,1244700923739418626
254327,Proud to be the part of @foldingathome #distri...,1248251854636515331


In [16]:
search(corpus, "python programming gpu")



Unnamed: 0,text,status_id
36738,How To Track Coronavirus In Your Country with ...,1248813665484079104
37296,How To Track Coronavirus In Your Country with ...,1248814560133300224
491122,How To Track COVID-19 Cases in the United Stat...,1248007352080404482
397256,I am offering FREE course of Python Programmin...,1246532658210852864
23752,Real-World Programming for Kids with Python - ...,1246971643425181704


We've only scratched the surface of the NLP capabilities that Dask and RAPIDS make possible. We encourage you to look at the [RAPIDS](https://docs.rapids.ai/) and [Dask](https://docs.dask.org/en/latest/) documentation to learn more!