# Gutenburg NLP Analysis:

### Blog Link:
https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801


### Objective: Show case nlp capabilties of nvstrings+cudf

### Pre-Processing :
* filter punctuation
* to_lower
* remove stop words (from nltk corpus)
* remove multiple spaces with one
* remove leading and trailing spaces    
    
### Word Count: 
* Get Frequency count for the whole dataset
* Compare word count for two authors (Albert Einstein vs Charles Dickens )
* Get Word counts for all the authors

### Encode the word-count for all authors into a count-vector

We do this in two steps:

1. Encode the string Series using `top 20k` most used `words` in the Dataset which we calculated earlier.
    * We encode anything not in the series to string_id = `20_000` (`threshold`)


2. With the encoded count series for all authors, we  create an aligned word-count vector for them, where:
    * Where each column corresponds to a `word_id` from the the `top 20k words`
    * Each row corresponds to the `count vector` for that author
    
    
### Find the nearest authors using the count-vector:
* Fit a knn
* Find the authors nearest to each other in the count vector space
* Decrease dimunitonality using UMAP
* Find the authors nearest to each other in the latent space

### Data Download Links:

Download the data from: https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html

You can also run below commands
```
!pip install gdown
!gdown https://drive.google.com/uc?id=0B2Mzhc7popBga2RkcWZNcjlRTGM
!unzip Gutenberg.zip
```

### Import libraries

In [1]:
import cudf
import nvcategory
import os
import numpy as np
import nvtext
import cuml
import nvstrings
import nltk
from numba import cuda

### Set Data Dir 

In [2]:
data_dir = 'Gutenberg/txt'

## Read Text Frame

#### Read helper functions

In [3]:
def get_non_empty_lines(lines):
    """
        returns non empty lines from a list of lines
    """
    clean_lines = []
    for line in lines:
        str_line = line.strip()
        if str_line:
            clean_lines.append(str_line)            
    return clean_lines

def get_txt_lines(data_dir):
    """
        Read text lines from gutenberg texts
        returns (text_ls,fname_ls) where 
        text_ls= input_text_lines and fname_ls = list of fnames
    """
    text_ls = []
    fname_ls = []
    for fn in os.listdir(data_dir):
        full_fn = os.path.join(data_dir,fn)
        with open(full_fn,encoding="utf-8") as f:
            content = f.readlines()
            content = get_non_empty_lines(content)
            text_ls += content
            ### dont add .txt to the file
            fname_ls += [fn[:-4]]*len(content)
    
    return text_ls,fname_ls

### Read text lines into a cudf dataframe

In [4]:
print("File Read Time:")
%time txt_ls,fname_ls = get_txt_lines(data_dir)
df = cudf.DataFrame()

print("\nCUDF  Creation Time:")
%time df['text'] = nvstrings.to_device(txt_ls)

df['label'] = nvstrings.to_device(fname_ls)
title_label_df = df['label'].str.split('___')
df['author'] = title_label_df[0]

df['title'] = title_label_df[1]
df = df.drop(labels=['label'])

print("Number of lines in the DF = {:,}".format(len(df)))
df.head(5).to_pandas()

File Read Time:
CPU times: user 7.81 s, sys: 1.08 s, total: 8.89 s
Wall time: 8.9 s

CUDF  Creation Time:
CPU times: user 2.05 s, sys: 893 ms, total: 2.95 s
Wall time: 2.98 s
Number of lines in the DF = 19,259,957


Unnamed: 0,text,author,title
0,THE STORY OF THE CHAMPIONS OF THE ROUND TABLE,Howard Pyle,The Story of the Champions of the Round Table
1,Written and Illustrated by,Howard Pyle,The Story of the Champions of the Round Table
2,HOWARD PYLE.,Howard Pyle,The Story of the Champions of the Round Table
3,In 1902 the distinguished American artist Howa...,Howard Pyle,The Story of the Champions of the Round Table
4,and illustrate the legend of King Arthur and t...,Howard Pyle,The Story of the Champions of the Round Table


## NLP Preprocessing

In almost every workflow involving textual data, we'll want to do some kind of preprocessing before running our analysis. We might want to remove punctuation, standardize to all lowercase characters, and potentially dozens of other small tasks. RAPIDS makes developing GPU accelerated preprocessing pipelines smooth.

Let's start by removing all the punctuation, since we don't want those characters to cloud our analysis. We could replace them one by one in many calls to replace. More realistically, we might generate a large regular expression pattern that looks for `!`, `,`, `%` and all of our other patterns and replaces them. It might look something like this: `(!)|(,)...|(%)`.

A longer regex may or may not be less efficient necessarily on the GPU.  If an instruction within the regex fails to match the current character being processed for the string, the rest of the expression does not need to be evaluated and we can move on to the next character.  However, regexes with many alternation as in our case,  may mean evaluating the same character over many more instructions before continuing. An alternation can be explicit like in `(\bone\b)|(\b1\b)` but also can be implicit like in `[aA]`.


This can be tedious, and isn't well suited to the GPU. 

Overall, avoiding regex can be more efficient since the algorithm is complex due to the richness of its features.  

For cases like removing multiple `characters` or `stop words`, a `general regex` can be overkill and `nvtext` provides some alternative methods which make this computation much faster. 

In this workflow we use the following `nvtext` functions:
* `nvstrings.replace_multi`: To replace the punctuations with a blank space.
* `nvtext.replace_tokens`: To replace the tokens with a empty space.

Please checkout `https://docs.rapids.ai/api/nvstrings/nightly/`, we are adding more features everyday. 


#### Now back to our workflow:

##### Removing Filters:
First, we need to define our list of filter characters.

In [5]:
# remove the following punctuations/characters from cudf
filters = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\', ':', ';', '<', '=', '>',
           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\~', '\t','\\n',"'",",",'~' , '—']

Next, we can simply pass `filters` to the string processing methods inside cuDF and apply it to our Series. We'll eventually make a helper function to let us execute this on every column in the DataFrame. But, let's just take a quick look now on a sample of our text data.

In [6]:
text_col_sample = df.head(5)
text_col_sample['text'].to_pandas()

0        THE STORY OF THE CHAMPIONS OF THE ROUND TABLE
1                           Written and Illustrated by
2                                         HOWARD PYLE.
3    In 1902 the distinguished American artist Howa...
4    and illustrate the legend of King Arthur and t...
Name: text, dtype: object

In [7]:
text_col_sample['text_clean'] = text_col_sample['text'].str.replace_multi(filters, ' ', regex=False)
text_col_sample['text_clean'].to_pandas()

0        THE STORY OF THE CHAMPIONS OF THE ROUND TABLE
1                           Written and Illustrated by
2                                         HOWARD PYLE 
3    In 1902 the distinguished American artist Howa...
4    and illustrate the legend of King Arthur and t...
Name: text_clean, dtype: object

With one method we removed all of the symbols in our `filters` list. Next, we'll want to convert to lowercase with `str.lower()`, just like we used `replace_multi`.

##### To Lower

In [8]:
text_col_sample['text_clean'] = text_col_sample['text_clean'].str.lower()
text_col_sample['text_clean'].to_pandas()

0        the story of the champions of the round table
1                           written and illustrated by
2                                         howard pyle 
3    in 1902 the distinguished american artist howa...
4    and illustrate the legend of king arthur and t...
Name: text_clean, dtype: object

We can also remove stopwords with `replace_tokens`. The `nvtext` library makes this easy. We can pass the default list of English stopwords that ships with the `nltk` library. We'll replace each of our stopwords with a single space.

##### Remove Stop Words

In [9]:
nltk.download('stopwords')
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = nvstrings.to_device(STOPWORDS)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
text_col_sample['text_clean'] = nvtext.replace_tokens(text_col_sample['text_clean'].data, STOPWORDS, ' ')
text_col_sample['text_clean'].to_pandas()

0                  story     champions     round table
1                              written   illustrated  
2                                         howard pyle 
3      1902   distinguished american artist howard ...
4      illustrate   legend   king arthur     knight...
Name: text_clean, dtype: object

##### Replacing Multiple White Spaces

This looks great, but we'll probably want to replace multiple spaces in a row with a single space and strip leading and trailing spaces. We can do that easily, too.

Replacing multiple spaces with a single space is a common operation, so we're making this even faster the above regex with a new feature coming soon (keep an eye on [this Github issue](https://github.com/rapidsai/custrings/issues/374) for the latest info).

In [11]:
text_col_sample['text_clean'] = text_col_sample['text_clean'].str.replace(r"\s+", ' ',regex=True)
text_col_sample['text_clean'] = text_col_sample['text_clean'].str.strip(' ')
text_col_sample['text_clean'].to_pandas()

0                          story champions round table
1                                  written illustrated
2                                          howard pyle
3    1902 distinguished american artist howard pyle...
4          illustrate legend king arthur knights round
Name: text_clean, dtype: object

With that, we've finished our basic preprocessing steps on a tiny sample of our text column. We'll wrap this into a function for portability, and run it on the entire data. We'll rewrite our code to create our filter list and stopwords again for clarity.

#### Full Pre-processing Pipe-Line

In [12]:
STOPWORDS = nltk.corpus.stopwords.words('english')

filters = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\', ':', ';', '<', '=', '>',
           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\~', '\t','\\n',"'",",",'~' , '—']

def preprocess_text(input_strs , filters=None , stopwords=STOPWORDS):
    """
        * filter punctuation
        * to_lower
        * remove stop words (from nltk corpus)
        * remove multiple spaces with one
        * remove leading spaces    
    """
    
    # filter punctuation and case conversion
    input_strs = input_strs.str.replace_multi(filters, ' ', regex=False)
    input_strs = input_strs.str.lower()
        
    # remove stopwords
    stopwords_gpu = nvstrings.to_device(stopwords)
    input_strs = nvtext.replace_tokens(input_strs.data, stopwords_gpu, ' ')
    input_strs = cudf.Series(input_strs)
        
    # replace multiple spaces with single one and strip leading/trailing spaces
    input_strs = input_strs.str.replace(r"\s+", ' ', regex=True)
    input_strs = input_strs.str.strip(' ')
    
    return input_strs

def preprocess_text_df(df, text_cols=['text'], **kwargs):
    for col in text_cols:
        df[col] = preprocess_text(df[col], **kwargs)
    return  df

With our function defined, we can execute it to preprocess the entire dataset.

In [13]:
%time df = preprocess_text_df(df, filters=filters)

CPU times: user 1.6 s, sys: 790 ms, total: 2.39 s
Wall time: 2.39 s


In [14]:
df['text'].head(5).to_pandas()

0                          story champions round table
1                                  written illustrated
2                                          howard pyle
3    1902 distinguished american artist howard pyle...
4          illustrate legend king arthur knights round
Name: text, dtype: object

## Word Count

Lets find the top words used in:
* Whole dataset
* by Albert Einstein
* by Charles dickens

In [15]:
## Getting a frequency count for Strings

def get_word_count(str_col):
    """
        returns the count of input strings
    """ 
    ## Tokenize: convert sentences into a long list of words
    ## Get counts: Groupby each token to get value counts

    df = cudf.DataFrame()
    # tokenize sentences into a string using nvtext.tokenize()
    # it into a single tall data-frame
    df['string'] = nvtext.tokenize(str_col.data)
    
    # Using Group by to do a value count for string columns
    # This will be natively supported soon
    # See: issue https://github.com/rapidsai/cudf/issues/1951

    df['counts'] = np.dtype('int32').type(0)
    
    res = df.groupby('string').count()
    res = res.reset_index(drop=False).sort_values(by='counts', ascending=False)
    return res.rename({'index':'string'})

### Top Words  Across the dataset

In [16]:
%%time 

count_df = get_word_count(df['text'])
count_df.head(5).to_pandas()

CPU times: user 3.03 s, sys: 699 ms, total: 3.73 s
Wall time: 3.73 s


### Now lets compare Charles Dickens and Albert Einstein

####  Albert Einstein

In [17]:
einstein_df = df[df['author'].str.contains('Einstein')]
einstein_count_df = get_word_count(einstein_df['text'])
einstein_count_df.head(5).to_pandas()

Unnamed: 0,string,counts
3002,theory,248
2517,relativity,223
2790,space,190
415,body,168
3030,time,160


#### Charles Dickens

In [18]:
charles_dickens_df = df[df['author'].str.contains('Charles Dickens')]
charles_dickens_count_df = get_word_count(charles_dickens_df['text'])
charles_dickens_count_df.head(5).to_pandas()

Unnamed: 0,string,counts
28485,mr,34408
36881,said,32643
29906,one,17684
47972,would,15266
45593,upon,14657



So Einstein is talking about relativity, with words like `relativity`,`theory`,`body` ,
while Charles Dickens is telling stories with   `once`, `upon`, `time` , `old`

Our Word Count seems to be working :-D

### Word Counts for all the authors

#### Lets get the list of authors for our dataframe

In [19]:
df['author'].unique().to_pandas().head(5)

0    Abraham Lincoln
1    Agatha Christie
2    Albert Einstein
3      Aldous Huxley
4     Alexander Pope
Name: author, dtype: object

#### Calculate the  word count for all authors into a list

In [20]:
%%time
author_wc_ls = []
author_name_ls = []
for author_name in df['author'].unique():
    df_auth = df[df['author']==author_name]
    author_wc = get_word_count(df_auth['text'])
    author_wc_ls.append(author_wc)
    author_name_ls.append(author_name)

CPU times: user 40.7 s, sys: 16.7 s, total: 57.5 s
Wall time: 58.3 s


## Encode the word-count `series` list for all authors into a count-vector

We do this in two steps:

1. Encode the string Series using`top 20k` most used `words` in the Dataset which we calculated earlier.
    * We encode anything not in the series to string_id = `20_000` (threshold)


2. With the encoded count series for all authors, we create an aligned word-count vector for them, where:
    * Where each column corresponds to a `word_id` from the the `top 20k words`
    * Each row corresponds to the `count vector` for that author

#### Categorize the `string series` from the `word count series` into a `integer series`  for all the authors 

In [21]:
def str_to_cat(str_s,keys):    
    """
        Cast string columm to category(int) using nvcategory
        Codes are index of keys
        any string not in keys is encoded to -1
    """
    from librmm_cffi import librmm
    
    cat = nvcategory.from_strings(str_s.data).set_keys(keys)
    device_array = librmm.device_array(str_s.data.size(), dtype=np.int32)    
    cat.values(devptr=device_array.device_ctypes_pointer.value)
    
    return cudf.Series(device_array)

def encode_count_df(auth_wc_df,keys,out_of_dict_id):
    """
        Encode the count series for all authors by using the index provided in keys
        All strings not in keys are mapped to out_of_dict_id and their count is summed
    """
    # any string not in keys is encoded to -1
    auth_wc_df['encoded_str_id'] = str_to_cat(auth_wc_df['string'],keys)
    
    # sub df which  contains words that are in the dictionary
    in_dict_wc_df = auth_wc_df[auth_wc_df['encoded_str_id']!=-1]
    
    # sum of `count series` of words not in dictionary 
    out_of_dict_wcount = auth_wc_df[auth_wc_df['encoded_str_id']==-1]['counts'].sum()
    
    # mapping out the count of words to -1
    out_of_dict_df = cudf.DataFrame({'encoded_str_id':out_of_dict_id,'counts': out_of_dict_wcount,'string':'other'})
    
    # by default cudf creates 64 bit arrays from dict
    # remap them to 32 bits to  line up with in_dict_wc_df
    out_of_dict_df['encoded_str_id'] = out_of_dict_df['encoded_str_id'].astype(np.int32)
    out_of_dict_df['counts'] = out_of_dict_df['counts'].astype(np.int32)
    
    return cudf.concat([in_dict_wc_df,out_of_dict_df])

In [22]:
%%time
# keep only top 20k words in the dataset
th = 20_000
keys = count_df['string'][:th].data
encoded_wc_ls = []

for auth_wc_df in author_wc_ls:
    encoded_count_df = encode_count_df(auth_wc_df,keys,th)
    encoded_wc_ls.append(encoded_count_df)

CPU times: user 6.85 s, sys: 184 ms, total: 7.04 s
Wall time: 7.04 s


##### Now lets check if the encoding worked ! 

##### Agatha Christie Counts

In [23]:
author_id = author_name_ls.index('Agatha Christie') 
print(author_name_ls[author_id])
encoded_wc_ls[author_id].head(5).to_pandas()

Agatha Christie


Unnamed: 0,string,counts,encoded_str_id
6490,said,738,15455
7862,tuppence,626,18503
7718,tommy,597,18137
5159,one,505,12506
4894,mr,455,11880


##### Charles Dickens Counts

In [24]:
author_id = author_name_ls.index('Charles Dickens') 
print(author_name_ls[author_id])
encoded_wc_ls[author_id].head(5).to_pandas()

Charles Dickens


Unnamed: 0,string,counts,encoded_str_id
28485,mr,34408,11880
36881,said,32643,15455
29906,one,17684,12506
47972,would,15266,19819
45593,upon,14657,18861


In [25]:
encoded_wc_ls[author_id].head(5).to_pandas()

Unnamed: 0,string,counts,encoded_str_id
28485,mr,34408,11880
36881,said,32643,15455
29906,one,17684,12506
47972,would,15266,19819
45593,upon,14657,18861


##### We can see that the encoded_str_id for `said` is `15455` for both `Charles Dickens` and `Agatha Christie`. Yaay! the encoding worked

## Create a aligned word-count vector for each author:

We create a dataframe, where a row represents a `author` and the columnss contain the count of the `words` respresented by that `column`.

#### Create a numba nd-array of shape (`num_authors`,`Vocablary Size+1`))

In [26]:
num_authors = len(encoded_wc_ls)
count_ary = np.zeros(shape = (num_authors,th+1), dtype=np.int32)
count_dary = cuda.to_device(count_ary)

Fill the count array using a numba function:

Apply the numba function to fill the `author_count_array` with the count of words used by the `author`

`Numba Function`: See https://numba.pydata.org/numba-doc/0.13/CUDAJit.html for more `info` on how to write `cuda-jit` functions.

In [27]:
%%time

@cuda.jit('void(int32[:], int32[:], int32[:])')
def count_vec_func(author_token_id_array,author_token_count_array,author_count_array):
    
    pos = cuda.grid(1)
    if pos < author_token_id_array.size:
        token_id = author_token_id_array[pos]
        token_count = author_token_count_array[pos]
        author_count_array[token_id] = token_count        
        
for author_id,encoded_wc_df in enumerate(encoded_wc_ls):    
    count_sr = encoded_wc_df['counts']
    token_id_sr =  encoded_wc_df['encoded_str_id']
    
    count_ar = count_sr.data.to_gpu_array()
    token_id_ar = token_id_sr.data.to_gpu_array()
    author_ar = count_dary[author_id]
    
    # See https://numba.pydata.org/numba-doc/0.13/CUDAJit.html
    threadsperblock = 36
    blockspergrid = (count_ar.size + (threadsperblock - 1)) // threadsperblock
    count_vec_func[blockspergrid, threadsperblock](token_id_ar,count_ar,author_ar)

CPU times: user 200 ms, sys: 7.98 ms, total: 208 ms
Wall time: 206 ms


#### Now, Lets check if creating the count vectors worked !

In [28]:
author_id = author_name_ls.index('Agatha Christie')  

print(author_name_ls[author_id])
top_word_ids = encoded_wc_ls[author_id]['encoded_str_id'].head(5).to_pandas()
for word_id in top_word_ids:
    print("{} : {}".format(word_id,count_dary[author_id][word_id]))

Agatha Christie
15455 : 738
18503 : 626
18137 : 597
12506 : 505
11880 : 455


## Lets find the  Nearest Authors 

Now your count df is ready for ML

Let's train a KNN on the count-df and see if we can find any interesting patterns in count_df. Though `euclidian distance` is not the best measure for these higher dimensional spaces but it still works as a small toy example. 


#### Normalize Counts

In [29]:
normalized_count_array = count_dary/np.sum(count_dary,axis=1)[:,None]

#### Train and find nearest_neighours on the non embedded  space

In [30]:
%%time
nn_model = cuml.neighbors.NearestNeighbors(n_neighbors = 5)
nn_model.fit(normalized_count_array)
ouput_mat,output_indices_count_sp = nn_model.kneighbors(X=normalized_count_array)

CPU times: user 1.98 s, sys: 314 ms, total: 2.29 s
Wall time: 961 ms


#### Nearest authors to Albert Einstein in the count vector space

In [31]:
author_id = author_name_ls.index('Albert Einstein') 
for index in output_indices_count_sp[author_id]:
    print(author_name_ls[int(index)])

Albert Einstein
Thomas Carlyle
Lord Byron
James Russell Lowell
Michael Faraday


#### Nearest authors to Charles Dickens in the count vector space

In [32]:
author_id = author_name_ls.index('Charles Dickens') 
for index in output_indices_count_sp[author_id]:
    print(author_name_ls[int(index)])

Charles Dickens
Winston Churchill
Charlotte Mary Yonge
Harriet Elizabeth Beecher Stowe
William Dean Howells


#### Encode the count vecotrs to a lower dimention using Umap

In [33]:
embedding_ar_gpu =  cuml.UMAP(n_neighbors=100,n_components=3).fit_transform(normalized_count_array)

#### KNN in the lower dimentional space

In [34]:
%%time
nn_model = cuml.neighbors.NearestNeighbors(n_neighbors = 5)
nn_model.fit(embedding_ar_gpu)
ouput_mat,output_indices_umap = nn_model.kneighbors(X=embedding_ar_gpu)

CPU times: user 1.05 s, sys: 155 ms, total: 1.21 s
Wall time: 121 ms


#### Nearest authors to Albert Einstein in the emdedded space

In [35]:
author_id = author_name_ls.index('Albert Einstein') 
for index in output_indices_umap[author_id]:
    print(author_name_ls[int(index)])

Albert Einstein
Thomas Crofton Croker
Ezra Pound
Thomas Carlyle
Robert Hooke


#### Nearest authors to Charles Dickens in the emdedded space

In [36]:
author_id = author_name_ls.index('Charles Dickens') 
for index in output_indices_umap[author_id]:
    print(author_name_ls[int(index)])

Charles Dickens
Hamlin Garland
Mary Shelley
William Dean Howells
William Penn


Want to get started with RAPIDS? Check out [`cuDF`](https://github.com/rapidsai/cudf) on Github and let us know what you think! You can download pre-built Docker containers for our 0.8 release from [NGC](https://ngc.nvidia.com/catalog/landing) or [Dockerhub](https://hub.docker.com/r/rapidsai/rapidsai/) to get started, or install it yourself via Conda. Need something even easier? You can quickly get started with RAPIDS in [Google Colab](https://colab.research.google.com/drive/1XTKHiIcvyL5nuldx0HSL_dUa8yopzy_Y#forceEdit=true&offline=true&sandboxMode=true) and try out all the new things we've added with just a single push of a button.

Don't want to wait for the next release to use upcoming features? You can download our nightly containers from [Dockerhub](https://hub.docker.com/r/rapidsai/rapidsai-nightly) or install via [Conda](https://anaconda.org/rapidsai-nightly) to stay at the tip of our development branch.