# Gutenburg NLP Analysis using RAPIDS

### Blog Link:
https://medium.com/rapids-ai/show-me-the-word-count-3146e1173801


### Objective: Show case nlp capabilties of cudf

### Pre-Processing :
* filter punctuation
* to_lower
* remove stop words (from nltk corpus)
* remove multiple spaces with one
* remove leading and trailing spaces    
    
### Word Count: 
* Get Frequency count for the whole dataset
* Compare word count for two authors (Albert Einstein vs Charles Dickens )
* Get Word counts for all the authors

### Encode the word-count for all authors into a count-vector

We do this in two steps:

1. Encode the string Series using `top 20k` most used `words` in the Dataset which we calculated earlier.
    * We encode anything not in the series to string_id = `20_000` (`threshold`)


2. With the encoded count series for all authors, we  create an aligned word-count vector for them, where:
    * Where each column corresponds to a `word_id` from the the `top 20k words`
    * Each row corresponds to the `count vector` for that author
    
    
### Find the nearest authors using the count-vector:
* Fit a knn
* Find the authors nearest to each other in the count vector space
* Decrease dimunitonality using UMAP
* Find the authors nearest to each other in the latent space

### Data Download Links:

Download the data from: https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html

You can also run below commands

In [None]:
!pip install gdown
!gdown https://drive.google.com/uc?id=0B2Mzhc7popBga2RkcWZNcjlRTGM
!apt update
!apt install unzip
!unzip Gutenberg.zip

### Import libraries

In [None]:
import cudf
import os
import numpy as np
import cuml
try:
    import nltk
except ModuleNotFoundError:
    os.system('pip install nltk')
    import nltk
from numba import cuda
from dask.utils import parse_bytes

### Setting Rmm Pool 
RAPIDS Memory Manager allows sharing a memory pool between RAPIDS libraries and CuPy.
This  allows us to use a single device memory pool on the entire GPU, providing significant performance gains by reducing the cost of dynamically allocating and freeing memory.

In [None]:
cudf.set_allocator(pool=True, initial_pool_size=parse_bytes("8GB"))

### Set Data Dir 

In [None]:
data_dir = 'Gutenberg/txt'

## Read Text Frame

#### Read helper functions

In [None]:
def get_non_empty_lines(lines):
    """
        returns non empty lines from a list of lines
    """
    clean_lines = []
    for line in lines:
        str_line = line.strip()
        if str_line:
            clean_lines.append(str_line)            
    return clean_lines

def get_txt_lines(data_dir):
    """
        Read text lines from gutenberg texts
        returns (text_ls,fname_ls) where 
        text_ls= input_text_lines and fname_ls = list of fnames
    """
    text_ls = []
    fname_ls = []
    for fn in os.listdir(data_dir):
        full_fn = os.path.join(data_dir,fn)
        with open(full_fn,encoding="utf-8",errors="ignore") as f:
            content = f.readlines()
            content = get_non_empty_lines(content)
            text_ls += content
            ### dont add .txt to the file
            fname_ls += [fn[:-4]]*len(content)
    
    return text_ls, fname_ls

### Read text lines into a cudf dataframe

In [None]:
print("File Read Time:")
%time txt_ls,fname_ls = get_txt_lines(data_dir)
df = cudf.DataFrame()

print("\nCUDF  Creation Time:")
%time df['text'] = cudf.Series(txt_ls,dtype='str')

df['label'] = cudf.Series(fname_ls,dtype='str')
title_label_df = df['label'].str.split('___')
df['author'] = title_label_df[0]
df['title'] = title_label_df[1]
df = df.drop(labels=['label'])
print("Number of lines in the DF = {:,}".format(len(df)))
df.head(5)

## NLP Preprocessing

In almost every workflow involving textual data, we'll want to do some kind of preprocessing before running our analysis. We might want to remove punctuation, standardize to all lowercase characters, and potentially dozens of other small tasks. RAPIDS makes developing GPU accelerated preprocessing pipelines smooth.

Let's start by removing all the punctuation, since we don't want those characters to cloud our analysis. We could replace them one by one in many calls to replace. More realistically, we might generate a large regular expression pattern that looks for `!`, `,`, `%` and all of our other patterns and replaces them. It might look something like this: `(!)|(,)...|(%)`.

A longer regex may or may not be less efficient necessarily on the GPU.  If an instruction within the regex fails to match the current character being processed for the string, the rest of the expression does not need to be evaluated and we can move on to the next character.  However, regexes with many alternation as in our case,  may mean evaluating the same character over many more instructions before continuing. An alternation can be explicit like in `(\bone\b)|(\b1\b)` but also can be implicit like in `[aA]`.


This can be tedious, and isn't well suited to the GPU. 

Overall, avoiding regex can be more efficient since the algorithm is complex due to the richness of its features.  

For cases like removing multiple `characters` or `stop words`, a `general regex` can be overkill and `cudf.str` provides some alternative methods which make this computation much faster. 

In this workflow we use the following `cudf.Series.str` functions:
* `str_ser.str.translate`: (Allows passing dict to replace multiple punctuation characters with blank spaces.
* `str_ser.str.replace_tokens`: To replace the tokens with a empty space.

Please checkout `https://docs.rapids.ai/api/cudf/nightly/`, we are adding more features everyday. 


#### Now back to our workflow:

##### Removing Filters:
First, we need to define our list of filter characters.

In [None]:
# remove the following punctuations/characters from cudf
filters = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\', ':', ';', '<', '=', '>',
           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\t','\n',"'",",",'~' , '—']

Next, we can simply pass `filters` to the string processing methods inside cuDF and apply it to our Series. We'll eventually make a helper function to let us execute this on every column in the DataFrame. But, let's just take a quick look now on a sample of our text data.

In [None]:
text_col_sample = df.head(5)
text_col_sample['text']

In [None]:
translation_table = {ord(char): ord(' ') for char in filters}
text_col_sample['text_clean'] = text_col_sample['text'].str.translate(translation_table)
text_col_sample['text_clean'].to_pandas()

With one method we removed all of the symbols in our `filters` list. Next, we'll want to convert to lowercase with `str.lower()`

##### To Lower

In [None]:
text_col_sample['text_clean'] = text_col_sample['text_clean'].str.lower()
text_col_sample['text_clean'].to_pandas()

We can also remove stopwords with `replace_tokens`. We can pass the default list of English stopwords that ships with the `nltk` library. We'll replace each of our stopwords with a single space.

##### Remove Stop Words

In [None]:
nltk.download('stopwords')
STOPWORDS = nltk.corpus.stopwords.words('english')
STOPWORDS = cudf.Series(STOPWORDS)

In [None]:
text_col_sample['text_clean'] = text_col_sample['text_clean'].str.replace_tokens(STOPWORDS, ' ')
text_col_sample['text_clean'].to_pandas()

##### Replacing Multiple White Spaces

This looks great, but we'll probably want to replace multiple spaces in a row with a single space and strip leading and trailing spaces. We can do that easily, too.

In [None]:
text_col_sample['text_clean'] = text_col_sample['text_clean'].str.normalize_spaces( )
text_col_sample['text_clean'] = text_col_sample['text_clean'].str.strip(' ')
text_col_sample['text_clean'].to_pandas()

With that, we've finished our basic preprocessing steps on a tiny sample of our text column. We'll wrap this into a function for portability, and run it on the entire data. We'll rewrite our code to create our filter list and stopwords again for clarity.

#### Full Pre-processing Pipe-Line

##### CPU
- ```5 min 2s``` with pure ```Pandas```
- ``` Dask CPU Time ``` = ```15.25 s ```  (on a  dual 16-core CPU (64 virtual core))

##### GPU (RAPIDS)
- ``` 2.94 s``` on a  ```Tesla T4  GPU ```

In [None]:
STOPWORDS = nltk.corpus.stopwords.words('english')

filters = [ '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', '-', '.', '/',  '\\', ':', ';', '<', '=', '>',
           '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '\t','\n',"'",",",'~' , '—']

def preprocess_text(input_strs , filters=None , stopwords=STOPWORDS):
    """
        * filter punctuation
        * to_lower
        * remove stop words (from nltk corpus)
        * remove multiple spaces with one
        * remove leading spaces    
    """
    
    # filter punctuation and case conversion
    translation_table = {ord(char): ord(' ') for char in filters}
    input_strs = input_strs.str.translate(translation_table)
    input_strs = input_strs.str.lower()
        
    # remove stopwords
    stopwords_gpu = cudf.Series(stopwords)
    input_strs =  input_strs.str.replace_tokens(STOPWORDS, ' ')
        
    # replace multiple spaces with single one and strip leading/trailing spaces
    input_strs = input_strs.str.normalize_spaces( )
    input_strs = input_strs.str.strip(' ')
    
    return input_strs

def preprocess_text_df(df, text_cols=['text'], **kwargs):
    for col in text_cols:
        df[col] = preprocess_text(df[col], **kwargs)
    return  df

With our function defined, we can execute it to preprocess the entire dataset.

In [None]:
%time df = preprocess_text_df(df, filters=filters)

In [None]:
df['text'].head(5)

## Word Count

Lets find the top words used in:
* Whole dataset
* by Albert Einstein
* by Charles dickens

In [None]:
## Getting a frequency count for Strings

def get_word_count(str_col):
    """
        returns the count of input strings
    """ 
    ## Tokenize: convert sentences into a long list of words
    ## Get counts: Groupby each token to get value counts

    df = cudf.DataFrame()
    # tokenize sentences into a string using nvtext.tokenize()
    # it into a single tall data-frame
    df['string'] = str_col.str.tokenize()
    
    # Using Group by to do a value count for string columns
    # This will be natively supported soon
    # See: issue https://github.com/rapidsai/cudf/issues/1951
    df['counts'] = np.dtype('int32').type(0)
    res = df.groupby('string').count()
    res = res.reset_index(drop=False).sort_values(by='counts', ascending=False)
    return res.rename(columns={'index':'string'})

### Top Words  Across the dataset

In [None]:
%%time 

count_df = get_word_count(df['text'])
count_df.head(5).to_pandas()

### Now lets compare Charles Dickens and Albert Einstein

####  Albert Einstein

In [None]:
einstein_df = df[df['author'].str.contains('Einstein')]
einstein_count_df = get_word_count(einstein_df['text'])
einstein_count_df.head(5).to_pandas()

#### Charles Dickens

In [None]:
charles_dickens_df = df[df['author'].str.contains('Charles Dickens')]
charles_dickens_count_df = get_word_count(charles_dickens_df['text'])
charles_dickens_count_df.head(5).to_pandas()


So Einstein is talking about relativity, with words like `relativity`,`theory`,`body` ,
while Charles Dickens is telling stories with   `one`, `upon`, `time` , `old`

Our Word Count seems to be working :-D

### Word Counts for all the authors

#### Lets get the list of authors for our dataframe

In [None]:
df['author'].unique().to_pandas().head(5)

#### Calculate the  word count for all authors into a list

In [None]:
%%time
author_wc_ls = []
author_name_ls = []
for author_name in df['author'].unique():
    df_auth = df[df['author']==author_name]
    author_wc = get_word_count(df_auth['text'])
    author_wc_ls.append(author_wc)
    author_name_ls.append(author_name)

## Encode the word-count `series` list for all authors into a count-vector

We do this in two steps:

1. Encode the string Series using`top 20k` most used `words` in the Dataset which we calculated earlier.
    * We encode anything not in the series to string_id = `20_000` (threshold)


2. With the encoded count series for all authors, we create an aligned word-count vector for them, where:
    * Where each column corresponds to a `word_id` from the the `top 20k words`
    * Each row corresponds to the `count vector` for that author

#### Categorize the `string series` from the `word count series` into a `integer series`  for all the authors 

In [None]:
def encode_count_df(auth_wc_df,keys,out_of_dict_id):
    """
        Encode the count series for all authors by using the index provided in keys
        All strings not in keys are mapped to out_of_dict_id and their count is summed
    """
    auth_wc_df['encoded_str_id'] = auth_wc_df['string'].astype('category')
    auth_wc_df['encoded_str_id'] = auth_wc_df['encoded_str_id'].cat.set_categories(keys)._column.codes
    auth_wc_df['encoded_str_id'] = auth_wc_df['encoded_str_id'].fillna(out_of_dict_id)
    
    # sub df which  contains words that are in the dictionary
    in_dict_wc_df = auth_wc_df[auth_wc_df['encoded_str_id']!=out_of_dict_id]
    
    # sum of `count series` of words not in dictionary 
    out_of_dict_wcount = auth_wc_df[auth_wc_df['encoded_str_id']==out_of_dict_id]['counts'].sum()
    
    # mapping out the count of words to -1
    out_of_dict_df = cudf.DataFrame({'encoded_str_id':out_of_dict_id,'counts': out_of_dict_wcount,'string':'other'})
    
    out_of_dict_df['encoded_str_id'] = out_of_dict_df['encoded_str_id'].astype(np.int32)
    out_of_dict_df['counts'] = out_of_dict_df['counts'].astype(np.int32)
    
    return cudf.concat([in_dict_wc_df,out_of_dict_df])


In [None]:
%%time
# keep only top 20k words in the dataset
th = 20_000
keys = count_df['string'][:th]
encoded_wc_ls = []

for auth_wc_df in author_wc_ls:
    encoded_count_df = encode_count_df(auth_wc_df,keys,th)
    encoded_wc_ls.append(encoded_count_df)

##### Now lets check if the encoding worked ! 

##### Agatha Christie Counts

In [None]:
author_id = author_name_ls.index('Agatha Christie') 
print(author_name_ls[author_id])
encoded_wc_ls[author_id].head(5).to_pandas()

##### Charles Dickens Counts

In [None]:
author_id = author_name_ls.index('Charles Dickens') 
print(author_name_ls[author_id])
encoded_wc_ls[author_id].head(5).to_pandas()

##### We can see that the encoded_str_id for `said` is `0` for both `Charles Dickens` and `Agatha Christie`. Yaay! the encoding worked

## Create a aligned word-count vector for each author:

We create a dataframe, where a row represents a `author` and the columnss contain the count of the `words` respresented by that `column`.

#### Create a numba nd-array of shape (`num_authors`,`Vocablary Size+1`))

In [None]:
num_authors = len(encoded_wc_ls)
count_ary = np.zeros(shape = (num_authors,th+1), dtype=np.int32)
count_dary = cuda.to_device(count_ary)

Fill the count array using a numba function:

Apply the numba function to fill the `author_count_array` with the count of words used by the `author`

`Numba Function`: See https://numba.pydata.org/numba-doc/0.13/CUDAJit.html for more `info` on how to write `cuda-jit` functions.

In [None]:
%%time

@cuda.jit('void(int32[:], int32[:], int32[:])')
def count_vec_func(author_token_id_array,author_token_count_array,author_count_array):
    pos = cuda.grid(1)
    if pos < author_token_id_array.size:
        token_id = author_token_id_array[pos]
        token_count = author_token_count_array[pos]
        author_count_array[token_id] = token_count        
        
for author_id,encoded_wc_df in enumerate(encoded_wc_ls):    
    count_sr = encoded_wc_df['counts']
    token_id_sr =  encoded_wc_df['encoded_str_id']
    
    count_ar = count_sr._column.data_array_view
    token_id_ar = token_id_sr._column.data_array_view
    author_ar = count_dary[author_id]
    
    # See https://numba.pydata.org/numba-doc/0.13/CUDAJit.html
    threadsperblock = 36
    blockspergrid = (count_ar.size + (threadsperblock - 1)) // threadsperblock
    count_vec_func[blockspergrid, threadsperblock](token_id_ar,count_ar,author_ar)

#### Now, Lets check if creating the count vectors worked !

In [None]:
author_id = author_name_ls.index('Agatha Christie')  

print(author_name_ls[author_id])
top_word_ids = encoded_wc_ls[author_id]['encoded_str_id'].head(5).to_pandas()
for word_id in top_word_ids:
    print("{} : {}".format(word_id,count_dary[author_id][word_id]))

## Lets find the  Nearest Authors 

Now your count df is ready for ML

Let's train a KNN on the count-df and see if we can find any interesting patterns in count_df. Though `euclidian distance` is not the best measure for these higher dimensional spaces but it still works as a small toy example. 


#### Normalize Counts

In [None]:
normalized_count_array = count_dary/np.sum(count_dary,axis=1)[:,None]

#### Train and find nearest_neighours on the non embedded  space

In [None]:
%%time
nn_model = cuml.neighbors.NearestNeighbors(n_neighbors = 5)
nn_model.fit(normalized_count_array)
ouput_mat,output_indices_count_sp = nn_model.kneighbors(X=normalized_count_array)

#### Nearest authors to Albert Einstein in the count vector space

In [None]:
author_id = author_name_ls.index('Albert Einstein') 
for index in output_indices_count_sp[author_id]:
    print(author_name_ls[int(index)])

#### Nearest authors to Charles Dickens in the count vector space

In [None]:
author_id = author_name_ls.index('Charles Dickens') 
for index in output_indices_count_sp[author_id]:
    print(author_name_ls[int(index)])

#### Encode the count vecotrs to a lower dimention using Umap

In [None]:
embedding_ar_gpu =  cuml.UMAP(n_neighbors=100,n_components=3).fit_transform(normalized_count_array)

#### KNN in the lower dimentional space

In [None]:
%%time
nn_model = cuml.neighbors.NearestNeighbors(n_neighbors = 5)
nn_model.fit(embedding_ar_gpu)
ouput_mat,output_indices_umap = nn_model.kneighbors(X=embedding_ar_gpu)

#### Nearest authors to Albert Einstein in the emdedded space

In [None]:
author_id = author_name_ls.index('Albert Einstein') 
for index in output_indices_umap[author_id]:
    print(author_name_ls[int(index)])

#### Nearest authors to Charles Dickens in the emdedded space

In [None]:
author_id = author_name_ls.index('Charles Dickens') 
for index in output_indices_umap[author_id]:
    print(author_name_ls[int(index)])

Want to get started with RAPIDS? Check out [`cuDF`](https://github.com/rapidsai/cudf) on Github and let us know what you think! You can download pre-built Docker containers for our 0.8 and newer releases from [NGC](https://ngc.nvidia.com/catalog/landing) or [Dockerhub](https://hub.docker.com/r/rapidsai/rapidsai/) to get started, or install it yourself via Conda. Need something even easier? You can quickly get started with RAPIDS in [Google Colab](https://colab.research.google.com/drive/1XTKHiIcvyL5nuldx0HSL_dUa8yopzy_Y#forceEdit=true&offline=true&sandboxMode=true) and try out all the new things we've added with just a single push of a button.

Don't want to wait for the next release to use upcoming features? You can download our nightly containers from [Dockerhub](https://hub.docker.com/r/rapidsai/rapidsai-nightly) or install via [Conda](https://anaconda.org/rapidsai-nightly) to stay at the tip of our development branch.

### Other Examples of NLP workflows:
- [Q18](https://github.com/rapidsai/tpcx-bb/tree/master/tpcx_bb/queries/q18): Identify the stores with flat or declining sales in 4 consecutive months, check if there are any negative reviews regarding these stores available online.

- [Q19](https://github.com/rapidsai/tpcx-bb/tree/master/tpcx_bb/queries/q19): Retrieve the items with the highest number of returns where the number of returns was approximately equivalent across all store and web channels (within a tolerance of +/ 10%), within the week ending given dates. Analyse the online reviews for these items to see if there are any negative reviews.

- [Q27](https://github.com/rapidsai/tpcx-bb/tree/master/tpcx_bb/queries/q27): For a given product, find "competitor" company names in the product reviews. Display review id, product id, "competitor’s" company name and the related sentence from the online review 

- [Q28](https://github.com/rapidsai/tpcx-bb/tree/master/tpcx_bb/queries/q28): Build text classifier for online review sentiment classification (Positive, Negative, Neutral), using 90% of available reviews for training and the remaining 10% for testing. Display classifier accuracy on testing data
and classification result for the 10% testing data: `<reviewSK>`,`<originalRating>`,`<classificationResult>`  

- [cyBERT](https://medium.com/rapids-ai/cybert-28b35a4c81c4) Click-streams work for cyber log parsing

### Upcoming NLP work

- [Count vectorizer in cuml](https://github.com/rapidsai/cuml/pull/2267)
- [GPU accelerated Bert tokenizer](https://github.com/rapidsai/cudf/issues/4981)