# Natural Language Processing

Natural Language Processing (NLP) is a vast subject with many different specializations. Here we are going to discuss two important topics.

* Sentiment analysis
* Language modelling

<table align="left">
    <td>
        <a target="_blank" href="https://colab.research.google.com/github/thushv89/manning_tf2_in_action/blob/master/Ch09/9.1.Natural_Language_Processing.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
    </td>
</table>



## Importing libraries and other housekeeping

In [21]:
import tensorflow as tf
#import tensorflow_hub as hub
import requests
print(tf.__version__)
import zipfile
import requests
import os
import time
import pandas as pd
import random
import shutil
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os
import tensorflow.keras.layers as layers
import tensorflow.keras.models as models
import tensorflow_addons as tfa
from tensorflow.keras.losses import CategoricalCrossentropy
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import EarlyStopping, CSVLogger
import numpy as np
from PIL import Image
import pickle
from tensorflow.keras.models import load_model, Model
from PIL import Image
from PIL.PngImagePlugin import PngImageFile
import matplotlib.pyplot as plt
import glob
from functools import partial
import nltk

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except:
        print("Couldn't set memory_growth")
        pass
    
    
def fix_random_seed(seed):
    """ Setting the random seed of various libraries """
    try:
        np.random.seed(seed)
    except NameError:
        print("Warning: Numpy is not imported. Setting the seed for Numpy failed.")
    try:
        tf.random.set_seed(seed)
    except NameError:
        print("Warning: TensorFlow is not imported. Setting the seed for TensorFlow failed.")
    try:
        random.seed(seed)
    except NameError:
        print("Warning: random module is not imported. Setting the seed for random failed.")

# Fixing the random seed
random_seed=4321
fix_random_seed(random_seed)

2.2.1


## Downloading data

In [6]:
# Downloading the data
# http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz

import os
import requests
import gzip
import shutil

# Retrieve the data
if not os.path.exists(os.path.join('data','Video_Games_5.json.gz')):
    url = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz"
    # Get the file from web
    r = requests.get(url)

    if not os.path.exists('data'):
        os.mkdir('data')
    
    # Write to a file
    with open(os.path.join('data','Video_Games_5.json.gz'), 'wb') as f:
        f.write(r.content)
else:
    print("The tar file already exists.")
    
if not os.path.exists(os.path.join('data', 'Video_Games_5.json')):
    with gzip.open(os.path.join('data','Video_Games_5.json.gz'), 'rb') as f_in:
        with open(os.path.join('data','Video_Games_5.json'), 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
else:
    print("The extracted data already exists")


## Loading review data

In [10]:
import pandas as pd

review_df = pd.read_json(os.path.join('data', 'Video_Games_5.json'), lines=True, orient='records')
review_df = review_df[["overall", "verified", "reviewTime", "reviewText"]]
review_df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5,True,"10 17, 2015",A1HP7NVNPFMA4N,700026657,Ambrosia075,"This game is a bit hard to get the hang of, bu...",but when you do it's great.,1445040000,,,
1,4,False,"07 27, 2015",A1JGAP0185YJI6,700026657,travis,I played it a while but it was alright. The st...,"But in spite of that it was fun, I liked it",1437955200,,,
2,3,True,"02 23, 2015",A1YJWEXHQBWK2B,700026657,Vincent G. Mezera,ok game.,Three Stars,1424649600,,,
3,2,True,"02 20, 2015",A2204E1TH211HT,700026657,Grandma KR,"found the game a bit too complicated, not what...",Two Stars,1424390400,,,
4,5,True,"12 25, 2014",A2RF5B5H74JLPE,700026657,jon,"great game, I love it and have played it since...",love this game,1419465600,,,


In [40]:
print("Before cleaning up: {}".format(review_df.shape))
review_df = review_df[~review_df["reviewText"].isna()]
print("After cleaning up: {}".format(review_df.shape))

Before cleaning up: (497577, 13)
After cleaning up: (497419, 13)


## Checking verified vs non-verified reviews

In [41]:
review_df["verified"].value_counts()

True     332504
False    164915
Name: verified, dtype: int64

## Check number of reviews for each rating

In [42]:
verified_df = review_df.loc[review_df["verified"], :]
verified_df["overall"].value_counts()

5    222335
4     54878
3     27973
1     15200
2     12118
Name: overall, dtype: int64

In [43]:
verified_df["label"]=verified_df["overall"].map({5:1, 4:1, 3:0, 2:0, 1:0})
verified_df["label"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


1    277213
0     55291
Name: label, dtype: int64

## Preprocessing the text

* Lower case (Spacy)
* Remove stop words (Spacy)
* Lemmatize (Spacy)
* Remove punctuation (Keras)
* Only keep most commmon n-worods (Keras)
* Most common n-grams (Keras)


In [44]:
verified_df = verified_df.sample(frac=1.0, random_state=random_seed)
data, labels = verified_df["reviewText"], verified_df["label"]

In [64]:
import nltk
nltk.download('averaged_perceptron_tagger', download_dir='nltk')
nltk.download('wordnet', download_dir='nltk')
nltk.data.path.append(os.path.abspath('nltk'))

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string

lemmatizer = WordNetLemmatizer()

EN_STOPWORDS = set(stopwords.words('english')) - {'not'}

def clean_text(doc):
    doc = doc.lower()
    doc = doc.replace("n\'t ", ' not ')
    doc = re.sub(r"(?:\'ll |\'re |\'d |\'ve )", " ", doc)
    doc = re.sub(r"/d+","", doc)
    tokens = [w for w in word_tokenize(doc) if w not in EN_STOPWORDS and w not in string.punctuation]  
    pos_tags = nltk.pos_tag(tokens)
    clean_text = [
        lemmatizer.lemmatize(w, pos=p[0].lower()) \
        if p[0]=='N' or p[0]=='V' else w \
        for (w, p) in pos_tags
    ]

    return clean_text

sample_doc = 'She sells seashells by the seashore.'
print("Before clean: {}".format(sample_doc))
print("After clean: {}".format(' '.join(clean_text(sample_doc))))
print("\nProcessing all the review data ...")
data = data.apply(lambda x: clean_text(x))
print("\tDone")

[nltk_data] Downloading package averaged_perceptron_tagger to nltk...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to nltk...
[nltk_data]   Package wordnet is already up-to-date!


Before clean: She sells seashells by the seashore.
After clean: sell seashell seashore

Processing all the review data ...
	Done


In [70]:
from collections import Counter
data_list = [w for doc in data for w in doc]
cnt = Counter(data_list)
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)
freq_df.head(n=10)

game     443725
not      276928
play     143060
's       139472
get      119875
like     109808
great    102651
one       97790
``        91834
good      83159
dtype: int64

In [77]:
print(freq_df.median())
freq_df.describe(percentiles=[0.25,0.5,0.75,0.9])

1.0


count    137945.000000
mean         79.732154
std        1894.467706
min           1.000000
25%           1.000000
50%           1.000000
75%           4.000000
90%          19.000000
max      443725.000000
dtype: float64

In [84]:
seq_length_ser = data.str.len()
print("Median length: {}\n".format(seq_length_ser.median()))
seq_length_ser.describe(percentiles=[0.25,0.5,0.75,0.8])

Median length: 12.0



count    332504.000000
mean         33.078255
std          74.746220
min           0.000000
25%           4.000000
50%          12.000000
75%          30.000000
80%          39.000000
max        3158.000000
Name: reviewText, dtype: float64

In [83]:
n_vocab = (freq_df >= 80).sum()
n_seq = 39
print("Using a vocabulary of size: {}".format(n_vocab))
print("Using a sequence length: {}".format(n_seq))

Using a vocabulary of size: 6496
Using a sequence length: 76


## Defining a Keras tokenizer

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)
tokenizer.fit_on_texts(data.tolist())

## Transforming text to numbers

Here we are going to transform the preprocessed text to number sequences and pad them to a fixed length