### What is NLP?

*Reading material: https://medium.com/@dharamshikrupa/nlp-landscape-from-1960s-to-2020s-4055a89f7b47*

NLP stands for Natural Language Processing, which is a part of Computer Science, Human language, and Artificial Intelligence. It is the technology that is used by machines to understand, analyse, manipulate, and interpret human's languages. It helps developers to organize knowledge for performing tasks such as translation, automatic summarization, Named Entity Recognition (NER), speech recognition, relationship extraction, and topic segmentation.

![](https://static.javatpoint.com/tutorial/nlp/images/what-is-nlp.png)

### Real world applications?
Contextual advertisements

Email spam classification

Removing unintended content, hate speech 

Election analysis through social media discussions

Search engines

Chatbots

### Common NLP Tasks
Text/Document classification

Sentiment analysis

Information retrieval from text/documents

Parts of speech tagging

Language detection and machine translation

Conversational agents text and speech based

Knowledge graphs (Neo4j)

Text summarization

Topic modelling (LTA)

Text generation or word prediction

Spell check and grammar corrections (Grammarly)

Text parsing

Speech to text

### Approaches to NLP
**Heuristic models**: Quick approach. Regular expressions, Wordnet (lexical dictionary), Open Mind Common Sense

**ML Based models**: Rules for open ended problems are decided by machine based on data. Downside is that it the algos do not care about sequential information, feature generation has to be programmed e.g algorithms are Naives based, Logistics regression, SVM, LDA, Hidden Markov models. 

**DL based models**: Feature generation is automatic. DL Architectures for NLP RNN/LSTM/GRU/CNN/Transformers (BERT), Autoencoders

### Challenges in NLP
Spelling errors

Synonyms

Diversity of languages

Ambiguity: I have never tasted a pizza quite like this before

Contextual words: I ran to the store because we ran out of bread

Colloquial/Slangs: Piece of cake, Pulling your leg

Irony/Sarcasm/tone difference: yeah

Creativity: Poems, scripts

In [None]:
import pandas as pd # to load dataset
import numpy as np
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords # to get collection of stopwords
from nltk.tokenize import word_tokenize
import string
#import gensim
import matplotlib.pyplot as plt

In [None]:
from keras.layers import LSTM
from keras.layers import Dense, Dropout
from keras.layers import Embedding
from keras.layers import Bidirectional

In [None]:
from keras.models import Sequential   
from keras.utils import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
movie_reviews = pd.read_csv("IMDB_Dataset.csv")

movie_reviews.head()

#print(movie_reviews.shape)

#print(movie_reviews.head())

# Check for null values
#movie_reviews.isnull().values.any()

In [None]:
movie_reviews.info()

In [None]:
movie_reviews.groupby("sentiment").sentiment.count().plot.bar(ylim=0);

## NLP Pipeline is as follows

*Reference: https://www.analyticsvidhya.com/blog/2022/06/an-end-to-end-guide-on-nlp-pipeline/*

![Data Collection -> Text Preperation -> Feature Engineering -> Modelling -> Deployment
](https://miro.medium.com/max/944/1*dWY7adQ62NDn_w_sc4lAKw.png)

#### **a) Data Acquisition:**
    
#####    Available with me: 
        -Ready in CSV files
        -in database. Pull data
        -less data: Use data augmentation techniques without changing the sentiment. e.g. synonym replacement, bigram flip, back translate, add noise
    
#####    Available with others: 
        -Public dataset: Open public datasets 
        -Web scraping: beautifulsoup library (https://pypi.org/project/beautifulsoup4/) 
        -API: https://rapidapi.com/hub will help discover and connect thousands of APIs
        -PDF/Image/Audio: Use python techniques

#####    Unavailable:
        -Create data with help of domain specialists
        -Collect data through customer surveys

#### **b) Text Preperation:**

#####    Basic preprocessing and cleanup tasks: 
**Lowercasing** Avoid duplicating tokens

**Remove HTML tags** Use regex to remove HTML tags use regex101.com to test

**Remove URLs** 
```python
            def remove_url (sentence) 
                pattern = re.compile(r'https?://\S+|www\.\S+')
                return pattern.sub(r'', text)
```
**Removing punctuations, digits**: Punctuations can add to additional tokens and cause confusion. If punctuations `!"#$%&'()*+, -./:;<=>?@[\]^_`{|}~` or digits does not contribute to the meaning you may remove them

*`                Hey! How is it going?: 'Hey', '!', 'How', 'is', 'it', 'going', '?' or 'Hey!', 'How', 'is', 'it', 'going?'`*

##### Assignment: Write an efficient python code to remove punctuation using translate().

**Chat word treatment**: GM -> Good Morning, GN -> Good Night, AFK -> Away from keyboard etc. 

*Reference:[ ](https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt)*

**Spelling checks**: 
```python
            from textblob import TextBlob 
            textblb = TextBlob(incorrect_text) 
            textblb.correct()
```

**Stopword removal**: Words related to sentence formation e.g. 'and', 'or', 'the', 'an' etc. that does not contribute to the meaning. However, they will be retained for **POS tagging*. nklt and spaCy are 2 popular packages for removing stopwords

*Reference: https://www.analyticsvidhya.com/blog/2019/08/how-to-remove-stopwords-text-normalization-nltk-spacy-gensim-python/*


**Replace emojis**: Either remove or replace with Unicode normalization. 

*Reference: https://carpedm20.github.io/emoji/docs/index.html*

**Tokenization**: **An important step** in NLP. Sentence or word tokenization. Prefix *$10*, Postfix *10km*, Infix *auto-encoder*, other exceptions is to split or to prevent splitting while removing punctuations *(U.S.A)* characters pose challenge while tokenizing 

*Reference: https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/*

What are the challenges when using 
        a) split() function when applied to sentences like: Incredible India!, Where do you want to go today? etc.
        b) RE is slighty better than split() function
        c) NLTK packages applied to sentences like: Auto-rickshaw ride in Bangalore costs Rs.30.00/km, my email address is test@e.gov, Dr. Rajkumar received a Hon. Doctrate from Mysore
        d) spaCy is a great for advance/complex NLP that gives better results. Try example sentences from above 3 options using spaCy


**Stemming and Lemmatization**: End result is to get the root word 

*Inflection* a change in the form of a word (typically the ending) to express a grammatical function or attribute such as tense, mood, person, number, case, and gender. Tense inflection: walk-walking-walked, Plural inflection: dog-dogs

*Stemming* is a process that reduces inflection from a word even if the stem word leads to incorrect meaning or spelling in the language. Stemming is used in case of information retrieval from large dataset faster. For instance, stemming the word ‘Caring‘ would return ‘Car‘. 

*Lemmatization* considers the context and converts the word to its meaningful base form, which is called **Lemma**. Lemmatization is computationally expensive since it involves look-up tables and what not. Used in chat applications. For instance, lemmatizing the word ‘Caring‘ would return ‘Care‘.

*Reference: https://www.analyticsvidhya.com/blog/2022/06/stemming-vs-lemmatization-in-nlp-must-know-differences/*


**Language detection**: For detecting language

#####    Advanced Preprocessing tasks: (Park for now)
**POS Tagging**: Each word will be assigned a part of speech
**Parsing**: 
**Coreference resolution**:
        
![https://](https://shubhangidabral13.github.io/Bits-and-Bytes-of-NLP/images/copied_from_nb/my_icons/topic_02.b.2.png)

In [None]:
def preprocess_text(sen):
    
    sen = re.sub('<.*?>', ' ', sen) # remove html tag

    tokens = word_tokenize(sen)  # tokenizing words

    tokens = [w.lower() for w in tokens]    # lower case

    table = str.maketrans('', '', string.punctuation)  # remove punctuations
    stripped = [w.translate(table) for w in tokens]

    words = [word for word in stripped if word.isalpha()]  # remove non alphabet
    stop_words = set(stopwords.words('english'))

    words = [w for w in words if not w in stop_words]   # remove stop words
    words = [w for w in words if len(w) > 2]  # Ignore words less than 2
    
    sentence = ' '.join(words)

    return sentence

In [None]:
# convert the sentiment from string to a binary form of 1 and 0, 1 is ‘positive’ and 0 is ‘negative’.
y = movie_reviews['sentiment']
y = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))

In [None]:
# Store the preprocessed reviews in a new list
review_lines = []
sentences = list(movie_reviews['review'])

In [None]:
for sen in sentences:
    # preprocess each sentence of the review text
    review_lines.append(preprocess_text(sen))

print(len(review_lines))

print(review_lines[1])

#### **c) Feature Extraction or Vectorization:**

Converting text to vector representation is not a direct process like it is for Image or Speech data. Techniques used to convert text to vector representation should retain the semantic information of the words

##### Common Terms

**Corpus** (C): Concatenation of all the strings

**Vocabulary** (V): Unique words in the Corpus

**Document** (D): Each sample is a document

**Word** (W): Words of the document

Techniques include

**OHE**: One hot encoding is converting the words of your document into a V-dimension vector. The technique is intutive and easy to implement. This technique is not popular due to several flaws like sparsity results in overfitting, the vectors of the documents is usually varying in size, Out of Vocabulary problem, **does not capture semantic information** e.g Fruit, Orange, Shoe

#### 1. Bag Of Words Model

*Reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html*

**BOW**: It is one of the most used text vectorization techniques. A bag-of-words is a representation of text that describes the occurrence of words within a document and order does not matter. Context does not matter either, but it gets captured. OOV does not occur. Performance is great in text classification. **Advantages**: This is simple and intutive, vector size is same across documents, in a way semantic information is captured. **Disadvantages**: Sparsity, OOV are ignored, order of words in sentences are not considered, if a small change in a sentence alter the meaning drastically BOW consideres both the sentences to be the same e.g. I like orange juice; I don't like orange juice

*Reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html*

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=2500)

X1 = cv.fit_transform(review_lines).toarray()

print(cv.get_feature_names_out())

print(X1)

In [None]:
y1 = pd.get_dummies(movie_reviews['sentiment'])
y1 = y1.iloc[:,1].values

In [None]:
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size = 0.20, random_state=0)

In [None]:
print(X_train1.shape)
print(y_train1.shape)

In [None]:
from sklearn.naive_bayes import MultinomialNB

model1 = MultinomialNB().fit(X_train1, y_train1)

In [None]:
y_pred1 = model1.predict(X_test1)

In [None]:
from sklearn.metrics import accuracy_score, classification_report

print(accuracy_score(y_test1, y_pred1))
print(classification_report(y_pred1, y_test1))

In [None]:
new_review = 'This is an average MoviE. I will not see it again'

new_review = re.sub('[^a-zA-Z]', ' ', new_review)

new_review = new_review.lower()

new_review = new_review.split()

new_review = ' '.join(new_review)

new_X_test = cv.transform([new_review]).toarray()
new_y_pred = model1.predict(new_X_test)
print(new_y_pred)

#### 2. n-grams

*Reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html*

A bag-of-n -grams model represents a text document as an unordered collection of its n-grams.

**bi-gram** — using **two** words of the document

**tri-gram** — using **three** words of the document

**n-gram** — using **n** number of words of the document.

**Advantages**: Able to capture the semantic meaning of the sentence

**Disadvantages**: As we move from unigram to N-Gram then the dimension of vector formation increases and slows down the algorithm, no solution for OOV

e.g food today was very good, food today was not good

In [None]:
cv = CountVectorizer(max_features=2500, ngram_range=(2,2))

X1 = cv.fit_transform(review_lines).toarray()

print(cv.get_feature_names_out())

print(X1)

#### 3. Tf-Idf

*Reference: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html*

TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. **Advantages**: 

*Term Frequency (TF)*: The number of times a word appears in a document is divided by the total number of words in that document. 0 < Tf< 1

![
](https://editor.analyticsvidhya.com/uploads/33409tf1.png)

*Inverse Document Frequency (IDF)*: The logarithm of the number of documents in the corpus is divided by the number of documents where the specific term appears. In Scikit-learn use log(N/ni) + 1 formula.

![
](https://editor.analyticsvidhya.com/uploads/95671idf.png)


In [None]:
## Creating tf-idf model
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=2500)
X2 = tv.fit_transform(review_lines).toarray()

In [None]:
y2 = pd.get_dummies(movie_reviews['sentiment'])
y2 = y2.iloc[:,1].values

In [None]:
from sklearn.model_selection import train_test_split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.20, random_state=0)

In [None]:
print(X_train2.shape)
print(y_train2.shape)

In [None]:
from sklearn.naive_bayes import MultinomialNB
model2 = MultinomialNB().fit(X_train2, y_train2)

y_pred2 = model2.predict(X_test2)

In [None]:
print(accuracy_score(y_test2,y_pred2))
print(classification_report(y_pred2,y_test2))

In [None]:
new_review = "Please do not buy this product. The oats get cooked faster and the Quinoa does not cook at all. The end result is gooey porridge"
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()
new_review = ' '.join(new_review)
print(new_review)
new_X_test = cv.transform([new_review]).toarray()
new_y_pred = model2.predict(new_X_test)
print(new_y_pred)

#### Custom

### Word Embeddings

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning.[1] Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.

There are **2** types: a) **Frequency based** (BOW, TF-Idf, Glove) b) **Prediction based** (Word2Vec)

#### Word2Vec

**Word2Vec** is one of the most popular **Deep Learning** technique to learn word embeddings using shallow neural network. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc. A word embedding is a learned representation for text where words that have the same meaning have a similar representation.

Core assumption is that two words sharing similar context also share similar meaning and consequently similar vector representation from the model.

##### Advantages

It can help extract semantic meaning (Happy, Joy are similar)

Representation is based on dense vectors 

Dense vectors (mostly non-zero) are relative smaller helps in faster computations and prevents overfitting during training

There are 2 types of algorithms namely: **CBOW - Continuous Bag Of Words** and **Skip-Gram** develop using shallow networks

##### CBOW

![
](https://towardsmachinelearning.org/wp-content/uploads/2022/04/CBOW2.png)


![
](https://towardsmachinelearning.org/wp-content/uploads/2022/04/CBOW1.png)


##### CBOW and Skip Gram Architectures

![
](https://www.researchgate.net/profile/Elena-Tutubalina/publication/318507923/figure/fig2/AS:613947946319904@1523388005889/Illustration-of-the-word2vec-models-a-CBOW-b-skip-gram-16-33.png)

CBOW is used in case of small dataset whereas Skip-gram is used in large datasets

*Reference: https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1*

*Reference: https://jalammar.github.io/illustrated-word2vec/*

*Word embeddings **visual inspector**: https://ronxin.github.io/wevi/*

![
](https://jalammar.github.io/images/word2vec/queen-woman-girl-embeddings.png)


Word2Vec is available as a **pre-trained** model in addition to option of **training your own model**

**Improve performance of Word2Vec**

Increase training data set

Increase vector dimension

Increase window size

Word2Vec pretrained model was trained on GoogleNews corupus containing 3B words. The model consists of 3 million words and phrases each represted as 300 dimensional vector

In [None]:
import gensim
from gensim.models import Word2Vec, KeyedVectors

In [None]:
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

In [None]:
print(model['man'].shape)

print(model['man'])

In [None]:
model.most_similar('man')

In [None]:
model.most_similar('ipl')

In [None]:
model.most_similar('facebook')

In [None]:
model.similarity('man', 'woman')

In [None]:
model.similarity('man', 'java')

In [None]:
model.doesnt_match(['man', 'woman', 'java'])

In [None]:
model.most_similar('horrible')

![(https://](https://www.researchgate.net/publication/358432453/figure/fig1/AS:1121209337020446@1644328545976/Analogical-reasoning-on-vectors-a-king-man-womanqueen-and-b.jpg)

In [None]:
vec = model['king'] - model['man'] + model['woman']
model.most_similar([vec])

#### 3. Word2Vec

In [None]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [None]:
words = []
for sent in review_lines:
    sent_token = sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent))

In [None]:
words

In [None]:
import gensim
model3 = gensim.models.Word2Vec(words, window=5, min_count=2)

In [None]:
model3.wv.index_to_key

In [157]:
model3.corpus_count

10378

In [158]:
model3.epochs

5

In [159]:
def avg_word2vec(doc):
    return np.mean([model3.wv[word] for word in doc if word in model3.wv.index_to_key], axis=0)

In [160]:
from tqdm import tqdm
X3 = []
for i in tqdm(range(len(words))):
    X3.append(avg_word2vec(words[i]))

100%|██████████| 10378/10378 [01:46<00:00, 97.10it/s] 


In [161]:
X_new = np.array(X3)

In [162]:
X_new.shape

(10378, 100)

In [163]:
y3=pd.get_dummies(movie_reviews['sentiment'])
y3=y3.iloc[:,1].values

In [164]:
from sklearn.model_selection import train_test_split
X_train3, X_test3, y_train3, y_test3 = train_test_split(X_new,y3, test_size=0.20, random_state=0)

In [165]:
X_train3.shape

(8302, 100)

In [166]:
y_train3.shape

(8302,)

In [167]:
from sklearn.svm import SVC
model4 = SVC(kernel='rbf', random_state=0).fit(X_train3, y_train3)

In [168]:
y_pred3 = model4.predict(X_test3)

In [169]:
print(accuracy_score(y_test3,y_pred3))
print(classification_report(y_pred3,y_test3))

0.8063583815028902
              precision    recall  f1-score   support

       False       0.78      0.83      0.80       987
        True       0.84      0.78      0.81      1089

    accuracy                           0.81      2076
   macro avg       0.81      0.81      0.81      2076
weighted avg       0.81      0.81      0.81      2076



In [180]:
new_review = 'The Dr.Strange MOM movie was not great.'
new_review = re.sub('[^a-zA-Z]', ' ', new_review)
new_review = new_review.lower()
new_review = new_review.split()

print(new_review)
#all_stopwords = stopwords.words('english')
#all_stopwords.remove('not')
#new_review = [lemmatizer.lemmatize(word) for word in new_review if not word in set(all_stopwords)]
new_review = ' '.join(new_review)
new_corpus = [new_review]
print(new_corpus)

new_words=[]
for sent in new_corpus:
    sent_token = sent_tokenize(sent)
    for sent in sent_token:
        new_words.append(simple_preprocess(sent))
        
new_X3 = []
for i in range(len(new_words)):
    new_X3.append(avg_word2vec(new_words[i]))

print(new_X)    

new_X = np.array(new_X3)
new_y_pred = model4.predict(new_X)

print(new_y_pred)

['the', 'dr', 'strange', 'mom', 'movie', 'was', 'not', 'great']
['the dr strange mom movie was not great']
[[-4.25027549e-01  3.76102239e-01 -1.17754415e-02  2.19848812e-01
   9.62777995e-03 -9.79634583e-01  1.71377391e-01  1.49202728e+00
  -1.26582170e+00 -3.83862674e-01 -2.68357009e-01 -8.35936248e-01
  -5.61337888e-01  6.53094172e-01  3.23007882e-01 -6.55304134e-01
   9.10625905e-02 -2.15287060e-01  2.48783261e-01 -7.99174070e-01
   3.82544130e-01  3.41346234e-01  2.08509699e-01 -1.97810084e-01
   8.06450844e-04  2.62502208e-03 -1.66353941e-01  4.00541760e-02
  -7.20776796e-01 -4.74611342e-01 -1.08326040e-01 -4.10183191e-01
   1.32701054e-01 -3.95946622e-01  5.54357842e-02  8.97217214e-01
   5.82306981e-01 -6.55133665e-01 -4.89395976e-01 -6.37313575e-02
   2.07061261e-01 -4.86977220e-01 -6.03871405e-01  1.23925321e-01
   4.23423290e-01 -1.98031664e-01 -9.20679450e-01 -1.58006132e-01
   6.66205287e-01  1.79108322e-01 -6.37286752e-02 -1.17133513e-01
  -1.40488252e-01  1.04062617e-01 -