In [1]:
# Imports
import nltk
nltk.download('punkt')
nltk.download("wordnet")
nltk.download('omw-1.4')

import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package punkt to /home/suyog/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/suyog/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/suyog/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Table of Contents
- **What is NLP and how it correlates with aspects of AI?**
- **Corpus**
- **Vocabulary**
- **Tokens and Tokenization**
- **Stemming and Lemmatization**
- **Vectorization/Embedding**
- **Model Training for text-classification**
- **Bonus: for the people who support till the end :D**

# What is Corpus?
<br/>

**Fundamental Unit of NLP :** `TEXT` 

**Simply, a collection of text is a** `Corpus`**. We analyse and get insights from the corpus. Hence, Corpus is a dataset for building models** 

**Examples:**
- NewsPaper articles
- Essay Book
- Reviews on Daraz
- Information in Invoices

# But, how come Computers understand the TEXT?

# But, how come Computers understand the TEXT?

- **There's where the `Vectorization` comes to Play :D**
- **But, before vectorizing the text, how many words can a Computer remember?**

**Quick Note:** ***We will soon discuss more about the vectorization***

## Vocabulary

- **Analogous to dictionary we have**
- **A set of words**(***token***) **are chosen based on the number of time it appears in the Corpus**

## Tokens and Tokenization

- **Tokens are basically the smaller units that can be more easily assigned meaning**

**Example: For the sentence `Today we are learning NLP`, the tokens are `[Today, we, are, learning, NLP]`**

**In dictionary,**

```python 
{
    "Today": 0, 
    "We": 1, 
    "are": 2,
    "learning": 3,
    "NLP": 4
}
```

**The above process of splitting the corpus to individual words and assigning a numeric value to each of them is known as tokenization**

In [2]:
text = "Today we are learning NLP"
tokens = text.split()    # Creating tokens using .split() method
token2id = {i:j for i,j in enumerate(tokens)}    # Creating dictionary

print(f"The tokens for above text are: {tokens}")
print("In dictionary, ", token2id, sep="\n")

The tokens for above text are: ['Today', 'we', 'are', 'learning', 'NLP']
In dictionary, 
{0: 'Today', 1: 'we', 2: 'are', 3: 'learning', 4: 'NLP'}


## Problem with .split() method for Tokenization

- **Doesn't consider punctuation while creating tokens**

**Example:**

**Tokens for `Wow! Tokens seems to be interesting. What do you think?` are**

**`['Wow!', 'Tokens', 'seems', 'to', 'be', 'interesting.', 'What', 'do', 'you', 'think?']`**

## Solution
- **Using some library that can handle this issue.** ***Eg: NLTk***

In [3]:
text = "Wow! Tokens seems to be interesting. What do you think?"
tokens_with_split = text.split()
tokens_with_nltk = word_tokenize(text)     # word_tokenize is a NLTK function which has a inbuilt tokenizer

print("The tokens using split are: ", tokens_with_split)
print("The tokens using NLTK are: ", tokens_with_nltk)

The tokens using split are:  ['Wow!', 'Tokens', 'seems', 'to', 'be', 'interesting.', 'What', 'do', 'you', 'think?']
The tokens using NLTK are:  ['Wow', '!', 'Tokens', 'seems', 'to', 'be', 'interesting', '.', 'What', 'do', 'you', 'think', '?']


## We are in a process of making good Vocabulary. Punctuations are now being handled.

**But wait, we are still missing something. What if we have a corpus with words `play, played, playing, plays` more repititive than other words?**

**Our vocabulary will contains only the words that has similar meaning.**

**What could be the solution?**

## Stemming and Lemmatization

- **Stemming and Lemmatization converts inflectional forms of each word into a common base or root.**

**Example :**

**`play, played, playing, plays` => `play`**

**Stemming: Cuts the end of the word, taking account of the common suffixes. It is best suited when context is not important.**

**Example :**
| Word    |Stem   |
|---------|-------|
| Studies | Studi |
| Studying| Studi |

**Lemmatization: takes into consideration the morphological analysis of the words. It is used in context analysis**

**Example :**
| Word    | Lemma  |
|---------|-------|
| likes | like |
| like| like |

In [4]:
stem_words = ["Studies", "Studying"]
lemma_words = ["likes", "like"]

# Defining Stemmer and Lemmatizer
stemmer = PorterStemmer()    
lemmatizer = WordNetLemmatizer()

words_after_stemming = [stemmer.stem(word) for word in stem_words]
words_after_lemmatizing = [lemmatizer.lemmatize(word) for word in lemma_words]

print("The words before stemming: ", stem_words)
print("The words after stemming: ", words_after_stemming)
print("The words before lemmatization: ", lemma_words)
print("The words after lemmatization: ", words_after_lemmatizing)

The words before stemming:  ['Studies', 'Studying']
The words after stemming:  ['studi', 'studi']
The words before lemmatization:  ['likes', 'like']
The words after lemmatization:  ['like', 'like']


## Vectorization / Embedding

- **Assigning numerical meaning to the texts.**

- **Usually done in one of the following way:**
    - **On Document/Corpus level**
    - **On Token Level**
    - **On Sub-Token Level**
    
***Quick Tip: The recent progress in NLP like ChatGPT is due to the improvement in the embedding and attention paid to the each embedding***

## Document Level Vectorization

**Bag Of Words(BoW)**

- **Bag of Words is a method for representing a piece of text/corpus as a collection of individual words, without considering the order in which they appear.**
- **It is called "Bag of Words" because it treats the text as a "bag" of individual words, where the order of the words doesn't matter, just like the order of items in a bag doesn't matter.**

In [5]:
# Creating a sample dataset for spam/ham classification.
corpus = ["Hello", "Buy Crypto", "Meet me", "Send money", "Hi there", "Get Rich"]
labels = [0, 1, 0, 1, 0, 1] # 1: Spam, 0: Ham

# Lets see in dataframe
data = pd.DataFrame({"Corpus": corpus, "Spam": labels})
data.head(6)

Unnamed: 0,Corpus,Spam
0,Hello,0
1,Buy Crypto,1
2,Meet me,0
3,Send money,1
4,Hi there,0
5,Get Rich,1


In [6]:
# BoW using Sklearn

# CountVectorizer is an sklearn class to create BoW
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
vector = vectorizer.transform(corpus)

In [7]:
features = vectorizer.get_feature_names_out()
print("The words in the bags(Vocabulary) are: ", features)

# Sample document in BoW vector
idx = 0
text = corpus[idx]
sample_vector = vector[idx]
print(f"\nText: {text}")
print(f"Vector: {sample_vector.toarray()[0]}")

The words in the bags(Vocabulary) are:  ['buy' 'crypto' 'get' 'hello' 'hi' 'me' 'meet' 'money' 'rich' 'send'
 'there']

Text: Hello
Vector: [0 0 0 1 0 0 0 0 0 0 0]


In [8]:
# How vectorization is done for new text
random_text = "I will send you money"
random_vector = vectorizer.transform([random_text])

print("The text is: ", random_text)
print("The vector is: ", random_vector.toarray()[0])

The text is:  I will send you money
The vector is:  [0 0 0 0 0 0 0 1 0 1 0]


## Training a simple spam-ham classifier

**We will use a simple Logistic Regression classifer. Because, the algorithm is very simple and we got very small dataset/corpus.**

**The explanation of underlying math of Logistic Regression is outside of the scope for today's workshop**

In [9]:
# Creating a classifier
classifier = LogisticRegression()

# Defining input and output for training
X = vector.toarray()
y = labels

# Training the classifier
classifier.fit(X, y)

In [10]:
# Testing the classifier in random text. For now, lets use the random text we defined in previous slides.
prediction = classifier.predict(random_vector.toarray())
print("The text is: ", random_text)
print("Prediction: ", prediction[0])

The text is:  I will send you money
Prediction:  1


In [11]:
# Lets try for other text
random_text = "Hi, I am Samip. Nice to meet you"

prediction = classifier.predict(vectorizer.transform([random_text]).toarray())
print("Text: ", random_text)
print("Prediction: ", prediction[0])

Text:  Hi, I am Samip. Nice to meet you
Prediction:  0


In [12]:
data = [
  {
    "text": "I love this product, it works great!",
    "label": 1
  },
  {
    "text": "This movie was terrible, I would not recommend it to anyone.",
    "label": 0
  },
  {
    "text": "The customer service was amazing, they were so helpful.",
    "label": 1
  },
  {
    "text": "I am so disappointed with this restaurant, the food was cold and the service was slow.",
    "label": 0
  },
  {
    "text": "I can't believe how good this book is, I couldn't put it down.",
    "label": 1
  },
  {
    "text": "This hotel was a nightmare, there were bugs in the bed and the staff was rude.",
    "label": 0
  },
  {
    "text": "I'm really happy with my new phone, it has all the features I wanted.",
    "label": 1
  },
  {
    "text": "The traffic was so bad, it took me an hour to get to work.",
    "label": 0
  },
  {
    "text": "I had an amazing time at the concert, the band was fantastic.",
    "label": 1
  },
  {
    "text": "This product is terrible, it doesn't work at all.",
    "label": 0
  },
  {
    "text": "The service at this restaurant was excellent, the staff was very attentive.",
    "label": 1
  },
  {
    "text": "I was very disappointed with the hotel, the room was dirty and the staff was unhelpful.",
    "label": 0
  },
  {
    "text": "I'm so happy with my new car, it drives like a dream.",
    "label": 1
  },
  {
    "text": "This movie was fantastic, I would highly recommend it.",
    "label": 1
  },
  {
    "text": "I had a terrible experience at the hair salon, the stylist didn't listen to me at all.",
    "label": 0
  },
  {
    "text": "The food at this restaurant was delicious, I can't wait to go back.",
    "label": 1
  },
  {
    "text": "I regret buying this product, it doesn't work as advertised.",
    "label": 0
  },
  {
    "text": "The customer service at this store was terrible, the staff was rude and unhelpful.",
    "label": 0
  },
  {
    "text": "I had a great time at the party, the music was fantastic and the food was delicious.",
    "label": 1
  },
  {
    "text": "I'm so disappointed with this phone, it keeps freezing and the battery life is terrible.",
    "label": 0
  },
  {
    "text": "The service at this hotel was excellent, the staff was very friendly and helpful.",
    "label": 1
  },
  {
    "text": "I was very impressed with this product, it exceeded my expectations.",
    "label": 1
  },
  {
    "text": "This book was a waste of money, I couldn't get past the first chapter.",
    "label": 0
  },
  {
    "text": "This hotel was the worst, the room was dirty and the staff was unhelpful.",
    "label": 0
  },
  {
    "text": "I had a great experience with customer service, they were so helpful and kind.",
    "label": 1
  },
  {
    "text": "This movie was just okay, it wasn't great but it wasn't terrible either.",
    "label": 0
  },
  {
    "text": "I was very impressed with this restaurant, the food was delicious and the service was excellent.",
    "label": 1
  },
  {
    "text": "I was very disappointed with this product, it didn't work at all.",
    "label": 0
  },
  {
    "text": "The staff at this store were very helpful, they went above and beyond to assist me.",
    "label": 1
  },
  {
    "text": "This car is amazing, it has all the features I could ever want.",
    "label": 1
  },
  {
    "text": "I had a terrible experience with this company, their customer service was terrible and their product didn't work.",
    "label": 0
  },
  {
    "text": "The food at this restaurant was terrible, I would not recommend it to anyone.",
    "label": 0
  },
  {
    "text": "I'm so happy with my new laptop, it's so fast and efficient.",
    "label": 1
  },
  {
    "text": "This book was amazing, I couldn't put it down.",
    "label": 1
  },
  {
    "text": "I had a terrible experience at this hotel, the room was dirty and the staff was rude.",
    "label": 0
  },
  {
    "text": "The service at this restaurant was terrible, the staff was unhelpful and the food was cold.",
    "label": 0
  },
  {
    "text": "I was very impressed with this service, they were so helpful and professional.",
    "label": 1
  },
  {
    "text": "This movie was fantastic, I loved every minute of it.",
    "label": 1
  },
  {
    "text": "I had a terrible experience at this store, the staff was rude and unhelpful.",
    "label": 0
  },
  {
    "text": "The food at this restaurant was amazing, I can't wait to go back.",
    "label": 1
  },
  {
    "text": "I was very disappointed with this product, it didn't work as advertised.",
    "label": 0
  },
  {
    "text": "The customer service at this company was terrible, they were unresponsive and unhelpful.",
    "label": 0
  },
  {
    "text": "I had a great time at the party, the atmosphere was amazing and the people were friendly.",
    "label": 1
  },
  {
    "text": "This phone is terrible, it keeps freezing and the battery life is terrible.",
    "label": 0
  },
  {
    "text": "The service at this hotel was terrible, the staff was rude and unhelpful.",
    "label": 0
  },
  {
    "text": "I was very impressed with this product, it worked better than I expected.",
    "label": 1
  }

]

# Creating a new dataset
corpus_new = list()
labels_new = list()

for text_label in data:
    text = text_label["text"]
    label = text_label["label"]
    corpus_new.append(text)
    labels_new.append(label)

In [13]:
# New dataset for sentiment analysis

df = pd.DataFrame({"Corpus": corpus_new, "Sentiment": labels_new})
df.head()

Unnamed: 0,Corpus,Sentiment
0,"I love this product, it works great!",1
1,"This movie was terrible, I would not recommend...",0
2,"The customer service was amazing, they were so...",1
3,"I am so disappointed with this restaurant, the...",0
4,"I can't believe how good this book is, I could...",1


In [14]:
# Lets Preprocess the data

lemmatizer = WordNetLemmatizer()
def preprocess(text):
    tokens = word_tokenize(text)  # Tokenizing
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return " ".join(lemmatized_tokens)

processed_corpus = [preprocess(corpus) for corpus in corpus_new]

In [15]:
# Vectorizing the corpus

vectorizer = CountVectorizer()  # Bag Of Words
vectorizer.fit(processed_corpus)
vectors = vectorizer.transform(processed_corpus)

In [16]:
# Looking on the vocabulary of new vectorizer

features = vectorizer.get_feature_names_out()
print("The words in the bags(Vocabulary) are: ", features)

# Sample document in BoW vector
idx = 0
text = processed_corpus[idx]
sample_vector = vectors[idx]
print(f"\nText: {text}")
print(f"Vector: {sample_vector.toarray()[0]}")

# We can control the size of vocabulary by changing max_features in CountVectorizer()

The words in the bags(Vocabulary) are:  ['above' 'advertised' 'all' 'am' 'amazing' 'an' 'and' 'anyone' 'assist'
 'at' 'atmosphere' 'attentive' 'back' 'bad' 'band' 'battery' 'bed'
 'believe' 'better' 'beyond' 'book' 'bug' 'but' 'buying' 'ca' 'car'
 'chapter' 'cold' 'company' 'concert' 'could' 'customer' 'delicious' 'did'
 'dirty' 'disappointed' 'doe' 'down' 'dream' 'drive' 'efficient' 'either'
 'ever' 'every' 'exceeded' 'excellent' 'expectation' 'expected'
 'experience' 'fantastic' 'fast' 'feature' 'first' 'food' 'freezing'
 'friendly' 'get' 'go' 'good' 'great' 'ha' 'had' 'hair' 'happy' 'helpful'
 'highly' 'hotel' 'hour' 'how' 'impressed' 'in' 'is' 'it' 'just' 'keep'
 'kind' 'laptop' 'life' 'like' 'listen' 'love' 'loved' 'me' 'minute'
 'money' 'movie' 'music' 'my' 'new' 'nightmare' 'not' 'of' 'okay' 'party'
 'past' 'people' 'phone' 'product' 'professional' 'put' 'really'
 'recommend' 'regret' 'restaurant' 'room' 'rude' 'salon' 'service' 'slow'
 'so' 'staff' 'store' 'stylist' 'terrible' 

In [17]:
## Training of Sentiment Analysis model. This time again we are using Linear Regression for its simplicity.

# Creating a classifier
classifier = LogisticRegression()

# Defining input and output for training
X = vectors.toarray()
y = labels_new

# Training the classifier
classifier.fit(X, y)

In [18]:

# Lets try for other text
# random_text = "The movie is fantastic. I enjoyed every moment of it."
random_text = "I was very disappointed with this product, it didn't work as advertised."

prediction = classifier.predict(vectorizer.transform([random_text]).toarray())
print("Text: ", random_text)
print("Prediction: ", prediction[0])

Text:  I was very disappointed with this product, it didn't work as advertised.
Prediction:  0


### Day 5 contd ...

Now lets save the classifier and vectorizers we built yesterday.

In [20]:
import os
import pickle

os.makedirs("data", exist_ok=True)

# Save the classifier and vectorizer as pickle files
with open("data/classifier.pkl", "wb") as f:
    pickle.dump(classifier, f)

with open("data/vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)
