In [1]:
import os
os.chdir("D:/NLP/NLP-a-day-keeps-doctors-away")

In [2]:
import pandas as pd

# Introduction to NLP
Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. It involves understanding, interpreting, and manipulating human language using algorithms and computational techniques.

# Key Concepts in NLP
**Tokenization**: This is the process of breaking down text into smaller units called tokens, which can be words, phrases, or symbols.

**Text Cleaning and Preprocessing**: Involves techniques like converting text to lowercase, removing punctuation, and removing stopwords (commonly used words that may not contribute much meaning).

**Stemming and Lemmatization**: These are methods used to reduce words to their root form. Stemming is a more rudimentary approach, often chopping off word endings, while lemmatization considers the context and converts the word to its meaningful base form.

**Bag of Words (BoW)**: A simple yet powerful way to represent text data in machine learning. It involves counting the frequency of words in a document.

**Term Frequency-Inverse Document Frequency (TF-IDF)**: A statistical measure used to evaluate the importance of a word in a document, which is part of a corpus.

**Regular Expressions (Regex)**: Useful for searching, matching, and manipulating text.

In [3]:
df=pd.read_csv("IMDB_Dataset.csv")

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,I grew up (b. 1965) watching and loving the Th...,negative
1,"When I put this movie in my DVD player, and sa...",negative
2,Why do people who do not know what a particula...,negative
3,Even though I have great interest in Biblical ...,negative
4,Im a die hard Dads Army fan and nothing will e...,positive


**What is Tokenization?**
Tokenization is the process of dividing a text into smaller units known as tokens. Tokens are typically words or subwords in the context of natural language processing (NLP) and computer science. Tokenization is a critical step in many NLP tasks, including text processing, language modelling, and machine translation.

Tokenization is the process of tokenizing or splitting a string, or text into a list of tokens. One can think of tokens as parts like a word is a token in a sentence, and a sentence is a token in a paragraph

![image.png](attachment:image.png)



**Need of Tokenization**
Tokenization is a crucial step in text processing and natural language processing (NLP) for several reasons.


1. Effective Text Processing: Tokenization reduces the size of raw text so that it can be handled more easily for processing and analysis.
2. Feature extraction: Text data can be represented numerically for algorithmic comprehension by using tokens as features in machine learning models.
3. Language Modelling: Tokenization in NLP facilitates the creation of organised representations of language, which is useful for tasks like text generation and language modelling.
4. Information Retrieval: Tokenization is essential for indexing and searching in systems that store and retrieve information efficiently based on words or phrases.
5. Text Analysis: Tokenization is used in many NLP tasks, including sentiment analysis and named entity recognition, to determine the function and context of individual words in a sentence.
6. Vocabulary Management: By generating a list of distinct tokens that stand in for words in the dataset, tokenization helps manage a corpus’s vocabulary.
7. Task-Specific Adaptation: Tokenization can be customised to meet the needs of particular NLP tasks, meaning that it will work best in applications such as summarization and machine translation.


Some terms that will be frequently used are :
1. Corpus – Body of text, singular. Corpora is the plural of this.
2. Lexicon – Words and their meanings.
3. Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

**word tokenization**- Splitting words in a sentence.

In [5]:

from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Example text
example_text = df['review'][0][0:30]

# Tokenization
tokens = word_tokenize(example_text)
print(tokens)


['I', 'grew', 'up', '(', 'b', '.', '1965', ')', 'watching', 'a']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\taaha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**sentence tokenization**-- Splitting sentences in the paragraph 

In [6]:
from nltk.tokenize import sent_tokenize


sent_tokenize(df['review'][3])


['Even though I have great interest in Biblical movies, I was bored to death every minute of the movie.',
 'Everything is bad.',
 'The movie is too long, the acting is most of the time a Joke and the script is horrible.',
 'I did not get the point in mixing the story about Abraham and Noah together.',
 'So if you value your time and sanity stay away from this horror.']

There are other kinds of Tokenizations as follows:
1. WordPunctTokenizer – It separates the punctuation from the words. 
2. PunktWordTokenizer – It doesn’t separates the punctuation from the words. 
    

# 2. Stemming

Stemming is the process of reducing words to their word stem or root form. For instance, “fishing”, “fished”, “fisher” all reduce to the stem “fish”. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming is an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.

Stemming is a natural language processing technique that is used to reduce words to their base form, also known as the root form. The process of stemming is used to normalize text and make it easier to process. It is an important step in text pre-processing, and it is commonly used in information retrieval and text mining applications.

Porter stemmer is the most commonly used stemmer

It is important to note that stemming is different from Lemmatization. Lemmatization is the process of reducing a word to its base form, but unlike stemming, it takes into account the context of the word, and it produces a valid word, unlike stemming which may produce a non-word as the root form.

In [7]:
from nltk.stem import PorterStemmer

# Initialize the Stemmer
stemmer = PorterStemmer()

# Stemming
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)


['i', 'grew', 'up', '(', 'b', '.', '1965', ')', 'watch', 'a']



# 3. Lemmatization

What is Lemmatization? 
In contrast to stemming, lemmatization is a lot more powerful. It looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

For clarity, look at the following examples given below: 
![image.png](attachment:image.png)

In [10]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Initialize the Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatization
lower_token=[word.lower() for word in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(lower) for lower in lower_token]
print(lemmatized_tokens)

['i', 'grew', 'up', '(', 'b', '.', '1965', ')', 'watching', 'a']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\taaha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['i', 'grew', 'up', '(', 'b', '.', '1965', ')', 'watching', 'a']

# Stop Words Removal

What are Stop words?
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 

We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory.

In [11]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\taaha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))


# converts the words in word_tokens to lower case and then checks whether 
#they are present in stop_words or not
filtered_sentence = [w for w in lemmatized_tokens if not w.lower() in stop_words]


print(lemmatized_tokens)
print(filtered_sentence)


['i', 'grew', 'up', '(', 'b', '.', '1965', ')', 'watching', 'a']
['grew', '(', 'b', '.', '1965', ')', 'watching']


# Parts of Speech Tagging

What is Part-of-speech (POS) tagging ? It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on

![image.png](attachment:image.png)

In [25]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))


# sent_tokenize is one of instances of 
# PunktSentenceTokenizer from the nltk.tokenize.punkt module

tokenized = sent_tokenize(df['review'][8][0:50])
for i in tokenized:
	
	# Word tokenizers is used to find the words 
	# and punctuation in a string
	wordsList = nltk.word_tokenize(i)

	# removing stop words from wordList
	wordsList = [w for w in wordsList if not w.lower() in stop_words] 

	# Using a Tagger. Which is part-of-speech 
	# tagger or POS-tagger. 
	tagged = nltk.pos_tag(wordsList)

	print(tagged)

[('may', 'MD'), ('remake', 'VB'), ('1987', 'CD'), ('Autumn', 'NNP'), ("'s", 'POS'), ('Tale', 'NNP'), ('e', 'NN')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\taaha\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


**Tagging Abbreviations**

CC coordinating conjunction

CD cardinal digit 

DT determiner 

EX existential there (like: “there is” … think of it like “there exists”) 

FW foreign word 

IN preposition/subordinating conjunction 

JJ adjective – ‘big’ 

JJR adjective, comparative – ‘bigger’

JJS adjective, superlative – ‘biggest’ 

LS list marker 1) 

MD modal – could, will

NN noun, singular ‘- desk’

NNS noun plural – ‘desks’ 

NNP proper noun, singular – ‘Harrison’ 

NNPS proper noun, plural – ‘Americans’ 

PDT predeterminer – ‘all the kids’ 

POS possessive ending parent’s 

PRP personal pronoun –  I, he, she 

PRP$ possessive pronoun – my, his, hers 

RB adverb – very, silently, 

RBR adverb, comparative – better 

RBS adverb, superlative – best 

RP particle – give up 

TO – to go ‘to’ the store. 

UH interjection – errrrrrrrm

VB verb, base form – take 

VBD verb, past tense – took 

VBG verb, gerund/present participle – taking 

VBN verb, past participle – taken 

VBP verb, sing. present, non-3d – take 

VBZ verb, 3rd person sing. present – takes 

WDT wh-determiner – which 

WP wh-pronoun – who, what 

WP$ possessive wh-pronoun, eg- whose 

WRB wh-adverb, eg- where, when

4. Bag of Words

The Bag of Words (BoW) model converts text into a numerical representation where each document is represented by a vector.

In [59]:
import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess

doc_list=df['review'][66:67]

doc_tokenized = [simple_preprocess(doc) for doc in doc_list]
doc_tokenized


dictionary = corpora.Dictionary()
dictionary

BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
print(BoW_corpus)
print('\n')

id_words = [[(dictionary[id], count) for id, count in line] for line in BoW_corpus]
print(id_words)

[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 9), (7, 1), (8, 1), (9, 1), (10, 2), (11, 6), (12, 1), (13, 2), (14, 1), (15, 2), (16, 1), (17, 1), (18, 2), (19, 6), (20, 1), (21, 2), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 2), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 3), (44, 6), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 2), (52, 1), (53, 5), (54, 1), (55, 2), (56, 3), (57, 1), (58, 1), (59, 1), (60, 1), (61, 4), (62, 2), (63, 3), (64, 6), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 6), (71, 1), (72, 2), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 2), (81, 2), (82, 1), (83, 1), (84, 1), (85, 2), (86, 1), (87, 3), (88, 1), (89, 1), (90, 1), (91, 1), (92, 17), (93, 1), (94, 2), (95, 1), (96, 2), (97, 2), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1

# Understanding TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

In [71]:
from gensim import corpora
from gensim.models import TfidfModel

# Sample documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing is used in various applications.",
    "Machine learning and NLP are essential in modern AI systems."]

# Tokenize the documents and create a dictionary
text_tokens = [[text for text in doc.lower().split()] for doc in documents]
dictionary = corpora.Dictionary(text_tokens)

# Create a bag-of-words (BoW) representation for each document
corpus = [dictionary.doc2bow(text) for text in text_tokens]

# Create a TfidfModel
tfidf_model = TfidfModel(corpus, normalize=True)

# Transform the BoW representation into Tfidf representation
tfidf_representation = tfidf_model[corpus]

# Print the Tfidf representation for each document
for i, doc in enumerate(tfidf_representation):
    print(f"Document {i + 1}: {doc}")

Document 1: [(0, 0.42998768831312806), (1, 0.42998768831312806), (2, 0.42998768831312806), (3, 0.1586956620869655), (4, 0.1586956620869655), (5, 0.1586956620869655), (6, 0.42998768831312806), (7, 0.42998768831312806)]
Document 2: [(3, 0.1473639561879945), (8, 0.39928430322959224), (9, 0.1473639561879945), (10, 0.39928430322959224), (11, 0.39928430322959224), (12, 0.39928430322959224), (13, 0.39928430322959224), (14, 0.39928430322959224)]
Document 3: [(4, 0.13559379987212375), (5, 0.13559379987212375), (9, 0.13559379987212375), (15, 0.36739293179076876), (16, 0.36739293179076876), (17, 0.36739293179076876), (18, 0.36739293179076876), (19, 0.36739293179076876), (20, 0.36739293179076876), (21, 0.36739293179076876)]
