<a href="https://colab.research.google.com/github/tinywizzard/HAAI_Codes/blob/main/NLP_Handson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Natural Language Processing (NLP) Introduction:** Natural Language Processing (NLP) is an interdisciplinary field merging computer science, linguistics, and machine learning. Its main objective is to allow computers to comprehend and handle human language naturally and effectively. NLP tasks are generally divided into two key categories:

**(i) Natural Language Understanding (NLU):** This focuses on interpreting and understanding human language, including tasks like speech recognition, text classification, sentiment analysis, and extracting information from text.

**(ii) Natural Language Generation (NLG):** This focuses on producing human-readable text from structured data, involving tasks like text summarization, dialogue generation, and language translation.

**Examples:** NLP is applied in many fields, such as customer support (through chatbots and virtual assistants), content analysis (like sentiment analysis and topic modeling), and information retrieval (such as search engines and question-answering systems), among others.

Important NLP Libraries in Python: Python provides a vast ecosystem of libraries and frameworks for performing NLP tasks. Below are some of the most popular and frequently used libraries:

**(i) NLTK (Natural Language Toolkit):** NLTK is a highly versatile and widely used Python library for NLP tasks. It offers a comprehensive set of tools and resources for text processing, including functionalities like tokenization, stemming, lemmatization, part-of-speech tagging, and more, making it a go-to library for NLP practitioners.

**(ii) spaCy:** spaCy is a high-performance library designed for advanced NLP tasks, offering powerful models for tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more. It is well-regarded for its speed and ability to be easily integrated into production environments.

**(iii) TextBlob:** TextBlob is a user-friendly library built on top of NLTK and Pattern, designed to streamline common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more. It simplifies these processes, making it easier for users to implement NLP functionality in their projects.

**(iv) Gensim:** Gensim is a powerful library designed for topic modeling, offering optimized algorithms like Latent Dirichlet Allocation (LDA), Word2Vec, and Doc2Vec. It excels in tasks such as topic discovery, text similarity analysis, and generating word embeddings for better text representation.

**1. Data Preprocessing as well as Text Cleaning**

Before applying NLP techniques, it's crucial to preprocess and clean the text data to achieve accurate and dependable outcomes. Below are some commonly used preprocessing steps. italicized text

Here is a Python example of basic data preprocessing and text cleaning using the **Natural Language Toolkit (NLTK)** and **regular expressions (re)**.

Steps in the code:

**(i)Lowercasing:** Converts the text to lowercase.

**(ii)Removing Punctuation:** Strips out any punctuation or special characters.

**(iii)Tokenization:** Splits the text into individual words.

**(iv)Removing Stopwords:** Filters out common words like "is", "and", "the" that are not essential for NLP tasks.

**(v)Lemmatization:** Reduces words to their base form (e.g., "running" → "run").

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK resources (run only once)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Example text
text = "This is an example sentence, to demonstrate text preprocessing! We'll clean it and tokenize."

# Function for text preprocessing
def preprocess_text(text):
    # 1. Lowercase the text
    text = text.lower()

    # 2. Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # 3. Tokenize the text
    tokens = word_tokenize(text)

    # 4. Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # 5. Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

# Process the example text
cleaned_text = preprocess_text(text)
print("Cleaned Tokens:", cleaned_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Cleaned Tokens: ['example', 'sentence', 'demonstrate', 'text', 'preprocessing', 'well', 'clean', 'tokenize']


This script covers basic preprocessing techniques, which are essential for preparing raw text data for more advanced NLP tasks.

**2. Part-of-speech tagging and Named Entity Recognition**

Here is a Python example using the **spaCy** library for **Part-of-Speech (POS) tagging** and **Named Entity Recognition (NER)**.

Background concepts and steps in the code:

**(i)Load spaCy’s model:** en_core_web_sm is a lightweight model that includes tokenization, POS tagging, and NER.

**(ii)Part-of-Speech Tagging:** Each word in the text is tagged with its grammatical role, such as noun, verb, adjective, etc. Here’s a simple example:

Sentence: "The quick brown fox jumps over the lazy dog."

**PoS Tags:**

The - Determiner (DT)

quick - Adjective (JJ)

brown - Adjective (JJ)

fox - Noun (NN)

jumps - Verb (VBZ)

over - Preposition (IN)

the - Determiner (DT)

lazy - Adjective (JJ)

dog - Noun (NN)

Here's a brief overview of other PoS tags for each of these categories:

**Pronoun:**

PRP (Personal Pronoun): I, you, he, she, it, we, they

PRP$ (Possessive Pronoun): my, your, his, her, its, our, their Number:

There isn’t a specific PoS tag for "number" itself, but numbers are often tagged as:

**CD (Cardinal Number):** one, two, three, 42

**Adverb:**

RB (Adverb): quickly, very, well

RBR (Adverb, Comparative): better, faster

RBS (Adverb, Superlative): best, fastest

**Punctuation:**

. (Period) or PUNC (General Punctuation)

, (Comma)

? (Question Mark)

! (Exclamation Mark)

**Sign:**

SYM (Symbol): $, %, &, @

**(iii)Named Entity Recognition:** The named entities in the text are identified and classified.

Here is an example:

Sentence: "Barack Obama visited the Eiffel Tower in Paris last summer."

**NER Tags:**

Barack Obama - Person (PER)

Eiffel Tower - Location (LOC)

Paris - Location (LOC)

In this example, NER helps identify and categorize specific names and locations in the text.

In [4]:
import spacy

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple is looking at buying U.K. startup for $1 billion. Steve Jobs founded Apple in 1976."

# Process the text using spaCy
doc = nlp(text)

# Part-of-Speech (POS) tagging
print("Part-of-Speech Tagging:")
for token in doc:
    print(f"{token.text}: {token.pos_} ({token.tag_})")

# Named Entity Recognition (NER)
print("\nNamed Entities:")
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

Part-of-Speech Tagging:
Apple: PROPN (NNP)
is: AUX (VBZ)
looking: VERB (VBG)
at: ADP (IN)
buying: VERB (VBG)
U.K.: PROPN (NNP)
startup: NOUN (NN)
for: ADP (IN)
$: SYM ($)
1: NUM (CD)
billion: NUM (CD)
.: PUNCT (.)
Steve: PROPN (NNP)
Jobs: PROPN (NNP)
founded: VERB (VBD)
Apple: PROPN (NNP)
in: ADP (IN)
1976: NUM (CD)
.: PUNCT (.)

Named Entities:
Apple: ORG (Companies, agencies, institutions, etc.)
U.K.: GPE (Countries, cities, states)
$1 billion: MONEY (Monetary values, including unit)
Steve Jobs: PERSON (People, including fictional)
Apple: ORG (Companies, agencies, institutions, etc.)
1976: DATE (Absolute or relative dates or periods)


This example demonstrates how to use spaCy for two key NLP tasks, POS tagging and NER, which are essential for understanding the structure and meaning of text.

**3. Sentiment Analysis**

Here is an example of **sentiment analysis** in **Python** using the **TextBlob** library.

In [5]:
from textblob import TextBlob

# Sample text
text = "I love this product! It's fantastic and works like a charm."

# Create a TextBlob object
blob = TextBlob(text)

# Perform sentiment analysis
sentiment = blob.sentiment

# Output sentiment polarity and subjectivity
print(f"Polarity: {sentiment.polarity}")    # Polarity: -1 (negative) to 1 (positive)
print(f"Subjectivity: {sentiment.subjectivity}")    # Subjectivity: 0 (objectivity) to 1 (subjective)

Polarity: 0.5125
Subjectivity: 0.75


**Explanation:**

**(i)Polarity:** Measures the sentiment's positivity or negativity (-1 for negative, 0 for neutral, 1 for positive).

**(ii)Subjectivity:** Ranges from 0 (objective) to 1 (subjective), indicating whether the text is based on fact or personal opinion.

In this case, the **polarity score** of 0.5125 suggests that the text is **moderately positive** but not overwhelmingly so. The language conveys a generally favorable tone, though it may not be excessively enthusiastic or optimistic.

With a **subjectivity score** of 0.75, the text likely contains a **strong personal bias, subjective expressions, or opinions**, rather than being purely based on facts or objective analysis.

This code shows that the sample text is highly positive with a significant degree of subjectivity. **TextBlob** is simple yet effective for basic sentiment analysis tasks.

**4. Topic Modeling and Document Clustering**

Here is a Python example for **Topic Modeling and Document Clustering** using the **Gensim library** and **Latent Dirichlet Allocation (LDA)**. **Topic modeling** is a method used to identify underlying themes or topics within a collection of documents, while **document clustering** focuses on grouping documents that share similar content or themes.

**Steps:**

**(i)Preprocessing:** Tokenization, lowercasing, and removal of stopwords.

**(ii)Dictionary and Document-Term Matrix:** Created using Gensim for converting text data into numerical form.

**(iii)LDA Model:** A Latent Dirichlet Allocation model is trained with 2 topics.

**(iv)Document Clustering:** Document similarity is demonstrated using cosine similarity between two documents.

In [6]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download NLTK stopwords (run only once)
nltk.download('punkt')
nltk.download('stopwords')

# Sample documents
documents = [
    "Artificial intelligence is transforming the technology industry.",
    "Machine learning and AI are shaping the future of automation.",
    "Deep learning algorithms are a subset of machine learning.",
    "Quantum computing will revolutionize industries like AI.",
    "Healthcare is benefiting from AI and machine learning advances.",
]

# Preprocess the documents
def preprocess(doc):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(doc.lower())
    return [word for word in tokens if word.isalpha() and word not in stop_words]

processed_docs = [preprocess(doc) for doc in documents]

# Create a dictionary and document-term matrix
dictionary = corpora.Dictionary(processed_docs)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_docs]

# Train the LDA model (specifying 2 topics)
lda_model = LdaModel(doc_term_matrix, num_topics=2, id2word=dictionary, passes=15)

# Print the topics with associated words
print("Topics discovered by LDA:")
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

# Document similarity (clustering example)
doc1_bow = dictionary.doc2bow(preprocess("AI and machine learning are advancing rapidly"))
doc2_bow = dictionary.doc2bow(preprocess("Healthcare is benefiting from AI advances"))

similarity = gensim.matutils.cossim(doc1_bow, doc2_bow)
print("\nDocument Similarity (cosine):", similarity)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Topics discovered by LDA:
(0, '0.113*"ai" + 0.065*"like" + 0.065*"quantum" + 0.065*"industries" + 0.065*"computing"')
(1, '0.129*"learning" + 0.091*"machine" + 0.053*"algorithms" + 0.053*"deep" + 0.053*"subset"')

Document Similarity (cosine): 0.2886751345948129


**Explanation:**

In this example, the **Gensim library** is used for both topic modeling and document clustering. We begin by preparing a list of sample documents, then create a dictionary and a document-term matrix from them. After that, a Latent Dirichlet Allocation (LDA) model is trained, specifying how many topics we aim to identify. The print_topics method is used to display the discovered topics along with their associated words. Additionally, document similarity is demonstrated by converting two documents into bag-of-words vectors and calculating their cosine similarity using gensim.matutils.cossim. These examples provide a basic introduction to NLP with Python, setting the stage for more advanced techniques such as text generation, machine translation, and question answering.

The **coefficient values** (e.g., 0.070, 0.091, etc.) in the topics discovered by **LDA** represent the probability or weight of each word in the corresponding topic. These values indicate how relevant each word is to a given topic. For example, in Topic 0, "artificial", "transforming", "industry", "intelligence" and "technology" are equally important as they have equal weights i.e., 0.070. In Topic 1, "learning" has a weight of 0.127, which makes it even more central to this topic.

A **cosine similarity** score of 0.289 (approximately) indicates a moderate to low level of similarity between two documents. This means the documents share some overlap in terms of words or content, but they are more different than similar.

**Remember:** Cosine similarity is a measure of how similar two documents are, based on the angle between their vector representations in a multi-dimensional space. It ranges from -1 to 1, where:

(i) 1 indicates that the documents are identical in terms of their direction (completely similar).

(ii) 0 means the documents are completely different and share no similarities.

(iii) -1 suggests that the documents are opposites in meaning.

**Conclusions:**

Natural Language Processing (NLP) using Python empowers machines to understand and interpret human language. With Python's vast array of NLP libraries, one can perform various tasks, including cleaning and tokenizing text, analyzing sentiment, modeling topics, and clustering documents. These tools offer immense capabilities for developing intelligent language-based applications.