# NLP Data Pipeline

### Introduction
This Jupyter Notebook introduces key concepts in text preprocessing and sentiment analysis using Natural Language Processing (NLP) techniques. It leverages Python libraries like NLTK and scikit-learn to demonstrate various NLP tasks and build a basic sentiment classification model. The notebook covers text tokenization at both the word and sentence levels, providing practical examples of processing English sentences and handling input text. It explores the removal of stopwords, with examples in English and French, and demonstrates how to view and work with predefined lists of stopwords in NLTK. The workflow for building a sentiment analysis model involves vectorizing text data with CountVectorizer, splitting datasets into training and testing subsets with varying train-test ratios, and training a Naive Bayes classifier using scikit-learn's MultinomialNB.

### Key Stages of the Notebook
<b>1. Importing the necessary Libraries</b><br>
Essential libraries for this implementation are:
- NLTK: A comprehensive library for working with human language data, providing tools for text processing, tokenization, stemming, lemmatization, and more.
- Scikit-learn: A versatile machine learning library for building, training, and evaluating models, with support for classification, regression, clustering, and preprocessing tasks.

<b>2. Text Pre-Processing</b><br>
- *Word Tokenization*: Splitting a text into individual words for analysis.
- *Sentence Tokenization*: Dividing a text into individual sentences for processing.
- *Lower Casing*: Converting all characters in a text to lowercase to ensure uniformity.
- *Stopwords Removal*: Eliminating commonly used words (e.g., "the", "and") that add little meaning to text analysis.
- *Stemming*: Reducing words to their root form by removing suffixes.
- *Lemmatization*: Reducing words to their base or dictionary form based on linguistic rules.

<b>3. Splitting the Train and Test Data</b><br> 
The purpose of splitting a dataset into training and testing subsets is to evaluate the performance of a machine learning model on unseen data. This ensures that the model performs well not only on the data it was trained on but also on new, unseen data, reducing the risk of overfitting and providing a more reliable estimate of its real-world performance.

### Learning Outcome
Upon completion of this Notebook, students will be able to:
- Learn to tokenize text into words and sentences using NLTK for further text analysis.
- Gain the ability to identify and remove stopwords in multiple languages to improve text processing.
- Develop skills in text normalization techniques such as stemming and lemmatization to standardize textual data.
- Learn to convert text into numerical features using CountVectorizer.
- Gain experience in training and testing a sentiment analysis model using the Naive Bayes algorithm with scikit-learn.
- Explore the impact of varying train-test splits on model performance to understand model generalization.

## What kind of AI Projects would this Jupyter Notebook extend to?
This Jupyter Notebook (JN) can be extended to a variety of AI projects involving text data and Natural Language Processing (NLP):
- Language Translation: Integrate tokenization, stemming, and lemmatization with translation APIs or models to preprocess and translate text efficiently.
- Social Media Analytics: Analyze Twitter or Facebook data for trends, opinions, and sentiment using preprocessing and classification techniques.

### Content Flow
An outline of the tasks performed in this Python implementation:
1. [Import the Libraries](#import-the-libraries)
2. [Text Pre-Processing](#text-pre-processing)
    - [Word Tokenization](#word-tokenization)
    - [Sentence Tokenization](#sentence-tokenization)
    - [Lower Casing](#lower-casing)
    - [Stopwords Removal](#stopwords-removal)
    - [Stemming](#stemming)
    - [Lemmatization](#lemmatization)
3. [Splitting Train and Text Data](#splitting-train-and-test-data)

### Time Required
It would take about an hour to complete the process discussed in this notebook. Follow the instructions and go through the additional explanations in this Notebook for easier execution.

### Hardware Requirement:
Any computer with access to internet and web browser.

# Import the Libraries
The following libraries are imported:
- NLTK: A Python library for natural language processing, offering tools for text analysis, tokenization, and linguistic tasks.
- PortStemmer: A stemming tool in NLTK that reduces words to their root form by removing suffixes.
- WordNetLemmatizer: A lemmatization tool in NLTK that reduces words to their base form based on linguistic rules.
- CountVectorizer: A scikit-learn tool that converts text data into a bag-of-words numerical representation for machine learning.
- Train_test_split:A scikit-learn function to split datasets into training and testing subsets for model evaluation.
- MultnomialNB: A Naive Bayes classifier in scikit-learn suitable for text data and discrete features.
- Accuracy_score: A scikit-learn metric that calculates the ratio of correct predictions to total predictions for model evaluation.

In [1]:
#Import the necessary Library
import nltk

#Libraries for stopwords removal and Lemmatization
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#Library for Stemming and Lemmatiztion                      
from nltk.stem import PorterStemmer, WordNetLemmatizer  

#Libraries for modeling and data split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Text Pre-Processing

### Word Tokenization

In [2]:
#Input text
data = "Let us talk about Python"

# Word Tokenization
nltk_tokens = nltk.word_tokenize(data)
print(nltk_tokens)

['Let', 'us', 'talk', 'about', 'Python']


### Sentence Tokenization

In [3]:
#Input text
sentence_data = "Let's talk about Python. Let's not talk about Python."

In [4]:
#Sentence Tokenization
nltk_tokens = nltk.sent_tokenize(sentence_data)
print(nltk_tokens)

["Let's talk about Python.", "Let's not talk about Python."]


### Lower Casing

In [5]:
sentence = "Books are on the table."
sentence = sentence.lower()
sentence

'books are on the table.'

### Stopwords Removal

In [6]:
#To view all the stopwords in English
print(stopwords.words('english'))
print("The total number of stopwords in English: ",len(stopwords.words('english')))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Similary, we can view all the stopwords in any other language.
Let's try this out with French!

In [7]:
#To view all the stopwords in French
print(stopwords.words('french'))
print("The total number of stopwords in French: ",len(stopwords.words('french')))

['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'ils', 'je', 'la', 'le', 'les', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aur

Now, let's continue with removing stopwords in a given input text in English.

In [8]:
#Input Text
sentence = "He is the only person in the library"

In [9]:
#Word Tokenization 
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(sentence)
word_tokens

['He', 'is', 'the', 'only', 'person', 'in', 'the', 'library']

In [10]:
#Removing Stopwords
filtered_sentence = [w for w in word_tokens if not w in stop_words]
print(filtered_sentence)

['He', 'person', 'library']


### Stemming

Stemming is the process of reducing a word to its root or base form by removing prefixes, suffixes, or other inflections. It focuses on chopping off word endings using predefined rules without considering the meaning of the word. 

PorterStemmer is an algorithm for stemming, which is the process of reducing words to their root or base form by removing suffixes and other word endings. It is implemented in Python through the Natural Language Toolkit (NLTK) library.

We begin by initiating an instance of PortStemmer().

In [11]:
#Initiate an instance of PortStemmer
ps = PorterStemmer()

In [12]:
#Input Text
sentence = "cats mice learning"

In [13]:
for word in sentence.split():
    print(ps.stem(word))

cat
mice
learn


Observe, how the words are reduced to the its base form.

### Lemmatization

In [14]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fyzan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fyzan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [15]:
# Initialize the WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

In [16]:
#Input Text
sentence = "cats mice"

In [17]:
# Tokenize the sentence into words
words = word_tokenize(sentence)
words

['cats', 'mice']

In [18]:
# Lemmatize each word
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
lemmatized_words

['cat', 'mouse']

Observe the difference in how Stemming and Lemmatization handles words like 'mice'. <br>
On stemming the word 'mice', the same word was returned as output. <br>
But on lemmatizing the word 'mice', the output is a grammatically correct root word 'mouse'. 

# Splitting Train and Test Data

Splitting the dataset into training and test subsets helps enhance the model's performance by ensuring that it is evaluated on unseen data. Training on the entire dataset can cause the model to memorize the data, reducing its ability to generalize. A separate test set helps assess how well the model performs on unseen examples.


Let's experiment with different training and test data split and train a sample data with Naïve Bayes algorithm. The Naive Bayes algorithm is a probabilistic machine learning algorithm primarily used for classification tasks.

In [19]:
# Sample text data and corresponding labels (0: Negative, 1: Positive)
texts = [
    "I love this product!", "This is the best thing ever!", 
    "Absolutely amazing experience.", "Not worth the price.", 
    "Terrible customer service.", "I hate this so much.", 
    "Would buy again.", "Highly recommended!", 
    "Waste of money.", "Awful quality!"
]
labels = [1, 1, 1, 0, 0, 0, 1, 1, 0, 0]

In [20]:
# Convert text data to numerical features using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

Let's consider the following ratios of 
- 90:10
- 80:20
- 70:30
- 60:40

In [21]:
# Experiment with different train-test splits
splits = [0.9, 0.8, 0.7, 0.6]

In [22]:
for split in splits:
    # Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=1-split, random_state=42)
    
    # Train a Naive Bayes model
    model = MultinomialNB()
    model.fit(X_train, y_train)
    
    # Test the model
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"Train-Test Split: {split*100:.0f}-{(1-split)*100:.0f}, Test Accuracy: {accuracy:.2f}")


Train-Test Split: 90-10, Test Accuracy: 1.00
Train-Test Split: 80-20, Test Accuracy: 0.00
Train-Test Split: 70-30, Test Accuracy: 0.25
Train-Test Split: 60-40, Test Accuracy: 0.25


Different train and test split enable the model to attain different levels of accuracy. <br>
For the sample text data considered for this implementation, the split of 90:10 attains 100% accuracy. <br>
The other splits considered does not produce desirable accuracy rates.<br>
It is also important to note that this split ratio might work differntly for a different dataset.<br>

Let's test the model with new text sample.

In [23]:
# Test the trained model with a single sentence
test_sentence = "This product is not worth the money!"
#Uncomment the following line to test the model with a different input.
#test_sentence = "This product awesome!"

# Transform the test sentence using the trained vectorizer
test_features = vectorizer.transform([test_sentence])

# Predict the sentiment
predicted_label = model.predict(test_features)[0]

# Map label to sentiment
label_map = {0: "Negative", 1: "Positive"}
predicted_sentiment = label_map[predicted_label]

# Display the prediction
print(f"Test Sentence: {test_sentence}\nPredicted Sentiment: {predicted_sentiment}")

Test Sentence: This product is not worth the money!
Predicted Sentiment: Negative


# Observations:
- This notebook demonstrates essential tasks of text preprocessing such as tokenization, stopword removal, stemming, and lemmatization using NLTK.
- Text data is vectorized using CountVectorizer, transforming text into a bag-of-words representation suitable for modeling.
- A Naive Bayes classifier (MultinomialNB) is trained and evaluated for sentiment classification using sample labeled data.