# Lab - NLP Pipeline

## Lab Summary:
In this lab we will be discussing NLP pipelines.

## Lab Goal:
Upon completion of this lab, the student should be able to:
<ul>
    <li> Design a simple NLP pipeline for a problem statement </li>
    
</ul>

## Import Packages and Classes (Initial)
We will be using the following libraries:
<ol>
    <li> NLTK </li>
    <li> Pandas </li>
    <li> Matplotlib </li>
    <li> Gensim </li>
</ol>



In [5]:
! pip install nltk pandas matplotlib gensim scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp311-cp311-macosx_12_0_arm64.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.7.2 threadpoolctl-3.6.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Dataset

The twenty newsgroup dataset from the sklearn library is used for this lab.


![Image](https://research.cs.aalto.fi/pml/software/ne/20newsgroups_wtsne.png)

The dataset contains over 18,000 messages with assigned topic labels and is split into train and test datasets. The training dataset contains around 11,300 labelled messages and the test dataset contains 7,500 messages without labels. The task is to predict these labels.

The categories include:

 -  'alt.atheism'
 -  'comp.graphics'
 -  'comp.os.ms-windows.misc'
 -  'comp.sys.ibm.pc.hardware'
 -  'comp.sys.mac.hardware'
 -  'comp.windows.x'
 -  'misc.forsale'
 -  'rec.autos'
 -  'rec.motorcycles'
 -  'rec.sport.baseball'
 -  'rec.sport.hockey'
 -  'sci.crypt'
 -  'sci.electronics'
 -  'sci.med'
 -  'sci.space'
 -  'soc.religion.christian'
 -  'talk.politics.guns'
 -  'talk.politics.mideast'
 -  'talk.politics.misc'
 -  'talk.religion.misc'

## Load the dataset "fetch_20newsgroups" and inspect it.

In [None]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

See all the categories:

In [7]:
twenty_train.target_names #prints all the categories

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [8]:
twenty_train.target_names #prints all the categories
print("\n".join(twenty_train.data[0].split("\n")[:3])) #prints first line of the first data file

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu


### Apply <b>Bag of Words</b>.

We learned about the Bag of Words model in a previous discussion.

<b><code>CountVectorizer</code></b> calculates the BoW model automatically with a few line of code.

What <code>CountVectorizer</code> does:

<li>Converts a collection of text documents into a matrix of token counts.</li>
<li>Tokenizes the text (splits into words)</li>
<li>Builds a vocabulary of known words</li>
<li>Encodes each document as a vector where each element counts how many times a word appears in that document.</li>

Reference: https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe

In [9]:
# Create a CountVectorizer object.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

<b><code>fit_transform</code></b> does two things:
1. <i>fit</i>: Learns the vocabulary from twenty_train.data (which is a list of text documents).
2. <i>transform</i>: Transforms the documents into a sparse matrix where rows correspond to documents and columns to words from the vocabulary. Each value in the matrix is the count of a word in a document.

We'll call the .shape method on the X_train_counts object to see the shape of the matrix.

In [10]:
# Fit the data to CountVectoriser and store it in a variable X_train_counts, and print the shape 
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

### Apply <b>TF-IDF</b>.

We also learned the long way of calculating TF-IDF.  

This can also be done with a few lines of code, using <code>TfidfTransformer</code> from <code>feature_extraction.text</code>.

In [11]:
# Create a TfidfTransformer object.
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

# Fit the data to TfidfTransformer and store it in a variable X_train_tfidf, and print the shape 
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

### <b>Naive Bayes</b> classifier, also available in the sklearn library.

#### What is Naive Bayes?

Naive Bayes classifies text into categories by:
1. Learning how often each word appears in each category from your training data.
2. Calculating the probabilities that a new document belongs to each category based on the words it contains.
3. Selecting the category with the highest probability.

"Naive" because it assumes every word in the document is independent of the others. This is not a valid assumption, but the system works well in practice.

##### Why Naive Bayes?

<li><b>Fast</b>: Trains quickly even on large datasets.</li>
<li><b>Simple</b>: Easy to understand and implement.</li>
<li><b>Surprisingly good</b>: Works well on many text classification problems.</li>

##### Common Use Cases for Naive Bayes in NLP:
<li>Spam detection</li>
<li>Sentiment analysis</li>
<li>News categorization</li>
<li>Language identification</li>

Silly video describing Naive Bayes in more detail: 
https://www.youtube.com/watch?v=O2L2Uv9pdDA


In [12]:
# Import Multinomial Naive Bayes module and create a fit() object.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

Let's make a prediction on the Test set.

In [13]:
# Import some measurement libraries:
from sklearn.metrics import accuracy_score, classification_report

# Load some Test data.
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
X_test_counts = count_vect.transform(twenty_test.data)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# Make predictions
predicted = clf.predict(X_test_tfidf)
print(f"Accuracy: {accuracy_score(twenty_test.target, predicted):.2f}")
print(classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))


Accuracy: 0.77
                          precision    recall  f1-score   support

             alt.atheism       0.80      0.52      0.63       319
           comp.graphics       0.81      0.65      0.72       389
 comp.os.ms-windows.misc       0.82      0.65      0.73       394
comp.sys.ibm.pc.hardware       0.67      0.78      0.72       392
   comp.sys.mac.hardware       0.86      0.77      0.81       385
          comp.windows.x       0.89      0.75      0.82       395
            misc.forsale       0.93      0.69      0.80       390
               rec.autos       0.85      0.92      0.88       396
         rec.motorcycles       0.94      0.93      0.93       398
      rec.sport.baseball       0.92      0.90      0.91       397
        rec.sport.hockey       0.89      0.97      0.93       399
               sci.crypt       0.59      0.97      0.74       396
         sci.electronics       0.84      0.60      0.70       393
                 sci.med       0.92      0.74      0.82     

#### Look back: What did we just do?

The table above shows the performance of our Naive Bayes Classifier, when using Training data to classify data in the Test data.

Accuracy is a measure of the correct classifications versus all classifications.  

<b>Accuracy = 0.77</b> tells us that 77% of the time, the Bayes classifier correctly classified the test data into their respective news categories.

## Our First Pipeline

The ML process traditionally involves a common set of steps, like the ones above.

<b>Pipelines</b> streamline those steps and help you build models more quickly.

For example, preprocessing often starts a ML model, followed by some transformation, followed by the machine learning algorithm.  

The Pipeline function is available in the Sklearn library. With the Pipeline function, you can perform all the key steps at once.  In this example, these three steps:

- ('vect', CountVectorizer())
- ('tfidf', TfidfTransformer())
- ('clf', MultinomialNB())

In [14]:
# Import the Pipeline function and create a Pipeline object.
from sklearn.pipeline import Pipeline

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

In [15]:
# Use the pipeline object. Fit the Training data agains the Target.
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

In [16]:
# See how well the prediction fits the test data.
import numpy as np

# Recreate the test dataset.
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

# Calculate the prediction on the Test data.
predicted = text_clf.predict(twenty_test.data)

# Print the performance measures:
print(f"Accuracy: {accuracy_score(twenty_test.target, predicted):.2f}")
print(classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))

# Another way to calculate the accuracy:
np.mean(predicted == twenty_test.target)

Accuracy: 0.77
                          precision    recall  f1-score   support

             alt.atheism       0.80      0.52      0.63       319
           comp.graphics       0.81      0.65      0.72       389
 comp.os.ms-windows.misc       0.82      0.65      0.73       394
comp.sys.ibm.pc.hardware       0.67      0.78      0.72       392
   comp.sys.mac.hardware       0.86      0.77      0.81       385
          comp.windows.x       0.89      0.75      0.82       395
            misc.forsale       0.93      0.69      0.80       390
               rec.autos       0.85      0.92      0.88       396
         rec.motorcycles       0.94      0.93      0.93       398
      rec.sport.baseball       0.92      0.90      0.91       397
        rec.sport.hockey       0.89      0.97      0.93       399
               sci.crypt       0.59      0.97      0.74       396
         sci.electronics       0.84      0.60      0.70       393
                 sci.med       0.92      0.74      0.82     

0.7738980350504514

### Classify a new headline

Classify new headlines with the method used here.

In [17]:
# Store the new headline into a variable.
headline = ["Celebrity faces charges over gambling"]

# Predict the category using text_clf.predict
predicted_category_index = text_clf.predict(headline)[0]

# Retrieve and print the human-readable category name
predicted_category_name = twenty_train.target_names[predicted_category_index]
print(f"Predicted category: {predicted_category_name}")

Predicted category: talk.politics.misc


# Practice

Classify a new article headline using the model we created with the pipeline.

Using the following headline, predict the category it belongs to.

#### Headline: <i>NASA is planning a new mission to Jupiter.</i>

In [18]:
# Your Code Here
# Store the new headline into a variable.
headline = ["NASA is planning a new mission to Jupiter."]

# Predict the category using text_clf.predict
predicted_category_index = text_clf.predict(headline)[0]

# Retrieve and print the human-readable category name
predicted_category_name = twenty_train.target_names[predicted_category_index]
print(f"Predicted category: {predicted_category_name}")

Predicted category: sci.space


# Support Vector Machines (SVM)

A Support Vector Machine (SVM) is a different kind of supervised machine learning model. 

SVM identifies the boundary - the support vector - that best separates two categories. 

SVM then classifies observations according to their location relative to that support vector.

![Image](https://miro.medium.com/max/921/1*06GSco3ItM3gwW2scY6Tmg.png)

More about SVM: https://scikit-learn.org/stable/modules/svm.html

### SVM Pipeline Objects:

<code>CountVectorizer()</code>

<code>TfidfTransformer()</code>

<code>SGDClassifier()</code>

### SGDClassifier Hyperparameters:

We'll use the following hyperparameters for our example.

- ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42))

## Classify Headlines with SVM:

In [20]:
from sklearn.linear_model import SGDClassifier

# Store the pipeline object in a variable 'text_clf_svm':
text_clf_svm = Pipeline([('vect', CountVectorizer()), 
                         ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, max_iter=5, random_state=42))
                        ])

In [21]:
# Call the fit function on the twenty_train data that we created previously.
text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)

# Predict labels for twenty_test data.
predicted_svm = text_clf_svm.predict(twenty_test.data)

# Print the performance measures:
print(f"Accuracy: {accuracy_score(twenty_test.target, predicted_svm):.2f}")
print(classification_report(twenty_test.target, predicted_svm, target_names=twenty_test.target_names))



Accuracy: 0.82
                          precision    recall  f1-score   support

             alt.atheism       0.73      0.71      0.72       319
           comp.graphics       0.78      0.72      0.75       389
 comp.os.ms-windows.misc       0.73      0.78      0.75       394
comp.sys.ibm.pc.hardware       0.74      0.67      0.70       392
   comp.sys.mac.hardware       0.81      0.83      0.82       385
          comp.windows.x       0.84      0.76      0.80       395
            misc.forsale       0.84      0.90      0.87       390
               rec.autos       0.91      0.90      0.90       396
         rec.motorcycles       0.93      0.96      0.95       398
      rec.sport.baseball       0.88      0.90      0.89       397
        rec.sport.hockey       0.88      0.99      0.93       399
               sci.crypt       0.84      0.96      0.90       396
         sci.electronics       0.83      0.62      0.71       393
                 sci.med       0.87      0.86      0.87     

## Let's predict our old headline using SVM.

In [22]:
# Store the new headline into a variable.
headline = ["Celebrity faces charges over gambling"]

# Predict the category using text_clf.predict
predicted_category_index = text_clf_svm.predict(headline)[0]

# Retrieve and print the human-readable category name
predicted_category_name = twenty_train.target_names[predicted_category_index]
print(f"Predicted category: {predicted_category_name}")

Predicted category: misc.forsale


## Discussion: What happened?

The headline category from SVM is different from Naive Bayes. What are some reasons why this might be?

Which one do you think is a better classification?

# Decision Trees

Decision trees are popular for classification and prediction. 

A Decision tree is a flowchart-like tree structure, where:

1. Each internal node denotes a test on an attribute
2. Each branch represents an outcome of the test
3. Each leaf node (terminal node) holds a class label.

Reference: https://www.geeksforgeeks.org/decision-tree/

![Image](https://media.geeksforgeeks.org/wp-content/cdn-uploads/Decision_Tree-2.png)

### Decision Tree Pipeline

Our convenient Pipeline can do the same work we have done before with just a quick change to our code.

In [23]:
# Create the Decision Tree pipeline, starting with Vectorizing and tranformation.
from sklearn.tree import DecisionTreeClassifier

# Pipeline: Countvectorizer, tfidftransformer, Decision tree classifier
text_clf_decisiontree = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()),
    ('clf', DecisionTreeClassifier())
])

In [24]:
# Predict test data using the decision tree classifier:
text_clf_decisiontree = text_clf_decisiontree.fit(twenty_train.data, twenty_train.target)
predicted_decisiontree = text_clf_decisiontree.predict(twenty_test.data)

# Print the performance measures:
print(f"Accuracy: {accuracy_score(twenty_test.target, predicted_decisiontree):.2f}")
print(classification_report(twenty_test.target, predicted_decisiontree, target_names=twenty_test.target_names))

Accuracy: 0.55
                          precision    recall  f1-score   support

             alt.atheism       0.50      0.47      0.49       319
           comp.graphics       0.41      0.43      0.42       389
 comp.os.ms-windows.misc       0.52      0.54      0.53       394
comp.sys.ibm.pc.hardware       0.44      0.42      0.43       392
   comp.sys.mac.hardware       0.53      0.58      0.55       385
          comp.windows.x       0.47      0.45      0.46       395
            misc.forsale       0.62      0.73      0.67       390
               rec.autos       0.62      0.59      0.61       396
         rec.motorcycles       0.74      0.74      0.74       398
      rec.sport.baseball       0.55      0.55      0.55       397
        rec.sport.hockey       0.63      0.67      0.65       399
               sci.crypt       0.75      0.71      0.73       396
         sci.electronics       0.32      0.35      0.33       393
                 sci.med       0.56      0.44      0.49     

## Decision Tree was far less accurate.

SVM and Bayes did a better job of classifying headlines.

Try your hand at classifying a headline using the Decision Tree.

# Practice: Decision Tree Classifier

Classify the headline we used above using the Decision Tree model.

Using the following headline, predict the category it belongs to.

#### Headline: <i>NASA is planning a new mission to Jupiter.</i>

In [25]:
# Your Code Here:
# Store the new headline into a variable.
headline = ["NASA is planning a new mission to Jupiter."]

# Predict the category using text_clf.predict
predicted_category_index = text_clf_decisiontree.predict(headline)[0]

# Retrieve and print the human-readable category name
predicted_category_name = twenty_train.target_names[predicted_category_index]
print(f"Predicted category: {predicted_category_name}")

Predicted category: comp.os.ms-windows.misc


# Random Forest

A Random Forest is basically a set of decision trees from a randomly selected subset of the training set, and then it collects the "votes" from different decision trees to decide the final prediction.

Reference: https://www.geeksforgeeks.org/random-forest-classifier-using-scikit-learn/

![Image](https://onestopdataanalysis.com/wp-content/uploads/2020/01/2-Most-Use-for-Random-Forest-748x421.png)

# Practice

## Predict the same headline using Random Forest.

This time, you will create the pipeline, perform the training, and then predict the header.

The initial library import is provided for you.

Use the same headline as before:

#### Headline: <i>NASA is planning a new mission to Jupiter.</i>

In [26]:
# Import Random Forest Classifier:
from sklearn.ensemble import RandomForestClassifier

# Create the RF pipeline.
text_clf_randomforest = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier())
])


In [27]:
# Train your model and evaluate using the Test data.
text_clf_decisiontree = text_clf_randomforest.fit(twenty_train.data, twenty_train.target)
predicted_decisiontree = text_clf_randomforest.predict(twenty_test.data)

# Print the performance measures:
print(f"Accuracy: {accuracy_score(twenty_test.target, predicted_decisiontree):.2f}")
print(classification_report(twenty_test.target, predicted_decisiontree, target_names=twenty_test.target_names))

Accuracy: 0.76
                          precision    recall  f1-score   support

             alt.atheism       0.73      0.64      0.68       319
           comp.graphics       0.54      0.68      0.60       389
 comp.os.ms-windows.misc       0.66      0.78      0.72       394
comp.sys.ibm.pc.hardware       0.65      0.65      0.65       392
   comp.sys.mac.hardware       0.72      0.77      0.74       385
          comp.windows.x       0.75      0.70      0.73       395
            misc.forsale       0.75      0.91      0.83       390
               rec.autos       0.83      0.78      0.80       396
         rec.motorcycles       0.91      0.91      0.91       398
      rec.sport.baseball       0.78      0.89      0.83       397
        rec.sport.hockey       0.90      0.92      0.91       399
               sci.crypt       0.88      0.91      0.90       396
         sci.electronics       0.68      0.47      0.55       393
                 sci.med       0.85      0.67      0.75     

In [28]:
# Store the new headline into a variable.
headline = ["NASA is planning a new mission to Jupiter."]

# Predict the category using text_clf.predict
predicted_category_index = text_clf_randomforest.predict(headline)[0]

# Retrieve and print the human-readable category name
predicted_category_name = twenty_train.target_names[predicted_category_index]
print(f"Predicted category: {predicted_category_name}")



Predicted category: sci.space
