## Natural Language Processing for Text Classification with NLTK and Scikit-learn

### Presented by Eduonix!

In the project, Getting Started With Natural Language Processing in Python, we learned the basics of tokenizing, part-of-speech tagging, stemming, chunking, and named entity recognition; furthermore, we dove into machine learning and text classification using a simple support vector classifier and a dataset of positive and negative movie reviews. 

In this tutorial, we will expand on this foundation and explore different ways to improve our text classification results. We will cover and use:

* Regular Expressions
* Feature Engineering
* Multiple scikit-learn Classifiers
* Ensemble Methods

### 1. Import Necessary Libraries

To ensure the necessary libraries are installed correctly and up-to-date, print the version numbers for each library.  This will also improve the reproducibility of our project.

In [36]:
import sys
import nltk
import sklearn
import pandas
import numpy

print('Python: {}'.format(sys.version))
print('NLTK: {}'.format(nltk.__version__))
print('Scikit-learn: {}'.format(sklearn.__version__))
print('Pandas: {}'.format(pandas.__version__))
print('Numpy: {}'.format(numpy.__version__))

Python: 3.9.16 (main, Mar  8 2023, 04:29:44) 
[Clang 14.0.6 ]
NLTK: 3.8.1
Scikit-learn: 1.2.2
Pandas: 2.0.2
Numpy: 1.23.5


### 2. Load the Dataset

Now that we have ensured that our libraries are installed correctly, let's load the data set as a Pandas DataFrame. Furthermore, let's extract some useful information such as the column information and class distributions. 

The data set we will be using comes from the UCI Machine Learning Repository.  It contains over 5000 SMS labeled messages that have been collected for mobile phone spam research. It can be downloaded from the following URL:

https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [37]:
import pandas as pd
import numpy as np

# load the dataset of SMS messages
df = pd.read_csv('https://raw.githubusercontent.com/rlipps/NRAO_Capstone/main/data/nrao_projects.csv')

In [38]:
# print useful information about the dataset
print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4528 entries, 0 to 4527
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   project_code      4528 non-null   object
 1   project_title     4528 non-null   object
 2   project_abstract  4527 non-null   object
 3   fs_type           4528 non-null   object
dtypes: object(4)
memory usage: 141.6+ KB
None
     project_code                                      project_title  \
0  2018.1.01205.L  Fifty AU STudy of the chemistry in the disk/en...   
1  2022.1.00316.L  COMPASS: Complex Organic Molecules in Protosta...   
2  2017.1.00161.L  ALCHEMI: the ALMA Comprehensive High-resolutio...   
3  2021.1.01616.L  ALMA JELLY - Survey of Nearby Jellyfish and Ra...   
4  2021.1.00869.L  Bulge symmetry or not? The hidden dynamics of ...   

                                    project_abstract fs_type  
0  The huge variety of planetary systems discover...    line  
1  The emerg

In [39]:
df.head()

Unnamed: 0,project_code,project_title,project_abstract,fs_type
0,2018.1.01205.L,Fifty AU STudy of the chemistry in the disk/en...,The huge variety of planetary systems discover...,line
1,2022.1.00316.L,COMPASS: Complex Organic Molecules in Protosta...,The emergence of complex organic molecules in ...,line
2,2017.1.00161.L,ALCHEMI: the ALMA Comprehensive High-resolutio...,A great variety in gas composition is observed...,line
3,2021.1.01616.L,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,We propose the first ever statistical survey o...,line
4,2021.1.00869.L,Bulge symmetry or not? The hidden dynamics of ...,A radio survey of red giant SiO sources in the...,line


In [40]:
# check class distribution
df.fs_type.value_counts()

fs_type
line         3628
continuum     900
Name: count, dtype: int64

In [41]:
classes = df.fs_type

In [42]:
classes

0       line
1       line
2       line
3       line
4       line
        ... 
4523    line
4524    line
4525    line
4526    line
4527    line
Name: fs_type, Length: 4528, dtype: object

### 2. Preprocess the Data

Preprocessing the data is an essential step in natural language process. In the following cells, we will convert our class labels to binary values using the LabelEncoder from sklearn, replace email addresses, URLs, phone numbers, and other symbols by using regular expressions, remove stop words, and extract word stems.  

In [43]:
from sklearn.preprocessing import LabelEncoder

# convert class labels to binary values, 0 = ham and 1 = spam
encoder = LabelEncoder()
Y = encoder.fit_transform(classes)

print(Y[:10])

[1 1 1 1 1 1 1 1 1 1]


In [44]:
df['both']=  df.project_title+ ' '+df.project_abstract
df.head()

Unnamed: 0,project_code,project_title,project_abstract,fs_type,both
0,2018.1.01205.L,Fifty AU STudy of the chemistry in the disk/en...,The huge variety of planetary systems discover...,line,Fifty AU STudy of the chemistry in the disk/en...
1,2022.1.00316.L,COMPASS: Complex Organic Molecules in Protosta...,The emergence of complex organic molecules in ...,line,COMPASS: Complex Organic Molecules in Protosta...
2,2017.1.00161.L,ALCHEMI: the ALMA Comprehensive High-resolutio...,A great variety in gas composition is observed...,line,ALCHEMI: the ALMA Comprehensive High-resolutio...
3,2021.1.01616.L,ALMA JELLY - Survey of Nearby Jellyfish and Ra...,We propose the first ever statistical survey o...,line,ALMA JELLY - Survey of Nearby Jellyfish and Ra...
4,2021.1.00869.L,Bulge symmetry or not? The hidden dynamics of ...,A radio survey of red giant SiO sources in the...,line,Bulge symmetry or not? The hidden dynamics of ...


In [45]:
df.both[1]

'COMPASS: Complex Organic Molecules in Protostars with ALMA Spectral Surveys The emergence of complex organic molecules in the interstellar medium is a fundamental puzzle of astrochemistry. Targeted observations with ALMA have opened the door to high-sensitivity spectral surveys over wide bandwidths to elucidate the chemical complexity of young stars in a systematic manner. We propose a Large Program to perform unbiased line surveys in the 279 to 312 GHz frequency range of 11 nearby Solar-type protostars. The targeted protostars are known hosts of complex organic molecules and sample different natal environments and evolutionary stages. The proposed spectral coverage will allow us to unambiguously identify complex organic molecules and their isotopologues and to accurately derive the abundances for species with abundances down to 0.01% relative to methanol. The concerted effort will provide a deep understanding of the complex organic inventories and isotopic ratios depending on the phy

In [46]:
# store the SMS message data
abstracts = df.both
print(abstracts[:10])

0    Fifty AU STudy of the chemistry in the disk/en...
1    COMPASS: Complex Organic Molecules in Protosta...
2    ALCHEMI: the ALMA Comprehensive High-resolutio...
3    ALMA JELLY - Survey of Nearby Jellyfish and Ra...
4    Bulge symmetry or not? The hidden dynamics of ...
5    The ALMA survey to Resolve exoKuiper belt Subs...
6    UNveiling the Initial Conditions of high-mass ...
7    REBELS: An ALMA Large Program to Discover the ...
8    ACES: The ALMA CMZ Exploration Survey The extr...
9    The COSMOS High-z ALMA-MIRI Population Survey ...
Name: both, dtype: object


#### 2.1 Regular Expressions



In [47]:
# make them all strings
processed =abstracts.astype(str)

In [48]:
processed[200]

'Properties of the most distant star-forming GMC in the Milky Way The properties of molecular clouds and star formation are expected to be subject to Galaxy-scale variations. In the Outer Galaxy, the importance of the spiral structure is thought to be diminished, with a metallicity gradient observed with galactocentric radius. Despite the importance of location in determining the outcome(s) of the ISM life-cycle, these effects are poorly understood. This leaves extrapolation from our current understanding to molecular cloud and star formation properties in even nearby galaxies uncertain, let alone to the conditions under which the bulk of stars in the Universe formed. We propose to comprehensively map and characterise the molecular gas and dust in the most distant star forming region found to date within our Galaxy, where the metallicity is expected to be significantly below that in the Inner Galaxy and supernovae may be more important than spiral arms. With these data, we will constra

In [49]:
processed[4525]

'Ultra-high resolution imaging of 3C84 3C84 is a prime target for high angular resolution studies of jet formation, due to its proximity and large SMBH mass, which provides a spatial resolution of 20 Rs. Previous 1.3cm RadioAstron space-VLBI imaging and 3mm GMVA maps show a prominent two rail jet, which is anchored in an E-W oriented compact component, perpendicular to the outer jet. With the proposed EHT+ALMA observation we address the question of the physical nature of this elongated jet base and the "true" location of the jet apex. EHT imaging with ultra-high angular resolution will allow for a number of questions to be answered: the Faraday depth, magnetic field and particle density can be estimated through the rotation measure. The location of the jet base will be precisely pinpointed. The transverse jet width and nozzle (profile) will allow for a discrimination between magnetic and/or pressure confinement in the jet launching region. Finally, the magnetic field topology and orien

In [50]:
processed[4525]

'Ultra-high resolution imaging of 3C84 3C84 is a prime target for high angular resolution studies of jet formation, due to its proximity and large SMBH mass, which provides a spatial resolution of 20 Rs. Previous 1.3cm RadioAstron space-VLBI imaging and 3mm GMVA maps show a prominent two rail jet, which is anchored in an E-W oriented compact component, perpendicular to the outer jet. With the proposed EHT+ALMA observation we address the question of the physical nature of this elongated jet base and the "true" location of the jet apex. EHT imaging with ultra-high angular resolution will allow for a number of questions to be answered: the Faraday depth, magnetic field and particle density can be estimated through the rotation measure. The location of the jet base will be precisely pinpointed. The transverse jet width and nozzle (profile) will allow for a discrimination between magnetic and/or pressure confinement in the jet launching region. Finally, the magnetic field topology and orien

In [51]:

import re

def remove_punctuation(text):
    punctuation_pattern = r'[^\w\s]'
    return re.sub(punctuation_pattern, '', text)

# Apply the function to the 'text' column
processed = processed.apply(remove_punctuation)

In [52]:
# use regular expressions to replace numbers
    
def replace_numbers_with_word(text, placeholder='numbr'):
    # Use a pattern that matches digits followed by any combination of letters
    number_pattern = r'\b\d+[A-Za-z]*\b'
    
    return re.sub(number_pattern, placeholder, text)

# Apply the function to the 'text' column
processed = processed.apply(replace_numbers_with_word)

In [53]:
def remove_extra_spaces(text):
    return re.sub(r'\s+', ' ', text).strip()
processed = processed.apply(replace_numbers_with_word)

In [54]:
# change words to lower case - Hello, HELLO, hello are all the same word
processed = processed.str.lower()
print(processed)

0       fifty au study of the chemistry in the diskenv...
1       compass complex organic molecules in protostar...
2       alchemi the alma comprehensive highresolution ...
3       alma jelly  survey of nearby jellyfish and ram...
4       bulge symmetry or not the hidden dynamics of t...
                              ...                        
4523    a detailed study of the subpc jet of bl lacert...
4524    disorder vs order discerning the nature of the...
4525    ultrahigh resolution imaging of 3c84 3c84 is a...
4526    imaging jet and magnetic field near the spinni...
4527    first subparsecscale imaging of the new tev ga...
Name: both, Length: 4528, dtype: object


In [55]:
from nltk.corpus import stopwords

# remove stop words from text messages

stop_words = set(stopwords.words('english'))

processed = processed.apply(lambda x: ' '.join(
    term for term in x.split() if term not in stop_words))

In [56]:
# Remove word stems using a Porter stemmer
ps = nltk.PorterStemmer()

processed = processed.apply(lambda x: ' '.join(
    ps.stem(term) for term in x.split()))

### 3. Generating Features

Feature engineering is the process of using domain knowledge of the data to create features for machine learning algorithms. In this project, the words in each text message will be our features.  For this purpose, it will be necessary to tokenize each word.  We will use the 1500 most common words as features.

In [57]:
from nltk.tokenize import word_tokenize

# create bag-of-words
all_words = []

for message in processed:
    words = word_tokenize(message)
    for w in words:
        all_words.append(w)
        
all_words = nltk.FreqDist(all_words)

In [58]:
# print the total number of words and the 15 most common words
print('Number of words: {}'.format(len(all_words)))
print('Most common words: {}'.format(all_words.most_common(15)))

Number of words: 13814
Most common words: [('numbr', 21167), ('observ', 9054), ('galaxi', 7878), ('ga', 6711), ('disk', 6040), ('star', 5897), ('propos', 5313), ('alma', 5109), ('format', 5013), ('molecular', 4600), ('mass', 3464), ('line', 3374), ('dust', 3210), ('studi', 3110), ('emiss', 3097)]


In [59]:
# use the 1500 most common words as features
word_features = list(all_words.keys())[:1500]

In [60]:
# The find_features function will determine which of the 1500 word features are contained in the review
def find_features(message):
    words = word_tokenize(message)
    features = {}
    for word in word_features:
        features[word] = (word in words)

    return features

# Lets see an example!
features = find_features(processed[0])
for key, value in features.items():
    if value == True:
        print(key)

fifti
au
studi
chemistri
diskenvelop
system
solarlik
protostar
faust
huge
varieti
planetari
discov
recent
decad
like
depend
earli
histori
format
propos
larg
program
focus
specif
chemic
divers
scale
numbr
planet
expect
form
particular
goal
project
reveal
quantifi
composit
envelopedisk
sampl
class
repres
observ
larger
sourc
spatial
resolut
set
molecul
abl
disentangl
compon
characteris
organ
complex
probe
ioniz
structur
measur
molecular
deuter
output
homogen
databas
thousand
imag
differ
line
speci
ie
unpreced
sourcesurvey
provid
commun
legaci
dataset
mileston
astrochemistri
star


In [61]:
# Now lets do it for all the messages
messages = zip(processed, Y)
messages = list(messages)
# define a seed for reproducibility
seed = 1
np.random.seed = seed
np.random.shuffle(messages)

# call find_features function for each SMS message
featuresets = [(find_features(text), label) for (text, label) in messages]

In [62]:
# we can split the featuresets into training and testing datasets using sklearn
from sklearn import model_selection

# split the data into training and testing datasets
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25, random_state=seed)

In [63]:
print(len(training))
print(len(testing))

3396
1132


### 4. Scikit-Learn Classifiers with NLTK

Now that we have our dataset, we can start building algorithms! Let's start with a simple linear support vector classifier, then expand to other algorithms. We'll need to import each algorithm we plan on using from sklearn.  We also need to import some performance metrics, such as accuracy_score and classification_report.

In [64]:
# We can use sklearn algorithms in NLTK
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

model = SklearnClassifier(SVC(kernel = 'linear'))

# train the model on the training data
model.train(training)

# and test on the testing dataset!
accuracy = nltk.classify.accuracy(model, testing)*100
print("SVC Accuracy: {}".format(accuracy))

SVC Accuracy: 81.97879858657244


In [65]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

In [66]:
import random
random.seed(2002)

In [67]:


# Define models to train
#names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Gradient Boosting","Logistic Regression", "SGD Classifier",
       #  "Naive Bayes", "SVM Linear"]
names = ["Random Forest", "Gradient Boosting", "Logistic Regression", "SVM Linear"]

classifiers = [
    #KNeighborsClassifier(),
   #DecisionTreeClassifier(),
    RandomForestClassifier(),
    GradientBoostingClassifier(),
    LogisticRegression(),
    #SGDClassifier(max_iter = 100),
    #MultinomialNB(),
    SVC(kernel = 'linear')
]

models = zip(names, classifiers)

for name, model in models:
    nltk_model = SklearnClassifier(model)
    nltk_model.train(training)
    accuracy = nltk.classify.accuracy(nltk_model, testing)*100
    print("{} Accuracy: {}".format(name, accuracy))

Random Forest Accuracy: 85.68904593639576
Gradient Boosting Accuracy: 87.63250883392226


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression Accuracy: 85.15901060070671
SVM Linear Accuracy: 81.97879858657244


In [68]:
# Ensemble methods - Voting classifier
from sklearn.ensemble import VotingClassifier

#names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
       #  "Naive Bayes", "SVM Linear"]

classifiers = [
  #  KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
   LogisticRegression(),
   # SGDClassifier(max_iter = 100),
   # MultinomialNB(),
    SVC(kernel = 'linear')
]

models = zip(names, classifiers)
models = list(models)
nltk_ensemble = SklearnClassifier(VotingClassifier(estimators = models, voting = 'hard', n_jobs = -1))
nltk_ensemble.train(training)
accuracy = nltk.classify.accuracy(nltk_model, testing)*100
print("Voting Classifier: Accuracy: {}".format(accuracy))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Voting Classifier: Accuracy: 81.97879858657244


In [69]:
# make class label prediction for testing set
txt_features, labels = zip(*testing)

prediction = nltk_ensemble.classify_many(txt_features)

In [70]:
# print a confusion matrix and a classification report
print(classification_report(labels, prediction))

pd.DataFrame(
    confusion_matrix(labels, prediction),
    index = [['actual', 'actual'], ['continuum', 'line']],
    columns = [['predicted', 'predicted'], ['continuum', 'line']])

              precision    recall  f1-score   support

           0       0.63      0.55      0.59       210
           1       0.90      0.93      0.91       922

    accuracy                           0.86      1132
   macro avg       0.77      0.74      0.75      1132
weighted avg       0.85      0.86      0.85      1132



Unnamed: 0_level_0,Unnamed: 1_level_0,predicted,predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,continuum,line
actual,continuum,116,94
actual,line,68,854
