In [1]:
import re

import nltk
nltk.download('punkt')
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy


[nltk_data] Downloading package punkt to
[nltk_data]     /usr3/graduate/jack7z/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Natural Language Processing (NLP) is the field of computer science focused on enabling computers to understand, analyze, and generate human language. The evolution of NLP can be categorized into three major eras: symbolic, statistical, and neural. The advancements made in each era continue to be relevant and valuable today. In this lecture, we explore one tool in each era that can be useful in your research. We start with regular expression, a symbolic NLP tool. The first example is the match function, which finds the first match in a string. 

In [43]:
#Eg 1 
# pattern = "^The.*Spain$"
# pattern = "The.*Spain"
pattern = "The.*?Spain"

txt1 = "The rain in Spain"
txt2 = 'Yes, indeed. The rain in Spain is precious. Yes, indeed. The people in Spain speak Spanish.'
txt3 = "The inflammation causes pains"

In [44]:
#Eg 1.5

x = re.search("\s", txt2)
print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 4


As is shown in the example above, re.match() finds the first match in a string and stops. If we want to find multiple matches in a string, you can use re.findall() instead. It returns a list of all non-overlapping matches of the pattern in the string.

In [2]:
#Eg 2





Just as another example, re.findall() can be used to search for numbers, and much more if we have the correct expressions. 

In [4]:
#Eg 3

pattern = r'\d+'  # This pattern matches one or more digits

text = "I have 10 apples and 5 oranges."

# Print all the matches

print('==================================')

pattern = r'\d+\s+\w+' # This pattern matches one or more digits, followed by one or more space, followed by one or more letters
# pattern = r'(\d+)\s+(\w+)' # This pattern matches one or more digits, followed by one or more space, followed by one or more letters





And we can revise the strings using re.sub(). For example, suppose we need to convert from inch to cm.  

In [46]:
#Eg 4


text = "The height, width, and depth of these 6 identical boxes are 10 inches, 19 inches, and 8 inches respectively."



print(result)




re is a powerful tool in machine learning. Here are a few examples:

1. Text Preprocessing: Regular expressions are commonly used for text cleaning and preprocessing tasks. They can help remove or replace specific patterns, such as URLs, email addresses, special characters, or punctuation marks. This can be useful for improving the quality of input data before training an ML model.

2. Feature Extraction: Regular expressions can be employed to extract specific patterns or features from text data. For instance, you can use regex to identify and extract dates, phone numbers, or specific keywords from a document. These extracted features can then be used as inputs to an ML model.

3. Text Classification: Regular expressions can aid in creating rules or patterns for text classification tasks. For example, you can define regex patterns to identify certain types of documents or topics based on specific keywords or patterns present in the text. These patterns can be used as features to train a classification model.

4. Named Entity Recognition (NER): NER is a task that involves identifying and classifying named entities (such as person names, locations, or organizations) in text. Regular expressions can be utilized to define patterns that match specific types of named entities and extract them from the text.

5. Text Generation and Language Modeling: Regular expressions can be helpful in generating or manipulating text data. They can be used to define rules for generating text based on specific patterns or templates. For example, you can use regex to replace placeholders in a text template with dynamically generated content.

There are more tool in regular expressions. Check out https://docs.python.org/3/howto/regex.html


Symbolic NLP are very useful. However, they are still very rigid. To increase the flexibility of processing natural languages, statistical NLP was developed. We will look at one example of statistical NLP method, namely using Bayesian statistics to classify sentences into positive and negative categories. To do so, we need to tokenize our input. Here is an example of tokenization. 

In [42]:
def format_sentence(sent):
    return({word: True for word in nltk.word_tokenize(sent)})

format_sentence("Life is beautiful so enjoy everymoment you have.")

{'Life': True,
 'is': True,
 'beautiful': True,
 'so': True,
 'enjoy': True,
 'everymoment': True,
 'you': True,
 'have': True,
 '.': True}

Now we download some training and testing data. 

In [4]:
pos = []
with open("./pos_tweets.txt") as f:
    for i in f: 
        pos.append([format_sentence(i), 'pos'])
        
neg = []
with open("./neg_tweets.txt") as f:
    for i in f: 
        neg.append([format_sentence(i), 'neg'])

In [5]:

eg_num = 138

print('Positive example')

with open("./pos_tweets.txt") as f:
    lines = f.readlines()
    print(lines[eg_num])
print('Tokenized')

print(pos[eg_num])

print('\n============================\n')

print('Negative example')

with open("./neg_tweets.txt") as f:
    lines = f.readlines()
    print(lines[eg_num])
print('Tokenized')
print(neg[eg_num])

Positive example
"@Lakers ready to win tonight!!! "

Tokenized
[{'``': True, '@': True, 'Lakers': True, 'ready': True, 'to': True, 'win': True, 'tonight': True, '!': True}, 'pos']


Negative example
"@hillaryrachel oh i know how you feel. i took a leap of faith and asked Taylor Swift to be my BFFL ... she didnt reply "

Tokenized
[{'``': True, '@': True, 'hillaryrachel': True, 'oh': True, 'i': True, 'know': True, 'how': True, 'you': True, 'feel': True, '.': True, 'took': True, 'a': True, 'leap': True, 'of': True, 'faith': True, 'and': True, 'asked': True, 'Taylor': True, 'Swift': True, 'to': True, 'be': True, 'my': True, 'BFFL': True, '...': True, 'she': True, 'didnt': True, 'reply': True}, 'neg']


In [6]:
#train (90%) test (10%) split

training = pos[:int((.9)*len(pos))] + neg[:int((.9)*len(neg))]
test = pos[int((.1)*len(pos)):] + neg[int((.1)*len(neg)):]


Here we train the a Naive Baye's classifier . 

In [7]:

classifier = NaiveBayesClassifier.train(training)

In [38]:
classifier.show_most_informative_features()


Most Informative Features
                      no = True              neg : pos    =     21.2 : 1.0
                 awesome = True              pos : neg    =     18.7 : 1.0
                headache = True              neg : pos    =     18.3 : 1.0
               beautiful = True              pos : neg    =     14.2 : 1.0
                    love = True              pos : neg    =     14.2 : 1.0
                      Hi = True              pos : neg    =     12.7 : 1.0
                   Thank = True              pos : neg    =      9.7 : 1.0
                     fan = True              pos : neg    =      9.7 : 1.0
                    glad = True              pos : neg    =      9.7 : 1.0
                    been = True              neg : pos    =      9.3 : 1.0


We can do some testing on the classifier.

In [8]:
example1 = "This workshop is likely going to prepare the students for their upcoming projects."
example2 = "Students need far more than this workshop to get prepared for real life questions."

print(classifier.classify(format_sentence(example1)))
print(classifier.classify(format_sentence(example2)))

pos
neg


In [12]:
print('The accuracy on the test set is')
print(accuracy(classifier, test))

The accuracy on the test set is
0.9562326869806094


The accuracy is in fact not the most informative metric to look at. Better alternatives are the precision and recall. 

precision = true_positives / predicted_positives <br>
recall = true_positives / actual_positives

In this case, our precision is low, meaning many negative examples are marked as positive. 

In [16]:
# Calculate precision and recall
predictions = [classifier.classify(features) for features, _ in test]
ground_truth = [label for _, label in test]

true_positives = sum(1 for pred, truth in zip(predictions, ground_truth) if pred == 'pos' and truth == 'pos')
predicted_positives = sum(1 for pred in predictions if pred == 'pos')
actual_positives = sum(1 for truth in ground_truth if truth == 'pos')

precision = true_positives / predicted_positives
recall = true_positives / actual_positives

print(precision)
print(recall)

0.8981636060100167
0.9676258992805755


Neural NLP: ChatGPT