# WEP24-MLB: Text Mining

Textual data is everywhere:
* Websites (e.g., news), social media (e.g., twitter), databases (e.g., doctors’ notes), digital scans of printed materials, …
* Applications in industry: search, machine translation, sentiment analysis, question answering, …
* Applications in science: cognitive modelling, understanding bias in language, automated systematic literature reviews, …



## Regular Expressions

Using the `re` library, run some of the examples from the slides. Try using the finctions `re.search()`, `re.match()`, `re.findall()`. Compare these functions with `str.startswith()` and `str.endswith()`

In [None]:
import re
'''
TODO: given the following data, try using some regex queries 
e.g. finding the words that start with 'a' and has at least two characters
or words that contain specific set of characters
'''
text = ['application', 'apple', 'banana', 'pear']
matchings = [x for x in text if re.search('^a.', x)]
print(matchings)

In [None]:
matchings = [x for x in text if x.startswith('a')]
matchings

In [None]:
r1 = re.compile(r".a$")
matchings = [x for x in text if r1.search(x)]
matchings

In [None]:
r1 = re.compile(r"ap|an")
matchings = [x for x in text if r2.search(x)]
matchings

In [None]:
r2 = re.compile(r"Bat[wo]man")
text2 = ['Batman', 'Batoman', 'Batwoman', 'Batwman', 'Batwowoman']
matchings = [x for x in text2 if r2.search(x)]
matchings

It doesn't work as you were expecting, right? Can you do something different?

In [None]:
r2 = re.compile(r"Bat(wo)?man")
text2 = ['Batman', 'Batoman', 'Batwoman', 'Batwman', 'Batwowoman']
matchings = [x for x in text2 if r2.search(x)]
matchings

In [None]:
r3 = re.compile(r"[b-f](at|ot)")
text3 = ['bat', 'cat', 'hat', 'eat', 'nat', 'oat', 'pat', 'Pat', 'ot', 'bot', 'got', '-at', '&at']
matchings = [x for x in text3 if r3.search(x)]
matchings

In [None]:
r4 = re.compile(r"^color$")
text4 = ['color', 'colour', 'colourhat', 'colormat', 'colournat', '123oat', 'pat', 'Pat', 'ot', 'bot', 'mot', '-at', '&at']
matchings = [x for x in text4 if r4.search(x)]
matchings

In [None]:
r5 = re.compile(r"[0-9]{2}")
text5 = 'My 2 favorite numbers are 19 and 4222'
matchings = r5.findall(text4)
matchings

In [None]:
text6 = 'You want to find sub-words that have three letters'
matchings = re.findall('[(a-z)|(A-Z)]{3}', text6)
matchings

Try the same text but remove the spaces.

In [None]:
text = 'You want to find sub-words that have three letters'
nst = text.replace(' ', '')
matchings = re.findall('[(a-z)|(A-Z)]{3}', nst)
matchings

## Textual Data Preprocessing

We will start by installing the `nltk` library and the sett of the stop-wrods.

In [None]:
!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('punkt')

Import required libraries

In [None]:
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
import re

We perform the same tasks that were discussed in the slides including tokenization, stemming, and removing stop-word and punctuation mark.

### Tokenization

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
'''
TODO: Tokenize the text and display the set of tokens ..
consider the tokens are separated by spaces and special characters.
'''
text = 'Morning and afternoon are parts of the day.'
word_tokenize(text)


In [None]:
'''
TODO: Tokenize the text and display the set of tokens ..
consider the tokens are separated by spaces and special characters.

Use 're' library this time.
'''
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S')
tokenizer.tokenize(text)

### Stop-word and Punctuation Marks Removal

In [None]:
stop_words = nltk.corpus.stopwords.words('english')
# text with no stop words
nsw_text = [word for word in word_tokenize(text) if word not in stop_words]
# text with no punctuation marks
filtered_text = [word for word in nsw_text if re.sub(r'[^\w\s]', '', word)]
filtered_text

### Stemming

In [None]:
'''
TODO: run porter's stemmer to find the root of each of the words in the sentence
'''
text = "Morning and afternoon are parts of the day."
ps = PorterStemmer()
stemmed_words = [ps.stem(w) for w in filtered_text]

print("Original text:", text)
print("Stemmed text:", stemmed_words)

## Textual Data Representation

In [None]:
'''
TODO: Create a dictionary that contains the term and its frequency in the document.

Display the TF matrix
'''
from sklearn.feature_extraction.text import CountVectorizer
D1 = 'the cat sits on the bed'
D2 = 'the dog sits on the bed'
corpus = [D1,D2]
rows = ['D1', 'D2']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()
frequencies = X.toarray()
dtm = pd.DataFrame(frequencies, columns = terms, index = rows)
dtm

In [None]:
'''
TODO: repeat the same exercise but remove the stop words this time.
'''
from sklearn.feature_extraction.text import CountVectorizer
D1 = 'the cat sits on the bed'
D2 = 'the dog sits on the bed'
corpus = [D1,D2]
rows = ['D1', 'D2']
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()
X = X.toarray()
dtm = pd.DataFrame(X, columns = terms, index = rows)
dtm

In [None]:
'''
TODO: Create a dictionary that contains the terms and their TF-IDF scores.

Display the TF-IDF matrix
'''
from sklearn.feature_extraction.text import TfidfVectorizer

D1 = 'the cat sits on the bed'
D2 = 'the dog sits on the bed'
corpus = [D1,D2]
rows = ['D1', 'D2']
# vectorizer = TfidfVectorizer(stop_words='english')
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
terms = vectorizer.get_feature_names_out()
X = X.toarray()
dtm = pd.DataFrame(X, columns = terms, index = rows)
dtm = dtm.round(2)
dtm

In [None]:
'''
TODO: Create a dictionary that contains the terms and their word-word
co-occurrence matrix (WWCM) scores.

Display the WWCM matrix
'''
from sklearn.feature_extraction.text import CountVectorizer
D1 = 'the cat sits on the bed'
D2 = 'the dog sits on the bed'
corpus = [D1,D2]

# Convert a collection of text documents to a matrix of token counts
cv = CountVectorizer(ngram_range=(1,1), stop_words = 'english')
# matrix of token counts
X = cv.fit_transform(corpus)
Xc = (X.T * X) # matrix manipulation
Xc.setdiag(0) # set the diagonals to be zeroes as it's pointless to be 1
names = cv.get_feature_names_out()
df = pd.DataFrame(data = Xc.toarray(), columns = names, index = names)
df

## Word Embedding
### gensim

In [None]:
!pip install gensim

We will need also to preprocess the text using `nltk` 

In [None]:
import nltk
nltk.download('punkt')

Download text to be used for training the model. 

In [None]:
!wget -c https://www.gutenberg.org/files/11/11-0.txt
!mv 11-0.txt sample_data

In [None]:
# Python program to generate word vectors using Word2Vec

# importing all necessary modules
from gensim.models import Word2Vec
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize

# Reads text file
sample = open("sample_data/11-0.txt")
s = sample.read()

# Replaces escape character with space
f = s.replace("\n", " ")

data = []
vs = 300        # The size of the word vector

# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []

    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())

    data.append(temp)

# Create CBOW model
m1 = gensim.models.Word2Vec(data, min_count=1, vector_size=vs, window=5)

# Compute the similarity and print it
print("CBOW('alice', 'computer') = ", m1.wv.similarity('alice', 'computer'))
print("CBOW('alice', 'wonderland') = ", m1.wv.similarity('alice', 'wonderland'))


# Create Skip Gram model
m2 = gensim.models.Word2Vec(data, min_count=1, vector_size=vs,	window=5, sg=1)

# Compute the similarity and print it
print("SG('alice', 'computer') = ", m2.wv.similarity('alice', 'computer'))
print("SG('alice', 'wonderland') = ", m2.wv.similarity('alice', 'wonderland'))

In [None]:
'''
TODO: Use the models most_similar() function to print the words that
are very close to a given word such as 'beginning', 'computer'
'''
m1.wv.most_similar("beginning")

In [None]:
m1.wv.most_similar("computer")

Before trying to print the vector representation of a word, you may think of reducing the size of the vector in the training step.


In [None]:
# Python program to generate word vectors using Word2Vec

# importing all necessary modules
from gensim.models import Word2Vec
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize



# Reads text file
sample = open("sample_data/11-0.txt")
s = sample.read()

# Replaces escape character with space
f = s.replace("\n", " ")

data = []
vs = 5        # The size of the word vector

# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []

    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())

    data.append(temp)

# Create CBOW model
m1 = gensim.models.Word2Vec(data, min_count=1, vector_size=vs, window=5)

# Compute the similarity and print it
print("CBOW('alice', 'computer') = ", m1.wv.similarity('alice', 'computer'))
print("CBOW('alice', 'wonderland') = ", m1.wv.similarity('alice', 'wonderland'))


# Create Skip Gram model
m2 = gensim.models.Word2Vec(data, min_count=1, vector_size=vs,	window=5, sg=1)

# Compute the similarity and print it
print("SG('alice', 'computer') = ", m2.wv.similarity('alice', 'computer'))
print("SG('alice', 'wonderland') = ", m2.wv.similarity('alice', 'wonderland'))

In [None]:
'''
TODO: Use the models get_vector() function to print the vector representation
of a word that exist in the vocabulary of the model
'''
m1.wv.get_vector('computer')

### gensim with pretrained models

We will start by importing the downloader of the pretrained models and list the set of available models.

In [None]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

Let us try a small model `glove-twitter-25`.

In [None]:
gtv = gensim.downloader.load('glove-twitter-25')

In [None]:
'''
TODO: Use the pretrained model's most_similar() function to print the words that
are very close to a given word such as 'beginning', 'computer'
Compare the results with those that were obtained using the model that we trained earlier.
'''
gtv.most_similar('beginning')

In [None]:
gtv.most_similar('computer')

### fastText

In [None]:
!pip install fasttext

In [None]:
import fasttext

In [None]:
ftm1 = fasttext.train_unsupervised('sample_data/11-0.txt', minn=2, maxn=5, dim=300)

In [None]:
ftm1.get_dimension()

In [None]:
import fasttext.util
fasttext.util.reduce_model(ftm1, 5)
ftm1.get_dimension()

In [None]:
'''
TODO: print the vector representation of the word "language"
'''
ftm1.get_word_vector("language").round(4)

In [None]:
'''
TODO: display the 5, 10 and 20 closest neighbors to the word "language"
'''
n = 5
ftm1.get_nearest_neighbors("language", 5)

### fastText with pretrained models
The downloading step will take long time as you will need to download nearly 4.5 GB compressed file and it will be extracted to 7.24 GB.

In [None]:
import fasttext.util
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')
ft.get_dimension()

Now, try to find the closest word to the word `computer` or `beginning`.

In [None]:
ft.get_nearest_neighbors('computer')

In [None]:
ft.get_word_vector('hello')

## Sentiment Analysis

### Using Transformers

A nice tuttorial for using trnasformers for sentiment analysis can be found [here](https://huggingface.co/blog/sentiment-analysis-python)

In [None]:
from transformers import pipeline

data = ["KAUST was established to become world class university", "KAU is an old university"]
sentiment_pipeline = pipeline("sentiment-analysis")
sentiment_pipeline(data)

### Using nltk

In [None]:
import nltk
nltk.download('vader_lexicon')

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")

In [None]:
'''
TODO: display the 'polarity_scores' of the following two sentences and 
compare the results with those obtained using the transformers
'''
data = ["KAUST was established to become world class university",
        "KAU is an old university"]
[sia.polarity_scores(x) for x in data]

Why do you think that the results are different?

In [None]:
'''
TODO: Try with these sentences.
'''
data = ["KAUST is a good university",
        "Students are feeling happy"]
[sia.polarity_scores(x) for x in data]