## Lab 8 NLP basics

In this notebook, you will learn how the basic NLP methods work. The methods include:
* Extract words from a given text using word-tokenisation
* Create n-grams from files in a given directory,
* Generate word meaning vector using a Word2Vec model

In [4]:
import nltk
nltk.download("popular")

AttributeError: module 'nltk' has no attribute 'internals'

In [5]:
from heading import *
from nltk.tokenize import *
from nltk.corpus import stopwords

## Tokenisation
Tokenisation is to split a text into meaningful units. Sentence tokenization will produce tokens as sentences and Word tokenization is to produce tokens as words.

Here you will try an example of Sentence tokenization and one example of Word tokenization.

First you need to define a text that needs to be tokenised.

In [6]:
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

Then you use the method `sent_tokenize()` to separate the text into sentences and print out the result.

In [7]:
sent_tokenize_result = sent_tokenize(EXAMPLE_TEXT)
print(sent_tokenize_result)

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]


You can also use the method `word_tokenize()` to separate the text into words and print out the result.

In [8]:
word_tokenize_result = word_tokenize(EXAMPLE_TEXT)
print(word_tokenize_result)

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


In [9]:
import string
all_punctuation = string.punctuation
print(f"All punctuation: {all_punctuation}")

All punctuation: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [10]:
# You can configure for the language you need. In this example, you can use 'English'
stops = set(stopwords.words('english'))

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\John/nltk_data'
    - 'c:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\nltk_data'
    - 'c:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\share\\nltk_data'
    - 'c:\\Users\\John\\AppData\\Local\\Programs\\Python\\Python311\\lib\\nltk_data'
    - 'C:\\Users\\John\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [11]:
#Filter the stop words and punctuations etc by removing the stop words and punctuations in the word list
removales = list(stops)+list(all_punctuation)+list("n't")
wordsWithoutStopWords = []

for w in word_tokenize_result:
    w_lower = w.lower()
    if (w_lower not in stops) and (w_lower!="n't") and (w_lower not in all_punctuation):
        wordsWithoutStopWords.append(w)

print(wordsWithoutStopWords)

NameError: name 'stops' is not defined

### N-grams

A sequence of written items of length N is called an N-gram. The unigram (or 1-gram), bigram (or 2-gram), and trigram (or 3-gram) are sequences of one, two, and three items, respectively.

In a character N-gram, the items are characters, and in a word N-gram the items are words. Here we practice to generate word N-grams for a given sentence in a simple application example.

Example: Extract the number of occurrences of N-grams in a given document.

First you need to specify two pieces of information:
* the path to a folder containing the given document files and
* the N in N-grams, which is the number of consecutive words.

In [12]:
file_directory = "text_data/holmes"
num_words = 2

Then, you need to create an instance of `Ngrams` using the directory path and the N value.

In [13]:
ngrams = Ngrams(file_directory,num_words)

Now you can get frequencies of N-grams in the given corpus of documents. Here we display top 10 N-grams with their frequencies.

In [14]:
ngrams.top_term_frequencies(10)

1158: ('of', 'the')
879: ('in', 'the')
521: ('it', 'was')
498: ('to', 'the')
463: ('it', 'is')
457: ('i', 'have')
405: ('that', 'i')
378: ('at', 'the')
370: ('and', 'i')
332: ('and', 'the')


You can generate the n-grams from a given corpus, here is an example of getting 2-grams of a test string

In [15]:
ngrams_test = Counter(nltk.ngrams(wordsWithoutStopWords,num_words))
print(ngrams_test)

NameError: name 'wordsWithoutStopWords' is not defined

In [16]:
for ngram, freq in ngrams_test.most_common(5):
    print(f"{freq}: {ngram}")

NameError: name 'ngrams_test' is not defined

### Generate a word vector to represent the meaning of a word using a Word2Vec model

Use a Word2vec model you can generate a word vector to represent the word meaning.

First, you need to define the path of a text file `words.txt` representing the word2vec model.
In the following example, the file is in a local folder `text_data`.

In [17]:
file_directory = "text_data/words.txt"

Then, you can create an instance of `Vectors` class by calling the method `Vectors(localtion-your-word2vec-model-file)`.

In [18]:
vectors = Vectors(file_directory)

Lastly, you can get the word model by calling the method `words` through the instance of `Vectors` class.

In [19]:
words = vectors.words

The elements in a word vector don't mean much. However, you can use the word vector to do the following things:
* find out the distance between two words,
* find out the closet of a word, and
* find out the relationship between related words

Now you can see some examples of word vectors and how to use them.

Here are some examples:
* word vector of "city"
* disctance between "city" and "book"
* top 10 closest words to word "book"
* calculate a new word vector from "paris" - "france" + "england" and find the closest word to this new word



In [20]:
words["city"]

array([ 0.231087, -0.238098,  0.584713, -0.524351,  0.40278 ,  0.148448,
        0.386096, -0.493994, -0.198922, -0.411161,  0.556962,  0.220978,
       -0.304637, -0.499713, -0.092555,  0.262613,  0.752704,  0.463667,
        0.054477,  0.155809, -0.195134, -0.009269,  0.378139, -0.651306,
       -0.029372, -0.563472,  0.024709,  0.366842, -0.476904, -0.42565 ,
       -0.094642, -0.052822,  0.124612,  0.296046, -0.244881,  0.195957,
        0.223666,  0.064116,  0.577874,  0.083096, -0.378262,  0.196044,
       -0.220993, -0.630213, -0.311214,  0.435611,  0.351486,  0.342794,
       -0.229961, -0.157521,  0.204315,  0.253944, -0.562277,  0.534482,
       -0.4158  ,  0.120161,  0.649395, -0.227012, -0.130488, -0.332326,
        0.691952, -0.400436,  0.410125,  0.026237, -0.408483,  0.188236,
        0.130957, -0.320686,  0.225932, -0.171665, -0.335107, -0.009982,
        0.680831, -0.023788, -0.165798,  0.345986, -0.232295,  0.021137,
        0.08515 , -0.24387 , -0.142469, -0.058325, 

You can calculate the vectors' distance between two different words by `distance()` method in `Vectors()`.

In [21]:
vectors.distance(words["city"], words["book"])

0.7886748166595728

You can also get the first 10 closest words by `closest_words()` method in `Vectors()`.

In [22]:
vectors.closest_words(words["book"])[:10]

['book',
 'books',
 'essay',
 'memoir',
 'essays',
 'novella',
 'anthology',
 'blurb',
 'autobiography',
 'audiobook']

In [23]:
words["paris"] - words["france"] + words["england"]

array([-1.874760e-01, -3.692510e-01,  3.075000e-01, -4.181100e-01,
        5.498840e-01,  6.352010e-01, -4.005000e-03, -4.593450e-01,
       -3.470230e-01,  1.659260e-01,  4.006800e-02,  4.340700e-01,
       -8.286800e-01,  6.503120e-01, -3.537860e-01, -1.202400e-02,
        6.322300e-02, -2.357520e-01,  2.798800e-01,  5.046000e-02,
       -7.738000e-03,  5.448990e-01,  2.280140e-01, -1.064991e+00,
        1.754930e-01,  1.058160e-01,  1.875940e-01,  4.105300e-02,
        4.010000e-04, -2.287470e-01, -3.982840e-01,  2.528880e-01,
       -3.941240e-01,  4.546850e-01,  1.792500e-02,  3.963680e-01,
        7.318080e-01, -7.724200e-02, -3.517500e-02, -4.509550e-01,
       -5.403160e-01,  1.168860e-01, -7.841200e-02, -3.544150e-01,
       -3.480260e-01, -6.697200e-02,  1.598300e-02,  3.642560e-01,
        3.123690e-01,  2.153870e-01,  2.149050e-01, -5.187360e-01,
       -1.973870e-01,  1.257630e-01,  2.832430e-01, -1.324610e-01,
        5.708990e-01, -4.308300e-02, -2.368190e-01, -6.645990e

You can also input one vector and get its closet word by `closest_word()` method.

In [24]:
vectors.closest_word(words["paris"] - words["france"] + words["england"])

'london'