# NLP for Beginners using NLTK and spaCy

NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is free, Open Source, easy to use, well documented and it has a large community. NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analyse, preprocess and understand written text.

You can install it by running the following command:

In [1]:
pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
     ----- ---------------------------------- 0.2/1.5 MB 4.7 MB/s eta 0:00:01
     ----------- ---------------------------- 0.5/1.5 MB 4.7 MB/s eta 0:00:01
     ------------------ --------------------- 0.7/1.5 MB 4.8 MB/s eta 0:00:01
     -------------------- ------------------- 0.8/1.5 MB 4.9 MB/s eta 0:00:01
     -------------------- ------------------- 0.8/1.5 MB 4.9 MB/s eta 0:00:01
     ---------------------------------------  1.5/1.5 MB 5.1 MB/s eta 0:00:01
     ---------------------------------------- 1.5/1.5 MB 4.8 MB/s eta 0:00:00
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2023.12.25-cp38-cp38-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/42.0 kB ? eta -:--:--
     --------------------------------------

In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rafen\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rafen\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rafen\AppData\Roaming\nltk_data...


True

## 1. Tokenizing

When we deal with text, we need to break it down into smaller pieces for analysis. This is
where tokenization comes into the picture. It is the process of dividing the input text into a
set of pieces like words or sentences. These pieces are called tokens.

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

# define input text
input_text = "Do you know how tokenization works? It's actually quite interesting! Let's analyze a couple of sentences and figure it out."

# sentence tokenizer
print("\nSentence tokenizer:")
print(sent_tokenize(input_text))

# word tokenizer
print("\nWord tokenizer:")
print(word_tokenize(input_text))


Sentence tokenizer:
['Do you know how tokenization works?', "It's actually quite interesting!", "Let's analyze a couple of sentences and figure it out."]

Word tokenizer:
['Do', 'you', 'know', 'how', 'tokenization', 'works', '?', 'It', "'s", 'actually', 'quite', 'interesting', '!', 'Let', "'s", 'analyze', 'a', 'couple', 'of', 'sentences', 'and', 'figure', 'it', 'out', '.']


### Stopwords

Stopwords are considered as noise in the text. Text may contain stopwords such as *is, am, are, this, a, an, the,* etc.

It is clear that you first need a list of stopwords so these words can be removed. This list can be easily created as follows:

In [4]:
# create stopwords
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

{"don't", 'then', 'this', 'here', 'aren', 'with', 'those', 'was', 'why', 've', 'hasn', 'but', 'can', 'her', 'myself', 'its', 'if', 'been', "mightn't", 'before', "haven't", 'over', 'd', 'theirs', 'itself', 'so', "doesn't", 'after', 'wouldn', 'yours', 'or', 'up', 'didn', 'haven', 'when', 're', 'such', 'by', 'from', "it's", 'be', 'having', 'mustn', 'm', 'as', 'in', 'hadn', 'all', "hasn't", 'ours', 'because', 'have', 'other', 'down', 'some', "hadn't", 'at', 's', 'am', 'against', 'both', 'does', 'are', 'our', 'what', 'won', 'under', 'your', 'do', "she's", 'y', 'him', 'now', 'these', 'shouldn', "couldn't", "won't", 'it', 'which', 'about', 'should', 'there', 'a', 'to', 'doing', 'once', "shouldn't", 'most', 'ma', 'on', 'during', "you'll", 'more', 'out', 'weren', 'o', 'will', 'were', 'again', 'for', 'not', 'had', 'ourselves', 'through', 'too', 'shan', 'each', 'them', 'nor', 'just', "shan't", 'wasn', "you're", 'no', "should've", 'who', 'the', 'she', 'between', 'only', "mustn't", 'my', "didn't", 

### Print all Dutch stopwords -  Exercise

Write a little program that prints all Dutch stopwords.

In [5]:
# print stopwords in dutch


### Remove stopwords and punctuation - Exercise

Now write a function `words()` that has a string as input parameter and returns all the words in that string without the stopwords. Also get rid of punctuation. The output of the input_text above should be:

```
Words without stopwords:  ['know', 'tokenization', 'works', 'actually', 'quite', 'interesting', 'let', 'analyze', 'couple', 'sentences', 'figure']
```

In [6]:
from nltk.tokenize import RegexpTokenizer

def words (input_text):
    tokenizer = RegexpTokenizer(r'\w+')
    output = []
    for word in tokenizer.tokenize(input_text):
        if word.lower() not in stop_words:
            output.append(word.lower())
    return output

print("Words without stopwords: ", words(input_text))

Words without stopwords:  ['know', 'tokenization', 'works', 'actually', 'quite', 'interesting', 'let', 'analyze', 'couple', 'sentences', 'figure']


## 2. Stemming

When working with text, we have to deal with different forms of the same word. For example, the word *sing* can appear in many forms such as *sang, singer, singing, singer,* and so on. When we analyze text, it's useful to reduce words in their different forms into a base form. This will enable us to extract useful statistics to analyze the input text.

Stemming is one way to achieve this. It is basically a process that cuts off the ends of words to extract their base forms. Let's see how to do it using NLTK.

In [7]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

input_words = ['writing', 'connections', 'connected', 'connecting', 'horse', 'randomize', 'possibly', 'provision', 'hospital', 'kept', 'scratchy', 'calves']

# create various stemmer objects
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')

# create a list of stemmer names for display
stemmer_names = ['PORTER', 'SNOWBALL', 'LANCASTER']
formatted_text = '{:>16}' * (len(stemmer_names) + 1)
print('\n', formatted_text.format('INPUT WORD', *stemmer_names), '\n', '='*68)

# stem each word and display the output
for word in input_words:
    output = [word, porter.stem(word), snowball.stem(word), lancaster.stem(word)]
    print(formatted_text.format(*output))


       INPUT WORD          PORTER        SNOWBALL       LANCASTER 
         writing           write           write            writ
     connections         connect         connect         connect
       connected         connect         connect         connect
      connecting         connect         connect         connect
           horse            hors            hors            hors
       randomize          random          random          random
        possibly         possibl         possibl            poss
       provision          provis          provis          provid
        hospital          hospit          hospit          hospit
            kept            kept            kept            kept
        scratchy        scratchi        scratchi        scratchy
          calves            calv            calv            calv


The difference between the three stemmers above is the level of strictness that's used to arrive at the base form. The Porter stemmer is the least in terms of strictness ("possibly" becomes "possibl") and Lancaster is the strictest ("possibly" becomes "poss").

Note that the result might not be an actual word. All the three stemmers said that the base form of "calves" is "calv", which is not a real word.

On the other hand all the three stemmers reduced "connections, connected, connecting" to a correct common word "connect".

## 3. Lemmatization - Exercise

Lemmatization is another way of reducing words to their base form. The lemmatization process uses a vocabulary and morphological analysis of words. It obtains the base forms by removing word endings such as ing or ed. This
base form of a word is known as a lemma. If you lemmatize the word "calves", you
should get "calf" as the output. One thing to note is that the output depends on whether the word is a verb or a noun.

Before using lemmatization, we have to download WordNet, a large lexical database of English.

In [8]:
import nltk
from nltk.stem import WordNetLemmatizer

Now write a little program to lemmatize the same `input_words` as above. Use the `lemmatize`-method from the `WordNetLemmatizer`-class. This method has two parameters: the first parameter is the word to be lemmatized, the second parameter is the type of output (pos='n' for a noun lemma, pos='v' for a verb lemma). The output should be something like this:

```
               INPUT WORD         NOUN LEMMATIZER         VERB LEMMATIZER 
 ===========================================================================
                 writing                 writing                   write
             connections              connection             connections
               connected               connected                 connect
```

In [9]:

input_words = ['writing', 'connections', 'connected', 'connecting', 'horse', 'randomize', 'possibly', 'provision', 'hospital', 'kept', 'scratchy', 'calves']

# Create lemmatizer object
lemmatizer = WordNetLemmatizer()

# Create a list of lemmatizer names for display
lemmatizer_names = ['NOUN LEMMATIZER', 'VERB LEMMATIZER']
formatted_text = '{:>24}' * (len(lemmatizer_names) + 1)
print('\n', formatted_text.format('INPUT WORD', *lemmatizer_names), '\n', '='*75)

# Lemmatize each word and display the output
for word in input_words:
    output = [word, lemmatizer.lemmatize(word, pos='n'), lemmatizer.lemmatize(word, pos='v')]
    print(formatted_text.format(*output))


               INPUT WORD         NOUN LEMMATIZER         VERB LEMMATIZER 


                 writing                 writing                   write
             connections              connection             connections
               connected               connected                 connect
              connecting              connecting                 connect
                   horse                   horse                   horse
               randomize               randomize               randomize
                possibly                possibly                possibly
               provision               provision               provision
                hospital                hospital                hospital
                    kept                    kept                    keep
                scratchy                scratchy                scratchy
                  calves                    calf                   calve


We can see that the noun lemmatizer works differently than the verb lemmatizer when it
comes to words like writing or calves. If you compare these outputs to stemmer outputs, you
will see that there are differences too. The lemmatizer outputs are all meaningful whereas
stemmer outputs may or may not be meaningful.

## 4. POS Tagging

The target of Part-of-Speech (POS) Tagging is to identify the grammatical group of a given word, whether it is a noun, pronoun, adjective, verb, adverb, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

We will use spaCy, a different Python library for NLP because it gives better results than NLTK for POS Tagging and Named Entity Recognition. First install spaCy.

In [10]:
pip install spacy

Collecting spacy
  Downloading spacy-3.7.2-cp38-cp38-win_amd64.whl.metadata (26 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.10-cp38-cp38-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.8-cp38-cp38-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp38-cp38-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.3.0,>=8.1.8 (from spacy)
  Downloading thinc-8.2.2-cp38-cp38-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.2-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.4.8-cp38-cp38-win_amd64.whl.metadata (20 kB

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-intel 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.9.0 which is incompatible.


Now, we'll download a trained language model (we need a 'model', a 'brain', to do POS and NER). We'll download the English and Dutch version of a small language model that's pretrained on written web text like blogs, news, comments,...).
Restart the kernel afterwards.

In [11]:
!py -m spacy download en_core_web_sm nl_core_news_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.1/12.8 MB 1.7 MB/s eta 0:00:08
      --------------------------------------- 0.2/12.8 MB 1.8 MB/s eta 0:00:08
      --------------------------------------- 0.3/12.8 MB 1.8 MB/s eta 0:00:08
     - -------------------------------------- 0.4/12.8 MB 1.8 MB/s eta 0:00:08
     - -------------------------------------- 0.4/12.8 MB 1.8 MB/s eta 0:00:07
     - -------------------------------------- 0.5/12.8 MB 1.7 MB/s eta 0:00:08
     - -------------------------------------- 0.6/12.8 MB 1.8 MB/s eta 0:00:07
     -- ------------------------------------- 0.7/12.8 MB 1.8 MB/s eta 0:00:07
     -- ------------------------------------- 0.8/12.8 MB 1.8 MB/s eta 0:00:07
     -- ---------------------------------

ERROR: Could not find a version that satisfies the requirement nl_core_news_sm (from versions: none)
ERROR: No matching distribution found for nl_core_news_sm


Next we'll try to load in the language model (if this fails, check below)

In [12]:
import spacy
sp = spacy.load("en_core_web_sm")

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

If loading the language model fails, it's probably due to linking errors (your python env isn't aware that you've download a spacy env).
So, let's use a shortcut, that fixes this (at least it did for me)

In [None]:
!py -m spacy download en 

⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full
pipeline package name 'en_core_web_sm' instead.
Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 11.5 MB/s eta 0:00:00
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


2022-10-20 09:42:58.193515: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-10-20 09:42:58.193693: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-20 09:43:01.425155: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2022-10-20 09:43:01.425774: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-10-20 09:43:01.431517: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: G_NB110860
2022-10-20 09:43:01.431799: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: G_NB110860

[notice] A new release of pip available: 22.2.1 -> 22.2.2
[notice] To update, run: python.exe -m pip install --upgr

After importing the core spaCy English model, we'll create a small spaCy document/text, that we will be using to perform Part-of-Speech tagging.

In [None]:
import spacy
sp = spacy.load("en_core_web_sm")

sen = sp("I like to play football. I hated it in my childhood though.")

The spaCy document object has several attributes that can be used to perform a variety of tasks. For instance, to print the text of the document, the text attribute is used. 

In [None]:
print(sen.text)

I like to play football. I hated it in my childhood though.


Similarly, the pos_ attribute returns the POS tag. And finally, to get the explanation of the POS tag, we can use the spacy.explain() method and pass it the tag name.

In [None]:
print(sen[7])
print(sen[7].pos_)
print(spacy.explain(sen[7].tag_))

hated
VERB
verb, past tense


We can print all the POS tags (we've improved the readability by adding 12 spaces between the text and the POS tag and then another 10 spaces between the POS tags and the explanation).

In [None]:
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {spacy.explain(word.tag_)}')

I            PRON       pronoun, personal
like         VERB       verb, non-3rd person singular present
to           PART       infinitival "to"
play         VERB       verb, base form
football     NOUN       noun, singular or mass
.            PUNCT      punctuation mark, sentence closer
I            PRON       pronoun, personal
hated        VERB       verb, past tense
it           PRON       pronoun, personal
in           ADP        conjunction, subordinating or preposition
my           PRON       pronoun, possessive
childhood    NOUN       noun, singular or mass
though       ADV        adverb
.            PUNCT      punctuation mark, sentence closer


Another cool thing about spaCy is, that you can use the dependency visualizer to show Part-of-Speech tags and syntactic dependencies. Maybe you can try some other sentences to visualise.

In [None]:
import spacy
from spacy import displacy

sp = spacy.load("en_core_web_sm")
sen = sp("I like to play football. I hated it in my childhood though.")
displacy.render(sen, style="dep", jupyter=True)

### POS Tagging in Dutch - Exercise

POS Tagging can be done in Dutch as well. You will probably have to install the Dutch model.

Use this sentences as input: "De concentratie broeikasgassen die bijdragen aan de verandering van het klimaat, heeft opnieuw een recordhoogte bereikt." The output should be as follows:

```
De           DET        Art|bep|zijdofmv|neut__Definite=Def|PronType=Art
concentratie NOUN       N|soort|ev|neut__Number=Sing
broeikasgassen ADP        Prep|voor__AdpType=Prep
die          PRON       Pron|aanw|neut|attr__PronType=Dem
bijdragen    NOUN       N|soort|mv|neut__Number=Plur
aan          ADP        Prep|voor__AdpType=Prep
de           DET        Art|bep|zijdofmv|neut__Definite=Def|PronType=Art
verandering  NOUN       N|soort|ev|neut__Number=Sing
van          ADP        Prep|voor__AdpType=Prep
het          DET        Art|bep|onzijd|neut__Definite=Def|Gender=Neut|PronType=Art
klimaat      NOUN       N|soort|ev|neut__Number=Sing
,            PUNCT      Punc|komma__PunctType=Comm
heeft        VERB       V|hulp|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
opnieuw      ADV        Adv|gew|geenfunc|stell|onverv__Degree=Pos
een          DET        Art|onbep|zijdofonzijd|neut__Definite=Ind|Number=Sing|PronType=Art
recordhoogte NOUN       N|soort|ev|neut__Number=Sing
bereikt      VERB       V|trans|verldw|onverv__Subcat=Tran|Tense=Past|VerbForm=Part
.            PUNCT      Punc|punt__PunctType=Peri

```

It is not possible to explain the POS tag in Dutch. Just use tag_ in the third column.

In [None]:
!py -m spacy download nl

⚠ As of spaCy v3.0, shortcuts like 'nl' are deprecated. Please use the full
pipeline package name 'nl_core_news_sm' instead.
Collecting nl-core-news-sm==3.4.0
  Using cached https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-3.4.0/nl_core_news_sm-3.4.0-py3-none-any.whl (12.8 MB)
✔ Download and installation successful
You can now load the package via spacy.load('nl_core_news_sm')


2022-10-20 09:43:07.589404: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-10-20 09:43:07.589727: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-10-20 09:43:10.830106: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2022-10-20 09:43:10.830342: W tensorflow/stream_executor/cuda/cuda_driver.cc:263] failed call to cuInit: UNKNOWN ERROR (303)
2022-10-20 09:43:10.834765: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: G_NB110860
2022-10-20 09:43:10.835075: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: G_NB110860

[notice] A new release of pip available: 22.2.1 -> 22.2.2
[notice] To update, run: python.exe -m pip install --upgr

In [None]:
import spacy
sp = spacy.load("nl_core_news_sm")

sen = sp('De concentratie broeikasgassen die bijdragen aan de verandering van het klimaat, heeft opnieuw een recordhoogte bereikt.')

In [None]:
for word in sen:
    print(f'{word.text:{12}} {word.pos_:{10}} {(word.tag_)}')

De           DET        LID|bep|stan|rest
concentratie NOUN       N|soort|ev|basis|zijd|stan
broeikasgassen NOUN       N|soort|mv|basis
die          PRON       VNW|betr|pron|stan|vol|persoon|getal
bijdragen    NOUN       N|soort|mv|basis
aan          ADP        VZ|init
de           DET        LID|bep|stan|rest
verandering  NOUN       N|soort|ev|basis|zijd|stan
van          ADP        VZ|init
het          DET        LID|bep|stan|evon
klimaat      NOUN       N|soort|ev|basis|onz|stan
,            PUNCT      LET
heeft        AUX        WW|pv|tgw|met-t
opnieuw      ADV        BW
een          DET        LID|onbep|stan|agr
recordhoogte NOUN       N|soort|ev|basis|zijd|stan
bereikt      VERB       WW|vd|vrij|zonder
.            PUNCT      LET


## 5. Named Entity Recognition

Named Entity Recognition refers to the identification of words in a sentence as an entity e.g. the name of a person, place, organization, etc. Let's see how the spaCy library performs Named Entity Recognition. Look at the following script:

In [None]:
import spacy
sp = spacy.load('en_core_web_sm')

sen = sp('Manchester United is looking to sign Harry Kane for $90 million.')

print(sen.ents)

(Manchester United, Harry Kane, $90 million)


You can see that three named entities were identified. To see the detail of each named entity, you can use the text, label, and the spacy.explain method which takes the entity object as a parameter.

In [None]:
for entity in sen.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Manchester United - ORG - Companies, agencies, institutions, etc.
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit


Like the POS tags, we can also view named entities.

In [None]:
from spacy import displacy

sen = sp('Manchester United is looking to sign Harry Kane for $90 million. David wants 100 Million Dollars.')
displacy.render(sen, style='ent', jupyter=True)