# Basic Natural Language Processing (NLP)

The goal of NLP is to bridge the gap between human communication and computer understanding. It is used in a wide range of applications such as text classification, sentiment analysis, machine translation, chatbots, and more.

One of the main steps in NLP is transforming text data into a structured format that can be used for analysis. This is often done by converting the text into a tabular format. The main steps involved in this transformation are:

- Tokenization: Breaking the text down into smaller units, such as words or phrases, known as tokens.

- Whitespace and delimiter removal: Removing whitspaces and delimiters from the text or list og tokens.

- Stop word removal: Removing commonly used words that do not carry much meaning, such as "the", "a", "an", etc.

- Stemming or Lemmatization: Reducing words to their root form, to simplify the analysis.

- Part-of-speech (POS) tagging: Identifying the grammatical role of each word in a sentence.

- Named entity recognition (NER): Identifying and classifying entities in the text, such as people, organizations, and locations.

- Creating a table: The final step is to organize the extracted information into a structured format, such as a spreadsheet or database table. The rows of the table represent individual documents, while the columns represent the extracted features, such as words or entities.

Overall, transforming text into a tabular format is a crucial step in NLP, as it enables the application of a wide range of analytical techniques to better understand and make use of the data.


### Document representation

![tabular](../images/text_tabular.png)


While we as human can understand the text, machines can hardly know text as it is. A classic way to mathematically represent text data is to convert them into [vector spaces](https://en.wikipedia.org/wiki/Vector_space_model). Specifically, each documeent will be stored as a vector, where each element is a ___weight___ for one term. A simple and commonly used approach is to tokenize documents and use term frequencies or [term frequency-inverse document frequency (TFIDF)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) as weights.


There are various representations of text in tabular form

- **binary (0,1)**: present or not
- **frequency**: count or proportion
- **(0,1,2)**: present zero, one or more times.
- **tf-idf**: 
$$tf-idf_j = tf_j * log( \frac{N}{df_j})$$

                                                j: specific term
                                                tf: term frequency (in a document)
                                                df: document frequency (frequency of documents with the term)
                                                N: number of documents


##### Definitions

- **corpus**: collection of text documents
- **tokens**: words, numbers, acronyms, symbols, fixed-length character strings etc. (in a document, corpus etc.)
- **types**: the possible (distinct) words in a document, corpus etc.
- **delimiter**: “.”, “,”, “!” etc. (can be tokens to)
- **whitespace**: “ “, “\n”,  “\t“
- **dictionary**: collection of words mapping unto e.g. definitions, synonyms etc.



We will be using the library [nltk](https://www.nltk.org/) for working with natural language.

`pip install nltk`


In [None]:
import nltk

### Bag-of-Words (BOW)
While more sophisticated approaches are available, BOW representation is still very popular due to its simplicity. In this style, we can forget word orders and treat each document as ___a bag of words___. While this is a very strong assumption, it still makes sense -- we can understand a sentence even if the words are randomly ordered. For example, we can easily understand the following sentence:

> sitting a chair is cat there on

In Python, we can use a `numpy.ndarray` or `list` to save this information:

In [None]:
sentence = ['sitting', 'a', 'chair', 'is', 'cat', 'there', 'on']
sentence

## Tokenization

Tokenization is the process of splitting a text into smaller units called tokens. These tokens are usually words or sub-words, although they can also be punctuation marks, numbers, or other meaningful units of a text.

In English text, we can simply split each document by spaces. Besides, we usually may want to remove puncutations because they do not provide valuable information. Last but not least, we may want to exclude some very common words (e.g., `the`, `we`, `you`, `is`, etc.), called [___stop words___](https://en.wikipedia.org/wiki/Stop_words).

In [None]:
nltk.download('punkt')

In [None]:
text = "The first step in handling text is to break the stream \
of characters into words or, precisely, tokens. This is fundamental \
to further analysis. Without identifying the tokens, it is difficult \
to imagine extracting higher-level information from the document. \
Each token is an instance of a type, so the number of tokens is much \
higher than the number of types"

In [None]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text.lower())
print(tokens)

In [None]:
from nltk.tokenize import sent_tokenize

sentence = sent_tokenize(text)
print(sentence)

#### Delimiter/punctuation removal

In [None]:
import string
string.punctuation

In [None]:
# Filter out the punctuation marks
tokens = [token for token in tokens if token not in string.punctuation]

print(tokens)


#### Stop words

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
stopwords = stopwords.words("english")
print(stopwords)

In [None]:
tokens = [token for token in tokens if token not in stopwords ]
print(tokens)

## Stemming and Lemmatization

Stemming and lemmatization are two common techniques used in Natural Language Processing (NLP) to reduce words to their base or root form. Both of these techniques are used to normalize words and reduce the dimensionality of the feature space, which can improve the performance of text classification and information retrieval models.

- **Stemming** is the process of reducing a word to its base form by removing the suffixes, prefixes, or inflectional endings of the word. The resulting word, called a stem, may not be a valid word in the language, but it represents the root of the original word. For example, the word "jumping" would be stemmed to "jump", and the word "cats" would be stemmed to "cat".

- **Lemmatization** is the process of reducing a word to its base form, called a lemma, by using the context and part of speech of the word. The resulting lemma is a valid word in the language and represents the base meaning of the original word. For example, the word "jumping" would be lemmatized to "jump", and the word "cats" would be lemmatized to "cat".

The main difference between stemming and lemmatization is that stemming is a heuristic, rule-based process that simply removes the suffixes or prefixes of a word to obtain the stem, whereas lemmatization uses more complex rules and the context of the word to determine the base form of the word.

Suppose we have the sentence "The cats were jumping over the fences". After tokenizing the sentence into individual words, we can apply stemming and lemmatization to each word:

- Stemming: The stems of the words in the sentence would be "the", "cat", "were", "jump", "over", "the", and "fenc". As you can see, some of the resulting stems, such as "fenc", are not valid words.
- Lemmatization: The lemmas of the words in the sentence would be "the", "cat", "be", "jump", "over", "the", and "fence". All of the resulting lemmas are valid words, and the verb "be" is correctly lemmatized to "be" instead of "were".

In general, lemmatization tends to produce more accurate results than stemming, but it can be slower and more computationally expensive. The choice between stemming and lemmatization depends on the specific application and the trade-off between accuracy and efficiency.

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stems = [ps.stem(token) for token in tokens]
print(stems)

In [None]:
# import these modules
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
  
lemmatizer = WordNetLemmatizer()
  
lemma = [lemmatizer.lemmatize(token) for token in tokens]
print(lemma)

## Counting Words

In [None]:
FreqDist = nltk.FreqDist(tokens)
for i,j in FreqDist.items():
    if j > 1:
        print(i, "---", j)

In [None]:
FreqDist = nltk.FreqDist(lemma)
for i,j in FreqDist.items():
    if j > 1:
        print(i, "---", j)

## Word groups

In [None]:
# Bigrams
print(list(nltk.bigrams(tokens)))

In [None]:
# Trigrams
print(list(nltk.trigrams(tokens)))

In [None]:
# N-grams
print(list(nltk.ngrams(tokens, 4)))


## POS Taggers

POS (Part-of-Speech) tagging, also known as grammatical tagging, is the process of labeling each word in a text with its corresponding part of speech, such as noun, verb, adjective, or adverb. 

[POS Tagging](https://www.nltk.org/book/ch05.html)

In [None]:
nltk.download('averaged_perceptron_tagger')

pos = nltk.pos_tag(lemma)
print(pos)

## Named Entity Recognition

Named Entity Recognition (NER) involves identifying and categorizing named entities in a text into predefined categories such as people, organizations, locations, products, or dates. Named entities are typically proper nouns or noun phrases that refer to specific entities in the real world.

In [None]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

In [None]:
Text = "The russian president Vladimir Putin is in the Kremlin"
Tokenize = nltk.word_tokenize(Text)
POS_tags = nltk.pos_tag(Tokenize)
NameEn = nltk.ne_chunk(POS_tags)
print(NameEn)

We ca also use the library [spacy](https://spacy.io/api/data-formats#section-named-entities).

install spacy `pip install spacy` and download the model [en_core_web_sm](https://spacy.io/models/en) `python -m spacy download en_core_web_sm`

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

nlp = en_core_web_sm.load()

In [None]:
doc = nlp(Text)
print([(X.text, X.label_) for X in doc.ents])

## Text in tabular form

In [None]:
with open('data/anmeldelser.txt', 'r') as file:
    anmeldelser = file.readlines()

In [None]:
anmeldelser

In [None]:
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
stopwords = stopwords.words("danish")

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

def tokenize(text: str, remove_num: bool = True, custom_remove: list = ['"', '``', "''"]) -> list:
    """
    Tokenizes and lemmatizes the input text.

    Args:
        text (str): The input text to tokenize and lemmatize.
        remove_num (bool): Whether to remove numbers from the text. Defaults to True.
        custom_remove (list): List of custom tokens to remove from the text. Defaults to ['"', '``', "''"].

    Returns:
        list: A list of lemmatized tokens.

    Example:
        >>> text = "This is an example sentence with punctuation, numbers 123, and stop words."
        >>> tokenize(text)
        ['example', 'sentence', 'punctuation', 'number', 'stop', 'word']
    """
    if remove_num:
        text = re.sub(r'[0-9]+', '', text)
    tokens = [token.lower() for token in word_tokenize(text)]
    tokens = [token for token in tokens if token not in stopwords 
              and token not in string.punctuation and token not in custom_remove]        
    lemmatized = [lemmatizer.lemmatize(item) for item in tokens]
    return lemmatized

In [None]:
# Initialize a TFIDF object, applying some settings
tfidf = TfidfVectorizer(analyzer='word',
                        sublinear_tf=True,
                        max_features=5000,
                        tokenizer=tokenize)

In [None]:
# TD-IDF Matrix
X = tfidf.fit_transform(anmeldelser)

# extracting feature names
tfidf_tokens = tfidf.get_feature_names_out()

In [None]:
import pandas as pd

result = pd.DataFrame(
    data=X.toarray(), 
    columns=tfidf_tokens
)

In [None]:
result

return to [overview](../00_overview.ipynb)