# Introduction
In this session, I will introduce you to the basics of natural language processing in Python. We will touch on several approaches and models, though we will not go too in-depth. This tutorial builds on the basics of text data introduced in tutorial session 4.

## Load data
First, let's retrieve the publication data we have been using in this tutorial series.

In [None]:
import pandas as pd
df = pd.read_csv('../data/publications.txt', sep='\t', encoding='utf-8', dtype={'authors': 'string', 'journal_title': 'string', 'paper_title': 'string', 'abstract': 'string'})

We need text to process, so we delete any publication without an abstract. We will only be using abstracts for this excersise, but you could easily include paper titles, too.

In [None]:
df = df.loc[df['abstract'].notna()].reset_index(drop=True)
df

Let's also fetch the abstract from Vincent's publication that we used in tutorial 4, so we have a clear example to work with.

In [None]:
vincent_pub = df.loc[df['authors'].str.contains('Traag')]
vincent_abstract = vincent_pub['abstract'].tolist()[0]
vincent_abstract

# Text vectorization
Many natural language processing techniques rely on the ability to represent documents as series of numbers, or a vector. These vectors can then be compared to each other, for instance, one can compare the directions they point in to determine if the original documents contain similar words.

## Bag-of-words models
A common approach in natural language processing is to treat text as simply a collection (bag) of words - ignoring their order, meaning, or grammatical structures within the text. Each word in the set of texts (the corpus) is treated as a column of data, or a dimension in a vector space, on which texts are then scored.

The most straightforward way of doing so is simply to count the number of times each word occurs in a document.

Remember the end of tutorial 4, in which we split Vincent's abstract into indivudal tokens, and removed stop-words and non-word tokens from the list. We repeat the same process here. If you have not attended tutorial 4, first run the cell below and download the "popular" packages.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
vincent_words = word_tokenize(vincent_abstract)
vincent_words = [w.lower() for w in vincent_words if w.lower() not in stopwords.words("english")]
vincent_words = [w for w in vincent_words if bool(re.search('[^a-z]', w))==False]
vincent_words

Some of these terms are clearly permutations of others. The plural 'countries', for instance, can be rendered instead as 'country' because the meaning of the two terms is identical (if not close enough to make no difference). This is where lemmatization comes into play again. We use WordNetLemmatizer to simplify our terms.

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
vincent_words = [WordNetLemmatizer().lemmatize(w) for w in vincent_words]
vincent_words

Now all that remains is to count the number of times each remaining term occurs in the text. `Counter` is a useful little class that does exactly this. Let's also order the words from most-common to least-common.

In [None]:
from collections import Counter
Counter(vincent_words).most_common()

We have now successfully reduced Vincent's abstract to just a bag of words, completely ignoring any subtlety or grammar in the original text. But, this new list is far more easy for a computer to interpret.

Of course, we can repeat the same process for our entire corpus. Let's write a quick class that combines the tokenization and lemmatization steps.

In [None]:
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, text):
        text = word_tokenize(text.lower())
        text = [w for w in text if bool(re.search('[^a-z]', w))==False]
        return [self.wnl.lemmatize(w) for w in text]

One of the reasons Python is such a useful tool is that many libraries exist that do the hard work for you. In our case, `sklearn` includes a handy function called `CountVectorizer` that can quickly apply the cleaning and tokenization steps and then vectorize our entire corpus. This produces a sparse matrix object, in which each row represents a document and each column represents a term.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                             strip_accents = 'unicode',
                             stop_words = 'english')
X = tf_vectorizer.fit_transform(df['abstract'].tolist())

<div class="alert alert-info">
    How many unique terms are in our resulting model?
</div>

[//]: # "X.shape[1]"

In [None]:
X.shape[1]

The matrix does not store the actual terms themselves, only the data contents. Documents (rows) are in the original order, while terms are listed alphabetically. The `vectorizer` object stores the terms that were extracted from the corpus.

In [None]:
terms = tf_vectorizer.get_feature_names_out().tolist()
terms[:10]

The complete vocabulary is stored in a dictionary, along with indices for each term. For instance, here's how to find the index for `'science'`.

In [None]:
vocab=tf_vectorizer.vocabulary_
vocab['science']

We can use this index to retrieve the number of times the word `'science'` features in the documents of our corpus.

In [None]:
X[:,vocab['science']].toarray()

<div class="alert alert-info">
    Which publication mentions the term <code>'science'</code> most often?
</div>

[//]: # "print(f'The term science occurs a maximum of {X[:,vocab['science']].max()} times at index {X[:,vocab['science']].argmax()}:')"
[//]: # "df.iloc[[X[:,vocab['science']].argmax()]]"

In [None]:
print(f"The term 'science' occurs a maximum of {X[:,vocab['science']].max()} times at index {X[:,vocab['science']].argmax()}:")
df.iloc[[X[:,vocab['science']].argmax()]]

But, why go through all this trouble? What is this good for?

Well, we can use these vectors to compare documents. For instance, if we take the angle between these vectors, we can see how similar they are with regards to the words they use. An often-used measure for this is cosine similarity, or cosine distance.

Let's see which abstract is most similar to Vincent's, according to this measure.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_scores = cosine_similarity(X[vincent_pub.index], X)[0]

# we're not interested in Vincent's paper's similarity to itself, so let's set that to 0
similarity_scores[vincent_pub.index] = 0

print(f'paper {similarity_scores.argmax()} with cosine similarity {similarity_scores.max()}')

df.iloc[[similarity_scores.argmax()]]

In [None]:
df.iloc[[similarity_scores.argmax()]]['abstract'].tolist()[0]

## Term weighing
You do not have to use raw term counts. Sometimes it makes sense to weigh term occurrences to prioritize or reduce the importance of certain terms. After all, in natural language, common terms are *far* more common than rare terms (see Zipf's law), but those very common terms are hardly ever the most interesting ones.

### Logarithmic scaling
Some researchers prefer to use logarithmic weighing for reducing the importance of overly-common terms:

In [None]:
import numpy as np
X_ln = X.copy()
X_ln.data = np.log(X_ln.data+1) # add 1 to avoid taking the log of zero

In [None]:
# list each term, its raw count, and its logarithmic weight
[(terms[t], X[vincent_pub.index[0],t], round(X_ln[vincent_pub.index[0],t], 3)) for d,t in zip(*X[vincent_pub.index].nonzero())]

### Tf-idf
Another popular way of weighing terms is to use tf-idf. This is the term frequency `tf` (raw term counts divided by total number of terms in the document - a measure of term importance in the document) multiplied by the inverse document frequency `idf` (logarithm of the number of documents divided by the number of documents containging the term - a measure of term specificity in the corpus).

tf-idf is a very practical way of weighing terms, even if it isn't very theoretically informed. Like `CountVectorizer`, there is a convenient `TfidfVectorizer` function we can use.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=LemmaTokenizer(), 
                                   strip_accents = 'unicode',
                                   stop_words = 'english')
X_tfidf = tfidf_vectorizer.fit_transform(df['abstract'].tolist())

In [None]:
[(terms[t], X[vincent_pub.index[0],t], round(X_tfidf[vincent_pub.index[0],t], 3)) for d,t in zip(*X[vincent_pub.index].nonzero())]

Note that terms with identical counts in Vincent's abstract now have different tf-idf scores.

<div class="alert alert-info">
    Judging just by these scores, which of the three terms <code>research</code>, <code>uk</code> and <code>uncertainty</code> occurs in most documents in the corpus?
</div>

One of the main benefits of tf-idf is that its main feature, the penalizing of terms that are ubiquitous throughout the corpus, allows us to get away with less pre-processing of the data. Stop-words are common words by definition, and will be heavily penalized by tf-idf, reducing their scores across the board. This means that it makes little difference whether you remove stop-words or not when using tf-idf, and that common words that are not on any stop-word lists are similarly reduced in importance.

<div class="alert alert-info">
    Using cosine similarity of <code>tf-idf</code> instead of raw term counts, which paper abstract is most similar to Vincent's abstract?
</div>

[//]: # "similarity_scores_tfidf = cosine_similarity(X_tfidf[vincent_pub.index], X_tfidf)[0]"

[//]: # "# we're not interested in Vincent's paper's similarity to itself, so let's set that to 0"
[//]: # "similarity_scores_tfidf[vincent_pub.index] = 0"

[//]: # "print(f'paper {similarity_scores_tfidf.argmax()} with cosine similarity {similarity_scores_tfidf.max()}')"

[//]: # "df.iloc[[similarity_scores_tfidf.argmax()]]"

[//]: # "df.iloc[[similarity_scores_tfidf.argmax()]]['abstract'].tolist()[0]"

In [None]:
similarity_scores_tfidf = cosine_similarity(X_tfidf[vincent_pub.index], X_tfidf)[0]

# we're not interested in Vincent's paper's similarity to itself, so let's set that to 0
similarity_scores_tfidf[vincent_pub.index] = 0

print(f'paper {similarity_scores_tfidf.argmax()} with cosine similarity {similarity_scores_tfidf.max()}')
print(df.iloc[[similarity_scores_tfidf.argmax()]]['abstract'].tolist()[0])
df.iloc[[similarity_scores_tfidf.argmax()]]

## Dimensionality reduction
One of the main downsides of bag-of-words approaches, weighted or not, is that the length of your vectors depends on the number of unique terms in your corpus after preprocessing. For large corpora, this can quickly become burdensome. Moreover, terms can mean similar things, but never occur next to each other. Imagine if one paper mentiones `'tf-idf'` in its abstract, while another only ever mentions it without the dash, `'tfidf'`. If we look purely at term co-occurrence in these two documents, `'tf-idf'` and `'tfidf'` appear unrelated, while they stand for the same thing.

Dimensionality reduction can help alleviate both these issues, through a variety of techniques. Some find the principal components of the vector space, others try to find latent structures, while yet others use neural networks to optimize text prediction problems. In essence though, what all these techniques have in common is that they attempt to reduce the dimensionality of the problem.

### Topic modeling with LDA
A topic model is a statistical model used to discover abstract semantic structures in a corpus. It uses the bag-of-words representation of documents in a corpus to find latent "topics" - essentially, distributions - from which documents draw their terms. Very much simplified, each "topic" contains a distribution of different terms, and the assumption is that documents draw terms from a limited number of topics. Optimizing these distributions for a given number of topics then results in your topic model.

Latent Dirichlet allocation (LDA) is the most popular approach for this, though multiple alternatives exists. Typically, the user has to specify the number of topics they want to find, after which their preferred topic modeling approach tries to determine which distributions of terms over topics are optimal.

Gensim is a useful library for topic modeling, though it uses its own corpus files. Let's try and find five topics in our corpus.

In [None]:
# first, let's make our corpus smaller. There are some non-english publications in our set, and they are causing trouble.
get_journals = df['journal_title'].str.contains('Scientometrics|Journal of Psychology of Science and Technology|Journal of Information Science')
get_journals = df['journal_title'].str.contains('Scientometrics|Journal of Information Science')
get_authors = df['authors'].str.contains('Traag|van Eck|Waltman')
df_small = df.copy().loc[get_journals|get_authors].reset_index(drop=True)
vincent_pub_s = df_small.loc[df_small['authors'].str.contains('Traag')]

In [None]:
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                             strip_accents = 'unicode',
                             stop_words = 'english')
X_small = tf_vectorizer.fit_transform(df_small['abstract'].tolist())
vocab = tf_vectorizer.vocabulary_

In [None]:
from gensim import matutils
from gensim.models import LdaModel

corpus = matutils.Sparse2Corpus(X_small)
lda = LdaModel(corpus, num_topics=5, id2word={i: w for w, i in vocab.items()})

We can print the resultant 'topics' by their most relevant (highest-probability) words.

In [None]:
lda.show_topics(num_words=10)

Note that these probabilities are topic->word estimates, so the sum of probabilities of words over a topic equals one. Similarly, if we pass a word vector to the LDA model, it will return a distribution of topics (summing to one) that it thinks this word vector is comprised of.

In [None]:
vincent_vector = [[i, X_small[vincent_pub_s.index[0],i]] for i in X_small[vincent_pub_s.index].nonzero()[1]]
lda.get_document_topics(vincent_vector, minimum_probability=0)

Note that if we retrieve topics for terms, these do not sum to one, as they report the probability the term is drawn from each topic, not the distribution of topics over terms.

In [None]:
lda.get_term_topics(vocab['collaboration'], minimum_probability=0)

While topics generated by topic modeling are supposedly human-interpretable, this is not always easy or straightforward. In any case, we have now reduced our several thousands of terms to five topics - a large step down in dimensionality. In this case though, I would not call it an improvement.

In [None]:
# for reference, an attempt at running cosine similarity on Vincent's abstract's topic distribution vs the other abstract in the set:
all_docs = [[[i, X_small[d,i]] for i in X_small[d].nonzero()[1]] for d in range(X_small.shape[0])]
all_docs = [lda.get_document_topics(w, minimum_probability=0) for w in all_docs]
all_docs = [[i[1] for i in w] for w in all_docs]

similarity_scores_lda = cosine_similarity([all_docs[vincent_pub_s.index.tolist()[0]]], all_docs)[0]
similarity_scores_lda[vincent_pub_s.index.tolist()[0]] = 0

print(f'paper {similarity_scores_lda.argmax()} with cosine similarity {similarity_scores_lda.max()}')
print(df_small.iloc[[similarity_scores_lda.argmax()]]['abstract'].tolist()[0])
df_small.iloc[[similarity_scores_lda.argmax()]]

## Word & document embedding
So far, all the aproaches to natural language processing we tried have treated text as simply an unordered collection of words, completely ignoring any kinds of grammar or semantic structure. Word embeddings, and subsequent approaches, try to improve on this by using words' neighbors in their representation. It does so by training a shallow neural network to predict words from their context, or context from given words. After training, the resulting neuron weights can function as a vector space. 

Doc2vec is a technique based on this, which includes the ability to tag documents along with their contents, and also generate embeddings for these tags.

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [None]:
lemmatokenize = LemmaTokenizer()
documents = [TaggedDocument(lemmatokenize(abs), ['doc'+str(i)]) for i, abs in df['abstract'].iteritems()]
#documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
d2v = Doc2Vec(documents, vector_size=100, window=5, min_count=1, workers=4)

In [None]:
# which documents are most similar to Vincent's abstract?
most_similar=[i for i in d2v.dv.most_similar('doc'+str(vincent_pub.index[0]))]
i, s = most_similar[0]
i = int(i[3:])
print(f'document {i} is most similar with a cosine of {s}')
print(df.iloc[i]['abstract'])
df.iloc[[i]]

Embeddings also allow one to compare the embedded terms themselves. For instance, we can easily find the terms most similar to one another:

In [None]:
d2v.wv.most_similar('science')

## Language models
The downside of embedding models is that they are limited to the data you provide to them. Language models, on the other hand, are pre-trained and encapsulate some understanding of language. These, in turn, can be used to generate vector representations of terms and documents. Many different language models are available. We will focus on SciBERT, a variant of BERT trained on scientific text.

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers import models
sbert=SentenceTransformer('allenai/scibert_scivocab_uncased')