<img src="http://certificate.tpq.io/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# AI in Finance

**Workshop at Texas State University (October 2023)**

**_Natural Language Processing_**

Dr. Yves J. Hilpisch | The Python Quants GmbH | http://tpq.io

## Basic Imports

In [None]:
import nlp
import nltk
import requests
import numpy as np
import pandas as pd

## Sample Text

In [None]:
text = 'This is a short text. The text is used to illustrate NLP techniques with Python and the nltk package.'

## Python String Operations

In [None]:
len(text)

In [None]:
text.lower()

In [None]:
text.upper()

In [None]:
text.count('text')

In [None]:
text.replace('a', '|')

In [None]:
from collections import Counter

In [None]:
Counter(text.split())

## `nltk` Operations

### Vocabulary

In [None]:
t = nltk.word_tokenize(text)
t

In [None]:
t = [w.lower() for w in t if len(w) > 3]
t

In [None]:
t = sorted(t)
t

In [None]:
t = list(set(t))
t

### Part-of-Speech Tagging

In [None]:
# nltk.download('averaged_perceptron_tagger')

In [None]:
t = nltk.word_tokenize(text)

In [None]:
# nltk.pos_tag?

In [None]:
nltk.pos_tag(t)

### Stemming 

In [None]:
text = 'I was running through the green fields. Later I was sitting on the green grass.'

In [None]:
from nltk.stem import PorterStemmer

In [None]:
t = nltk.word_tokenize(text)

In [None]:
stemmer = PorterStemmer(PorterStemmer.MARTIN_EXTENSIONS)

In [None]:
[stemmer.stem(w) for w in t]

### Lemmatization

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
[lemmatizer.lemmatize(w) for w in t]

### Removing Stop Words

In [None]:
nlp.stop_words[:5]

In [None]:
t = [w.lower() for w in nltk.word_tokenize(text)
         if w.lower() not in nlp.stop_words
             and len(w) > 3]

In [None]:
t

In [None]:
t = [stemmer.stem(w) for w in t]
t

In [None]:
t = [lemmatizer.lemmatize(w) for w in t]
t

## Topic Modeling

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
# Simple short text snippets
texts = [
    'I love Apple MacBooks.',
    'Grappling is a wonderful sport.',
    'Walking the dogs in nature nurtures your soul.'
]

In [None]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts)

In [None]:
# Create an LDA topic model with 10 topics
lda = LatentDirichletAllocation(n_components=3, random_state=100)

In [None]:
%%time
# Train the LDA model on the text data
l = lda.fit_transform(X)

In [None]:
# Extract the topics and their corresponding words
feature_names = vectorizer.get_feature_names_out()

In [None]:
feature_names

In [None]:
for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx}:")
    print(" ".join([feature_names[i] for i in topic.argsort()[:-5:-1]]))

## Word Embeddings

In [None]:
# %conda install -y gensim 

In [None]:
from gensim.models import Word2Vec

In [None]:
tokens = nltk.word_tokenize(text)

In [None]:
model = Word2Vec([tokens], min_count=1, vector_size=100, window=5, sg=1)
model.train([tokens], total_examples=1, epochs=1)

In [None]:
print(model.wv['green'].round(4))

In [None]:
print(model.wv['sitting'].round(4))

## Appendix: Word2Vec Embeddings

When generating word embeddings with Word2Vec, the output is a set of word vectors that represent the meaning of each word in the text. The output is typically a matrix or ndarray object that contains the word vectors, where each row corresponds to a word and each column corresponds to a dimension of the vector space. The number of dimensions is a hyperparameter that can be set when creating the Word2Vec model, and typically ranges from 50 to 300.

The output of Word2Vec can be interpreted in various ways, depending on the specific application and context. Some common ways to interpret the output include:

- Similarity: The cosine similarity between two word vectors can be used to measure the semantic similarity between the corresponding words. Words that are semantically similar will have similar word vectors and high cosine similarity values.
- Clustering: The word vectors can be clustered using unsupervised learning algorithms such as k-means or hierarchical clustering to group similar words together. This can be useful for tasks such as topic modeling or text classification.
- Visualization: The word vectors can be visualized in a lower-dimensional space using techniques such as principal component analysis (PCA) or t-SNE. This can help to identify patterns or relationships between words that are not apparent in the high-dimensional vector space.

It is important to note that the output of Word2Vec is not always interpretable or meaningful, and may require further processing or analysis to be useful for downstream tasks. Additionally, the choice of hyperparameters such as the number of dimensions or the size of the training corpus can affect the quality and interpretability of the word embeddings.

In general, the output of Word2Vec should be treated as a set of numerical representations of words that capture their meaning in a high-dimensional vector space. These representations can be used as input to various machine learning models or algorithms to perform tasks such as text classification, sentiment analysis, or information retrieval.

## Appendix: Embedding Algorithms

### 1. Skip Gram

**Explanation:** 
Skip Gram is one of the architectures of the Word2Vec model. It works by using a current word to predict the context words (words surrounding the current word). Given a specific word, the Skip Gram model tries to maximize the probability of predicting surrounding words in a certain window size.

**Use Cases:** 
- Word similarity and analogy tasks.
- Feeding word vectors into downstream NLP tasks such as sentiment analysis, named entity recognition, etc.
- Visualizing word relationships.

**Example:**

Let's use the `gensim` library again to illustrate this with a simple example.

```python
from gensim.models import Word2Vec

# Data
sentences = [['dog', 'barks'], ['cat', 'meows'], ['bird', 'sings']]

# Training a Skip Gram model
model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=1, workers=4)
model.train(sentences, total_examples=len(sentences), epochs=1000)

# Finding similar words to 'dog'
similar_words = model.wv.most_similar('dog')
print(similar_words)
```

### 2. Common Bag Of Words (CBOW)

**Explanation:** 
CBOW is the other architecture of the Word2Vec model. Instead of predicting the context from a word (as in Skip Gram), CBOW predicts a word from its context. Given a set of context words, CBOW tries to predict the word that is most likely to appear with those context words.

**Use Cases:** 
- Like Skip Gram, CBOW's embeddings can be used for word similarity tasks, visualizations, and as features in downstream NLP tasks.
- It's generally faster and requires less memory than Skip Gram.

**Example:**

```python
from gensim.models import Word2Vec

# Data
sentences = [['dog', 'barks'], ['cat', 'meows'], ['bird', 'sings']]

# Training a CBOW model
model = Word2Vec(sentences, vector_size=100, window=2, min_count=1, sg=0, workers=4)
model.train(sentences, total_examples=len(sentences), epochs=1000)

# Finding similar words to 'dog'
similar_words = model.wv.most_similar('dog')
print(similar_words)
```

In both examples, we've used a very small dataset just for illustrative purposes. The `sg` parameter in the `Word2Vec` function determines the architecture: `sg=1` indicates Skip Gram and `sg=0` indicates CBOW.

## Appendix: Tags

The tags have the following meaning:

- CC: coordinating conjunction
- CD: cardinal digit
- DT: determiner
- EX: existential there
- FW: foreign word
- IN: preposition or subordinating conjunction
- JJ: adjective
- JJR: adjective, comparative
- JJS: adjective, superlative
- LS: list marker
- MD: modal
- NN: noun, singular or mass
- NNS: noun, plural
- NNP: proper noun, singular
- NNPS: proper noun, plural
- PDT: predeterminer
- POS: possessive ending
- PRP: personal pronoun
- PRP\$: possessive pronoun
- RB: adverb
- RBR: adverb, comparative
- RBS: adverb, superlative
- RP: particle
- SYM: symbol
- TO: to
- UH: interjection
- VB: verb, base form
- VBD: verb, past tense
- VBG: verb, gerund or present participle
- VBN: verb, past participle
- VBP: verb, non-3rd person singular present
- VBZ: verb, 3rd person singular present
- WDT: wh-determiner
- WP: wh-pronoun
- WP$: possessive wh-pronoun
- WRB: wh-adverb

<img src='http://hilpisch.com/tpq_logo.png' width="35%" align="right">

<br><br><a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">ai@tpq.io</a>