<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**Word Embeddings**

&copy; Dr. Yves J. Hilpisch

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Imports

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
!pip install html2text

In [None]:
import numpy as np
import pandas as pd
from pylab import plt
# import cufflinks as cf

In [None]:
# cf.go_offline()
plt.style.use('seaborn-v0_8')
np.set_printoptions(suppress=True)
%config InlineBackend.figure_format = 'svg'

_Texts from ChatGPT._

## Embeddings

Embeddings in Natural Language Processing (NLP) are a way to represent words or phrases as numerical vectors in a continuous vector space. This representation allows machine learning models to understand and work with text data in a mathematical form. Embeddings capture the semantic meaning of words so that words with similar meanings are close to each other in this vector space.

Hereâ€™s a simple example to help understand embeddings:

**Traditional Word Representation**: If we represent words using one-hot encoding, each word is represented as a binary vector where only one element is 1, and all other elements are 0. This representation doesn't capture any semantic relationship between words. For example:

* "cat" might be represented as [1, 0, 0]
* "dog" might be represented as [0, 1, 0]
* "fish" might be represented as [0, 0, 1]

**Word Embeddings**: In embeddings, words are represented as dense vectors of real numbers, capturing semantic relationships. For instance:

* "cat" might be represented as [0.2, 0.8]
* "dog" might be represented as [0.1, 0.9]
* "fish" might be represented as [0.9, 0.1]

In this space, "cat" and "dog" would be closer to each other than to "fish", reflecting their semantic similarity.



In [None]:
def euclidean_distance(x, y):
    return np.sqrt(((x - y) ** 2).sum())

In [None]:
cat = np.array((0.2, 0.8))

In [None]:
dog = np.array((0.1, 0.9))

In [None]:
fish = np.array((0.9, 0.1))

In [None]:
euclidean_distance(cat, dog)

In [None]:
euclidean_distance(cat, fish)

In [None]:
euclidean_distance(dog, fish)

In [None]:
c = np.array((cat, dog, fish))

In [None]:
c

In [None]:
d = pd.DataFrame(c, columns=['x', 'y'], index=['cat', 'dog', 'fish'])

In [None]:
d.plot.scatter(x='x', y='y');

## TF-IDF

The `TfidfVectorizer()` in scikit-learn is a tool used to convert a collection of raw documents into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. It combines two important concepts in text processing: Term Frequency (TF) and Inverse Document Frequency (IDF).

### Term Frequency (TF)
Term Frequency measures how frequently a term occurs in a document. The raw count of a term in a document is divided by the total number of terms in the document. This helps to normalize the term frequency across documents of different lengths. The formula for term frequency $ TF(t,d) $ of term $ t $ in document $ d $ is:

$$ TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

### Inverse Document Frequency (IDF)
Inverse Document Frequency measures how important a term is across the entire set of documents. It decreases the weight of terms that appear very frequently in many documents and increases the weight of terms that appear less frequently. The formula for inverse document frequency $ IDF(t,D) $ of term $ t $ in the document set $ D $ is:

$$ IDF(t,D) = \log \left( \frac{\text{Total number of documents in } D}{\text{Number of documents containing term } t} \right) $$

### TF-IDF
TF-IDF is the product of Term Frequency and Inverse Document Frequency. It balances the term frequency with the document frequency, highlighting words that are important in a document but not common in the entire corpus. The formula for TF-IDF score of term $ t $ in document $ d $ within document set $ D $ is:

$$ \text{TF-IDF}(t,d,D) = TF(t,d) \times IDF(t,D) $$

### What `TfidfVectorizer()` Does

1. **Tokenization**: Splits the text into individual terms (tokens).
2. **Building Vocabulary**: Constructs a vocabulary of terms from the documents.
3. **Calculating TF-IDF**: Computes the TF-IDF score for each term in each document.
4. **Output Matrix**: Produces a sparse matrix where each row represents a document and each column represents a term. The values in the matrix are the TF-IDF scores.

Here's an example to demonstrate how `TfidfVectorizer()` works:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
text = 'this is a short sentence. this is another one. and yet another one.'

In [None]:
# text = 'the dog is barking. the cat is sitting on a tree. the fish is swimming in the water.'

In [None]:
documents = text.split('.')
documents = [d.strip(' ') for d in documents if len(d) > 0]
documents

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
tfidf = vectorizer.fit_transform(documents)

In [None]:
tfidf

In [None]:
print(tfidf)

In [None]:
tfidf.toarray().round(3)

In [None]:
pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names_out())

In [None]:
d = pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names_out()).T
d

In [None]:
# d.iplot(kind='scatter3d', x=0, y=1, z=2)

## Key Words

In [None]:
import pickle
import html2text

In [None]:
url = "https://certificate.tpq.io/apple_news_06_to_10_02_2024.pkl"

In [None]:
!wget $url

In [None]:
news = pickle.load(open(url.split('/')[-1], 'rb'))

In [None]:
type(news)

In [None]:
len(news)

In [None]:
news[0][:150]

In [None]:
converter = html2text.HTML2Text()

In [None]:
cnews = [converter.handle(n) for n in news[:10]]

In [None]:
print(cnews[0][:490])

In [None]:
vectorizer = TfidfVectorizer(min_df=4, stop_words='english')

In [None]:
tfidf = vectorizer.fit_transform(cnews)

In [None]:
tfidf

In [None]:
# print(tfidf)

In [None]:
vectorizer.get_feature_names_out()

In [None]:
words = pd.DataFrame({'words': vectorizer.get_feature_names_out(),
                     'tfidf': vectorizer.idf_})

In [None]:
words.sort_values('tfidf').head()

In [None]:
[w for w in words.sort_values('tfidf')['words'].values if len(w) > 4][:10]

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>