<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# NLP Basics

**Word Embeddings**

&copy; Dr. Yves J. Hilpisch

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>

## Imports

In [None]:
!git clone https://github.com/tpq-classes/natural_language_processing.git
import sys
sys.path.append('natural_language_processing')


In [None]:
import numpy as np
import pandas as pd

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Text Classification (1)

In [None]:
python_snippets = [
    "Python is a versatile language for web development, data analysis, and automation.",
    "Use Python's libraries like NumPy and Pandas for efficient data manipulation.",
    "Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming.",
    "The Python community offers extensive documentation and a wealth of online resources.",
    "Python's syntax is designed to be readable and straightforward, making it beginner-friendly.",
    "Django and Flask are popular frameworks for developing web applications in Python.",
    "Automate repetitive tasks with Python scripts and save time in your workflow."
]

In [None]:
nlp_snippets = [
    "Natural Language Processing (NLP) enables computers to understand and process human language.",
    "NLP is used in applications like sentiment analysis, chatbots, and machine translation.",
    "Tokenization is a fundamental step in NLP, breaking text into meaningful units.",
    "Named Entity Recognition (NER) identifies proper nouns in text, such as names and locations.",
    "Vectorization converts text data into numerical form for machine learning models.",
    "Popular NLP libraries include NLTK, SpaCy, and Hugging Face Transformers.",
    "NLP combines computational linguistics and machine learning for language understanding."
]

In [None]:
llm_snippets = [
    "Large Language Models (LLMs) are advanced neural networks trained on vast text corpora.",
    "LLMs like GPT-3 generate human-like text based on input prompts.",
    "Applications of LLMs include content creation, code generation, and conversational agents.",
    "LLMs utilize transformers, a deep learning architecture, for efficient processing.",
    "Training LLMs requires substantial computational resources and large datasets.",
    "Fine-tuning LLMs on specific tasks enhances their performance and accuracy.",
    "Ethical considerations in LLMs include bias, misinformation, and data privacy."
]

In [None]:
X = list()
X.extend(python_snippets)
X.extend(nlp_snippets)
X.extend(llm_snippets)

In [None]:
len(X)

In [None]:
y = np.array(7 * [0] + 7 * [1] + 7 * [2])
y

In [None]:
vectorizer = TfidfVectorizer(min_df=2, stop_words='english')

In [None]:
tfidf = vectorizer.fit_transform(X)
tfidf

In [None]:
X_ = tfidf.toarray()

In [None]:
X_.shape

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

In [None]:
model = LogisticRegression(C=1)
# model = MLPClassifier(hidden_layer_sizes=[128, 128],
#                       max_iter=1000)

In [None]:
model.fit(X_, y)

In [None]:
p = model.predict(X_)
p

In [None]:
accuracy_score(y, p)

In [None]:
# test snippets
test_snippets = [
    "Python's extensive standard library supports many common programming tasks.",
    "Jupyter notebooks are widely used for interactive Python development and data visualization.",
    "Python's dynamic typing and garbage collection simplify memory management.",
    "Sentiment analysis in NLP determines the emotional tone of text.",
    "Text classification categorizes text into predefined labels using NLP techniques.",
    "Word embeddings represent words as dense vectors for better machine learning performance.",
    "Transfer learning is often used in LLMs to adapt pre-trained models to new tasks.",
    "LLMs can summarize long documents, extracting key information efficiently.",
    "Prompt engineering tailors inputs to guide LLM outputs more effectively."
]

# Labels for the test snippets
new_labels = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])

In [None]:
tfidf_test = vectorizer.transform(test_snippets)
tfidf_test

In [None]:
p_test = model.predict(tfidf_test)
p_test

In [None]:
accuracy_score(new_labels, p_test)

## Text Classification (2)

In [None]:
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

In [None]:
data = fetch_20newsgroups(categories=['sci.med', 'sci.crypt', 'sci.space'])

In [None]:
data['data'][0]

In [None]:
X, y = data.data, data.target

In [None]:
len(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=100)

In [None]:
pipeline = make_pipeline(TfidfVectorizer(stop_words='english'),
                         LogisticRegression())

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
p_train = pipeline.predict(X_train)
p_train

In [None]:
accuracy_score(y_train, p_train)  # in-sample accuracy

In [None]:
p_test = pipeline.predict(X_test)

In [None]:
accuracy_score(y_test, p_test)

In [None]:
len(y_test)

## Word2Vec

_From ChatGPT._

Word2Vec is a popular technique used in natural language processing (NLP) to create word embeddings, which are dense vector representations of words. Developed by a team of researchers at Google led by Tomas Mikolov in 2013, Word2Vec models learn to map words into high-dimensional continuous vector spaces where semantically similar words are located close to each other.

### Key Concepts of Word2Vec

1. **Distributed Representations**: Unlike traditional one-hot encoding, which represents words as sparse vectors with many dimensions (one per unique word) and no meaningful distances between them, Word2Vec creates dense vectors where the semantic relationships between words are captured in the vector space.

2. **Training Approaches**:
   - **Continuous Bag of Words (CBOW)**: Predicts the target word based on the context of surrounding words. It uses a window of words around the target word to predict the target word itself.
   - **Skip-gram**: Predicts the surrounding context words based on the target word. Given a word, it tries to predict the words in its neighborhood.

### How Word2Vec Works

1. **Input**: A large corpus of text.
2. **Training**: The model is trained on the corpus to predict context words from a target word (skip-gram) or a target word from context words (CBOW).
3. **Output**: A set of word vectors where each word is represented by a dense vector of real numbers.

### Benefits of Word2Vec

- **Semantic Relationships**: Words with similar meanings are close together in the vector space. For example, "king" - "man" + "woman" is close to "queen".
- **Efficient**: Word2Vec can be trained on large corpora efficiently using stochastic gradient descent and other optimization techniques.
- **Generalization**: The vectors can be used in various downstream NLP tasks, improving their performance by providing meaningful word representations.

### Example in Python using Gensim

Here's how you can use the Gensim library to create Word2Vec embeddings:

1. **Install Gensim**:
   ```bash
   pip install gensim
   ```

2. **Train a Word2Vec Model**:
   ```python
   from gensim.models import Word2Vec

   # Sample sentences
   sentences = [
       ["cat", "sat", "on", "the", "mat"],
       ["dog", "barked", "at", "the", "mailman"],
       ["fish", "swims", "in", "the", "water"]
   ]

   # Train Word2Vec model
   model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, sg=0)

   # Get embeddings for a word
   cat_vector = model.wv['cat']
   print("Embedding for 'cat':", cat_vector)
   
   # Find similar words
   similar_to_cat = model.wv.most_similar('cat')
   print("Words similar to 'cat':", similar_to_cat)
   ```

### Explanation

- **Training Data**: We use a small set of sample sentences.
- **Word2Vec Model**: We create a Word2Vec model using these sentences.
  - `vector_size`: The size of the word vectors.
  - `window`: The maximum distance between the current and predicted word within a sentence.
  - `min_count`: Ignores all words with a total frequency lower than this.
  - `sg`: Training algorithm, 0 for CBOW (Continuous Bag of Words), and 1 for skip-gram.
- **Get Embeddings**: We retrieve the embeddings for specific words like 'cat'.
- **Find Similar Words**: We find words similar to 'cat' based on the trained embeddings.

### Applications of Word2Vec

1. **Text Classification**: Improved feature representations lead to better classification performance.
2. **Semantic Analysis**: Understanding relationships between words.
3. **Machine Translation**: Capturing the semantic meaning of words helps in translating sentences more accurately.
4. **Recommendation Systems**: Finding similar items or content based on word embeddings.
5. **Information Retrieval**: Improving search results by understanding the semantic context of queries.

Word2Vec has been a foundational technique in NLP, leading to more advanced embedding methods like GloVe, FastText, and contextual embeddings such as BERT.

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:team@tpq.io">team@tpq.io</a>