MODULE 4 | LESSON 1


---


# **Similarity Measures**


|  |  |
|:---|:---|
|**Reading Time** | 45 minutes  |
|**Prior Knowledge** |  Basic statistics: Familiarity with concepts like mean, standard deviation, correlation, and distributions. <br>Linear algebra: Basic knowledge of vectors, matrices, and operations like dot product. <br>Probability: A basic understanding of probability concepts to aid in interpreting statistical measures and their significance. <br>Financial and Investing Knowledge: Basic understanding of financial markets, instruments, and investment strategies; <br>Familiarity with risk assessment and management concepts; Basic knowledge of sentiment analysis and its role in understanding market trends. |
|**Keywords** | Euclidean Distance, Manhattan Distance, Cosine Similarity, Jaccard Index, Word Mover's Distance (WMD), Dynamic Time Warping, Pearson Correlation, <br>Spearman Rank Correlation, Term Frequency-Inverse Document Frequency (TF-IDF), Text Analysis, Natural Language Processing (NLP), Information Retrieval |

---

*In this lesson we explore similarity measures and their use in financial markets, especially Term Frequency-Inverse Document Frequency (TF-IDF) for analyzing text data like news and social media to gain insights for investment decisions. We will cover different types of similarity measures, benefits of TF-IDF, and its applications in sentiment analysis, risk management, and for other investment strategies.*

In [1]:
# Load libraries
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer


## **1 Similarity Measures**

Similarity measures are methods used to quantify how alike two data objects are. They are essential in various fields, including data mining, machine learning, and information retrieval. The choice of similarity measure depends on the type of data and the desired interpretation of similarity. Common types of Similarity Measures are:

**Distance Measures:** These measures quantify dissimilarity. Smaller distances indicate higher similarity:

 - Euclidean distance
 - Manhattan distance
 - Minkowski distance
 - Word Mover's Distance (WMD)
 - Dynamic Time Warping

**Similarity Coefficients:** These measures directly quantify similarity. Higher values indicate greater similarity:

 - Cosine similarity
 - Jaccard index
 - Pearson correlation coefficient
 - Spearman Rank Correlation


Below we will consider each in further details.











### **1.1 Distance Measures**

**1. Euclidean distance**

 - **Type:** Distance measure
 - **Data Type:** Numerical data
 - **Description:** Measures the straight-line distance between two points in Euclidean space.
 - **Formula:** $\sqrt{\sum (x_i - y_i)^2}$ (where $x$ and $y$ are the data points, and $i$ represents the dimensions)
 - **Interpretation:** Smaller distance = higher similarity
 - **Use Cases:**
   - K-means clustering
   - K-nearest neighbors
   - Finding similar data points


**2. Manhattan distance**

 - **Type:** Distance measure
 - **Data Type:** Numerical data
 - **Description:** Measures the distance between two points by summing the absolute differences of their coordinates. Also known as city block distance or L1 distance.
 - **Formula:** $\sum |x_i - y_i|$
 - Interpretation: Smaller distance = higher similarity
 - **Use Cases:**
   - Recommender systems
   - Feature selection
   - Robust to outliers


**3. Minkowski distance**

 - **Type:** Distance measure
 - **Data Type:** Numerical data
 - **Description:** A generalized distance metric that includes Euclidean and Manhattan distances as special cases.
 - **Formula:** $(\sum |x_i - y_i|^p)^{(1/p)}$ (where $p$ is the order of the distance)
   - $p = 1$: Manhattan distance
   - $p = 2$: Euclidean distance
 - **Interpretation:** Smaller distance = higher similarity
 - **Use Cases:**
   - Adaptable to different data distributions
   - Experimenting with different distance metrics


**4. Word Mover's Distance (WMD)**

 - **Type:** Distance measure
 - **Data Type:** Text data (documents, sentences)
 - **Description:** Measures the dissimilarity between two text documents based on the "travel cost" of moving words from one document to match the words in the other document. Uses word embeddings to represent words in a semantic space.
 - **Interpretation:** Lower WMD indicates higher similarity.
 - **Use Cases:**
   - Document classification
   - Text similarity tasks where semantic meaning is important
   - Clustering documents


**5. Dynamic Time Warping**

 - **Type:** Distance measure
 - **Data Type:** Time series data
 - **Description:** Measures the similarity between two time series that may vary in speed or have time shifts. It finds the optimal alignment between the time series by "warping" the time axis.
 - **Interpretation:** Lower DTW distance indicates higher similarity.
 - **Use Cases:**
   - Speech recognition
   - Gesture recognition
   - Financial time series analysis
   - Anomaly detection in time series


### **1.2 Similarity Coefficients**

**1. Cosine similarity**

 - **Type:** Similarity coefficient
 - **Data Type:** Numerical vectors (often used for text data after vectorization)
 - **Description:** Measures the cosine of the angle between two vectors.
 - **Formula:** $(A \cdot B) / (\lVert A \rVert \lVert B \rVert)$ (where $A$ and $B$ are vectors, $\cdot$ is the dot product, and $\lVert \rVert$ represents magnitude)
 - **Range:** -1 to 1 (1 = perfect similarity, 0 = no similarity, -1 = perfect dissimilarity)
 - **Use Cases:**
   - Document similarity
   - Information retrieval
   - Recommender systems


**2. Jaccard index**

 - **Type:** Similarity coefficient
 - **Data Type:** Sets (categorical data)
 - **Description:** Measures the similarity between two sets by dividing the number of elements they have in common by the total number of elements in both sets.
 - **Formula:** $|A \cap B| / |A \cup B|$ (where $\cap$ is intersection and $\cup$ is union)
 - **Range:** 0 to 1 (1 = perfect similarity, 0 = no similarity)
 - **Use Cases:**
   - Text analysis (comparing documents based on word sets)
   - Image recognition (comparing image features)
   - Recommender systems


**3. Pearson correlation coefficient**

 - **Type:** Similarity coefficient (but can be interpreted as a distance measure when considering 1 - correlation)
 - **Data Type:** Numerical data
 - **Description:** Measures the linear relationship between two variables.
 - **Range:** -1 to 1 (1 = perfect positive correlation, 0 = no correlation, -1 = perfect negative correlation)
 - **Use Cases:**
   - Finding relationships between variables
   - Feature selection
   - Stock market analysis


**4. Spearman Rank Correlation**

 - **Type:** Similarity coefficient (but can be interpreted as a distance measure when considering 1 - correlation).
 - **Data Type:** Ordinal data or numerical data where the rank order is more important than the actual values.
 - **Description:** Measures the monotonic relationship between two variables. It assesses how well the relationship between two variables can be described by a monotonic function (always increasing or always decreasing).
 - **Range:** -1 to 1, similar to Pearson correlation.
 - **Use Cases:**
   - Assessing correlations when data doesn't meet the assumptions of Pearson correlation (linearity, normality).
   - Analyzing ranked data (e.g., customer preferences, survey responses).





### **1.3 Applications of similarity measures**

Now let's expand on the context of similarity measures within the broader scope of data science and machine learning, as well as their relevance to financial applications.

**Broader Context of Similarity Measures:**

 - **Data Mining and Machine Learning:** Similarity measures are fundamental building blocks in many data mining and machine learning algorithms. They are used for tasks like clustering (grouping similar data points), classification (assigning data points to categories based on similarity to existing categories), and anomaly detection (identifying unusual data points that differ significantly from others).
 - **Information Retrieval:** In information retrieval systems, such as search engines, similarity measures are used to rank documents or web pages based on their relevance to a user's query. Documents that are more similar to the query are ranked higher in the search results.
 - **Recommender Systems:** Recommender systems leverage similarity measures to suggest products, movies, or other items that are similar to those a user has liked or interacted with in the past.

**Relevance to Financial Applications:**

 - **Financial Time Series Analysis:** Similarity measures like DTW are crucial for analyzing financial time series data, such as stock prices, exchange rates, or economic indicators. DTW can help identify patterns, trends, and correlations between different time series, even if they have variations in their temporal alignment. This information can be used for forecasting, risk management, and investment strategies.
 - **Risk Management:** Similarity measures can be used to assess the similarity between different financial instruments or portfolios to identify potential risks and diversification opportunities. By measuring the similarity of risk profiles, investors can make more informed decisions about portfolio construction and risk mitigation.
 - **Algorithmic Trading:** Similarity measures can be incorporated into algorithmic trading strategies to identify trading opportunities based on patterns and relationships between different financial assets. For example, a trading algorithm might identify stocks that have historically moved in similar ways and use this information to make trading decisions.


Overall, this lesson provides foundation for understanding the importance of similarity measures in various data science and machine learning applications, with a specific focus on their relevance to financial analysis. If we look on the broader context and potential applications of these measures, we can better appreciate their role in extracting insights and making informed decisions in the financial domain. Here are few examples of how similarity measures can be appied in financial engineering scenarios:

 - **Clustering Financial Instruments:** Euclidean distance or correlation coefficients can be used to cluster stocks or other financial instruments based on their historical price movements or other financial metrics. This can help investors identify groups of assets that behave similarly and diversify their portfolios accordingly.
 - **Detecting Anomalies in Financial Data:** Similarity measures can be used to detect anomalies or outliers in financial data, such as unusual trading activity or sudden price movements. These anomalies might indicate fraudulent activity or market disruptions.
 - **Predicting Stock Price Movements:** Similarity measures can be used to build predictive models for stock price movements by identifying patterns and relationships between different financial indicators or news sentiment.




### **1.4 Simplified examples**

Let's now consider simplified example using financial data to illustrate the calculation of similarity using Euclidean distance, Manhattan distance, and cosine similarity. We will also delve into the geometry of these similarity measures.

**Scenario:** We have two stocks, Stock A and Stock B, and we want to measure their similarity based on their daily returns over a period of 5 days.

|  Day  |  Stock A Returns  |  Stock B Returns  |
| :---: | :---: | :---: |
|  1  |   0.02  |  0.03  |  
|  2  |  -0.01  |  0.01  |  
|  3  |   0.03  |  0.02  |  
|  4  |   0.01  |  0.00  |  
|  5  |  -0.02  | -0.01  |  


Let's represent this scenario data in code using Python lists or NumPy arrays, which will allow for programmatic calculations and easier manipulation in follow-up code snippets:


In [2]:
# Stock returns data
stock_a_returns = [0.02, -0.01, 0.03, 0.01, -0.02]
stock_b_returns = [0.03, 0.01, 0.02, 0.00, -0.01]

# Convert to NumPy arrays for easier calculations
stock_a = np.array(stock_a_returns)
stock_b = np.array(stock_b_returns)


Now, we have the daily returns of Stock A and Stock B stored in the variables `stock_a` and `stock_b`, respectively. These variables can be directly used in subsequent calculations.




#### **Euclidean Distance:**

Euclidean distance represents the straight-line distance between two points in a multi-dimensional space. It's the shortest distance between the points, as if we were to draw a straight line connecting them. Imagine two points on a Cartesian plane. The Euclidean distance is the length of the line segment connecting these points. In higher dimensions, the concept extends similarly, but the line segment exists in a higher-dimensional space.

The Euclidean distance between the two stocks is calculated as follows:

$$\begin{align*}
    & \text{Euclidean Distance (A, B)} = \sqrt{ \Sigma (\text{ReturnA}_i - \text{ReturnB}_i)^2} = \\
    & = \sqrt{(0.02 - 0.03)^2 + (-0.01 - 0.01)^2 + (0.03 - 0.02)^2 + (0.01 - 0.00)^2 + (-0.02 - (-0.01))^2} = \\
    & = \sqrt{0.0001 + 0.0004 + 0.0001 + 0.0001 + 0.0001} = \\
    & \approx 0.0283
\end{align*}$$

Here
 - $\text{ReturnA}_i$ is the return of Stock A on day $i$.
 - $\text{ReturnB}_i$ is the return of Stock B on day $i$.
 - And $\Sigma$ is the summation over all days ($i$ = 1 to 5).

We can compute Euclidean distance between Stock A and Stock B programmatically as follows:




In [3]:
# Euclidean Distance
euclidean_distance = np.sqrt(np.sum((stock_a - stock_b)**2))
print(f"Euclidean Distance: {euclidean_distance}")


Euclidean Distance: 0.0282842712474619


**Interpretation:** The Euclidean distance between Stock A and Stock B is approximately 0.0283. This indicates that the daily returns of the two stocks are relatively close to each other, suggesting some similarity in their price movements.



#### **Manhattan Distance:**

Manhattan distance, also known as L1 distance or city block distance, represents the distance between two points if we could only travel along grid lines or city blocks. We can only move horizontally or vertically, not diagonally. Think of navigating a grid-like city layout. The Manhattan distance is the total distance we would travel along the streets to get from one point to another.

The Manhattan distance is calculated by summing the absolute differences of the daily returns:

$$\begin{align}
    & \text{Manhattan Distance (A, B)} = \Sigma|\text{ReturnA}_i - \text{ReturnB}_i| = \\
    &= |0.02 - 0.03| + |-0.01 - 0.01| + |0.03 - 0.02| + |0.01 - 0.00| + |-0.02 - (-0.01)| = \\
    &= 0.01 + 0.02 + 0.01 + 0.01 + 0.01 = \\
    &= 0.06
\end{align}$$

Programmatic computation of Manhattan distance between our stocks is given in the following code:




In [4]:
# Manhattan Distance
manhattan_distance = np.sum(np.abs(stock_a - stock_b))
print(f"Manhattan Distance: {manhattan_distance}")


Manhattan Distance: 0.06


**Interpretation:** The Manhattan distance between Stock A and Stock B is 0.06. This represents the total absolute difference in their daily returns over the 5-day period.

#### **Cosine Similarity:**

Cosine similarity measures the cosine of the angle between two vectors. It focuses on the direction of the vectors rather than their magnitudes. Imagine two vectors originating from the same point. The cosine similarity is related to the angle between these vectors. If the vectors point in the same direction, the angle is 0 degrees, and the cosine similarity is 1 (perfect similarity). If the vectors are perpendicular, the angle is 90 degrees, and the cosine similarity is 0 (no similarity). Higher cosine similarity values indicate vectors that are closer in direction, while lower values indicate vectors that are more divergent in direction.

We can measure the similarity between the two stocks using cosine similarity formula:

$$\text{Cosine Similarity (A, B)} = \frac{A \cdot B}{\Vert A \Vert \Vert B \Vert}$$

Here $A \cdot B$ is the dot product of the daily return vectors of Stock A and Stock B.
$\Vert A \Vert$ and $\Vert B \Vert$ are the magnitudes of the daily return vectors of Stock A and Stock B, respectively.

Let's complete the calculation of cosine similarity using the provided values. We first compute dot product and magnitudes:

$$\begin{align}
    A \cdot B &= (0.02 * 0.03) + ((-0.01) * 0.01) + (0.03 * 0.02) + (0.01 * 0.00) + ((-0.02) * (-0.01)) = 0.0013 \\
    \Vert A \Vert &= \sqrt{0.02^2 + (-0.01)^2 + 0.03^2 + 0.01^2 + (-0.02)^2} \approx 0.0436 \\
    \Vert B \Vert &= \sqrt{0.03^2 + 0.01^2 + 0.02^2 + 0.00^2 + (-0.01)^2} \approx 0.0387
\end{align}$$

We can now plug these in the Cosine similarity formula:

$$\text{Cosine Similarity (A, B)} = \frac{A \cdot B}{\Vert A \Vert \Vert B \Vert} = \frac{0.0013}{(0.0436 * 0.0387)} \approx 0.77$$

The following code computes cosine similarity for the given stocks:




In [5]:
# Compute dot product and magnitudes
dot_product = np.dot(stock_a, stock_b)
magnitude_a = np.linalg.norm(stock_a)
magnitude_b = np.linalg.norm(stock_b)

# Compute Cosine similarity
cosine_similarity = dot_product / (magnitude_a * magnitude_b)
print(f"Cosine Similarity: {cosine_similarity}")


Cosine Similarity: 0.7700535410868199


**Interpretation:** The cosine similarity value of 0.77 indicates a relatively high similarity between the daily returns of Stock A and Stock B. This suggests that the two stocks tend to move in similar directions, although not necessarily with the same magnitude.

#### **Summing up:**

Based on the calculations we performed, we can conclude the folowing summary of the similarity measures for Stock A and Stock B:

 - **Euclidean and Manhattan Distances:** In financial data analysis, these distances can be used to measure the overall dissimilarity between time series or vectors of financial metrics. For instance, we could use them to compare the historical returns of two stocks or to cluster stocks based on their financial characteristics. For our exaple both distances are relatively small ($\approx$0.0283 for Eucledian distance and $\approx$0.06 for Manhattan distance), indicating that the daily returns of Stock A and Stock B are quite similar in terms of their overall magnitude of differences. The smaller the distance, the more similar the stocks are considered to be.
 - **Cosine Similarity:** Cosine similarity is often used in finance to assess the correlation or co-movement between assets. It can help identify stocks that tend to move in similar directions, even if their returns have different magnitudes. This information is valuable for portfolio diversification and risk management. For our example the cosine similarity value of 0.77 indicates a relatively high similarity in the direction of movement between the two stocks. This suggests that they tend to move in the same direction (up or down) on most days, although the magnitude of their returns might differ.

These results suggest that Stock A and Stock B exhibit a considerable degree of similarity in their price movements. They tend to move in the same direction, and the overall differences in their daily returns are relatively small.

We should remark that these similarity measures are based on a limited dataset of only 5 days. To draw more robust conclusions about the relationship between the two stocks, it's essential to analyze a larger dataset over a more extended period. Additionally, it is important to consider using other relevant metrics and techniques to gain a comprehensive understanding of their relationship.

### **1.5 Choice of similarity measure**

It is important to understand the factors influencing the choice of similarity measure. The choice of similarity measure depends on the type of data, the desired interpretation of similarity and specifics of applications. Below we consider each factor in detail:

**Data Type:** The nature of data fundamentally guides the choice of similarity measure.

 - **Numerical Data:** If data consists of continuous numerical values, we typically use distance measures like Euclidean, Manhattan, or Minkowski, or correlation coefficients like Pearson or Spearman for assessing relationships between numerical variables. These measures quantify the magnitude of differences or the strength of relationships between numerical variables.
 - **Categorical Data:** When dealing with data that falls into distinct categories i.e. with categorical data represented as sets, the Jaccard Index is a common choice. It focuses on the overlap or shared characteristics between sets of categories.
 - **Text Data:** Word Mover's Distance (WMD) is a way to measure the semantic similarity between text documents. WMD considers the meaning of words and the "effort" required to transform one document into another.
 - **Time Series Data:** For data that changes over time, Dynamic Time Warping (DTW) is essential due to its ability to handle variations in speed and time shifts. It addresses the challenge of aligning time series that may have variations in speed or temporal shifts, enabling accurate similarity comparisons.

**Desired Interpretation:** The way we want to interpret the similarity results and the overall goal of analysis also play a role.

 - **Distance vs. Similarity:** Whether we want to quantify dissimilarity (distance) or similarity directly depends on task's objective. Distance measures are useful for identifying differences or outliers, while similarity coefficients are better for finding similar items or grouping data points together. Distance measures result in smaller values for higher similarity, while similarity coefficients result in larger values for higher similarity. We needd to chose the measure that aligns with interpretation needs.
 - **Linear vs. Non-linear Relationships:** When working with numerical data, we need to consider the type of relationship we're interested in capturing. Pearson correlation is suitable for linear relationships, while Spearman Rank Correlation is more appropriate for monotonic or non-linear trends. If we're interested in linear relationships between numerical variables, Pearson correlation is appropriate. For non-linear or monotonic relationships, Spearman Rank Correlation is a better choice.

**Data Characteristics:** Certain characteristics of data can influence the choice of similarity measure.

 - **Outliers:** If data contains outliers (extreme or unusual values), we need to consider using Manhattan distance. Manhattan distance is less sensitive to outliers compared to Euclidean distance.
 - **Time Shifts (for Time Series Data):** When comparing time series, variations in speed or temporal shifts can significantly affect similarity calculations. DTW is crucial for time series data with potential time shifts or variations in speed. DTW is specifically designed to address this issue by finding the optimal alignment between time series before calculating the distance.


By carefully considering the data type, desired interpretation, data characteristics, and the specifics of applications, we can select the most appropriate similarity measure. By considering these broader contextual factors, we can make a more informed decision about the most appropriate similarity measure for specific task and data.




## **2 Textual Data**

Alternative data refers to non-traditional data sources that are used to gain insights into investment opportunities or market trends. This type of data isn't typically found in traditional financial datasets like market prices. Alternative data is becoming increasingly important in finance as investors seek new ways to gain insights and make better investment decisions. While there are challenges associated with using alternative data, it has the potential to provide a significant advantage for those who can effectively leverage it.

**What is Textual Data?** Textual data, in simple terms, is data that is in the form of text. This includes things like: news articles, social media posts, earnings call transcripts, company filings, regulatory documents, customer reviews, and many other types of text-based information.

Textual data contains a wealth of information that can be valuable for investors and financial analysts. It can provide insights into:

 - Market sentiment: By analyzing the tone and language used in news articles and social media posts, we can gauge how the market feels about specific companies, industries, or assets. This can help to predict potential market movements or identify emerging trends.
 - Company performance and strategies: Textual data like earnings call transcripts and company filings can provide information about a company's financial performance, strategic direction, and future plans.
 - Risk assessment: By analyzing news and social media for mentions of risks or potential threats, investors can identify emerging risks or monitor market sentiment related to specific risk factors.
 - Alternative data analysis: Textual data is a rich source of "alternative data" that can complement traditional financial data and provide a more comprehensive view of the market or specific investments.






### **2.1 Text analysis techniques**

There are several valuable approaches that can be used to analyse textual data depending on the specific task and the nature of the textual data. Here are some of the commonly used techniques:

**Term Frequency-Inverse Document Frequency (TF-IDF):** TF-IDF is a statistical measure that evaluates the importance of a word to a document within a collection of documents. It works by calculating two metrics: Term Frequency (TF), which measures how often a word appears in a document, and Inverse Document Frequency (IDF), which measures how rare the word is across the entire collection. By combining TF and IDF, TF-IDF highlights words that are frequent within a specific document but relatively rare across the entire corpus, effectively identifying the most distinctive and meaningful words for each document.

**Word Embeddings (Word2Vec, GloVe, FastText):** Word embeddings represent words as dense vectors in a high-dimensional space, where words with similar meanings are located closer to each other. This allows capturing semantic relationships between words and understanding their contextual meaning. Word embeddings are trained on large text corpora and can be used for various tasks, including sentiment analysis, document similarity, and building features for machine learning models. They enable capturing the essence of word meanings and relationships, going beyond simple word frequency counts.

**Sentiment Analysis (Lexicon-based, Machine Learning-based):** Sentiment analysis aims to determine the emotional tone or opinion expressed in text. It can be rule-based using sentiment lexicons, which are lists of words with associated sentiment scores, or machine learning-based using classifiers trained on labeled data. Sentiment analysis is widely used in finance to gauge market sentiment, identify customer feedback, and predict stock price movements based on news sentiment. It helps understand the overall positivity or negativity expressed in text, providing valuable insights for investment decisions.

**Topic Modeling (LDA, NMF):** Topic modeling uncovers hidden thematic structures in a collection of documents by identifying groups of words that frequently co-occur. This helps understand the main topics or themes discussed in the text data. Techniques like Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) are commonly used for topic modeling. In finance, topic modeling can be applied to identify industry trends, categorize news articles, and analyze customer reviews to understand common themes or concerns.

**Named Entity Recognition (NER):** Named Entity Recognition (NER) identifies and classifies named entities in text, such as people, organizations, locations, dates, and monetary values. It extracts structured information from unstructured text, making it easier to analyze and understand the relationships between different entities. In finance, NER can be used to identify key players in financial news, extract company names from regulatory filings, and understand the relationships between different entities mentioned in text.

**Text Classification:** Text classification assigns predefined categories or labels to text documents based on their content. It can be rule-based, using predefined rules to categorize documents, or machine learning-based, using classifiers trained on labeled data. In finance, text classification can be used for tasks like categorizing financial news articles, identifying spam or fraudulent emails, and classifying customer reviews based on sentiment. It helps automate the process of categorizing text data, making it easier to analyze and understand large volumes of textual information.

Often, a combination of these techniques is used to achieve the desired results. For example, we might use TF-IDF to identify important words, word embeddings to capture semantic relationships, and sentiment analysis to determine the overall sentiment of a document. Later in this module we delve in more details and learn applications of some of these techniques.



### **2.2 Definition of Term Frequency-Inverse Document Frequency**

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It's widely used in information retrieval and text mining to help identify the most relevant words in a document.

 - **Term Frequency (TF):**
   - This component measures how frequently a term appears in a document. The basic intuition is that a word that occurs more often within a document is likely to be more important to that document's meaning. There are different ways to calculate TF, but a common one is raw frequency:
   
   $$\text{TF(term, document)} = \frac{ \text{(Number of times term appears in document)}}{\text{(Total number of terms in document)}}$$

 - **Inverse Document Frequency (IDF):**
   - IDF measures how important a term is within the whole corpus. Words that occur rarely in the corpus have a higher IDF score. IDF scales down the weight of terms that occur very frequently across the corpus and are therefore less informative about a specific document. It gives higher weight to terms that are rare. A common way to calculate IDF is:

   $$\text{IDF(term, corpus)} = log \Big[ \frac{\text{(Total number of documents in corpus)}}{\text{(Number of documents containing the term)}} \Big]$$

   - If a term appears in all documents, its IDF score is 0. If a term is very rare, its IDF score is high.

 - **TF-IDF Calculation:**
   - The TF-IDF score is simply the product of the TF and IDF scores:

   $$\text{TF-IDF(term, document, corpus)} = \text{TF(term, document)} \times \text{IDF(term, corpus)}$$




### **2.3 Applications of TF-IDF**

While still an emerging area, the application of TF-IDF in financial investment holds potential for leveraging textual data to gain insights, manage risk, and potentially improve investment strategies. Some applications of TF-IDF specifically within the financial domain and potential use cases could be:

**Sentiment Analysis for Financial Markets:**

 - **Gauging Market Sentiment:** TF-IDF can be used to analyze financial news articles, social media posts, earnings call transcripts, and other textual data to gauge market sentiment towards specific companies, industries, or assets. By identifying terms strongly associated with positive or negative sentiment, investors can get a sense of the overall market sentiment towards a particular investment.
 - **Predicting Market Movements:** TF-IDF can contribute to building predictive models for market movements by identifying terms that are indicative of future price changes. For example, an increase in the TF-IDF scores of terms related to positive sentiment might suggest an upcoming price increase for a particular stock.
 - **Example:** A hedge fund might use TF-IDF to analyze news articles and social media sentiment to predict short-term stock price movements. By identifying shifts in sentiment towards a company, they can make informed trading decisions.

**Alternative Data Analysis:**

 - **Unstructured Data Insights:** TF-IDF can be applied to analyze unstructured data sources like earnings call transcripts, company filings, regulatory documents, and even website content to extract key information that might not be readily apparent from traditional financial data. This can help investors gain a deeper understanding of a company's performance, strategy, and future prospects.
 - **Extracting Key Information:** By identifying high TF-IDF terms, investors can extract crucial information, such as mentions of new products, partnerships, or potential risks. This information can be used to make more informed investment decisions.
 - **Example:** An investment firm could use TF-IDF to identify companies with innovative products or services by analyzing patent filings and research papers. By identifying terms related to innovation and technological advancements, they can identify companies with high growth potential.

**Risk Management:**

 - **Identifying Emerging Risks:** Analyzing news and social media using TF-IDF can help identify emerging risks or potential threats to investments. By tracking the TF-IDF scores of terms related to risk factors, such as regulatory changes, economic downturns, or natural disasters, investors can get early warnings of potential market disruptions.
 - **Monitoring Market Sentiment:** Tracking changes in the TF-IDF scores of specific terms related to risk factors can provide early warnings of potential market downturns. For example, an increase in the TF-IDF scores of terms related to economic uncertainty might suggest an upcoming market correction.
 - **Example:** A risk manager might use TF-IDF to monitor news and social media for mentions of potential risks to their investments. By identifying emerging risks early on, they can take steps to mitigate potential losses.

**Portfolio Optimization:**

 - **Text-Based Diversification:** TF-IDF can be used to analyze the textual descriptions of assets (e.g., company profiles, product descriptions) and diversify portfolios based on the semantic similarity or dissimilarity of their underlying businesses. This can help investors build more diversified portfolios that are less susceptible to market shocks.
 - **Identifying Overlapping Exposures:** By analyzing the TF-IDF profiles of different assets, investors can identify potential overlaps in their investments. This can help them avoid overexposure to specific industries or risk factors.
 - **Example:** An investment advisor might use TF-IDF to analyze the textual descriptions of different stocks in a client's portfolio to identify potential areas of overlap. This can help them diversify the portfolio and reduce risk.

**Algorithmic Trading:**

 - **News-Based Trading:** TF-IDF can be incorporated into algorithmic trading strategies that react to news events. By identifying the sentiment and relevance of news articles, trading algorithms can make automated trading decisions based on real-time information.
 - **Sentiment-Driven Trading:** Trading decisions can be automated based on real-time sentiment analysis using TF-IDF. For example, a trading algorithm might buy a stock if the sentiment towards the company suddenly becomes more positive.
 - **Example:** A high-frequency trading firm might use TF-IDF to analyze news headlines and social media posts to identify trading opportunities. By reacting quickly to news events, they can potentially profit from short-term market movements.

These examples highlight the versatility and potential of TF-IDF in various financial applications. As textual data becomes increasingly important in the financial domain, TF-IDF and other text analysis techniques will continue to play a significant role in helping investors gain insights, manage risk, and improve investment strategies.






### **2.4 Benefits of TF-IDF**

The benefits of using TF-IDF for text analysis, particularly in the context of finance, has many aspects that can be summarised as follows:

**Enhanced Relevance and Focus on Important Words:**

 - Distinctive Words: TF-IDF highlights words that are distinctive and meaningful within a specific document relative to a larger corpus. It goes beyond simple word frequency by considering entire collection. It emphasizes terms that are frequent in a document but relatively rare across the entire collection.
 - Discriminatory Power: By emphasizing terms that are frequent in a document but relatively rare across the entire collection, TF-IDF effectively distinguishes between common words (like "the", "a", "is") and more informative words that characterize the content of a document. This helps focus on the words that truly matter for understanding the document's meaning.
 - Improved Search Results: In financial applications like searching for relevant news articles or research reports, TF-IDF helps retrieve documents that are more relevant to a query by prioritizing documents containing high TF-IDF terms matching the query. This ensures that the most relevant information is presented to the user.

**Dimensionality Reduction and Feature Selection:**

 - Feature Selection: TF-IDF acts as a feature selection technique by assigning weights to terms. Terms with higher TF-IDF scores are considered more important features for representing the document. This helps reduce the number of features used for analysis, making the process more efficient.
 - Reduced Computational Cost: By focusing on the most informative words, TF-IDF effectively reduces the dimensionality of the data, leading to faster processing and analysis, especially for large datasets. This is particularly important in finance, where large volumes of textual data are often analyzed.

**Improved Information Retrieval and Semantic Similarity:**

 - Semantic Similarity: Documents with similar TF-IDF profiles are likely to be semantically related, even if they don't share exact keywords. This is because TF-IDF captures the underlying meaning of documents by focusing on the most distinctive words.
 - Better Document Ranking: Search engines and recommendation systems leverage TF-IDF to rank documents or items based on their relevance to a query or user profile. In finance, this can be used to rank news articles, research reports, or other textual data based on their relevance to a specific investment topic or company.

**Versatility and Wide Applicability:**

 - Wide Applicability: TF-IDF can be applied to various text analysis tasks, including document classification, clustering, summarization, and topic modeling. This makes it a versatile tool for a wide range of financial applications.
 - Language Agnostic: The core principles of TF-IDF are applicable to different languages, making it a versatile tool for multilingual text analysis. This is important in finance, where information is often gathered from sources in different languages.


In summary, TF-IDF is a valuable technique for enhancing the relevance of terms, reducing data dimensionality, and improving information retrieval in various text analysis applications.



### **2.5 Simplified example of applying TF-IDF**

Let's illustrate the application of TF-IDF with a simplified example in the context of financial sentiment analysis.

  > **Scenario:** Suppose we have a small collection of financial news headlines related to a company called "Acme Corp.":

  > 1. "Acme Corp. announces record profits, stock surges."
  > 2. "Acme Corp. faces regulatory scrutiny, shares decline."
  > 3. "Market volatility impacts Acme Corp. earnings."
  > 4. "Acme Corp. expands into new markets, analysts optimistic."

  > **Goal:** We want to use TF-IDF to identify the most important words in each headline and understand the overall sentiment towards Acme Corp.

**Steps:**

**Step 1. Tokenization and Preprocessing:**

 - Break down each headline into individual words (tokens).
 - Remove stop words (common words like "the", "a", "is") and punctuation.
 - Convert words to lowercase.

**Step 2. Calculating Term Frequency (TF):**

 - For each headline, count the occurrences of each word.
 - Divide the word count by the total number of words in the headline to get the TF.

**Step 3. Calculating Inverse Document Frequency (IDF):**

 - For each word, count the number of headlines it appears in.
 - Divide the total number of headlines by the number of headlines containing the word.
 - Take the logarithm of the result to get the IDF.

**Step 4. Calculating TF-IDF:**

Multiply the TF and IDF scores for each word in each headline to get the TF-IDF score.




#### **Illustrative Example:**

Let's demonstrate implementation on example Headline 1: "Acme Corp. announces record profits, stock surges."

 - Tokenization and Preprocessing:

   - Tokens: ["acme", "corp", "announces", "record", "profits", "stock", "surges"]

 - Calculating TF:

   - TF("acme") = 1/7
   - TF("corp") = 1/7
   - TF("announces") = 1/7
   - TF("record") = 1/7
   - TF("profits") = 1/7
   - TF("stock") = 1/7
   - TF("surges") = 1/7

 - Calculating IDF:

   - IDF("acme") = log(4/4) = 0 (appears in all headlines)
   - IDF("corp") = log(4/4) = 0 (appears in all headlines)
   - IDF("announces") = log(4/1) = log(4)
   - IDF("record") = log(4/1) = log(4)
   - IDF("profits") = log(4/1) = log(4)
   - IDF("stock") = log(4/2) = log(2)
   - IDF("surges") = log(4/1) = log(4)

 - Calculating TF-IDF:

   - TF-IDF("acme") = (1/7) * 0 = 0
   - TF-IDF("corp") = (1/7) * 0 = 0
   - TF-IDF("announces") = (1/7) * log(4)
   - TF-IDF("record") = (1/7) * log(4)
   - TF-IDF("profits") = (1/7) * log(4)
   - TF-IDF("stock") = (1/7) * log(2)
   - TF-IDF("surges") = (1/7) * log(4)

**Interpretation:** Words with higher TF-IDF scores are more important for this headline. In this case, "announces", "record", "profits", and "surges" have higher TF-IDF scores, indicating their importance in conveying the positive sentiment of the headline.

We can then proceed to with applying this techniques to all eadlines. By repeating this process for all headlines, we can identify the most important words in each and understand the overall sentiment towards Acme Corp.



#### **Python code:**

let's demonstrate how to apply TF-IDF using Python code with the sklearn library. The following code snippet takes a set of financial news headlines and converts them into a numerical representation (TF-IDF matrix) that captures the importance of each word in each headline. We do this by using the `TfidfVectorizer` - a powerful tool for converting a collection of raw documents into a matrix of TF-IDF features. It essentially automates the process of calculating TF-IDF scores for each word in each document:

In [7]:
# Sample headlines
headlines = [
    "Acme Corp. announces record profits, stock surges.",
    "Acme Corp. faces regulatory scrutiny, shares decline.",
    "Market volatility impacts Acme Corp. earnings.",
    "Acme Corp. expands into new markets, analysts optimistic."
]

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer(stop_words='english')

# Fit the vectorizer to the headlines
vectorizer.fit(headlines)


# Transform the headlines into a TF-IDF matrix
tfidf_matrix = vectorizer.transform(headlines)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()


After defining sample headlines in `headlines` variable we first create `TfidfVectorizer` object. `vectorizer = TfidfVectorizer(stop_words='english')` line creates an instance of the `TfidfVectorizer` class and assigns it to the variable `vectorizer`. The `stop_words='english'` argument tells the vectorizer to automatically remove common English words (like "the", "a", "is") that typically don't carry much meaning in text analysis. This helps focus on more informative words.

Then the code fit the Vectorizer. `vectorizer.fit(headlines)` line "fits" the vectorizer to the headlines data. This means the vectorizer analyzes the headlines.

Now we are ready to transform Headlines into TF-IDF Matrix. `tfidf_matrix = vectorizer.transform(headlines)` is the crucial step where the actual TF-IDF transformation happens. The transform method takes the headlines as input and converts them into a matrix representation called `tfidf_matrix`. Structure of `tfidf_matrix` has:
 - Rows representing the headlines;
 - Columns representing a unique words (tokens) from the vocabulary learned by the vectorizer;
 - Values that are the TF-IDF scores for each word in each headline. Higher TF-IDF scores indicate that a word is more important or distinctive within a particular headline.

After fitting the `TfidfVectorizer` to headlines and then learning a vocabulary of all the unique words present in the data, the last line of code `feature_names = vectorizer.get_feature_names_out()` retrieves the list of words (features) that were used to create the TF-IDF matrix. The list of feature names is then assigned to the variable `feature_names`.

Finally, we can print the list of feature names (words) that were extracted from headlines and used to create the TF-IDF matrix:

In [9]:
# Print the feature names
print(feature_names)


['acme' 'analysts' 'announces' 'corp' 'decline' 'earnings' 'expands'
 'faces' 'impacts' 'market' 'markets' 'new' 'optimistic' 'profits'
 'record' 'regulatory' 'scrutiny' 'shares' 'stock' 'surges' 'volatility']


This output represents the vocabulary of unique words that the `TfidfVectorizer` extracted from headlines after preprocessing (removing stop words). The words in the list are usually sorted alphabetically. Each word in this list corresponds to a column in the TF-IDF matrix (`tfidf_matrix`). The values in that column represent the TF-IDF scores for that particular word in each headline.

We can use `feature_names` to access specific words. When we want to analyze the TF-IDF score for a particular word, we can use the `feature_names` list to find its index. The followig code focuses on accessing TF-IDF scores for a specific headline and word. This code snippet identifies the specific headline (first headline) and word ("profits") that we are interested in. Then it finds the corresponding row and column in the TF-IDF matrix, retrieves the TF-IDF score from that location in the matrix and prints the score:

In [10]:
# Accessing TF-IDF scores for a specific headline and word
headline_index = 0  # Index of the first headline
word_index = feature_names.tolist().index('profits')  # Get the index of the word "profits"

# Find TF-IDF score and print
tfidf_score = tfidf_matrix[headline_index, word_index]
print(f"TF-IDF score for 'profits' in the first headline: {tfidf_score}")


TF-IDF score for 'profits' in the first headline: 0.42468159315633897


The TF-IDF matrix is a core component of TF-IDF analysis, as it stores the calculated TF-IDF scores for each word in each document. Let's now see a numerical representation of the TF-IDF matrix. The following line of code displays the numerical representation of the TF-IDF matrix:



In [11]:
# Print the TF-IDF matrix
print(tfidf_matrix.toarray())


[[0.22161647 0.         0.42468159 0.22161647 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.42468159 0.42468159 0.         0.         0.
  0.42468159 0.42468159 0.        ]
 [0.22161647 0.         0.         0.22161647 0.42468159 0.
  0.         0.42468159 0.         0.         0.         0.
  0.         0.         0.         0.42468159 0.42468159 0.42468159
  0.         0.         0.        ]
 [0.24478737 0.         0.         0.24478737 0.         0.46908376
  0.         0.         0.46908376 0.46908376 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.46908376]
 [0.22161647 0.42468159 0.         0.22161647 0.         0.
  0.42468159 0.         0.         0.         0.42468159 0.42468159
  0.42468159 0.         0.         0.         0.         0.
  0.         0.         0.        ]]


While `print(tfidf_matrix.toarray())` gives the numerical representation, it lacks context. Here's how we can improve it and make the output of the TF-IDF matrix more readable by adding row and column headings. The most convenient way to add headings is to use the Pandas library to create a DataFrame. This allows to label rows and columns, making the output much clearer.

In [12]:
# Create a list of generic headline labels
headline_labels = ["Headline1", "Headline2", "Headline3", "Headline4"]

# Create a Pandas DataFrame
pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names, index=headline_labels)


Unnamed: 0,acme,analysts,announces,corp,decline,earnings,expands,faces,impacts,market,...,new,optimistic,profits,record,regulatory,scrutiny,shares,stock,surges,volatility
Headline1,0.221616,0.0,0.424682,0.221616,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.424682,0.424682,0.0,0.0,0.0,0.424682,0.424682,0.0
Headline2,0.221616,0.0,0.0,0.221616,0.424682,0.0,0.0,0.424682,0.0,0.0,...,0.0,0.0,0.0,0.0,0.424682,0.424682,0.424682,0.0,0.0,0.0
Headline3,0.244787,0.0,0.0,0.244787,0.0,0.469084,0.0,0.0,0.469084,0.469084,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.469084
Headline4,0.221616,0.424682,0.0,0.221616,0.0,0.0,0.424682,0.0,0.0,0.0,...,0.424682,0.424682,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, instead of just seeing an array of numbers, we have a nicely formatted table with the headlines as row labels and the words as column labels. Each cell in the table contains the TF-IDF score for that word in that headline.

This simplified example demonstrates the basic steps involved in applying TF-IDF. In real-world applications, we might need to perform more advanced preprocessing, use different parameters for the `TfidfVectorizer`, and apply more sophisticated analysis techniques to extract meaningful insights from financial text data.



## **Conclusion**

In this lesson we primarily focused on similarity measures and their applications in financial markets. We introduced the concept of similarity measures, explaining how they are used to quantify the likeness between data objects. We learned that there are two main categories of similarity measures: Distance Measures and Similarity Coefficients:.

We then shifted focus on to Term Frequency-Inverse Document Frequency (TF-IDF), a statistical measure used to evaluate the importance of words in a document within a collection of documents. We learned about how TF-IDF works, its benefits, and its applications in financial analysis. These applications include sentiment analysis for financial markets, alternative data analysis, risk management, algorithmic trading, and portfolio optimization.

In the next lesson we try to discover how to apply these techniques in practice.



**References**

 - Daniel Jurafsky and James H. Martin. (2024). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models, 3rd edition. Online manuscript released August 20, 2024. https://web.stanford.edu/~jurafsky/slp3

 - Wikipedia Contributors (2024). Similarity measure. [online] Wikipedia. Available at: https://en.wikipedia.org/wiki/Similarity_measure

---
Copyright 2024 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
