MODULE 4 | LESSON 2


---


# **Search Engine Data**

|  |  |
|:---|:---|
| **Reading Time**  |  60 minutes  |
| **Prior Knowledge**  |  Basic Python, Basic concepts of Natural Language Processing, Financial Markets  |
| **Keywords**  | Machine Learning, Google Trends, Pytrends, Sentiment Analysis, TF-IDF, LSA (Latent Semantic Analysis), NLP (Natural Language Processing) |

---

*In this lesson we explore how to leverage data science and machine learning for financial analysis, with a focus on analyzing search terms from Google Trends data. It covers techniques like TF-IDF and LSA to extract insights from text, and demonstrates how to gather and analyze Google Trends data using Python and the `pytrends` library. The notebook also discusses the application of these techniques for sentiment analysis, risk management, and portfolio optimization.*

In [3]:
# Load libraries
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import plotly.graph_objects as go
import re
import time

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from pytrends.request import TrendReq
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download 'punkt' resource
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## **1 What is Search Engine Data**

Search engine data refers to the information collected and generated by search engines as users interact with them. This data encompasses various aspects of user behavior, preferences, and interactions within the search engine ecosystem. Search engine data is a valuable alternative data source that can complement traditional financial data and provide a unique perspective on market trends and investor behavior.

When it comes to financial engineering, using search engine data is a more specialized area. We generaly have two options: Using APIs from major search engines such as Google Trends API or Bing API; or Alternative data provides: Several companies specialize in providing alternative data for financial analysis, including search engine data.

In this lesson we will focus on using Google Trends data. Here are some suggestions on how Google Trends data can be used for the financial engineering tasks:

 - **Sentiment analysis:** Google Trends data can be used to measure the sentiment of the market towards a particular security or asset. This can be done by tracking the search volume for keywords related to the security or asset. For example, if the search volume for "Tesla stock price" is increasing, this could be a sign that the market is becoming more bullish on Tesla.
 - **Predictive modeling:** Google Trends data can be used to build predictive models for financial markets. This can be done by using machine learning algorithms to train models on historical Google Trends data and other relevant data sources. For example, a model could be trained to predict the price of a stock based on the search volume for keywords related to the stock.
 - **Risk management:** Google Trends data can be used to identify and manage risks in financial markets. This can be done by tracking the search volume for keywords related to risk factors, such as "economic recession" or "market volatility." For example, if the search volume for "economic recession" is increasing, this could be a sign that the market is becoming more concerned about the possibility of a recession.
 - **Algorithmic trading:** Google Trends data can be used to develop algorithmic trading strategies. This can be done by using Google Trends data to identify trading opportunities. For example, a trading strategy could be developed to buy a stock when the search volume for keywords related to the stock is increasing.


Search engine data can be valuable for investors, and TF-IDF can play a role in its analysis. Here's how:

**How Search Engine Data Helps Investors:**

 - Gauging Public Interest: Search engine data, like Google Trends, reveals the relative popularity of search terms over time. This can indicate public interest in specific companies, products, industries, or economic trends.
 - Identifying Emerging Trends: Increases in search volume for specific terms can signal emerging trends or shifts in consumer behavior, providing early investment opportunities.
 - Sentiment Analysis: Analyzing search queries related to a company or asset can provide insights into public sentiment, which can influence investment decisions.
 - Predictive Modeling: Search engine data can be combined with other data sources to build predictive models for market movements or company performance.

**Role of TF-IDF in Analyzing Search Engine Data:**

 - Keyword Analysis: TF-IDF can be applied to analyze the search queries themselves, identifying the most relevant and informative terms related to an investment topic.
 - Document Analysis: TF-IDF can be used to analyze the content of web pages returned in search results, helping investors understand the context and sentiment surrounding a particular search term.
 - Trend Identification: Tracking changes in the TF-IDF scores of specific terms over time can help identify emerging trends or shifts in public interest.

 - Examples:

   - An investor might analyze Google Trends data for search terms related to electric vehicles to assess the growing interest in this industry and identify potential investment opportunities.
   - A hedge fund could use TF-IDF to analyze news articles and social media posts related to a company to gauge public sentiment and predict potential stock price movements.

**Benefits for Investors:**

 - Early Signals: Search engine data can provide early signals of market trends or changes in investor sentiment.
 - Alternative Data Source: It offers an alternative data source that can complement traditional financial data.
 - Improved Decision-Making: By incorporating search engine data analysis, investors can make more informed investment decisions.

**Considerations:**

 - Data Quality: Search engine data can be noisy and may not always accurately reflect market sentiment.
 - Privacy Concerns: Ethical considerations and data privacy regulations need to be taken into account when using search engine data.

Overall, search engine data can be a valuable tool for investors, and TF-IDF can be effectively used to analyze this data and extract meaningful insights for investment decisions.



### **1.1 Introduction to Google Trends data**

It is important to note that Google Trends data is just one of many data sources that can be used for financial engineering tasks. It is important to use a variety of data sources to get a complete picture of the market.

In this lesson, we will explore how to use Python to fetch and analyze data from Google Trends, a powerful tool that reveals how often a particular search term is entered relative to the total search volume across various regions of the world and in various languages. We will use the `pytrends` library, a Python interface for Google Trends data. This library allows us to automate the downloading of reports from Google Trends. We will learn how to fetch the interest over time for a specific search term, how to analyze this data using Python, and how to visualize findings. It provides an interface to Google Trends, allowing to:

 - Fetch Interest Over Time: See how search interest for specific keywords has changed over a period.
 - Analyze Interest by Region: Discover where in the world people are searching for particular terms.
 - Find Related Queries: Identify top and rising search queries related to our keywords.

By the end of this section, we will have a solid understanding of how to leverage Python and Google Trends data to uncover insights about search trends. This knowledge can be particularly useful in fields like market research, where understanding trends can provide valuable insights into consumer behavior.


### **1.2 Fetching and Comparing Interest Over Time**

Now let's define the search terms that we want to compare. In this case, we'll compare “AAPL”, “TSLA”, and “NVDA”. We fetch the interest over last 12 month period for these search terms from Google Trends and visualize the data. The following code snippet initializes an instance of the `TrendReq` class from the `pytrends` library and defines a list of keywords and a timeframe. This sets up the necessary objects to interact with Google Trends and retrieve data. Then the code builds the payload for the Google Trends request using the defined keywords and timeframe, and then retrieves the interest over time data. The `interest_over_time_df` DataFrame would then contain the Google Trends data for the specified keywords and timeframe:

In [5]:
# Initialize an instance of the TrendReq class with US region and 360 timezone
pytrends = TrendReq(hl='en-US', tz=360)

# Define the list of keywords and timeframe
kw_list = ["AAPL", "TSLA", "NVDA"]
timeframe = 'today 12-m' # for latest 12 months replace with 'today 12-m'

# Build the payload and get the interest over time data
pytrends.build_payload(kw_list, timeframe=timeframe)
interest_over_time_df = pytrends.interest_over_time()

# Create a figure and axes and plot the data
plt.figure(figsize=(16, 6))
for term in kw_list:
  plt.plot(interest_over_time_df.index, interest_over_time_df[term], label=term)

plt.xlabel('Date')
plt.ylabel('Trends Index')
plt.title('Interest Over Time')
plt.legend(loc='upper left')
plt.grid(True)
plt.show()


TooManyRequestsError: The request failed: Google returned a response with code 429

Now that we have the plot, here are some key things to pay attention to and analyze:

**Overall Trends:**
 - Upward/Downward: Are the trends generally increasing, decreasing, or remaining stable over the year? This can indicate growing or waning interest in the companies.
 - Seasonality: Do we notice any recurring patterns or spikes at certain times of the year? This could be related to product releases, financial reports, or other events.

**Relative Popularity:**
 - Comparison: Which keyword shows the highest overall interest? Which one shows the lowest? This can give us an idea of the relative popularity of these companies in terms of search volume.
 - Crossovers: Are there points where the trends for different keywords intersect? This could signal shifts in public attention or perception.

**Volatility:**
 - Fluctuations: How much do the trends fluctuate over time? Large swings might suggest sensitivity to news or market events.
 - Outliers: Are there any significant spikes or dips that stand out? Try to investigate what might have caused them (e.g., news announcements, product launches, controversies).

**Correlation:**
 - Relationship: Do the trends for different keywords seem to move together or independently? This can indicate whether they are influenced by similar factors or not.

By carefully examining these aspects of the plot, we can gain insights into the search behavior related to these companies and potentially draw conclusions about their public perception and market performance.

For our example specifically, peaks and troughs can be observed, with NVIDIA (NVDA) experiencing a significant peak around March 2024, nearly reaching a trend index of 80, and then again over most recent period by end of August 2024. Suc high index could coincide with significant events related to NVIDIA, such as product launches, earnings reports, or market developments. In fact, in March 2024, NVIDIA made significant announcements related to its next-generation GPU architecture called Blackwell. This architecture boasts an impressive 208 billion transistors and is designed to run real-time generative AI models. The anticipation around Blackwell likely contributed to the increased interest in NVIDIA during that time. Additionally, in August 2024, NVIDIA continued to unveil innovations related to AI and data center performance at the Hot Chips conference, further fueling interest in the company. These events and technological advancements likely explain the observed spikes in interest for NVIDIA during those months.

Apple (AAPL) and Tesla (TSLA) exhibit varying trends over the same period. While both companies generally maintain interest, their fluctuations are less pronounced compared to NVIDIA. Analyzing news or corporate announcements during these timeframes might shed light on the reasons behind these trends.

We also need to consider exploring correlations between the interest trends in these companies and broader market indices (e.g., S&P 500). If there's a strong positive correlation, it could indicate that overall market sentiment influences interest in specific stocks.

We also need to observe any recurring patterns across different months or quarters. Seasonal factors (e.g., holiday seasons, earnings seasons) might impact investor sentiment and interest levels.

 - TSLA Peaks: Tesla (TSLA) usually experiences noticeable peaks around April and July each year. These could align with quarterly earnings reports or significant announcements related to Tesla's products or business developments.
 - AAPL Trends: Apple (AAPL) usually shows a consistent upward trend around September and January. These months often coincide with new product launches (e.g., iPhone releases) and holiday seasons.
 - NVDA Fluctuations: NVIDIA (NVDA) exhibits less predictable patterns, but there are fluctuations around March and August. Investigate whether industry-specific events or conferences occur during these months.

### **1.3 Analyzing Interest by other metrics**

Google Trends also allow to fetch other details in relation to our search query:

 - **Interest by subregion** lets us see how interest in a particular search term varies across different geographical areas within a larger region.
 - **Related topics** are broader subjects or concepts that are associated with our keyword, and which were also searched by other users.
 - **Related queries** are specific search terms that people have used in conjunction with our keyword. These are actual queries entered into the search engine by other users during the same time period.



#### **Analyzing Interest by Region**

Now, let's fetch the interest by region for a search term NVDA and identify the top 10 regions with the highest interest:

In [6]:
# Initialize an instance of the TrendReq class and build the payload
pytrends = TrendReq(hl='en-US', tz=360)
pytrends.build_payload(["NVDA"], timeframe='2023-09-01 2024-09-01')

# Get the interest by region data
interest_by_region_df = pytrends.interest_by_region()
interest_by_region_df.sort_values(by='NVDA', ascending=False).head(10)


TooManyRequestsError: The request failed: Google returned a response with code 429

This output shows the top 10 regions with the highest search interest for "NVDA" during the specified period ('2023-09-01 2024-09-01'). The values represent the relative popularity of the search term in each region, where 100 is the highest value.

For example, Hong Kong has the highest search interest for "NVDA" with a value of 100, followed by Taiwan and Israel with a value of 79 each.

This data can be useful for businesses to understand the geographic distribution of interest in their products or services. It can also be used for market research and to identify potential target markets. Please note that further analysis and correlation with other data sources would be needed to draw definitive conclusions.



#### **Related Queries and Topics**

Top and rising related topics and queries can be extremely useful for financial engineering tasks because they provide insights into the collective "mind of the market" and reveal trends that might not be obvious from traditional financial data.

**Related Topics** are broader themes associated with NVDA. This can help to identify:
 - Industry trends: Are there broader trends in gaming, AI, or data centers that are influencing interest in NVDA?
 - Investment themes: Are there specific investment themes (e.g., growth stocks, tech stocks) that are driving interest in NVDA?

**Related Queries** are other terms people search for in conjunction with NVDA keyword. This can reveal:
 - Emerging trends: Are there new or rising keywords that indicate shifting interest or new areas of focus for NVDA?
 - Competitor analysis: Are people searching for NVDA in comparison to its competitors?
 - Public perception: What are the main concerns or interests people have regarding NVDA?

Both Related Topics and Related Queries can be filtered to get Top and Rising terms for each:

 - **Top** - these are most popular search topics/queries. Scoring is on a relative scale from 100 as the most commonly searched topic/querie, 50 for query searched half as often, and so on.
 - **Rising** - these are topics/queries with the biggest increase in search frequency since the last time period. Results marked "Breakout" had a tremendous increase, probably because these topics/queries are new and had few (if any) prior searches.


In this lesson we will focus on Related Queries. Both top and rising related queries offer valuable insights, but they serve different purposes. The best choice depends on specific goals and the type of analysis we are conducting. From a financial engineering perspective, the choice between top and rising related queries still depends on our specific goals, but here's a more tailored recommendation:

 - **For Algorithmic Trading and Sentiment Analysis:** Rising queries are generally more valuable. They can provide real-time signals for sentiment-based trading strategies. We could build a system that automatically adjusts portfolio positions or generates trading signals based on changes in the sentiment expressed in rising queries. For example if we observe a sudden surge in negative sentiment related to a company we hold in our portfolio, algorithm could automatically reduce exposure to that stock.

 - **For Risk Management:** Rising queries are crucial for early warning signs. A spike in searches for terms related to lawsuits, regulatory investigations, or product recalls could indicate potential risks for a company or industry. We can integrate rising query analysis into risk management system to identify and assess potential threats proactively.

 - **For Portfolio Optimization and Asset Allocation:** Both top and rising queries can be useful.
   - Top queries: Can help to understand the main themes and drivers of different assets or industries, allowing to diversify portfolio accordingly.
   - Rising queries: Can help to identify emerging themes or potential risks that might warrant adjustments to portfolio.

 - **For Quantitative Research and Modeling:** Both types of queries can be valuable as alternative data sources.
    - Rising queries: Can be used to create sentiment indicators or event detection signals that can be incorporated into quantitative models.
    - Top queries: Can be used to understand the relationships between different assets or industries based on the themes and concerns expressed in search queries.

Overall, while both have their uses, rising queries tend to be more valuable for financial engineering applications that require real-time insights and a focus on short-term trends and sentiment shifts. Top queries provide a broader understanding of market perception, which can be useful for longer-term strategies and fundamental analysis. Remember to consider the challenges of noise, data quality, and the need for sophisticated analysis techniques when working with both types of queries.

Please also note that narrowing the time period for fetching insights from rising related queries can be very beneficial, especially when our goal is to detect short-term trends or capture immediate market reactions. Shorter timeframe generally provides a good balance between sensitivity, data availability, and actionability. We can experiment with different timeframes to find what works best for our specific needs and the volatility of the assets we are analyzing.

Let's now fetch the top and rising related queries for 'NVDA' search term for the period from 2024-05-01 to 2024-09-01 and analyze the results. This time we manually downloaded the data in CSV format and saved locally as 'NVDA-relatedQueries.csv'. CSV file from Google Trends containing related queries typically presents the data in a stacked format:  The "Rising" queries data is typically stacked below the "Top" queries data in the same CSV file. There might be a blank row or a visual separator (like a line of dashes) to distinguish the two sections. The code below addresses the specific structure of the Google Trends CSV file, particularly the stacking of "Top" and "Rising" queries. First we read CV file into DataFrame. Then find the start and end rows for each section and finally extract each section with simultaneously splitting column on comma and formatting values:

In [7]:
# Read the CSV file
df = pd.read_csv('NVDA-relatedQueries.csv', header=None, delimiter='\t')

# Find the start and end rows for each section
top_start = df[df.iloc[:, 0] == 'TOP'].index[0] + 1
top_end = df[df.iloc[:, 0] == 'RISING'].index[0] - 1
rising_start = df[df.iloc[:, 0] == 'RISING'].index[0] + 1
rising_end = df.shape[0]  # End of the DataFrame


In [8]:
# Extract top queries
queries_top = df.iloc[top_start:top_end].copy()
queries_top = queries_top.rename(columns={queries_top.columns[0]: 'top_queries'})

# Split the 'top_queries' column and convert 'value' to numeric
queries_top[['top_query', 'value']] = queries_top['top_queries'].str.split(',', expand=True)
queries_top = queries_top[['top_query', 'value']]  
queries_top.loc[:, 'value'] = pd.to_numeric(queries_top['value'])
queries_top


Unnamed: 0,top_query,value
3,stock nvda,100
4,nvda price,23
5,stock price nvda,20
6,tsla,7
7,amd,7
8,amd stock,7
9,aapl,7
10,tsla stock,6
11,aapl stock,6
12,amzn,5


In [9]:
# Extract rising queries
queries_rising = df.iloc[rising_start:rising_end].copy()
queries_rising = queries_rising.rename(columns={queries_rising.columns[0]: 'rising_queries'})

# Split the 'rising_queries' column aon the first comma
queries_rising[['rising_query', 'value']] = queries_rising['rising_queries'].str.split(',', n=1, expand=True)
queries_rising = queries_rising[['rising_query', 'value']]

# Clean the 'value' column
queries_rising['value'] = queries_rising['value'].str.replace(r'[^0-9.-]', '', regex=True)  # Keep only numbers, dots, and minus signs
queries_rising.loc[:, 'value'] = pd.to_numeric(queries_rising['value'], errors='coerce')  # Convert to numeric

# Display 'rising_queries'
queries_rising

Unnamed: 0,rising_query,value
29,ffie stock,3200
30,ffie,3150
31,serv stock,2800
32,asts,750
33,asts stock,700
34,nvda split date,700
35,nvda stock split date,550
36,gme stock,550
37,gme,550
38,gamestop stock,400


The rising queries for "NVDA" provides some interesting insights, particularly regarding the upcoming earnings announcement. Here's what we can gather:

 - Earnings Dominate: The overwhelming focus is on NVDA's earnings. Terms related to earnings are all experiencing massive increases in search interest. This signifies a high level of anticipation and interest in the company's financial results.

 - Real-Time Information: The "Breakout" trend for "nvda earnings time today" suggests that people are actively seeking real-time information about the earnings announcement, possibly during the day of the release.

 - Global Interest: The presence of Japanese and Chinese characters indicates that interest in NVDA's earnings extends beyond English-speaking regions, highlighting the company's global reach.

 - Investor Focus: Terms like "nvda investor relations" and "when does nvda report earnings" suggest that investors are actively seeking information about the earnings release and possibly preparing for potential market reactions.

 - Financials and Nasdaq: The rise of "nasdaq nvda financials" suggests that people are looking for detailed financial information about NVDA, possibly on the Nasdaq website or other financial platforms.

 - Other Stocks and Potential Noise: While earnings-related terms dominate, there are some outliers like "lunr stock," "pdd stock," "crm stock," "chwy stock," and "afrm stock" These might be related to other companies or market trends and could potentially be noise in the context of analyzing NVDA. Further investigation would be needed to understand their relevance.


### **1.4 Considerations and Limitations when using Google Trends data**

When using Google Trends data for financial analysis, we should be aware of its limitations. While Google Trends data can be insightful, it's crucial to understand its limitations regarding data quality. Here are some key aspects to keep in mind:

 - **Sampling:** Google Trends data is based on a sample of searches, not the entire search volume and the exact sampling methodology isn't publicly disclosed. This means there's inherent sampling error, and the data might not perfectly represent the entire population of searches. This means there might be some noise and the data may not perfectly reflect the actual search interest.
 - **Data Updates:** Google Trends data might be updated or revised over time as Google refines its algorithms and data processing techniques. This means that historical data you accessed previously could potentially change slightly in the future.
 - **Normalization:** Google Trends data is normalized, meaning the values are relative to the highest point in the timeframe you select. The data is normalized to a scale of 0 to 100. A value of 100 represents the peak popularity, while other values are scaled proportionally. This means the values represent the relative popularity of a search term compared to the highest point in the selected timeframe. While normalization helps in comparing trends, it can also mask the absolute search volume, which might be relevant in some cases.
 - **Geographic Granularity:** The level of geographic detail available in Google Trends data varies depending on the search term and the time period. For some terms or regions, the data might be more granular, while for others, it might be more aggregated. This can affect the accuracy of regional analysis.
 - **External Factors:** Search behavior can be influenced by various external factors, such as media coverage, news events, seasonality, and even the popularity of other search terms. These factors can introduce noise and bias into the data, making it challenging to isolate the true underlying trends.
 - **Privacy:** Google Trends data is anonymized and aggregated, but it's still important to be mindful of privacy considerations, especially when dealing with sensitive topics.


By understanding these data quality considerations and taking appropriate precautions, we can use Google Trends data more effectively and draw more reliable conclusions for financial analysis. Rememebr that context is key. We need to consider the context surrounding the data, such as potential external factors that might be influencing search behavior.

## **2 Latent Semantic Analysis**

Up until now we explored Google Trends data and explored simple examples of using `pytrends` to gather data. While `pytrends` itself provides valuable insights into search trends, incorporating machine learning techniques like Latent Semantic Analysis (LSA) can offer several advantages.

Before intrducing LSA, we need to understand some terminology. In the context of LSA, a "**document**" refers to a unit of text that we want to analyze. It can be anything from a single sentence to a whole article, a book, or even a collection of related texts. The key is that it represents a distinct piece of information that we want to compare and analyze in relation to other documents. Here are some examples of what could be considered a "document" in different LSA applications:

 - **Topic Modeling:** Each document could be a news article, a blog post, or a social media post. LSA can help identify the main topics discussed across these documents.
 - **Information Retrieval:** Each document could be a web page, a research paper, or a product description. LSA can help find documents that are semantically similar to a given query.
 - **Document Classification:** Each document could be an email, a customer review, or a legal contract. LSA can help categorize these documents based on their content.

Latent Semantic Analysis (LSA) is a technique in natural language processing that analyzes relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. Incorporating machine learning techniques like LSA can offer several advantages:

 - **Uncovering Hidden Relationships:** LSA can reveal hidden relationships between terms and concepts that might not be apparent from simply looking at raw search data. This can help us understand the underlying semantic structure of the search queries related to our documents.

 - **Dimensionality Reduction:** LSA can reduce the dimensionality of our data by representing it in a lower-dimensional concept space. This can make it easier to analyze and visualize the data, especially when dealing with a large number of terms.

 - **Improved Search and Recommendation:** By understanding the relationships between terms and concepts, we can build more intelligent search engines and recommendation systems. For example, we could suggest related searches or recommend documents based on semantic similarity rather than just keyword matching.

 - **Topic Modeling:** LSA can be used for topic modeling, which allows us to discover the main topics or themes present in a set of documents or search queries. This can help us understand the broader context of the information and potentially categorize or cluster documents more effectively.

In a nutshell, LSA is used to analyze textual data and is a method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. LSA is a powerful tool for analyzing textual data, but it has some limitations. For example, it is not very good at handling polysemy (words with multiple meanings) or synonymy (different words with the same meaning).

In this lesson we will explore basic example of how we could use LSA to analyze related search terms and improve our understanding of user search behavior. The choice of what constitutes a document depends on our specific application and the kind of analysis we want to perform. It's important to ensure that our documents are meaningful and relevant to the task at hand. A collection of documents is collectively called as "**term-document matrix**".

Financial engineers might approach document composition for LSA or similar techniques in different ways:

 - Domain Expertise: Financial engineers often have deep domain expertise and carefully select keywords and phrases based on their understanding of financial markets, specific assets, and relevant events.

 - Data Sources: They may use a wider range of data sources beyond just Google Trends, such as News articles and sentiment analysis, Social media data, Company filings and earnings call transcripts, Proprietary datasets.

 - Sophisticated Preprocessing: They might employ more advanced natural language processing techniques for preprocessing, including Named entity recognition (identifying companies, people, locations), Sentiment analysis (assigning sentiment scores to text), Topic modeling (identifying underlying topics in documents).

 - Ensemble Methods: They may combine LSA with other machine learning techniques or use ensemble methods to improve the accuracy and robustness of their models.

 - Backtesting and Validation: Financial engineers rigorously backtest and validate their models using historical data to ensure they are reliable and can potentially generate profitable trading signals or investment strategies.

Fow our example we will create term-document matrix with the following set of documents:

 > Document 1: "Risk management in derivatives trading", \
 > Document 2: "Portfolio optimization techniques and strategies", \
 > Document 3: "Algorithmic trading and its impact on market efficiency", \
 > Document 4: "Quantitative methods for financial modeling", \
 > Document 5: "High-frequency trading and liquidity provision", \
 > Document 6: "Integrating Derivatives into Portfolio Optimization Strategies", \
 > Document 7: "High-Frequency Trading Strategies: Algorithmic Approaches and Market Impact", \
 > Document 8: "Quantitative Methods for Algorithmic Trading and Portfolio Optimization" \

Before proceeding with code, let's observe that documents in our term-document matrix contain such words as "in", "and", "for", "on". These words are called **stop words**. They are common words that generally don't carry much specific meaning in the context of text analysis. It's often beneficial to remove stop words from text data before creating the term-document matrix. This can help to:

 - **Reduce noise:** Stop words can add noise to our data and obscure the more important terms that convey the main topics or themes.
 - **Improve efficiency:** Removing stop words can reduce the size of our matrix, making computations faster and more efficient.
 - **Focus on relevant terms:** By eliminating common words, we can focus on the more meaningful terms that are specific to our documents and application.

We will also apply lemmatization technique. **Stemming and lemmatisation** are two text preprocessing techniques used in Natural Language Processing (NLP) that help to reduce words to their base form.
 - **Stemming** reduces words to their stem, which are not necessarily real words, by removing prefixes and sufixes. For example "playing" would become "play", "studies" would become "studi", and "happily" would become "happi".
 - **Lemmatization** reduce words to their dictionary form or base form known as lemma. For example "playing" would become "play", "studies" would become "study", and "better" would become "good".

In our example we consider lemmatization: Given the specific and potentially nuanced language of finance, lemmatization is likely to be more beneficial than stemming. It can help group different forms of words (e.g., "analyze," "analyzing," "analysis") while preserving their core meaning.

Now that we made clear of all basic termonology and LSA definition, let's proceed with preparing our data. The following Python code performs exactly that:






In [None]:
# Sample documents
documents = [
    "Risk management in derivatives trading",
    "Portfolio optimization techniques and strategies",
    "Algorithmic trading and its impact on market efficiency",
    "Quantitative methods for financial modeling",
    "High-frequency trading and liquidity provision",
    "Integrating Derivatives into Portfolio Optimization Strategies",
    "High-Frequency Trading Strategies: Algorithmic Approaches and Market Impact",
    "Quantitative Methods for Algorithmic Trading and Portfolio Optimization"
]

# Download stopwords and WordNet resources and then create a Lemmatizer Object
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
lemmatizer = WordNetLemmatizer()

# construct text preprocessing function
def preprocess_text(text):
    text = re.sub(r'[^\w\s-]', '', text)  # Remove punctuation except hyphens
    tokens = nltk.word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens] # Lemmatize
    stop_words = set(stopwords.words('english')) # define stop_words as a set of english stop words
    tokens = [token for token in tokens if token not in stop_words] # Remove stop words
    return tokens

# Create a CountVectorizer object
vectorizer = CountVectorizer(tokenizer=preprocess_text, token_pattern=None)

# Create the term-document matrix, then convert to a pandas DataFrame
term_document_matrix = vectorizer.fit_transform(documents)
df = pd.DataFrame(term_document_matrix.toarray(), columns = vectorizer.get_feature_names_out())
term_document_matrix = df.T
print(term_document_matrix)

This matrix shows the relationships between the terms (rows) and documents (columns). For example:

 - The term "algorithmic" appears (value of 1) in documents 3, 7 and 8.
 - The term "trading" appears in documents 1, 3, 5, 7 and 8.
 - Document 1 contains the terms "derivatives", "management", "risk" and "trading".

This matrix is the foundation for applying LSA or other dimensionality reduction techniques.



### **2.1 How LSA relate to SVD**

LSA uses Singular Value Decomposition (SVD) at its core to find the hidden relationships between terms and concepts. Singular Value Decomposition is a fundamental concept in linear algebra and a powerful technique used in many areas, including machine learning and natural language processing. Here's how it works:

 1. Document-Term Matrix: LSA starts with a document-term matrix, where each row represents a document and each column represents a term (word). The cells contain the frequency of each term in each document.

 2. SVD Application or Matrix Factorization: SVD is applied to Document-Term matrix. SVD is a way to factorize (break down) a matrix into three separate matrices. It decomposes the matrix into three matrices:

   - $U$: Represents document-to-concept similarity.
   - $\Sigma$: A diagonal matrix containing singular values, representing the importance of each concept.
   - $V$: Represents term-to-concept similarity.

   - The Formula: The relationship between the original matrix ($A$) and the decomposed matrices is: $A = U \Sigma V^T$ (where $V^T$ is the transpose of $V$).

 3. Dimensionality Reduction: By keeping only the largest singular values (and corresponding columns/rows in $U$ and $V$), we reduce the number of dimensions (concepts) while preserving the most important information. This is similar to Principal Component Analysis (PCA).

 4. Concept Space: The resulting matrices represent documents and terms in a new "concept space". Words with similar meanings will be closer together in this space, even if they don't appear in the same documents.

In essence, SVD in LSA helps to:

 - **Uncover latent relationships:** Identify underlying concepts that are not explicitly present in the text.
 - **Reduce noise:** Filter out less important information and focus on the core meaning.
 - **Improve information retrieval:** Represent documents and queries in a way that captures semantic similarity.

This process allows LSA to go beyond simple keyword matching and understand the meaning of text at a deeper level.

Let's now proceed with python code where we will use `TruncatedSVD` from `scikit-learn`, essentially performing Singular Value Decomposition (SVD) under the hood. `TruncatedSVD` is an efficient implementation of SVD specifically designed for dimensionality reduction. It calculates the SVD and then keeps only the top `n_components` singular values and their corresponding vectors. This truncated version of the SVD is what produces the `lsa_matrix`, which represents our data in a lower-dimensional concept space.

**Important:** The output of LSA can be different each time we run the code. This is due to the random initialization of the `TruncatedSVD` algorithm. The variation is related to the nature of Singular Value Decomposition (SVD) and how LSA uses it. Here's a breakdown of the key factors:

 - Rotation in Concept Space: SVD essentially finds the best way to represent our data in a new "concept space." The components (or dimensions) in this space are determined by finding the directions of greatest variance in the original data. However, there is often some flexibility in how these components are oriented (rotated). Think of it like rotating a 3D object -- the object remains the same, but its projection onto the axes changes.

 - Sign Ambiguity: The signs (positive or negative) of the values in the LSA output can flip without changing the underlying meaning. This is because the direction of a component is arbitrary. Imagine a component representing "financial modeling" -- it doesn't matter if it points in one direction or the opposite direction; it still captures the same concept.

 - Sensitivity to Small Changes: SVD can be sensitive to small changes in the input data. Even minor variations in word frequencies or the addition/removal of a few documents can lead to some degree of rotation in the concept space, resulting in different term loadings.

 - Random Initialization: As mentioned before, the TruncatedSVD algorithm often uses random initialization, which can contribute to variations in the output.

While these variations might seem concerning, they usually don't drastically change the overall interpretation of the LSA results. The relative positions of terms in the concept space and the general patterns of similarity are generally preserved. To ensure consistent results, we will set the `random_state` parameter to a fixed value when creating the `TruncatedSVD` object. This will produce the same output each time we run the code.



In [None]:
# Get the terms (words) from the term-document matrix
terms = term_document_matrix.index

# Create an LSA model and fit to term_document_matrix
lsa = TruncatedSVD(n_components=2, random_state=42)
lsa_matrix = lsa.fit_transform(term_document_matrix)

# Create DataFrame for LSA results with term names and print it
df_lsa = pd.DataFrame(lsa_matrix, index=terms)
print(df_lsa)

Interpreting the results of LSA involves understanding the new semantic space created by the dimensionality reduction and analyzing the relationships between terms and documents in this space. Here is how to interpret the results:
 - Interpretting the Columns: These are also known as Components or Concepts or Dimensions. Think of each component as representing a hidden "concept" or "topic" within documents.
 - Interpreting the rows: Each row in this matrix represents a term from our original matrix, but now projected onto a 2-dimensional concept space (because we chose `n_components=2`).
 - Interpreting the values:
   - The two values in each row represent the coordinates of that term in the new concept space. Terms with similar coordinates are closer together in the concept space, indicating a semantic relationship. In our example first row represents the term "algorithmic". Its coordinates are approximately (1.488 -0.458). If another term, has similar coordinates, it suggests that these terms are related in the context of our documents.
   - The values represent the loading of that term on the corresponding component. The loadings tell us how much each term contributes to that concept. For example, "algorithmic" and "trading" have high loadings on Component 1, suggesting that Component 1 represents the concept of "algorithmic trading". A higher value (positive or negative) indicates a stronger association between the term and that component.

From the output of our code we can interpret the relationships between terms by looking at their coordinates and draw some preliminary observations:

 - **Distinct Clusters:**
   - Trading Cluster: "algorithmic", "high-frequency", "impact", "market", and "trading" have high values for component 0 and mostly negative values for component 1. This reinforces the idea that component 0 represents concepts related to automated and high-frequency trading, and their potential impact on the market.

   - Portfolio and Optimization Cluster: "optimization" and "portfolio" form a clear cluster with high positive values for both components. This suggests a strong association between these terms, likely related to portfolio optimization techniques.

   - Derivatives and Method-Quantitative Cluster: Grouping "method", "quantitative", and "derivative" together suggests a potential connection between quantitative methods and their application to derivatives.

   - Management-Risk-Liquidity-Provision Cluster: Combining "management", "risk", "liquidity", and "provision" suggests a cluster related to managing financial risk and ensuring liquidity. This also seems plausible, as these concepts are often intertwined in financial contexts, particularly when dealing with derivatives or trading activities.

 - **Other Observations:**
   - "Trading" as a central concept: "trading" has the highest value for component 0, indicating its importance and potential connection to various forms of trading discussed in the documents.
   - Distinct Pairs: The pair of terms "financial" and "modeling" form very distinct pair due to their identical coordinates. Further analysis is needed to determine their relationship to other terms and clusters.
   - Overlap between clusters: Some terms, like "strategy" and "technique", have moderate values in multiple components, suggesting potential overlap or connections between different clusters.

This quick analysis gives somewhat good overview of the relationships between terms based on the LSA results.

We can now plot the coordinates to visualise them and confirm if these proposed clusters appear close together in visual representation. To achieve this, we will use `plotly.graph_objects` module imported as `go`:



In [None]:
# Create a list of hover texts with labels and coordinates
hover_texts = []
for row in df_lsa.values:
    terms = df_lsa.index[(df_lsa[0] == row[0]) & (df_lsa[1] == row[1])]
    terms_str = ', '.join([f'"{term}"' for term in terms])
    coord = f"({row[0]:.5f}, {row[1]:.5f})"
    hover_texts.append(f"Terms: {terms_str}<br>{coord}")

# Create the scatter plot
fig = go.Figure(go.Scatter(
    x=df_lsa[0],
    y=df_lsa[1],
    mode='markers',
    marker=go.scatter.Marker(
        size=10,
        color='blue',
        opacity=0.8
    ),
    hovertemplate='%{text}<extra></extra>',
    text=hover_texts
))

# Set axis labels and title
fig.update_layout(
    xaxis_title="Component 0",
    yaxis_title="Component 1",
    title="LSA Results Scatter Plot"
)

fig.show()

Plotly plots are interactive by default. We can use its interactive features to pan/zoom and hover/select to see tooltips with term lables and coordinates.

Now looking at this plot it seems that clusters are not that easy to visualise. It seems that we were too focused on the moderate values of term coordinates, without paying enough attention to the precise distances between them. To determine clusters more accurately, we should use a more formal methods. By using these methods, we can avoid subjective interpretations and define clusters more precisely based on quantitative measures of similarity or distance. We will consider these methods later in this lesson.

### **2.2 Important considerations with LSA and SVD**

#### **Choosing the optimal number of components for LSA**

Choosing the optimal number of components (`n_components`) for LSA is a crucial step. Here are some considerations:

 - **Start small:** Begin with a small number of components (e.g., 2, 3, or 5) to see if it gives a good initial understanding of the relationships between terms and documents. This can help to get a feel for the data and identify potential clusters or topics.
 - **Iteratively increase:** Gradually increase the number of components and observe how the interpretation of the results changes. Look for a point where adding more components doesn't significantly improve the interpretability or reveal new meaningful relationships.
 - **Consider the size of corpus:** For smaller corpora, fewer components might suffice to avoid overfitting. For larger corpora, more components could capture more subtle semantic relationships.
 - **Focus on semantic coherence:** Evaluate the semantic coherence of the resulting topics or clusters. Do the terms within each cluster make sense together? Are the clusters distinct and meaningful? If the coherence decreases as we add more components, it might be a sign that we are starting to overfit.
 - **Balance dimensionality reduction with information loss:** LSA involves a trade-off between reducing dimensionality and preserving information. We should try to find a sweet spot where we have a manageable number of components while still capturing the most important semantic relationships.

In summary, there is no one-size-fits-all answer for the optimal number of components in LSA. It depends on specific data and goals. We would need to experiment with different values, carefully evaluate the results, and use domain knowledge to guide our choice.


#### **Understanding limitations of LSA and SVD**

As we observe from above example some terms are clustered as they appear in particular document. For example "algorithmic", "high-frequency", "impact", "market", and "trading" are similar terms making one cluster. And they are all from document 7.

This is crucial point about Latent Semantic Analysis (LSA) and its potential limitations. If LSA consistently clusters terms solely based on their co-occurrence within the same document, it raises questions about its effectiveness in capturing broader semantic relationships.

Here's a breakdown of why this might happen and what it means for LSA:

 - **Document-Specific Context:** LSA primarily relies on word co-occurrence patterns within a corpus. If terms frequently appear together in a single document but not across the entire corpus, LSA might interpret them as semantically related even if their broader contextual meanings differ significantly.
 - **Limited Corpus Size or Diversity:** A small or homogenous corpus can exacerbate this issue. If documents within the corpus lack diversity in topics and language use, LSA might struggle to identify broader semantic relationships beyond document-level co-occurrences.
 - **Data Sparsity:** In LSA, the term-document matrix can be very sparse, especially with a large vocabulary. This sparsity can sometimes lead to the model focusing on local co-occurrence patterns within documents rather than global semantic relationships.

So, what's the point of LSA if it primarily identifies document-level clusters? While the scenario we encounter isn't ideal, LSA can still offer valuable insights:

 - **Document Summarization:** LSA can be effective in identifying key terms and concepts within individual documents, which is useful for summarization and topic extraction.
 - **Dimensionality Reduction:** Even if clusters are document-specific, LSA effectively reduces the dimensionality of the data, making it easier to analyze and visualize.
 - **Basis for Other Techniques:** LSA can serve as a foundation for more advanced topic modeling techniques that address its limitations, such as Latent Dirichlet Allocation (LDA).

To improve LSA's performance and capture broader semantic relationships:

 - **Increase Corpus Size and Diversity:** Include a wider range of documents on different topics with diverse language use.
 - **Preprocessing and Feature Engineering:** Carefully preprocess the text (e.g., removing stop words, stemming/lemmatization) and consider using different weighting schemes (e.g., Term Frequency-Inverse Document Frequency or TF-IDF in short, a numerical statistic used in information retrieval and text mining to reflect how important a word is to a document in a collection or corpus.) to emphasize more informative terms.
 - **Explore Other Techniques:** Consider using alternative topic modeling techniques like LDA, which explicitly models the distribution of topics within documents.

In summary, while LSA can be useful for document-level analysis, its limitations in capturing broader semantic relationships should be considered. Ensuring a diverse and representative corpus and exploring alternative techniques can help overcome these limitations and unlock the full potential of topic modeling.

## **3 Compute Similarity Measures**

In this subsection we will appply several of the similarity measures to our LSA results to further analyse terms in LSA space.

It is generally not best practice to rely solely on one similarity measure, especially when exploring complex datasets or tasks. Different similarity measures have different strengths and weaknesses and may capture different aspects of similarity.











### **3.1 Cosine similarity measure between terms**

Cosine similarity is a common way to measure the similarity between vectors in the LSA space. The following code helps to quantify the semantic similarity between terms within the first distinct cluster (terms: "algorithmic", "high-frequency", "impact", "market", "trading") based on their representations in the LSA concept space:

In [None]:
# Extract term vectors for first distinct cluster
terms_cl1 = ["algorithmic", "high-frequency", "impact", "market", "trading"]
term_vectors_cl1 = df_lsa.loc[terms_cl1]

# Calculate cosine similarity
similarity_matrix = cosine_similarity(term_vectors_cl1)

# Convert to DataFrame for better visualization
similarity_df = pd.DataFrame(similarity_matrix, index=terms_cl1, columns=terms_cl1)

# Print the results for the first distinct cluster with labels
print("Cosine similarity for first distinct cluster:")
print(similarity_df)

This matrix shows the pairwise cosine similarity between the terms in the first distinct cluster. The values range from 0 to 1, where 1 indicates perfect similarity and 0 indicates no similarity. Here are some observations:

 - High similarity within the cluster: Most of the values in the matrix are close to 1, indicating that the terms in this cluster are highly semantically similar. This is expected since they were identified as a cluster in the LSA analysis.
 - "Impact" and "market" are perfectly similar: The cosine similarity between "impact" and "market" is 1, suggesting that they are considered synonymous in the context of this corpus.
 - "Algorithmic" and "trading" are very similar: The cosine similarity between "algorithmic" and "trading" is very high (0.998), indicating a strong semantic connection between these terms. This makes sense as "algorithmic trading" is a common concept.
 - "High-frequency" is slightly less similar: "High-frequency" has slightly lower similarity scores compared to other terms, suggesting that it might have a slightly different meaning or context within this cluster.

This analysis confirms that the terms in the first distinct cluster are indeed semantically related based on their cosine similarity.

Let's now compute cosine similarity for terms in the fourth cluster:

In [None]:
# Extract term vectors for fourth cluster
terms_cl4 = ["management", "risk", "liquidity", "provision"]
term_vectors_cl4 = df_lsa.loc[terms_cl4]

# Calculate cosine similarity
similarity_matrix = cosine_similarity(term_vectors_cl4)

# Convert to DataFrame for better visualization
similarity_df = pd.DataFrame(similarity_matrix, index=terms_cl4, columns=terms_cl4)

# Print the results for the fourth cluster with labels
print("Cosine similarity for fourth cluster:")
print(similarity_df)

Here we have the following:

 - Perfect similarity: "Management" and "risk" have a cosine similarity of 1, indicating they are treated as highly related or even synonymous within this context. Similarly, "liquidity" and "provision" also show perfect similarity. This is not surprising because cosine similarity measures the angle between two vectors. Since "Management" and "risk" have identical coordinates, they point in the same direction, resulting in 0-degree angle and cosine similarity of 1. Similarly, "liquidity" and "provision" also form another pair of identical vectors.

 - Strong but not perfect similarity: While the other terms exhibit strong similarity, it's not perfect (cosine similarity of 0.82). This suggests a degree of semantic overlap but also some distinction in their meanings within the documents.

This cluster clearly indicates a focus on managing risk, potentially highlighting the importance of liquidity provision in mitigating financial risks.

We can continue this approach of selecting specific clusters and analyzing cosine similarity within those groups. This allows us to focus on terms that are already known to be related and uncover finer-grained semantic connections.

By using a more targeted approach, we can gain more meaningful insights from cosine similarity analysis and better understand the semantic structure of our data.

One way to expore relationship between clusters is to calculate the centroid of each cluster and then measure the cosine similarity between the centroids. This can help us understand how different topics or concepts relate to each other.

Earlier we identified four distinct clusters and one pair Distinct Pair: The pair of terms "financial" and "modeling". Let's further analyse this pair to determine their relationship to other terms and clusters.

In [None]:
# Extract term vectors for distinct pair
terms_pair = ["financial", "modeling"]
term_vectors_pair = df_lsa.loc[terms_pair]

# Extract term vectors for second and third cluster
term_vectors_cl2 = df_lsa.loc[["optimization", "portfolio"]]
term_vectors_cl3 = df_lsa.loc[["derivative", "method", "quantitative"]]


# Calculate centroid for the distinct pair
centroid_pair = term_vectors_pair.mean()

# Store cluster term vectors in a list and calculate centroids for each cluster
cluster_term_vectors = [term_vectors_cl1, term_vectors_cl2, term_vectors_cl3, term_vectors_cl4]
centroids = [term_vectors.mean() for term_vectors in cluster_term_vectors]


# Calculate cosine similarity between the distinct pair and each cluster centroid
similarities = [cosine_similarity(centroid_pair.values.reshape(1, -1), centroid.values.reshape(1, -1)) for centroid in centroids]

# Print the similarities
for i, similarity in enumerate(similarities):
  print(f"Cosine similarity between distinct pair and cluster {i+1}:", similarity)


These results show the cosine similarity between the distinct pair and each of the four clusters. Here's how to interpret them:

 - Strongest similarity with cluster 2: The distinct pair has the highest cosine similarity (0.997) with cluster 2. This suggests that the pair is very closely related to the concepts or terms represented by cluster 2.
 - Moderate similarity with cluster 3: There's a moderate similarity (0.961) with cluster 3, indicating some overlap in meaning or context.
 - Weak similarity with clusters 1 and 4: The pair shows very weak similarity with clusters 1 (0.084) and 4 (0.056), suggesting that these clusters represent distinct concepts that are not closely related to the pair.

Conclusion: Based on these results, we can conclude that the distinct pair is most strongly associated with cluster 2, with some moderate connection to cluster 3. It's likely that the pair shares a significant semantic overlap with the terms in cluster 2.

### **3.2 Cosine similarity between documents**

Measuring cosine similarity between entire documents is one of the common applications of cosine similarity in natural language processing and information retrieval.

To do this we first need to get **Document vectors**. There are few techniques to do this such as geting LSA vectors, Word embeddings and Term Frequency-Inverse Document Frequency (TF-IDF).

In our lesson we will focus on TF-IDF vectors as these capture the importance of words in a document relative to a collection of documents. The following code calculates the cosine similarity between the documents using TF-IDF and presents the output in a dataframe.





In [None]:
# Create a TfidfVectorizer object and fit to to documents
tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocess_text, token_pattern=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# alculate the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix)

# Present TF-IDF matrix in DataFrame form
short_labels = ['Doc ' + str(i+1) for i in range(len(documents))]
df = pd.DataFrame(cosine_sim, index=short_labels, columns=short_labels)
df


Here `TfidfVectorizer` is initialized with the same `preprocess_text` function we used previously for preprocessing document texts during LSA analysis. This function is passed as an argument to the tokenizer parameter inside `TfidfVectorizer`. By using the `preprocess_text` function as the tokenizer, we ensure that the TF-IDF calculation is based on the lemmatized and cleaned tokens, which can lead to more meaningful and accurate results. The `TfidfVectorizer` will apply this function to each document in the corpus to convert the raw text into a sequence of tokens (words or terms).

Both `TfidfVectorizer` and `CountVectorizer` that we used earlier are used in `scikit-learn` for text analysis to convert a collection of text documents into a matrix of numerical features. However, they differ in how they represent the importance of words in the documents:

 - `CountVectorizer`: Creates a matrix where each entry represents the number of times a word appears in a document (term frequency). It's a simple way to represent text data numerically, but it doesn't consider the relative importance of words across the entire corpus. `CountVectorizer` focuses on the local importance of words within a document.

 - `TfidfVectorizer`: Creates a matrix where each entry represents the TF-IDF score of a word in a document. TF-IDF (Term Frequency-Inverse Document Frequency) takes into account both the frequency of a word in a document and its rarity across the entire corpus. This means that words that are frequent in a document but rare in the corpus will have higher TF-IDF scores, indicating their importance for that specific document. `TfidfVectorizer` focuses on the global importance of words relative to the entire corpus.

Let's now analyse the code output. The above DataFrame shows the cosine similarity between the documents. Here are some observations based on the cosine similarity scores:

 - Document 6 and Document 2 are the most similar (0.54). This makes sense as both documents discuss portfolio optimization.
 - Document 7 and Document 3 are also quite similar (0.59), which is expected as they both cover algorithmic trading.
 - Document 8 seems to be moderately similar to several documents, including Document 2 (0.36), Document 6 (0.32), and Document 4 (0.42). This is because it covers a broader range of topics, including algorithmic trading and portfolio optimization.
 - Document 1 and Document 5 have a low similarity (0.10) with most other documents as they discuss risk management and liquidity provision, respectively, which are relatively broad topics.

Here are some potential insights we could draw:

 - Documents related to portfolio optimization (2 and 6) form a cluster of related topics.
 - Algorithmic trading (3 and 7) is another cluster of related concepts.
 - Document 8 acts as a bridge between these two clusters, indicating some overlap between the topics.
 - Risk management (1) and liquidity provision (5) are relatively independent concepts within this dataset.

Please note that these are just initial observations. One can delve deeper into the documents themselves to understand the nuances of their similarity and how the specific terms contribute to the cosine similarity scores.

Remember that cosine similarity is just one measure of similarity. Depending on specific needs and the nature of data, we might need to explore other similarity measures as well.

### **3.3 Jaccard index**

Let's now consider using Jaccard index as measure to estimate similarity between our documents. In the following code snippet we constructed `jaccard_similarity` function that calculates the Jaccard similarity between two documents which are pre-processed. Inside this function we apply `intersection` and `union` built-in methods available for set objects in Python. And then Jaccard index is calculated by simply dividing size of the intersection by the size of union: `intersection / union`. We will apply this function to all possible pairs in our document corpus to compute all pairwise indexes:

In [None]:
# Sample documents
documents = [
    "Risk management in derivatives trading",
    "Portfolio optimization techniques and strategies",
    "Algorithmic trading and its impact on market efficiency",
    "Quantitative methods for financial modeling",
    "High-frequency trading and liquidity provision",
    "Integrating Derivatives into Portfolio Optimization Strategies",
    "High-Frequency Trading Strategies: Algorithmic Approaches and Market Impact",
    "Quantitative Methods for Algorithmic Trading and Portfolio Optimization"
]

# Download stopwords, WordNet resources and create a Lemmatizer Object
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
lemmatizer = WordNetLemmatizer()

# Preprocessing function
def preprocess_text(text):
    text = re.sub(r'[^\w\s-]', '', text)  # Remove punctuation except hyphens
    tokens = nltk.word_tokenize(text.lower())
    tokens = [lemmatizer.lemmatize(token) for token in tokens] # Lemmatize
    stop_words = set(stopwords.words('english')) # define stop_words as a set of english stop words
    tokens = [token for token in tokens if token not in stop_words] # Remove stop words
    return tokens


# Function to calculate Jaccard similarity between two documents
def jaccard_similarity(doc1, doc2):
    # Preprocess documents
    tokens1 = set(preprocess_text(doc1))
    tokens2 = set(preprocess_text(doc2))

    # Calculate Jaccard similarity
    intersection = len(tokens1.intersection(tokens2))
    union = len(tokens1.union(tokens2))
    return intersection / union

# Calculate Jaccard similarity matrix
jaccard_sim = [[jaccard_similarity(doc1, doc2) for doc2 in documents] for doc1 in documents]

# Present Jaccard similarity matrix in DataFrame form
short_labels = ['Doc ' + str(i+1) for i in range(len(documents))]
df = pd.DataFrame(jaccard_sim, index=short_labels, columns=short_labels)
df

Here we observe:
 - Strong similarity between Documents 2 and 6 with highest similarity score (0.50) and also between documents 3 and 7 with also similarity score (0.50).
 - Moderate similarity: Document 8 again shows moderate similarity with Documents 2 (0.25), 4 (0.25) and 6 (0.22).
 - Low similarity: Documents 1 and 5 again exhibit low similarity.

Comparison with Cosine similarity: The general patterns of similarity are relatively consistent with the cosine similarity results. However, Jaccard index tends to produce lower similarity scores overall, as it focuses on the overlap of terms and does not consider term frequency or inverse document frequency.

## **Conclusion**

By the end of this lesson, you should have a solid understanding of how to use Python and Google Trends to analyze search trends. You should be able to fetch and compare the interest over time for multiple search terms, analyze interest by region, discover top and rising related queries. In this lesson we also explored how to leverage data science and machine learning for financial analysis, with a focus on analyzing search trends from Google Trends data. It covered techniques like TF-IDF and LSA to extract insights from text, and demonstrated how to gather and analyze Google Trends data using Python and the `pytrends` library. The notebook also discussed the application of these techniques for sentiment analysis, risk management, and portfolio optimization.

**References**

 - Vrba, G. (2018). Timing The Market With Google Trends Search Volume Data. [online] Seeking Alpha. Available at: https://seekingalpha.com/article/4202781-timing-market-google-trends-search-volume-data.










---
Copyright 2024 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
