# DSA4265 Assignment 2: RAG Generation

With the large availability of news available today from different agencies, it is increasingly difficult for investors to spend time to look through all news articles in order to obtain the answer that they are looking for. 

Therefore, the goal of this assignment is to create a search engine that summarises key information about the recent stock data in order to have a better context of the stock such that investors can make a more informed decision about the performance of the stock.

## Part 1: Data Extraction

The following section describes the data extraction process and generation of the labelled dataframe. The tickers used for analysis are that of Apple Inc. (AAPL) and Tesla stocks (TSLA). The data obtained was sourced from Refinitiv Workspace, and the code to extract the dataframes were all copied and pasted from its in-built CodeBook. News headlines were limited to top deals for digital finance, corporate finance, and overall news about the stock itself.

In [2]:
import refinitiv.data as rd
from refinitiv.data.content import news
from IPython.display import HTML
import pandas as pd
import numpy as np
from datetime import datetime,timedelta
import time
import warnings
import refinitiv.data.eikon as ek
from IPython.display import HTML
warnings.filterwarnings("ignore")



In [2]:
rd.open_session()

<refinitiv.data.session.Definition object at 0x2969ae560e0 {name='workspace'}>

In [None]:
def fetch_full_story(story_id):
    try:
        story = rd.news.get_story(story_id, format=rd.news.Format.TEXT)
        return story if story else "Story not available"
    except Exception as e:
        print(f"Error fetching {story_id}: {e}")
        return "Error retrieving story"

dNow = datetime.now().date()
maxenddate = dNow - timedelta(days=90) #upto months=15
compNews = pd.DataFrame()
riclist = ['TSLA.O','AAPL.O'] # can also use Peers, Customers, Suppliers, Monitor, Portfolio to build universe

for ric in riclist:
    try:
        cHeadlines = rd.news.get_headlines("R:" + ric + " AND Language:LEN AND Source:RTRS AND (Topic:TOP/DEALS OR Topic:TOP/DIGFIN)", 
                                           start= str(dNow), 
                                           end = str(maxenddate), count = 100)
        cHeadlines['ric'] = ric
        # Corporate Finance: TOP/DEALS, Broker Research / Recommendation: RCH
        if len(compNews):
            compNews = pd.concat([compNews,cHeadlines])
        else:
            compNews = cHeadlines
    except Exception:
        pass

# Apply to all rows
compNews["full_story"] = compNews["storyId"].apply(fetch_full_story)

compNews2 = pd.DataFrame()
riclist = ['TSLA.O','AAPL.O'] # can also use Peers, Customers, Suppliers, Monitor, Portfolio to build universe

for ric in riclist:
    try:
        cHeadlines = rd.news.get_headlines("R:" + ric + " AND Language:LEN AND Source:RTRS AND (Topic:TOPALL)", 
                                           start= str(dNow), 
                                           end = str(maxenddate), count = 100)
        cHeadlines['ric'] = ric
        # Corporate Finance: TOP/DEALS, Broker Research / Recommendation: RCH
        if len(compNews):
            compNews2 = pd.concat([compNews2,cHeadlines])
        else:
            compNews2 = cHeadlines
    except Exception:
        pass

compNews2["full_story"] = compNews2["storyId"].apply(fetch_full_story)
combined_df = pd.concat([compNews, compNews2], axis = 0)
combined_df.to_csv('combined_news_updated.csv')
combined_df

Unnamed: 0_level_0,headline,storyId,sourceCode,ric,full_story
versionCreated,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-02-27 13:00:00,RPT-BREAKINGVIEWS-GM illuminates good times be...,urn:newsml:reuters.com:20250227:nL3N3PH1P3:5,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-25 12:52:47,Tesla to acquire parts of insolvent German par...,urn:newsml:reuters.com:20250225:nL5N3PG0Y9:7,NS:RTRS,TSLA.O,"* \n Acquisition includes 300 staff, excl..."
2025-02-25 01:10:32,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250225:nL3N3PG036:3,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-24 12:00:00,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250224:nL3N3PF0DU:4,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
2025-02-21 13:55:05,UPDATE 6-Japan seeks Tesla investment in Nissa...,urn:newsml:reuters.com:20250221:nL3N3PC0GR:2,NS:RTRS,TSLA.O,* \n Japanese group draws up plans for Te...
...,...,...,...,...,...
2025-02-14 15:03:57,US STOCKS-Wall St subdued as markets await tar...,urn:newsml:reuters.com:20250214:nL4N3P515Y:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 14:04:36,US STOCKS-Wall St set for subdued open as mark...,urn:newsml:reuters.com:20250214:nL4N3P512O:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 12:31:59,US STOCKS-Futures slip as markets await tariff...,urn:newsml:reuters.com:20250214:nL4N3P50XC:5,NS:RTRS,AAPL.O,"(For a Reuters live blog on U.S., UK and Europ..."
2025-02-14 11:50:37,REFILE-FACTBOX-China's AI firms take spotlight...,urn:newsml:reuters.com:20250214:nL1N3P50BI:1,NS:RTRS,AAPL.O,(Refiles to correct transposed letters in Doub...


## Part 2: Building of RAG Model

The RAG model was built with the help of 3 factors, all of which will be dealt with in greater depth below:

### Feature 1: Chunking of Documents

To facilitate the separation of documents into distinct chunks, RecursiveTextSplitter function was utilised, with an overlap of 100 characters so as to ensure the preservation of context between chunks. Therefore, this enables better understanding of each chunk.

### Feature 2: Sentence Embeddings

To assign texts to numerical vectors for the machines to process the text, a sentence transformer model was selected to help in semantic similarity search.

### Feature 3: Vector Store

The use of Facebook Artificial Intelligence Similarity Search (FAISS) was used to store the embeddings for quick retrieval.

### Feature 4: Language Model

In [None]:
import pandas as pd
import faiss
import torch
from transformers import AutoTokenizer, AutoModel
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DataFrameLoader
from langchain.schema import Document
from transformers import pipeline
import re

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
combined_news_df = pd.read_csv("combined_news_updated.csv")
tsla_news = combined_news_df[combined_news_df['ric'] == 'TSLA.O']
aapl_news = combined_news_df[combined_news_df['ric'] == 'AAPL.O']

In [3]:
tsla_news

Unnamed: 0,versionCreated,headline,storyId,sourceCode,ric,full_story
0,2025-02-27 13:00:00,RPT-BREAKINGVIEWS-GM illuminates good times be...,urn:newsml:reuters.com:20250227:nL3N3PH1P3:5,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
1,2025-02-25 12:52:47,Tesla to acquire parts of insolvent German par...,urn:newsml:reuters.com:20250225:nL5N3PG0Y9:7,NS:RTRS,TSLA.O,"* \n Acquisition includes 300 staff, excl..."
2,2025-02-25 01:10:32,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250225:nL3N3PG036:3,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
3,2025-02-24 12:00:00,RPT-BREAKINGVIEWS-Nissan offers suitors daunti...,urn:newsml:reuters.com:20250224:nL3N3PF0DU:4,NS:RTRS,TSLA.O,(The author is a Reuters Breakingviews columni...
4,2025-02-21 13:55:05,UPDATE 6-Japan seeks Tesla investment in Nissa...,urn:newsml:reuters.com:20250221:nL3N3PC0GR:2,NS:RTRS,TSLA.O,* \n Japanese group draws up plans for Te...
...,...,...,...,...,...,...
157,2025-02-18 10:56:15,Netherlands to build 1.4 GW battery storage fa...,urn:newsml:reuters.com:20250218:nL6N3P90C3:2,NS:RTRS,TSLA.O,"AMSTERDAM, Feb 18 - Dutch energy storage firm ..."
158,2025-02-18 06:29:17,Tesla steps up India hiring after Musk-Modi me...,urn:newsml:reuters.com:20250218:nL3N3P90AQ:1,NS:RTRS,TSLA.O,Feb 18 (Reuters) - Elon Musk's Tesla <TSLA.O> ...
159,2025-02-18 04:44:51,Tesla begins mass production of revamped Model...,urn:newsml:reuters.com:20250218:nP8N3O50DP:2,NS:RTRS,TSLA.O,"BEIJING, Feb 18 (Reuters) - U.S. automaker Tes..."
160,2025-02-18 03:27:45,Complaints targeting BYD flood Chinese consume...,urn:newsml:reuters.com:20250218:nL3N3P80MF:5,NS:RTRS,TSLA.O,"BEIJING, Feb 18 (Reuters) - Complaints about B..."


In [46]:
# Convert dataframe to documents
documents = [
    Document(page_content=row['full_story'],
             date=row['versionCreated']) 
    for _, row in tsla_news.iterrows()
]

# Chunk documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_docs = text_splitter.split_documents(documents)

# Initialize embedding model for documents and queries
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model)

# Create FAISS vector database
vector_db = FAISS.from_documents(split_docs, embeddings)

# Use a generative model for text generation
# llm_model = "EleutherAI/gpt-neo-2.7B"  # You can try other models as needed
llm_model = "bigscience/bloom-560m"
llm_pipeline = pipeline("text-generation", model=llm_model, device=0 if torch.cuda.is_available() else -1, 
                        max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

# Create the RetrievalQA chain, passing in the LLM and vector database retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, retriever=vector_db.as_retriever()
)

In [82]:
def ask_question(query):
    """Function to ask a question using RAG model with extracted context chunks."""
    # Retrieve relevant documents
    retrieved_docs = vector_db.similarity_search(query, k=3)
    
    # Extract the context chunks from the retrieved documents
    prompt_context = [doc.page_content for doc in retrieved_docs]
    # return prompt_context
    # Prepare the prompt
    context_str = "\n\n".join(prompt_context)
    prompt = f"""
        You are an AI system. Below are relevant news articles with potential relevance:
        {context_str}

        Based on these excerpts, if the information is insufficient, say "I do not have enough information." Otherwise, answer the following:
        
        Question: {query}

        Answer:
    """.strip()
    def remove_consecutive_duplicates(text: str):
        # Split the text into sentences using regex
        sentences = re.split(r'(?<=[.!?])\s+', text)
        
        # Create a list to store non-duplicate sentences
        unique_sentences = []
        
        # Iterate over the sentences and add them to unique_sentences if not a duplicate
        for i in range(len(sentences)):
            current_sentence = sentences[i].strip()
            
            # If it's the first sentence or not a duplicate of the previous one, keep it
            if i == 0 or current_sentence != sentences[i - 1].strip():
                unique_sentences.append(current_sentence)
        
        # Join the unique sentences back into a single text
        return ' '.join(unique_sentences)
    
    # Generate the model's response
    response = llm_pipeline(prompt)[0]['generated_text']
    # Find where the 'Question:' part starts
    first_qn_pos = response.lower().find('question:')
    
    answer_start = response.lower().find('answer:')
    
    # If 'Answer:' is found, slice the text from there
    if answer_start != -1:
        answer = response[answer_start + len('answer:'):].strip()  # Extract everything after "Answer:"
        new_answer = remove_consecutive_duplicates(answer)
    else:
        answer = "No answer found."
    
    # answer_end = response.lower().find('a:', qn_start)
    # answer_end = response.lower().find('question:', first_qn_pos + len('question:'))
    
    # # If 'Question:' and 'A:' are found, return the text between them
    # if first_qn_pos != -1 and answer_end != -1:
    #     answer = response[first_qn_pos:answer_end].strip()
    # elif first_qn_pos != -1:
    #     # If only 'Question:' is found, return everything from 'Question:' onward
    #     answer = response[first_qn_pos:].strip()
    # else:
    #     # If no 'Answer:' part is found, return the whole generated text
    #     answer = response.strip()
    final_response = f"Question:{query}\nAnswer:{new_answer}"
    return final_response

In [83]:
# Example usage
query = "Is Tesla the biggest company selling electric vehicles?"
response = ask_question(query)
print(response)

Question:Is Tesla the biggest company selling electric vehicles?
Answer:Yes. The company is the largest in the world in terms of sales. The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.


In [90]:
def remove_consecutive_duplicates(text: str):
    # Split the text into sentences using regex
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())  # Split on punctuation followed by space
    
    # Create a list to store non-duplicate sentences
    unique_sentences = []
    
    # Iterate over the sentences and add them to unique_sentences if not a duplicate
    for i in range(len(sentences)):
        current_sentence = sentences[i].strip()
        
        # If it's the first sentence or not a duplicate of the previous one, keep it
        if i == 0 or current_sentence.lower() != sentences[i - 1].strip().lower():
            unique_sentences.append(current_sentence)
    
    # Join the unique sentences back into a single text
    return ' '.join(unique_sentences)

text = "The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. Claire."
remove_consecutive_duplicates(text)

'The company has a market share of about 50% in the U.S. and Europe. The company has a market share of about 50% in the U.S. and Europe. Claire.'

In [None]:
# from transformers import pipeline
# import torch
# # from transformers import HuggingFacePipeline

# # Set up the text generation pipeline
# llm_model = "bigscience/bloom-560m"
# llm_pipeline = pipeline("text-generation", model=llm_model, device=0 if torch.cuda.is_available() else -1, 
#                         max_new_tokens=100)
# llm = HuggingFacePipeline(pipeline=llm_pipeline)

# # Function to ask a question given context
# def ask_question_with_context(context: str, question: str):
#     # Format the prompt with context and question
#     prompt = f"You are an AI bot who is given the following:\nContext: {context}\n\nQuestion: {question}\nAnswer:"
#     response = llm(prompt)
    
#     # Find where the 'Question:' part starts
#     qn_start = response.lower().find('question:')
    
#     # Find where the 'A:' part starts (indicating the end of the answer)
#     answer_end = response.lower().find('a:', qn_start)
    
#     # If 'Question:' and 'A:' are found, return the text between them
#     if qn_start != -1 and answer_end != -1:
#         answer = response[qn_start:answer_end].strip()
#     elif qn_start != -1:
#         # If only 'Question:' is found, return everything from 'Question:' onward
#         answer = response[qn_start:].strip()
#     else:
#         # If no 'Answer:' part is found, return the whole generated text
#         answer = response.strip()
    
#     return answer

# # Example context and question
# context = "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It was named after the engineer Gustave Eiffel, whose company designed and built the tower."
# question = "Where is Eiffel Tower?"

# # Ask the question based on the context
# answer = ask_question_with_context(context, question)

In [None]:
# print(answer)

Question: Where is Eiffel Tower?
Answer: The Eiffel Tower is located in the city of Paris, France. The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It was named after the engineer Gustave Eiffel, whose company designed and built the tower.


In [42]:
def find_question_q(text: str):
    # Find the position of 'Q' in 'Question:'
    q_position = text.lower().find('question:')
    
    # If 'Question:' is found, return the index of the first 'Q'
    if q_position != -1:
        return q_position  # Returns the index of 'Q' in 'Question:'
    else:
        return None  # Return None if 'Question:' is not found

# Example string
example_text = "You are an AI bot who is given the following:\nContext: Some context here\n\nQuestion: Where is Eiffel Tower?"

# Find the position of the 'Q' in 'Question:'
q_index = find_question_q(example_text)

# Print the result
print(f"The position of 'Q' in 'Question:' is: {q_index}")

example_text[q_index:]

The position of 'Q' in 'Question:' is: 74


'Question: Where is Eiffel Tower?'

## Part 3: Evaluation of RAG Model

Th

### Evaluation of Strengths and Weaknesses

#### Strengths

#### Weaknesses

### Potential Future Work

### Conclusion