# **RAG-based Financial Risk Assessment Tool Demo**

This notebook demonstrates the Retrieval-Augmented Generation (RAG) pipeline for financial risk assessment using a dataset of S&P 500 companies. We'll walk through data loading, preprocessing, model setup, and running sample queries to generate insightful responses.

---

## **1. Setup**

### **1.1. Import Necessary Libraries**

In [None]:
# Standard Libraries
import os
import sys
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Add the src directory to the Python path
sys.path.append('../src')

In [None]:
# Custom Modules
from retriever import DataRetriever
from generator import TextGenerator
from pipelines.risk_assessment_pipeline import RiskAssessmentPipeline
from utils.data_processing import clean_data
from utils.logging_utils import setup_logging

# Configuration
from config import Config

In [None]:
# For displaying plots inline
%matplotlib inline

### **1.2. Initialize Logger**

In [None]:
logger = setup_logging()

---

## **2. Data Loading and Exploration**


### **2.1. Load the Dataset**

In [None]:
# Define the path to the dataset
data_path = '../data/raw/financials.csv'  # Ensure this path is correct

# Initialize the DataRetriever
retriever = DataRetriever(file_path=data_path)

# Load the data
df = retriever.load_data()

# Check if data is loaded
if df.empty:
    logger.error("DataFrame is empty. Check if the dataset exists at the specified path.")
else:
    logger.info(f"Data loaded successfully with shape: {df.shape}")

### **2.2. Preview the Data**

In [None]:
# Display the first 5 rows
df.head()

### **2.3. Basic Data Statistics**

In [None]:
# Get basic statistics
df.describe()

### **2.4. Check for Missing Values**


In [None]:
# Sum of missing values per column
df.isnull().sum()

---


## **3. Data Preprocessing**


### **3.1. Data Cleaning**


In [None]:
# Clean the data using the clean_data function
clean_df = clean_data(df)

# Check the shape after cleaning
logger.info(f"Data shape after cleaning: {clean_df.shape}")

### **3.2. Data Visualization**

- Let's visualize some key financial metrics to understand the data better.

In [None]:
# Setting up the plot style
sns.set(style="whitegrid")

# Plot Market Capitalization Distribution
plt.figure(figsize=(10,6))
sns.histplot(clean_df['Market Cap'], bins=50, kde=True)
plt.title('Market Capitalization Distribution')
plt.xlabel('Market Cap (in billions)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Plot Price vs. Earnings/Share
plt.figure(figsize=(10,6))
sns.scatterplot(data=clean_df, x='Price', y='Earnings/Share', hue='Sector')
plt.title('Price vs. Earnings/Share by Sector')
plt.xlabel('Price (in USD)')
plt.ylabel('Earnings/Share (in USD)')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

---

## **4. Setting Up the RAG Pipeline**


### **4.1. Initialize the Text Generator**

- We'll use the `TextGenerator` class, which utilizes LangChain and OpenAI's GPT-3.5-turbo model.

In [None]:
# Initialize the TextGenerator
api_key = Config.API_KEY  # Ensure your OpenAI API key is set in the Config
generator = TextGenerator(api_key=api_key)

### **4.2. Define the Risk Assessment Pipeline**

- The `RiskAssessmentPipeline` will integrate the retriever and generator to process queries.

In [None]:
# Initialize the RiskAssessmentPipeline
pipeline = RiskAssessmentPipeline(retriever=retriever, generator=generator)

---

## **5. Running Sample Queries**


### **5.1. Sample Query 1: Assessing a Specific Company's Risk**

Let's assess the financial risk of a specific company, say "Apple Inc."

In [None]:
# Define the query
query_company = "Apple Inc."

# Retrieve data related to the company
company_data = clean_df[clean_df['Name'] == query_company]

if company_data.empty:
    logger.warning(f"No data found for {query_company}.")
else:
    # Prepare the prompt
    prompt = (f"Based on the following financial data, assess the financial risk of {query_company}:\n"
              f"Price: {company_data['Price'].values[0]} USD\n"
              f"Price/Earnings: {company_data['Price/Earnings'].values[0]}\n"
              f"Dividend Yield: {company_data['Dividend Yield'].values[0]}%\n"
              f"Earnings/Share: {company_data['Earnings/Share'].values[0]} USD\n"
              f"52 Week Low: {company_data['52 Week Low'].values[0]} USD\n"
              f"52 Week High: {company_data['52 Week High'].values[0]} USD\n"
              f"Market Cap: {company_data['Market Cap'].values[0]} billion USD\n"
              f"EBITDA: {company_data['EBITDA'].values[0]} billion USD\n"
              f"Price/Sales: {company_data['Price/Sales'].values[0]}\n"
              f"Price/Book: {company_data['Price/Book'].values[0]}\n"
              f"SEC Filings: {company_data['SEC Filings'].values[0]}")
    
    # Generate the assessment
    assessment = generator.generate_text(prompt)
    
    # Display the assessment
    print(f"Financial Risk Assessment for {query_company}:\n")
    print(assessment)

### **5.2. Sample Query 2: Sector-Wide Risk Analysis**

- Assess the financial risk associated with the "Technology" sector.

In [None]:
# Define the sector
query_sector = "Technology"

# Retrieve data related to the sector
sector_data = clean_df[clean_df['Sector'] == query_sector]

if sector_data.empty:
    logger.warning(f"No data found for sector: {query_sector}.")
else:
    # Aggregate key metrics
    avg_price = sector_data['Price'].mean()
    avg_pe_ratio = sector_data['Price/Earnings'].mean()
    avg_div_yield = sector_data['Dividend Yield'].mean()
    avg_earnings = sector_data['Earnings/Share'].mean()
    avg_market_cap = sector_data['Market Cap'].mean()
    
    # Prepare the prompt
    prompt = (f"Assess the financial risk associated with the {query_sector} sector based on the following average metrics:\n"
              f"Average Price: {avg_price:.2f} USD\n"
              f"Average Price/Earnings: {avg_pe_ratio:.2f}\n"
              f"Average Dividend Yield: {avg_div_yield:.2f}%\n"
              f"Average Earnings/Share: {avg_earnings:.2f} USD\n"
              f"Average Market Cap: {avg_market_cap:.2f} billion USD")
    
    # Generate the assessment
    assessment = generator.generate_text(prompt)
    
    # Display the assessment
    print(f"Financial Risk Assessment for {query_sector} Sector:\n")
    print(assessment)

### **5.3. Sample Query 3: Comparing Two Companies**

Compare the financial risks of "Microsoft Corporation" and "Tesla Inc."

In [None]:
# Define the companies
company_1 = "Microsoft Corporation"
company_2 = "Tesla Inc."

# Retrieve data for both companies
data_company_1 = clean_df[clean_df['Name'] == company_1]
data_company_2 = clean_df[clean_df['Name'] == company_2]

if data_company_1.empty or data_company_2.empty:
    logger.warning(f"Data missing for one or both companies: {company_1}, {company_2}.")
else:
    # Prepare the prompt
    prompt = (f"Compare the financial risks of the following two companies based on their financial data:\n\n"
              f"{company_1} Data: {data_company_1.to_dict(orient='records')[0]}\n\n"
              f"{company_2} Data: {data_company_2.to_dict(orient='records')[0]}")
    
    # Generate the comparison
    comparison = generator.generate_text(prompt)
    
    # Display the comparison
    print(f"Financial Risk Comparison between {company_1} and {company_2}:\n")
    print(comparison)

---

## **6. Conclusion**

In this notebook, we've demonstrated the setup and execution of a Retrieval-Augmented Generation pipeline for financial risk assessment using specific financial data columns. By integrating data retrieval, preprocessing, and advanced language models, we can generate insightful analyses that aid in understanding financial risks associated with companies and sectors.


---


**Note:** Before running this notebook, ensure that:

1. The dataset `financials.csv` is placed in the path `../data/raw/` relative to the notebook's location.
2. The `src` directory is correctly structured and contains all the necessary modules

 (e.g., `retriever.py`, `generator.py`, etc.).
3. The API key for OpenAI is correctly set in the `Config` class.