<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/main/notebooks/modelling/RB_JP_Morgan_Summarisation_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install bertopic umap-learn hdbscan sentence-transformers
!pip install transformers torch
!pip install rouge_score
!pip install evaluate
!pip install --upgrade protobuf
!pip install tensorboard

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting umap-learn
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting hdbscan
  Downloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12=

In [16]:
import time
import torch
from google.colab import drive
import os
import sys
import pandas as pd
import re
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer,Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
import evaluate
from rouge_score import rouge_scorer

In [3]:
# creating a pdf reader object
df_qna = pd.read_csv('/content/sample_data/jpmorgan_qna_df_preprocessed_final.csv', header=0)
df_mgmt = pd.read_csv('/content/sample_data/jpmorgan_management_df_preprocessed_final.csv', header=0)

print("Q&A DataFrame:")
display(df_qna.head())

print("\nManagement Discussion DataFrame:")
display(df_mgmt.head())

Q&A DataFrame:


Unnamed: 0,Index,Quarter-Year,Question,Question_cleaned,Asked By,Role of the person asked the question,Answer,Answer_cleaned,Answered By,Role of the person answered the question
0,1,1Q23,"So, Jamie, I was actually hoping to get your p...",['so jamie actually hoping get perspective see...,Steven Chubak,"Analyst, Wolfe Research LLC","Well, I think you were already kind of complet...",['well think already kind complete answering q...,Jamie Dimon,"Chairman & Chief Executive Officer, JPMorgan C..."
1,2,1Q23,"Hey, thanks. Good morning. Hey, Jeremy, I was ...",['hey thanks good morning hey jeremy wondering...,Ken Usdin,"Analyst, Jefferies LLC","Yeah, sure. So let me just summarize the drive...",['yeah sure let summarize drivers change outlo...,Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
2,3,1Q23,"Hi, thanks. Jeremy, wanted to follow up again ...",['hi thanks jeremy wanted follow drivers nii r...,John McDonald,"Analyst, Autonomous Research","Yeah. John, it's a really good question, and w...",['yeah john really good question weve obviousl...,Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
3,4,1Q23,My first question is you mentioned that your r...,['first question mentioned reserve build drive...,Erika Najarian,"Analyst, UBS Securities LLC","Yeah. So, Erika, as you know, we take \n not g...",['yeah so erika know take going go lot detail ...,Jeremy Barnum,"Chief Financial Officer, JPMorgan Chase & Co."
4,5,1Q23,Hey. Good morning. Maybe just a little bit on ...,['hey good morning maybe little bit deposit th...,Jim Mitchell,"Analyst, Seaport Global Securities LLC","Yeah. A couple things there. So, first of all,...",['yeah couple things there so first all know r...,"Jeremy Barnum, Jamie Dimon","Chief Financial Officer, JPMorgan Chase & Co.;..."



Management Discussion DataFrame:


Unnamed: 0,Index,Quarter-Year,Text,Text_cleaned
0,,4Q24,MANAGEMENT DISCUSSION SECTION \n \nOperator : ...,['management discussion section operator : goo...
1,,3Q24,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...
2,,2Q24,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...
3,,1Q24,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...
4,,4Q23,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...


###**Data Preparation**

In [4]:
# Drop Unnecessary Columns
df_qna.drop(columns=["Index"], inplace=True, errors='ignore')
df_mgmt.drop(columns=["Index"], inplace=True, errors='ignore')

# Standardize Column Names
df_qna.rename(columns={
    "Quarter-Year": "Quarter",
    "Asked By": "Analyst",
    "Answer": "Response",
    "Answered By": "Executive",
    "Role of the person asked the question": "Analyst Role",
    "Role of the person answered the question": "Executive Role"
}, inplace=True)

df_mgmt.rename(columns={
    "Quarter-Year": "Quarter",
    "Text": "Transcript"
}, inplace=True)

# Drop Missing Q&A Entries (2 rows in the Q&A transcript)
df_qna.dropna(subset=["Question", "Response"], inplace=True)

# Format `Quarter` Properly
def format_quarter(quarter_str):
    match = re.search(r'(\d)Q(\d{2})', quarter_str)
    if match:
        return f"20{match.group(2)}-Q{match.group(1)}"
    return quarter_str

df_qna["Quarter"] = df_qna["Quarter"].astype(str).apply(format_quarter)
df_mgmt["Quarter"] = df_mgmt["Quarter"].astype(str).apply(format_quarter)

# Standardize Executive & Analyst Roles
role_mapping = {
    "Chief Executive Officer": "CEO",
    "Chairman & Chief Executive Officer": "CEO",
    "Chief Financial Officer": "CFO",
    "Chief Operating Officer": "COO",
    "President": "President",
    "Vice Chairman": "Vice Chairman",
    "Head of Investor Relations": "Head of IR",
    "Managing Director": "Managing Director",
    "Analyst, Wolfe Research LLC": "Analyst",
    "Analyst, Jefferies LLC": "Analyst",
    "Analyst, Autonomous Research": "Analyst",
    "Analyst, UBS Securities LLC": "Analyst",
    "Analyst, Seaport Global Securities LLC": "Analyst"
}

# Apply role mapping (handles cases where multiple roles are listed)
def standardize_role(role):
    if pd.isna(role):
        return None
    for key, value in role_mapping.items():
        if key.lower() in role.lower():
            return value
    return role

df_qna["Executive Role"] = df_qna["Executive Role"].apply(standardize_role)
df_qna["Analyst Role"] = df_qna["Analyst Role"].apply(standardize_role)

# Add `Type` Column
df_qna["Type"] = "Q&A"
df_mgmt["Type"] = "Management Discussion"

print("Q&A DataFrame:")
display(df_qna.head())

print("\nManagement Discussion DataFrame:")
display(df_mgmt.head())


Q&A DataFrame:


Unnamed: 0,Quarter,Question,Question_cleaned,Analyst,Analyst Role,Response,Answer_cleaned,Executive,Executive Role,Type
0,2023-Q1,"So, Jamie, I was actually hoping to get your p...",['so jamie actually hoping get perspective see...,Steven Chubak,Analyst,"Well, I think you were already kind of complet...",['well think already kind complete answering q...,Jamie Dimon,CEO,Q&A
1,2023-Q1,"Hey, thanks. Good morning. Hey, Jeremy, I was ...",['hey thanks good morning hey jeremy wondering...,Ken Usdin,Analyst,"Yeah, sure. So let me just summarize the drive...",['yeah sure let summarize drivers change outlo...,Jeremy Barnum,CFO,Q&A
2,2023-Q1,"Hi, thanks. Jeremy, wanted to follow up again ...",['hi thanks jeremy wanted follow drivers nii r...,John McDonald,Analyst,"Yeah. John, it's a really good question, and w...",['yeah john really good question weve obviousl...,Jeremy Barnum,CFO,Q&A
3,2023-Q1,My first question is you mentioned that your r...,['first question mentioned reserve build drive...,Erika Najarian,Analyst,"Yeah. So, Erika, as you know, we take \n not g...",['yeah so erika know take going go lot detail ...,Jeremy Barnum,CFO,Q&A
4,2023-Q1,Hey. Good morning. Maybe just a little bit on ...,['hey good morning maybe little bit deposit th...,Jim Mitchell,Analyst,"Yeah. A couple things there. So, first of all,...",['yeah couple things there so first all know r...,"Jeremy Barnum, Jamie Dimon",CEO,Q&A



Management Discussion DataFrame:


Unnamed: 0,Quarter,Transcript,Text_cleaned,Type
0,2024-Q4,MANAGEMENT DISCUSSION SECTION \n \nOperator : ...,['management discussion section operator : goo...,Management Discussion
1,2024-Q3,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...,Management Discussion
2,2024-Q2,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...,Management Discussion
3,2024-Q1,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...,Management Discussion
4,2023-Q4,MANAGEMENT DISCUSSION SECTION \n \n...,['management discussion section operator : goo...,Management Discussion


In [5]:
# Recheck for short, non-substantive responses as indicated by EDA (separate notebook)

# convert Answer_cleaned from string to a list of words
df_qna["Answer_cleaned"] = df_qna["Answer_cleaned"].apply(lambda x: str(x).lower().split() if isinstance(x, str) else [])

# define a threshold for what is considered a "short" response
SHORT_RESPONSE_THRESHOLD = 5

# filter for responses that contain very few words
short_responses = df_qna[df_qna["Answer_cleaned"].apply(lambda x: isinstance(x, list) and len(x) <= SHORT_RESPONSE_THRESHOLD)]

print("Examples of Short Responses:")
print(short_responses[["Quarter", "Answer_cleaned"]].head())

print(f"\nTotal number of short responses: {len(short_responses)}")

Examples of Short Responses:
    Quarter                       Answer_cleaned
11  2023-Q1  [['excellent, folks, thank, much']]
25  2023-Q2               [['thank, you, guys']]
37  2023-Q3                    [['thank, much']]
48  2023-Q4   [['okay, thanks, much, everyone']]
79  2024-Q3      [['yeah, hear, you, hear, us']]

Total number of short responses: 5


In [6]:
# Remove short, non-informative responses

# flatten nested lists
def flatten_list(nested_list):
    if isinstance(nested_list, list) and len(nested_list) == 1 and isinstance(nested_list[0], list):
        return nested_list[0]
    return nested_list

df_qna["Answer_cleaned"] = df_qna["Answer_cleaned"].apply(flatten_list)
df_qna_filtered = df_qna[df_qna["Answer_cleaned"].apply(lambda x: isinstance(x, list) and len(x) >= SHORT_RESPONSE_THRESHOLD)]

print(f"Removed {len(df_qna) - len(df_qna_filtered)} short non-informative responses.")
df_qna = df_qna_filtered

Removed 4 short non-informative responses.


In [7]:
# ensure cleaned text is a proper string
df_qna["Question_cleaned"] = df_qna["Question_cleaned"].apply(lambda x: " ".join(x) if isinstance(x, list) else str(x))
df_qna["Answer_cleaned"] = df_qna["Answer_cleaned"].apply(lambda x: " ".join(x) if isinstance(x, list) else str(x))
df_mgmt["Text_cleaned"] = df_mgmt["Text_cleaned"].apply(lambda x: " ".join(x) if isinstance(x, list) else str(x))

# convert text into tokenized lists (split by space)
df_qna["Question_tokens"] = df_qna["Question_cleaned"].apply(lambda x: x.split())
df_qna["Answer_tokens"] = df_qna["Answer_cleaned"].apply(lambda x: x.split())
df_mgmt["Text_tokens"] = df_mgmt["Text_cleaned"].apply(lambda x: x.split())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_qna["Question_cleaned"] = df_qna["Question_cleaned"].apply(lambda x: " ".join(x) if isinstance(x, list) else str(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_qna["Answer_cleaned"] = df_qna["Answer_cleaned"].apply(lambda x: " ".join(x) if isinstance(x, list) else str(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#re

In [8]:
# Remove artifacts

def clean_tokens(token_list):
    if isinstance(token_list, list):
        refined_tokens = []
        for token in token_list:
            token = re.sub(r"â", "", token)
            token = re.sub(r"heldtomaturity", "held to maturity", token)
            token = re.sub(r"yearonyear", "year on year", token)
            token = re.sub(r"cohead", "co-head", token)
            token = re.sub(r"typesize", "type size", token)
            token = re.sub(r"[^\w$%&-]", "", token)
            if token.strip():
                refined_tokens.append(token)
        return refined_tokens
    return token_list

df_qna["Question_tokens"] = df_qna["Question_tokens"].apply(clean_tokens)
df_qna["Answer_tokens"] = df_qna["Answer_tokens"].apply(clean_tokens)
df_mgmt["Text_tokens"] = df_mgmt["Text_tokens"].apply(clean_tokens)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_qna["Question_tokens"] = df_qna["Question_tokens"].apply(clean_tokens)


In [9]:
# Remove operator text from management discussion

operator_phrases = {
    "operator", "good morning", "ladies", "gentlemen", "welcome",
    "muted", "duration", "call", "please", "refer", "stand", "turn",
    "line", "available", "website", "ahead", "go"
}

def remove_operator_text(tokens):
    if isinstance(tokens, list):
        return [word for word in tokens if word.lower() not in operator_phrases]
    return tokens

df_mgmt["Text_tokens"] = df_mgmt["Text_tokens"].apply(remove_operator_text)


In [10]:
# Convert token lists back to full sentences
df_mgmt["Text_processed"] = df_mgmt["Text_tokens"].apply(lambda x: " ".join(x))

management_discussion = df_mgmt["Text_processed"].tolist()


## **Running summarisation model Flan-T5**

###**Loading the model**

###**Management Discussion**

In [17]:
# Load FLAN-T5 Large model and tokenizer
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def chunk_text(text, chunk_size=400, overlap=50):
    """Splits text into overlapping chunks of size chunk_size"""
    # Check if text is a list and join if necessary
    if isinstance(text, list):
        text = " ".join(text)
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap  # Overlap to maintain context
    return chunks

def summarize_text(text, max_new_tokens=100):
    """Summarizes text using FLAN-T5"""
    if pd.isna(text) or text.strip() == "":
        return ""

    prompt = f"Summarize: {text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    outputs = model.generate(inputs.input_ids, max_new_tokens=max_new_tokens, do_sample=False)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def summarize_long_text(text, chunk_size=400, overlap=50):
    """Handles long texts by summarizing in chunks and then summarizing the summaries"""
    chunks = chunk_text(text, chunk_size, overlap)
    chunk_summaries = [summarize_text(chunk) for chunk in chunks]

    # If multiple summaries, summarize again to get final summary
    final_summary = summarize_text(" ".join(chunk_summaries)) if len(chunk_summaries) > 1 else chunk_summaries[0]
    return final_summary

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [18]:
# Running the model
# Summarizing the long text
summary_mgmt = summarize_long_text(management_discussion)

print("Final Summary:", summary_mgmt)

Final Summary: The financial sector reported a mixed quarter for the year, with a sluggish start to the year and a slowdown in the second quarter.


In [19]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [20]:
# Join the list of management discussion strings into a single string
management_discussion_str = " ".join(management_discussion)

# Calculate ROUGE scores
scores = scorer.score(management_discussion_str, summary_mgmt)
for key in scores:
    print(f'{key}: {scores[key]}')

rouge1: Score(precision=0.7083333333333334, recall=0.0015434901035046304, fmeasure=0.0030802681645225585)
rouge2: Score(precision=0.043478260869565216, recall=9.080177971488241e-05, fmeasure=0.00018122508155128667)
rougeL: Score(precision=0.6666666666666666, recall=0.001452696568004358, fmeasure=0.002899075919550643)


ROUGE Scores:

**ROUGE-1:**Precision is good (70.8%), but recall is very low (0.15%), leading to a low F1 score (0.0031).

**ROUGE-2:**Both precision (4.3%) and recall (
9.08
×
10
−
5
9.08×10
−5
 ) are very low, with an F1 score of 0.00018.

**ROUGE-L:**Precision is 66.7%, but recall is very low (0.15%), leading to a low F1 score (0.0029).
Interpretation:

With low recall the generated summary misses most of the content in the reference summary and low F1 scores indicate poor balance between precision and recall, suggesting the model's summary is far from the reference.

**Improvement Suggestions:**
The model coult be improved by adding different prompts to return more detailed summaries, it could also benefit from fine-tuning and parameters adjustiments, including the chunking overlap strategy.

The model captures a partial text summary, returing an accurate result and a core information, however it lacks depth of information. The FLAN-T5 model produced a short and weak summary, likely due to insufficient token limits, or the model's tendency to generate brief responses. Adjusting the prompt for more detailed extraction, increasing max_new_tokens, or summarising in multiple stages could help improve the output."

### **Analysing Jim Mitchell data Q2-2024 data**

In [21]:
filtered_df = df_qna[(df_qna["Analyst"] == 'Jim Mitchell')& (df_qna["Quarter"] == "2024-Q2")]

# Display results
print(filtered_df)

    Quarter                                           Question  \
69  2024-Q2  Oh, hey, good morning. Maybe just one last que...   

                                     Question_cleaned       Analyst  \
69  ['oh hey good morning maybe one last question ...  Jim Mitchell   

   Analyst Role                                           Response  \
69      Analyst  Thanks, Jim. Good questions. So, yeah, trading...   

                                       Answer_cleaned      Executive  \
69  ['thanks jim good questions so yeah trading as...  Jeremy Barnum   

   Executive Role Type                                    Question_tokens  \
69            CFO  Q&A  [oh, hey, good, morning, maybe, one, last, que...   

                                        Answer_tokens  
69  [thanks, jim, good, questions, so, yeah, tradi...  


Creating dialogue data

In [49]:
# Ensure 'Question' and 'Answer' are formatted as a dialogue
filtered_df = filtered_df.dropna(subset=["Question", "Response"])  # Remove rows with missing values
dialogue_text = " ".join([f"Q: {q} A: {a}" for q, a in zip(filtered_df["Question"], filtered_df["Response"])])


In [50]:
# Running the model
# Summarizing the long text
summary = summarize_long_text(dialogue_text)

print("Final Summary:", summary)

Final Summary: The financial transcript is a summary of the company's financial results for the quarter ended December 31, 2014, as reported by the company.


In [51]:
# Join the list of management discussion strings into a single string
dialogue_text_str = " ".join(dialogue_text)

# Calculate ROUGE scores
scores1 = scorer.score(dialogue_text_str, summary)
for key in scores1:
    print(f'{key}: {scores1[key]}')

rouge1: Score(precision=0.08333333333333333, recall=0.0009583133684714902, fmeasure=0.001894836570345808)
rouge2: Score(precision=0.0, recall=0.0, fmeasure=0.0)
rougeL: Score(precision=0.08333333333333333, recall=0.0009583133684714902, fmeasure=0.001894836570345808)


### **Analysing Steven Chubak data Q2-2024 data**

In [26]:
filtered_df2 = df_qna[(df_qna["Analyst"] == 'Steven Chubak')& (df_qna["Quarter"] == "2024-Q2")]

# Display results
print(filtered_df2)

    Quarter                                           Question  \
60  2024-Q2  So, wanted to start off with a question on cap...   

                                     Question_cleaned        Analyst  \
60  ['so wanted start question capital given indic...  Steven Chubak   

   Analyst Role                                           Response  \
60      Analyst  Right. Okay. Thanks, Steve. And actually, befo...   

                                       Answer_cleaned      Executive  \
60  ['right okay thanks steve actually answering q...  Jeremy Barnum   

   Executive Role Type                                    Question_tokens  \
60            CFO  Q&A  [so, wanted, start, question, capital, given, ...   

                                        Answer_tokens  
60  [right, okay, thanks, steve, actually, answeri...  


Creating dialogue data

In [27]:
# Ensure 'Question' and 'Answer' are formatted as a dialogue
filtered_df2 = filtered_df2.dropna(subset=["Question", "Response"])  # Remove rows with missing values
dialogue_text2 = " ".join([f"Q: {q} A: {a}" for q, a in zip(filtered_df2["Question"], filtered_df2["Response"])])

In [28]:
# Running the model
# Summarizing the long text
summary2 = summarize_long_text(dialogue_text2)

print("Final Summary:", summary2)

Final Summary: The question of the deployment of capital is a matter of when, not if.


In [47]:
# Join the list of management discussion strings into a single string
dialogue_text2_str = " ".join(dialogue_text2)

# Calculate ROUGE scores
scores2 = scorer.score(dialogue_text2_str, summary2)
for key in scores2:
    print(f'{key}: {scores2[key]}')

rouge1: Score(precision=0.07142857142857142, recall=0.00033057851239669424, fmeasure=0.0006581112207963147)
rouge2: Score(precision=0.0, recall=0.0, fmeasure=0.0)
rougeL: Score(precision=0.07142857142857142, recall=0.00033057851239669424, fmeasure=0.0006581112207963147)


###**Updating model prompt to be more specific**

In [None]:
def reset_session():
    tf.keras.backend.clear_session()
    np.random.seed(42)
    random.seed(42)
    tf.random.set_seed(42)

reset_session()

In [45]:
# Load FLAN-T5 Large model and tokenizer
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def chunk_text(text, chunk_size=400, overlap=50):
    """Splits text into overlapping chunks of size chunk_size"""
    # Check if text is a list and join if necessary
    if isinstance(text, list):
        text = " ".join(text)
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap  # Overlap to maintain context
    return chunks

def summarize_text(text, max_new_tokens=100):
    """Summarizes text using FLAN-T5"""
    if pd.isna(text) or text.strip() == "":
        return ""

    prompt = f"Summarize the following financial transcript, highlighting key financial results, performance trends, major developments, and any projections or forecasts mentioned:\n\n{text}"
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    outputs = model.generate(inputs.input_ids, max_new_tokens=max_new_tokens, do_sample=False)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def summarize_long_text(text, chunk_size=400, overlap=50):
    """Handles long texts by summarizing in chunks and then summarizing the summaries"""
    chunks = chunk_text(text, chunk_size, overlap)
    chunk_summaries = [summarize_text(chunk) for chunk in chunks]

    # If multiple summaries, summarize again to get final summary
    final_summary = summarize_text(" ".join(chunk_summaries)) if len(chunk_summaries) > 1 else chunk_summaries[0]
    return final_summary


In [52]:
# Running the model
# Summarizing the long text
summary_mgmt1 = summarize_long_text(management_discussion)

print("Final Summary:", summary_mgmt1)

Final Summary: Jpmorgan Chase reported strong fourth quarter earnings on thursday, citing strong performance in the fourth quarter, a record number of new card accounts, record revenue, and strong earnings from markets, payments, securities services, and awm.


In [53]:
# Join the list of management discussion strings into a single string
management_discussion_str = " ".join(management_discussion)

# Calculate ROUGE scores
scores = scorer.score(management_discussion_str, summary_mgmt1)
for key in scores:
    print(f'{key}: {scores[key]}')

rouge1: Score(precision=0.02857142857142857, recall=0.0004791566842357451, fmeasure=0.000942507068803016)
rouge2: Score(precision=0.0, recall=0.0, fmeasure=0.0)
rougeL: Score(precision=0.02857142857142857, recall=0.0004791566842357451, fmeasure=0.000942507068803016)


### **Adding new prompt.**

In [13]:
# Load FLAN-T5 Large model and tokenizer
model_name = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def chunk_text(text, chunk_size=400, overlap=50):
    """Splits text into overlapping chunks of size chunk_size"""
    # Check if text is a list and join if necessary
    if isinstance(text, list):
        text = " ".join(text)
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap  # Overlap to maintain context
    return chunks

def summarize_text(text, max_new_tokens=300):
    """Summarizes a financial performance management transcript for a bank using FLAN-T5"""
    if pd.isna(text) or text.strip() == "":
        return ""

    prompt = f"""
    Summarize the content of the following report in detail, focusing on key insights and highlights. Please address the following points:
    - Overall performance and key outcomes
    - Major achievements and challenges
    - Any important changes or developments discussed in the report
    - Insights or recommendations based on the report’s content
    - Future outlook or potential areas of focus

    Please provide a detailed, clear, and concise summary that includes multiple key takeaways from the report.

    Report Text:
    {text}
    """
    inputs = tokenizer(prompt, return_tensors="pt", max_length=1024, padding="max_length")
    outputs = model.generate(inputs.input_ids, max_new_tokens=max_new_tokens, do_sample=False)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


def summarize_long_text(text, chunk_size=400, overlap=50):
    """Handles long texts by summarizing in chunks and then summarizing the summaries"""
    chunks = chunk_text(text, chunk_size, overlap)
    chunk_summaries = [summarize_text(chunk) for chunk in chunks]

    # If multiple summaries, summarize again to get final summary
    final_summary = summarize_text(" ".join(chunk_summaries)) if len(chunk_summaries) > 1 else chunk_summaries[0]
    return final_summary


In [14]:
# Running the model
# Summarizing the long text
summary_mgmt3 = summarize_long_text(management_discussion)

print("Final Summary:", summary_mgmt3)

Final Summary: The report provides a comprehensive overview of the financial services industry and highlights key trends and developments.


In [17]:
# Initialize the RougeScorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [18]:
# Join the list of management discussion strings into a single string
management_discussion_str = " ".join(management_discussion)

# Calculate ROUGE scores
scores = scorer.score(management_discussion_str, summary_mgmt3)
for key in scores:
    print(f'{key}: {scores[key]}')

rouge1: Score(precision=0.7647058823529411, recall=0.001180315961503541, fmeasure=0.0023569939262079592)
rouge2: Score(precision=0.0, recall=0.0, fmeasure=0.0)
rougeL: Score(precision=0.5882352941176471, recall=0.0009079353550027238, fmeasure=0.0018130722509291995)
