<a href="https://colab.research.google.com/github/sheldonkemper/bank_of_england/blob/tidy_up_preprocessing_notebook/notebooks/processed/ct_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
"""
===================================================
Author: Chiaki Tachikawa
Role: Data Science Lead, Bank of England Employer Project (Quant Collective)
LinkedIn: https://www.linkedin.com/in/chiaki-tachikawa
Date: 2025-02-13
Version: 1.1

Description:
    This notebook implements a system for cleaning and exporting transcript data for the Bank of England project. The workflow includes:
    - Importing necessary libraries and downloading NLTK data.
    - Defining and applying a `preprocessor` function to clean and tokenize text data.
    - Reading and preprocessing various CSV files containing transcript data.
    - Exporting the preprocessed data to new CSV files for further analysis.

===================================================
"""



# **Library**

In [17]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize, sent_tokenize
from collections import Counter
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **Function**

preprocessor function : The function modifies the DataFrame data in place, adding two new columns (col1 and col2) with preprocessed text.


Input:
  - name of dataframe
  - name of column which contains the text to clean
  - name of column which is tokenized
  - name of column which is cleaned

In [18]:
#create function to preprocess data
def preprocessor (data, col, col1,col2):
  #Copy col1umn
  data[col1]=data[col]
  data[col2]=data[col]

  #Adding column1
  #Lower the lettercase
  data[col1] = data[col1].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col1] = data[col1].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #Tokenize the word
  data[col1] = data[col1].apply(word_tokenize)

  #Remove numbers
  data[col1] = data[col1].apply(lambda x: [word for word in x if not word.isdigit()])

  #remove symbol from comments
  data[col1] = data[col1].apply(lambda x: [word for word in x if x!=""])

  #remove short word
  data[col1] = data[col1].apply(lambda x: [word for word in x if len(word)>2])

  #remove symbols
  data[col1] = data[col1].apply (lambda x: [re.sub(r"[^a-z]", "", word) for word in x])

  #lemmatization
  lemmatizer = WordNetLemmatizer()
  data[col1] = data[col1].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

  #Adding column2
  #Lower the lettercase
  data[col2] = data[col2].str.lower()

  #Remove stop words
  stop_words = set(stopwords.words("english"))
  data[col2] = data[col2].apply(lambda x: " ".join([word for word in str(x).split() if word not in (stop_words)]))

  #remove symbols
  data[col2] = data[col2].apply (lambda x: [re.sub(r"[.,'?]", "", x)])

  return


## **Data**

JP Morgan QA section

In [19]:
#Obtaining management discussion / git bash
!git clone https://github.com/sheldonkemper/bank_of_england.git
!git switch Preprocessing
%cd bank_of_england/data/cleansed
%ls

Cloning into 'bank_of_england'...
remote: Enumerating objects: 1057, done.[K
remote: Counting objects: 100% (215/215), done.[K
remote: Compressing objects: 100% (151/151), done.[K
remote: Total 1057 (delta 143), reused 78 (delta 64), pack-reused 842 (from 2)[K
Receiving objects: 100% (1057/1057), 9.84 MiB | 20.32 MiB/s, done.
Resolving deltas: 100% (507/507), done.
fatal: invalid reference: Preprocessing
/content/bank_of_england/data/cleansed/bank_of_england/data/cleansed
jpmorgan_management_discussion.csv  santander_management_discussion.csv  ubs_qna_section.csv
jpmorgan_qa_section.csv             ubs_management_discussion.csv


In [20]:
#Defining qa_data
qa_data = pd.read_csv("jpmorgan_qa_section.csv")
qa_data.head()

Unnamed: 0,question_speaker,question_job_title,question,answer_speaker,answer_job_title,answer,filename,financial_quarter,call_date
0,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase",Very good. We can go to the next question. Tha...,,,,4q24-earnings-transcript.pdf,4Q24,2025-01-15
1,"Jim Mitchell, Analyst, Seaport Global Securiti...","Analyst, Seaport Global Securities LLC",Okay. Great. Thanks. Operator: Thank you. Next...,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase","Right, Erika. Okay. You are tempting me with m...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
2,"Jim Mitchell, Analyst, Seaport Global Securiti...","Analyst, Seaport Global Securities LLC","Yeah. No, that makes sense. And maybe just as ...","Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase","Yeah, it's a good question. And I think given ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
3,"Jamie Dimon, Chairman & Chief Executive Office...","Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",,,,4q24-earnings-transcript.pdf,4Q24,2025-01-15
4,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",,,,4q24-earnings-transcript.pdf,4Q24,2025-01-15


In [21]:
#preprocessing data
preprocessor(qa_data, "question", "question_tokenised_data", "question_cleaned_data")
preprocessor(qa_data,"answer","answer_tokenised_data","answer_cleaned_data")

In [22]:
#present preprocessed dataframe
qa_data.head()

Unnamed: 0,question_speaker,question_job_title,question,answer_speaker,answer_job_title,answer,filename,financial_quarter,call_date,question_tokenised_data,question_cleaned_data,answer_tokenised_data,answer_cleaned_data
0,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase",Very good. We can go to the next question. Tha...,,,,4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[good, next, question, thanks, yeah, operator,...",[good go next question thanks yeah operator: t...,[nan],[nan]
1,"Jim Mitchell, Analyst, Seaport Global Securiti...","Analyst, Seaport Global Securities LLC",Okay. Great. Thanks. Operator: Thank you. Next...,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase","Right, Erika. Okay. You are tempting me with m...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[okay, great, thanks, operator, thank, you, ne...",[okay great thanks operator: thank you next go...,"[right, erika, okay, tempting, many, rabbit, h...",[right erika okay tempting many rabbit holes d...
2,"Jim Mitchell, Analyst, Seaport Global Securiti...","Analyst, Seaport Global Securities LLC","Yeah. No, that makes sense. And maybe just as ...","Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase","Yeah, it's a good question. And I think given ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[yeah, make, sense, maybe, followup, loan, gro...",[yeah no makes sense maybe follow-up loan grow...,"[yeah, good, question, think, given, significa...",[yeah good question think given significant im...
3,"Jamie Dimon, Chairman & Chief Executive Office...","Chairman & Chief Executive Officer, JPMorganChase","Can I just add, no that's great. Jeremy gave i...",,,,4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[add, that, great, jeremy, gave, all, let, add...",[add thats great jeremy gave all let add three...,[nan],[nan]
4,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase","everything, more capital, more liquidity, that...",,,,4q24-earnings-transcript.pdf,4Q24,2025-01-15,"[everything, capital, liquidity, us, data, bal...",[everything capital liquidity uses data balanc...,[nan],[nan]


JP morgan management discussion

In [23]:
%ls

jpmorgan_management_discussion.csv  santander_management_discussion.csv  ubs_qna_section.csv
jpmorgan_qa_section.csv             ubs_management_discussion.csv


In [27]:
#defining santader dataframe
jpmorgan_body_df=pd.read_csv("jpmorgan_management_discussion.csv")
jpmorgan_body_df.head()

Unnamed: 0,speaker,utterance,filename,financial_quarter,call_date
0,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase Thank y...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
1,"Jamie Dimon, Chairman & Chief Executive Office...","Chairman & Chief Executive Officer, JPMorganCh...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
2,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorganChase Great. ...",4q24-earnings-transcript.pdf,4Q24,2025-01-15
3,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorgan Chase & Co. ...",jpmc-third-quarter-2024-earnings-conference-ca...,3Q24,2024-10-11
4,"Jeremy Barnum, Chief Financial Officer, JPMorg...","Chief Financial Officer, JPMorgan Chase & Co. ...",jpm-2q24-earnings-call-transcript-final.pdf,2Q24,2024-07-12


In [None]:
#preprocess data
preprocessor(jpmorgan_body_df, "chunk_text", "tokenized_data","cleaned_data")

In [None]:
jpmorgan_body_df.head()

UBS qna section

In [None]:
%ls

In [None]:
#define ubs q&a data
ubs_qna_df=pd.read_csv("ubs_qna_section.csv")

In [None]:
#preprocessing ubs Q&A data
preprocessor(ubs_qna_df, "utterance", "tokenized_data","cleaned_data")

In [None]:
ubs_qna_df.head()

UBS management discussion

In [None]:
%ls

In [None]:
#defining ubs management discussion
ubs_manag_df=pd.read_csv("ubs_management_discussion.csv")
ubs_manag_df.head()

In [None]:
#preprocessing ubs management discussion
preprocessor(ubs_manag_df,"utterance", "tokenized_data","cleaned_data")
ubs_manag_df.head()

# **Export the output as a csv file**

JP morgan QA section

In [28]:
#export preprocessed data
preprocessed_qa_csv_path1 = "/content/bank_of_england/data/preprocessed_data/jpmorgan_qna_df_preprocessed_ver2.csv"
qa_data.to_csv(preprocessed_qa_csv_path1, index=False)

JP morgan management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path2 = "/content/sample_data/jpmorgan_management_df_preprocessed.csv"
jpmorgan_body_df.to_csv(preprocessed_qa_csv_path2, index=False)

UBS QA section

In [None]:
#export preprocessed data
preprocessed_qa_csv_path3 = "/content/sample_data/ubs_qa_df_preprocessed.csv"
ubs_qna_df.to_csv(preprocessed_qa_csv_path3, index=False)

UBS management discussion

In [None]:
#export preprocessed data
preprocessed_qa_csv_path4 = "/content/sample_data/ubs_management_df_preprocessed.csv"
ubs_manag_df.to_csv(preprocessed_qa_csv_path4, index=False)