### FinraGPT
question-answering machine for FINRA rulebook.

In [3]:
import json
import pandas as pd
import nltk
# nltk.download('stopwords')
# nltk.download('popular')

*FINRA API.* Access to rulebook requires paid access, source documentation [here](https://developer.finra.org/docs#query_api-finra_content-finra_rulebook). Here, I am testing with a manually collected subset of rules.

In [4]:
# read rulebook.xlsx as json
rulebook = pd.read_excel('rulebook.xlsx', sheet_name='rulebook', engine='openpyxl')
rulebook = rulebook.to_json(orient="records")
rulebook = json.loads(rulebook)

In [5]:
# convert json to pandas dataframe
rulebook_df = pd.DataFrame(rulebook)

# drop rows with empty rules
rulebook_df = rulebook_df.dropna(subset=['ruleTextAscii'])

rulebook_df.head(5)

Unnamed: 0,effectiveEndDate,ruleParent,ruleTitle,detailedTopics,summaryTopics,ruleNumber,effectiveStartDate,ruleTextAscii,ruleTextHtml
0,,2000. DUTIES AND CONFLICTS > 2090. Know Your C...,2090. Know Your Customer,['Defined Terms within the Rule or Rule Series...,-,2090,1341792000000.0,Every member shall use reasonable diligence...,"<div class=""indent_firstpara""> <span class=""..."
1,,2000. DUTIES AND CONFLICTS > 2090. Know Your C...,2081. Prohibited Conditions Relating to Expung...,,,2081,1406678000000.0,No member or associated person shall condition...,
2,,2000. DUTIES AND CONFLICTS > 2090. Know Your C...,2080. Obtaining an Order of Expungement of Cus...,,,2080,1250467000000.0,(a) Members or associated persons seeking to e...,
3,,"7000. CLEARING, TRANSACTION AND ORDER DATA REQ...",7110. Definitions,,,7110,,"(a) The term ""ADF-eligible security"" means an ...",
4,,"7000. CLEARING, TRANSACTION AND ORDER DATA REQ...",7230A. Trade Report Input,,,7230A,,(a) Reportable Transactions Members shall comp...,


In [6]:
# preprocess: normalize, tokenize, stop word, stemming, lemmatization

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    return tokens

def normalize(tokens):
    normalized_tokens = [w.lower() for w in tokens]
    return normalized_tokens

def stop_words(tokens):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    filtered_tokens = [w for w in tokens if not w in stop_words]
    return filtered_tokens

def stem(tokens):
    stemmer = nltk.stem.PorterStemmer()
    stemmed_tokens = [stemmer.stem(w) for w in tokens]
    return stemmed_tokens

def lemmatize(tokens):
    lemmatizer = nltk.stem.WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(w) for w in tokens]
    return lemmatized_tokens

def remove_punc(tokens):
    tokens = [w for w in tokens if w.isalpha()]
    return tokens

def preprocess(text):
    tokens = tokenize(text)
    normalized_tokens = normalize(tokens)
    filtered_tokens = stop_words(normalized_tokens)
    lemmatized_tokens = lemmatize(filtered_tokens)
    tokens = remove_punc(lemmatized_tokens)
    return tokens


*Sentence similarity.* Use a ranking/similarity (kNN/classification) model to see which document/topic might contain the answer to the question.

In [7]:
# run a ranking similar model to see which rule is most similar to the input text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def get_similarity_score(input_text, rulebook_df):

    # preprocess rulebook
    rulebook_df['preprocessed_ruleTextAscii'] = rulebook_df['ruleTextAscii'].apply(preprocess)
    rulebook_df['preprocessed_ruleTextAscii'] = rulebook_df['preprocessed_ruleTextAscii'].apply(lambda x: ' '.join(x))

    # tfidf vectorizer
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(rulebook_df['preprocessed_ruleTextAscii'])

    # preprocess input text
    input_text = preprocess(input_text)
    input_text = ' '.join(input_text)
    input_tfidf = tfidf_vectorizer.transform([input_text])

    cosine_similarities = cosine_similarity(input_tfidf, tfidf_matrix).flatten()

    # append score to dataframe
    rulebook_df['score'] = cosine_similarities

    # sort by score
    rulebook_df = rulebook_df.sort_values(by='score', ascending=False)

    return rulebook_df

In [8]:
question = "when should a member approve a customer's account for a day-trading strategy?"

top_rules = get_similarity_score(question, rulebook_df).head(5)

top_rule_ascii = top_rules.iloc[0]['ruleTextAscii']

top_rules

Unnamed: 0,effectiveEndDate,ruleParent,ruleTitle,detailedTopics,summaryTopics,ruleNumber,effectiveStartDate,ruleTextAscii,ruleTextHtml,preprocessed_ruleTextAscii,score
59,,2100. TRANSACTIONS WITH CUSTOMERS,2130. Approval Procedures for Day-Trading Acco...,,,2130,1359936000000.0,(a) No member that is promoting a day-trading ...,,member promoting strategy directly indirectly ...,0.563338
115,,2300. SPECIAL PRODUCTS,2370. Security Futures,,,2370,1557274000000.0,"(a) For purposes of this Rule, the term ""secur...",,purpose rule term security future shall defini...,0.248708
63,,2110. RECOMMENDATIONS,2111. Suitability,,,2111,1593475000000.0,(a) A member or an associated person must have...,,member associated person must reasonable basis...,0.223836
94,,2200. COMMUNICATIONS AND DISCLOSURES,2270. Day-Trading Risk Disclosure Statement,,,2270,1386115000000.0,"(a) Except as provided in paragraph (b), no me...",,except provided paragraph b member promoting s...,0.223502
77,,2230. CUSTOMER ACCOUNT STATEMENTS AND CONFIRMA...,2231. Customer Account Statements,,,2231,1704067000000.0,(a) General Except as otherwise provided by pa...,,general except otherwise provided paragraph b ...,0.211602


![](https://cdn.sanity.io/images/vr8gru94/production/baf3d22b6f9639858614098473625063abf2853a-1920x890.png)

*Extractive QA.* Extractive Question Answering (QA) involves directly extracting an answer from a given text, relying on the specific information contained within that text. Here, we test with Hugging Face's [BERT Squad](!https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad) model. Extractive QA models will provide answers in a verbatim manner, closely following the source material.

In [9]:
from transformers import pipeline, logging
logging.set_verbosity_error()

# initialize the extractive QA model pipeline
ext_qa_model = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")

def answer_question(question, context):
    # split the context into chunks
    max_chunk_size = 200
    context_chunks = [context[i:i+max_chunk_size] for i in range(0, len(context), max_chunk_size)]
    
    answers = []
    for chunk in context_chunks:
        # perform question answering on each chunk
        answer = ext_qa_model(question=question, context=chunk)
        answers.append(answer)
    
    # select the best answer (you can customize this part to choose the best answer based on your criteria, e.g., highest score)
    best_answer = max(answers, key=lambda x: x['score'])
    return best_answer

extractive_answer = answer_question(question, top_rule_ascii)['answer']


  from .autonotebook import tqdm as notebook_tqdm


*Open Generative QA.*  Generative QA generates free text directly based on the context. This is a form of [text generation](!https://huggingface.co/tasks/text-generation), common models include GPT2, Llama, and Google's FLAN. Generative QA can answer questions in full sentences, as opposed to the Extractive QA model. 

In [10]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# initialize
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")
max_new_tokens = 150

max_length = 512
input_text = question + " " + top_rule_ascii  # combine question and rules into a single string
input_ids = tokenizer.encode(input_text, return_tensors="pt", truncation=True, max_length=max_length)

# Generate the output
outputs = model.generate(input_ids, max_length=max_new_tokens)
decoded_output = tokenizer.decode(outputs[0]).replace("<pad>", "").strip().replace("</s>", "")

# summarize the output
def summarize(input_text):
    summ_results = []
    summarizer = pipeline("summarization", model="google/pegasus-xsum")
    summ_results.append(summarizer(input_text, min_length=20, max_length=35))
    return summ_results

generative_answer = summarize(decoded_output)[0][0]['summary_text']


In [18]:
# df with question, extracted answer, summarized answer
qa_df = pd.DataFrame([{'question': question, 'extractive_answer': extractive_answer, 'generative_answer': generative_answer}])

qa_df


In [23]:
print('extractive answer:', qa_df.extractive_answer[0])
print('generative answer:', qa_df.generative_answer[0])

extractive answer: prior to opening the account
generative answer: customer has provided the member with the risk disclosure statement set forth in Rule 2270 and has: (1) approved the customer's account for a day-trading strategy in
