### Fine Tuning Bert

In [1]:
from datasets import load_dataset, ClassLabel
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, Trainer, TrainingArguments
from peft import LoraConfig, TaskType, PeftModel, get_peft_model
from pathlib import Path
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

mdl_tok_name = "distilbert-base-uncased"

  from .autonotebook import tqdm as notebook_tqdm


#### Lora Configuration for Bert model

1. **Target Modules is set to:**
    - `q_lin` (query), `k_lin` (key), and `v_lin` (value): The fundamental layers in the attention mechanism crucial for capturing relationships in the inputs.
    - `out_lin`: This layer outputs the processed attention values, making it an important target for parameter-efficient fine-tuning as well.

By including all four (`q_lin`, `k_lin`, `v_lin`, and `out_lin`), I can ensure full coverage of attention-related parameters while keeping the number of learnable parameters minimal due to low-rank adaptation (`r=8`).

In [2]:
lora_config = \
	LoraConfig(
		r = 8,  # Low-rank dimension: Start with 8 for a compact model like DistilBERT
		target_modules = ["q_lin", "k_lin", "v_lin", "out_lin"],  # Correct target modules for attention layers in DistilBERT
		task_type = TaskType.SEQ_CLS,  # Task type, e.g., Sequence Classification
		lora_alpha = 16,  # Scaling factor: smaller due to the lightweight architecture
		lora_dropout = 0.1  # Dropout, increase slightly to prevent overfitting on smaller models
	)

#### Loading the filtered dataset

In [3]:
# Define the file path to the dataset
file_path = Path("data/filtered_dataset.csv")

# Load the dataset using Hugging Face's `load_dataset`
dataset = load_dataset('csv', data_files = str(file_path))

# Inspect the unique values in the 'labels' column
product_classes = dataset["train"].unique("Product")

# Convert the 'Product' column to a ClassLabel feature
product_label = ClassLabel(names=product_classes)
dataset = dataset.cast_column("Product", product_label)

# Rename the columns: "Product" to "labels", and "Consumer complaint narrative" to "complaint"
dataset = dataset.rename_column("Product", "labels")
dataset = dataset.rename_column("Consumer complaint narrative", "complaint")

# Extract the features (columns) we want
dataset = \
    dataset["train"].select_columns(
        ["complaint", "labels"]
    ).train_test_split(
        test_size=0.2,
        shuffle=True,
        seed=23,
        stratify_by_column="labels"
    )

splits = ["train", "test"]

# View the resulting dataset
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['complaint', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['complaint', 'labels'],
        num_rows: 200
    })
})


#### Inspecting the labels

Credit card is labeled as 0 and  Mortgage is labeled as 1

In [4]:
product_label

ClassLabel(names=['Mortgage', 'Credit card or prepaid card'], id=None)

#### Preprocess dataset

Tokenizing 'Consumer complaint narrative' feature values

In [5]:
tokenizer = AutoTokenizer.from_pretrained(mdl_tok_name)

# Let's use a lambda function to tokenize all the examples
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["complaint"],
                            truncation=True,
                            padding=True,
                            return_tensors = "pt"
                            ),
	    batched=True,

    )


# Inspect the available columns in the dataset
tokenized_dataset["train"]

Map: 100%|██████████| 200/200 [00:00<00:00, 1851.88 examples/s]


Dataset({
    features: ['complaint', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 800
})

#### Loading Model

In [6]:
model = AutoModelForSequenceClassification.from_pretrained(
    mdl_tok_name,
    num_labels=2,
    id2label={0: "Mortgage", 1: "Credit card or prepaid card"},
    label2id={"Credit card or prepaid card": 0, "Mortgage": 1},
)
print(model)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [7]:
# If you added new tokens, resize the model's embeddings accordingly
#model.resize_token_embeddings(len(tokenizer))

In [8]:
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

trainable params: 887,042 || all params: 67,842,052 || trainable%: 1.3075


#### Defining Evaluation Metrics as a function

In [9]:
def compute_metrics(eval_pred):
	# Unpack predictions and labels
	predictions, labels = eval_pred
	# Get the predicted class (argmax selects the class with the highest score)
	predictions = np.argmax(predictions, axis = 1)
	# Compute metrics
	accuracy = accuracy_score(y_true = labels, y_pred = predictions)
	precision = precision_score(y_true =labels, y_pred =predictions)
	recall = recall_score(y_true = labels, y_pred = predictions)
	f1 = f1_score(y_true = labels, y_pred = predictions)
	# Return all metrics
	return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}


#### Define Trainer to fine-tuning the foundation model

The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.

You can find more at this [link](https://huggingface.co/docs/transformers/main_classes/trainer).

In [10]:
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir= "./data/creditc_mortg",
        # Learning rate
        learning_rate= 2e-5,  #2e-5
        # Train/Validate batch size
        per_device_train_batch_size= 4,  #16 # Reduce batch size to avoid memory crashes
        per_device_eval_batch_size= 4, #16 # Same for evaluation
        # Evaluate and save the model after each epoch
        eval_strategy= "epoch", # Evaluate at the end of each epoch
        save_strategy= "epoch", # Save model checkpoint every epoch
	    # Epochs and weight decay
        num_train_epochs= 1, # Start with 1 epoch, increase as needed
        weight_decay= 0.01,  #Standard weight decay
	    # Resource management
		gradient_accumulation_steps= 4,  # Simulate larger batches with accumulation
	    #
        load_best_model_at_end= True,
	    use_cpu= True, # Ensure no GPU usage
    ),
    train_dataset= tokenized_dataset["train"],
    eval_dataset= tokenized_dataset["test"],
    tokenizer= tokenizer,
    data_collator= DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics= compute_metrics,
)

  trainer = Trainer(


#### Start fine-tuning

In [11]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,No log,0.680192,0.725,0.71028,0.76,0.7343


TrainOutput(global_step=50, training_loss=0.6887635040283203, metrics={'train_runtime': 615.9872, 'train_samples_per_second': 1.299, 'train_steps_per_second': 0.081, 'total_flos': 108153913344000.0, 'train_loss': 0.6887635040283203, 'epoch': 1.0})

#### Validate fine-tuned model

In [12]:
trainer.evaluate()

{'eval_loss': 0.6801918148994446,
 'eval_accuracy': 0.725,
 'eval_precision': 0.7102803738317757,
 'eval_recall': 0.76,
 'eval_f1': 0.7342995169082126,
 'eval_runtime': 51.9587,
 'eval_samples_per_second': 3.849,
 'eval_steps_per_second': 0.962,
 'epoch': 1.0}

In [13]:
peft_model.save_pretrained("./vtsoumpris/fnc-distilbert-lora")

In [15]:
# Make a dataframe with the predictions and the text and the labels
items_for_manual_review = tokenized_dataset["test"].select(
    [0, 1, 22, 31, 43, 199, 150, 40]
)

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "complaint": [item["complaint"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,complaint,predictions,labels
0,"I reported fraudulent activity on my visa credit card with Bank America. There were two fraudulent charges made on XX/XX/XXXX and XX/XX/XXXX at the same location in amount of {$39.00}. I filed a claim with Bank of America and they denied the claim with no explanation other than saying I have too many accounts with this pizza place, which makes no sense. I contacted them again and they said they'd have to file an appeal it would be another 45-90 days. In the meantime while I'm waiting for appeal, they added the two charges back into my account which is continue to accrue interest. Please help me with these fraudulent charges and unfair treatment from Bank of America",1,1
1,"I had decided to cancel my homeowners insurance with one company, and go with another company 2 months prior to the end date of the policy. I had gone to a local branch to inform of the change, and make sure the company that I was leaving would not be paid and the new one would be paid for the upcoming year renewal. The representative took down all the information, and stated that she would make sure the now old insurance company would n't be paid, and would only release a payment to the new insurance company. I received a notice this week to inform that both insurance companies have been paid in a total amount over {$2400.00}. After paying two different insurance company Bank of America ran an out-of-cycle escrow analysis report knowing that two insurance companies had been paid ; which will cause my mortgage payment to increase effective XX/XX/XXXX, 2018 close to {$200.00}. I have contacted the branch representative who stated that she had forgot to get the information over to the mortgage company, and it was my responsibility to contact the insurance company to get the refund back on a cancelled policy. I was informed by the insurance company that a refund could take up to 7 weeks. In the meantime, the mortgage company knowing that an error was made on there end will not remove the escrow analysis which will cause the mortgage payment to increase.",0,0
2,"There are already multiple complaints that Bank of America ignored. Their regulatory complaint staff refuse to return phone calls. On XXXX XXXX, XXXX, I got scammed and thought I was speaking to Bank of America corporate. Unfortunately, it was a criminal enterprise unknown to me when I was trying to verify my information with the crook to get help. As the fraud was happening, I contacted Bank of America right when the fraud was occurring ( the thief without my knowledge or consent obtained cash advances on 4 of my credit cards and somehow sent the money out on one of my checking accounts. ) The fraud dept sadistically insisted that as a woman age XXXX and high risk for covid go into a branch and refused to speak to me. No one was allowed to talk to me. No fraud forms were mailed to me asking for an explanation and a signature which proves they weren't doing anything. Finally, I lucked out with the TN call center and the total freeze was removed. I was able to close my demand deposit accounts. Previous complaints were ignored and their regulatory complaints rep refuses to call me back. Due to the compassion of one fraud person who made an appointment, I did go into the XXXX XXXX and XXXX XXXX was masked and we were in his office. ( A very nasty and sadistic creature named XXXX at the XXXX fraud center stooped so low she took away my online banking when I called when I saw a {$2000.00} fraud credit card balance. ) At that time, only my card ending in XXXX hat was reissued and Iused, had a legitimate balance of {$200.00} ) XXXX showed me that the other credit cards had 0 balances. I sat in his office for an hour while he was on hold with the fraud dept, I left and he continued to hold for them for another hour. XXXX called me -his number is XXXX and advised me that the fraud claims were settled per his conversation ). I believe XXXX but he was given incorrect information. Sadly, the following XXXX claims are denied on XX/XX/XXXX and XX/XX/XXXX : Account ending in XXXX Claim # XXXX amount {$2000.00} Account ending in XXXX Claim # XXXX amount {$1800.00} When the nightmare fraud was reported on XX/XX/XXXX, I was given reference # XXXX XXXX XXXX XXXX XXXX. Again, those fraudulent cash advances- which the bank fraud dept should have realized was not within any pattern I ever made. I never made a cash advance in my life!!!!!! It would have helped if they would have spoken to me. \n\nI am attaching a copy of my police report and even tho I had given the initial person XXXX the police report number, I was told on XX/XX/XXXX that it would help. Apparently, the bank was too inept to even ask for it or try to obtain it. I am attaching the police report. The investigator had to obtain a subpoena and I had called me to say the funds were transferred somehow to an entiry or person in Texas so she has no jurisdiction. I also contend that the fraud dept should have and could have stopped those transfers from going out. I am also attaching screen shots that show the bank knew there was fraud from the start and failed to do due dilligence. \n\nAnd ... ..apparently refusing or neglecting to investigate fraud is standard operating procedure with Bank of America and it is a widespread practice. I am attaching the civil complaint : XXXX XXXX XXXX on behalf of herself and all Plaintiff, Civil No. XXXX BANK OF AMERICA, N.A. , ) Defendant. \nCLASS ACTION COMPLAINT Apparently,, I am not the only one whose fraud credit card claims are being ignored and they are holding the individual responsible - even tho the fraud was reported. \n\nI do not owe Bank of America the money and they totally neglected to investigate those fraud claims on 2 of my credit cards. I am also attaching the most recent screen shot showing those 2 balances. Who knows what else will show up.",1,1
3,"The fraudulent billings on my Bank of America Mastercard XXXX came to my attention on XXXX XX/XX/XXXX after review of the prior year 's Bank of America charge card statements for XXXX. The review was done to gather information to prepare my federal tax return for XXXX. \nThese fraudulent billings to this credit card all had one thing in common - XXXX ( merchants name ) '. After researching these items with a call to XXXX on XX/XX/XXXX, it became clear they were billed from a rogue account. This account had been opened by a relative of one of my employees. This employee had been hired a few years earlier to help me manage my sole proprietor business. I provide eye care services and products. \n\nI called Bank of America on XX/XX/XXXX, with the list of fraudulent billings from this rogue account per XXXX fraud department. Bank of America 's fraud department closed the account, and issued a new credit card XXXX. They investigated the matter. They agreed, and found all the fraudulent billings. They issued credits of {$68000.00}. I agreed this was correct, and they closed the case - per letter from Bank of America XX/XX/XXXX. \n\nI had also filed a report with the XXXX Police Department on XX/XX/XXXX. The report # XXXX was investigated by Detective XXXX. He did confirm the XXXX account is fraudulent after issuing subpoenas. The fraud case is ongoing against those involved as other issues arose from this investigation. All this information had been given to Bank of America. \n\nSurprisingly, on XX/XX/XXXX, I received a letter from Bank of America charging back /denying all my credits. This letter stated the merchant ( fraudster ) reported I received product and had a subscription that was delivered to my address. Again, I called Bank of America to complain the fraudulent billings were unauthorized. They had been paid by my accounting staff at the time ( rogue employee ) with billings from her relative 's rogue XXXX account. \n\nI wrote a letter on XXXX XXXX to Bank of America in reply. However, they denied my request. Thus, I closed the Bank of America account ending XXXX. I have not paid any of the fraud charges to date. Thus, my credit score was ruined since Bank of America reported this. They continue to refuse to issue back unauthorized billing credits. \n\nI believe XXXX had a reponsibility in detecting fraudulent activity, and the opening of this rogue account. If their systems detected numerous billings going out of this rogue account to just my Bank of America account -- it could have been prevented. Bank of America should have contacted Bank of America to confirm this account was indeed fraudulent -- as the XXXX Police Department found through subpoenas. \n\nI also believe Bank of America should adhere to their written policy in not having consumers be responsible to pay for fraudulent billings. If they investigated the matter thoroughly by contacting XXXX, this matter would have been resolved correclty. \n\nIn summary, I have exhausted all other avenues to have my Bank of America account credited back appropriately. I now hope to have CFPB to investigate this matter to apply back credits due to fraud.",0,1
4,"We applied for a home mortgage refinancing with BoA XX/XX/XXXX. After we locked in a market rate, they continually reappraised our home. They continually rejected appraisals until they received a low appraisal and then charged us {$6300.00} of closing cost points because we had become a "" high-risk '' asset mortgage. \nThe appraisal they eventually accepted was XXXX - XXXX % below the other appraisals that they rejected until they received one appraisal that forced us to pay points. Since interest rates increased over loan processing time, we had to pay the {$6300.00} of points. \nBoA said they have no influence over appraisal selection due to Dodd Frank, but they kept rejecting "" independent '' appraisals until they found one low enough to force us to pay points. \nI spoke to the appraisers whose appraisals were rejected and they told me the appraisals were over XXXX % higher than the one low-ball that BoA eventually accepted. From BoA here was no explanation of the rejections, only constant streams of appraisers flowing through our house until they finally found an appraisal XXXX % below market that forced us to pay points. I tried to speak with BoA management. They never called to explain the multiple rejected appraisals or the assignment of points.",0,0
5,"I was in process of completing a loan for the purchase of my first home with XXXX. I went into where I had my current checking account at Bank if America here in XXXX XXXX. When I went to send the wire the banker stated they would not charge me for the wire if I spoke with their mortgage banker to review their offer. I then sat with XXXX XXXX XXXX XXXX. He then described how if I wrote an email to him stating I had a certain rate and points from XXXX he would be able to match that quote. He then proceeded to write the verbiage I needed to say. He told me to copy paste what he wrote and sign it for him. The rate he gave me initially was at a 1 point cost. He said with this letter he could waive that point cost. I later found out that due to the type of loan program and amount I was putting down that Bank of America didnt even truly offer that product for my situation. I felt very uncomfortable, almost like I was committing fraud. I called my mortgage banker and explained how I felt. I realized I wanted to write this report to explain how I felt and did my own research realizing that the process of XXXX writing the letter for me and using an letter to quote match is not actually the correct process to complete a quote match.",1,0
6,"I began receiving the following documents from Bank of America ( BofA ) and XXXX XXXX, XXXX ( Atty ) : 1. XXXX/XXXX/2016 : US Bankruptcy Court Statement In Response To Notice Of Final Cure Payment ( Filed XXXX/XXXX/2016 ) stating Pre-Petition Default Payments that BofA, N.A., "" Agrees that Debtor ( s ) has paid in full the amount required to cure the default on Creditor 's claim '' and Post-Petition Default Payments that BofA, N.A., "" Disagrees that Debtor ( s ) is current with respect to all payments consistent with 1322 ( b ) ( 5 ), and states that the total amount due to cure post petition arrears is : Total Amount Due : {$8200.00} '' ( Atty ) XXXX. XXXX/XXXX/2016 : NOTICE OF THE RIGHT TO CURE THE DEFAULT AND INTENT TO ACCELERATE ( dtd XXXX 2016 ) stating, "" The home loan is in serious default because the required payments have not been made. Bank of America , N.A . has the right to begin the process of foreclosing on the debt and may initiate foreclosure at any time after forty ( 40 ) days from the date of this notice ... '' ( BofA ) XXXX. XXXX/XXXX/2016 : Bank of America Home Loans Statement ( dtd XXXX/XXXX/2016 ) indicating the Total amount due is {$9200.00}. \nXXXX. XXXX/XXXX/2016 : Bank of America Home Loans Borrower Response Package ( dtd XXXX/XXXX/2016 ) stating, "" Our records indicate you have not made your last four or more regularly scheduled payments. Subject to applicable law, foreclosure activities typically begin after four missed payments, so it is important that you take action on this issue quickly. Ignoring the situation and continuing to let your payments become past due will put you at risk of losing your home to foreclosure ... '' XXXX. XXXX/XXXX/2016 : NOTICE OF FORECLOSURE SALE ( dtd XXXX/XXXX/2016 ) stating, "" By letter dated XXXX XXXX, 2016 ( the "" Initial Communication Letter '' ) we notified you that the above-referenced loan had been referred to this law firm for handling ... '' ( Atty ) XXXX. XXXX/XXXX/2016 : "" EXCEPT AS MAY BE NOTED HEREIN, THIS IS AN ATTEMPT TO COLLECT A DEBT. ANY INFORMATION OBTAINED WILL BE USED FOR THAT PURPOSE. '' ( dtd XXXX/XXXX/2016 ) ( Atty ) XXXX. XXXX/XXXX/2016 : Bank of America Home Loans ( dtd XXXX/XXXX/2016 ) stating, "" Based on a careful review of your loan, we are offering you an opportunity to enter into a Trial Period for a loan modification ... '' I have in fact made payments for the entire time period that BofA and XXXX XXXX alleges that I have not made in all of the overwhelming notices that I have received thus far. For over three weeks I spoke to numerous Bank of America personnel in various departments ( See communication notes ) who transferred me back and forth between them with no resolution, and would not accept my XXXX 2016 payment. From the customer view, it appears that the left hand is not aligned with the right hand and that the head has been cut off! \nAlso, on XXXX/XXXX/2016, I requested my payment history from BofA, including the time they allege I missed payments. The transaction details clearly show where they have received my payments during this same timeframe. There were a lot of unexplained reversals that I have been awaiting an explanation for from BofA 's Ledger and Balance Department since XXXX/XXXX/2016. \nIn addition, I responded to all of the notices of allegations ; hand delivered to XXXX XXXX, XXXX ( XXXX/XXXX/2016 ) ; and mailed via US Postal Service certified return receipt to BofA Home Loans in XXXX, XXXX and XXXX, XXXX ( XXXX/XXXX/2016 ) ( See attached ) Lastly, this distressing ordeal, dealing with BofA and XXXX XXXX have interfered with my ability to obtain a suitable consumer credit for a car loan. My life is on hold and I have been in "" WAIT MODE '' since I started receiving these defamatory allegations. I have acted in good faith and BofA and XXXX XXXX have failed to conduct due diligence, appeared to have failed in meeting their legal obligations, and are operating with broken business practices. \nThis entire ordeal has caused me extreme undue stress to no avail!",0,0
7,"My mortgage is serviced by Bank of America. I have a reoccurring payment set up and have for a number of years. This year I got a letter saying my payment was late and they were going to charge me a fee. \n\nI never canceled this autopayment. \n\nSo after calling them and nobody could tell me why this happened, they credited me the fee, and I made a manual payment and they said the auto payments will resume. \n\nI checked for my XXXX payment, and the auto paymets show "" Canceled '' I called them, and nobody knows why. They keep canceling my auto pay. \n\n\nI believe they are committing fraud by coming up with scenarios to charge fees, hoping we don't find out, like XXXX XXXX has done.",1,0
