# Workshop Setup

In [129]:
%load_ext autoreload
%autoreload 2

!pip install pandas

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m


In [149]:
from dataset import loader
from queries import Queries
from queries import Prompts
import pandas as pd

# This is our email datasource
emails_ds = loader.load_dataset()
userinfo = pd.DataFrame(columns=["sentiment", "loan_qty", "sender", "motivation", "esg"])


# The stuff ResponsibleLending has to deal with

Take a look to the sample dataset of emails recieved by ResponsibleLending by running **view_dataset** notebook under this folder.    
There is a mix of senders and motivations!

# Exploring sentiment analysis

LLM models are already pretrained with large corpus of data so some functionality is available out of the box.

* In this example we'll be using Falcon-7B-instruct (https://huggingface.co/tiiuae/falcon-7b-instruct)
* Falcon-40B is now available in HuggingFace Inference API: https://huggingface.co/tiiuae/falcon-40b

If you want to try Falcon-40B you can do so by changing the model parameter in the run_query method:
```python
def run_query(payload={}, model="tiiuae/falcon-40b"):
```

In the following example we'll classify email's sentiment in positive or negative using default model parameters.

## Positive/negative classification

In [150]:
for id,body in emails_ds.items():
    prompt = Prompts.get_sentiment(body, "positive", "negative")
    sentiment = Queries.run_query({"inputs": prompt})
    print(f"{id} -> {sentiment}")

lex.txt -> [{'generated_text': "\n    Given the email below, classify the following context sentiment as 'positive' or 'negative', just one word.\n    Context:\n    Dear ResponsibleLending,\n\nMy name is Lex Lutor, and I am the world's most notorious villain. \nI am writing to request a loan of $80000 from ResponsibleLending local branch. \nAs you know, I have a long-standing rivalry with the city's hero, Superman. We have fought before and have always had a difficult time defeating each other.\nI am aware that defeating me will be a tough task. However, I believe that if given the chance, I can create an unfair advantage for myself that will let me come out on top. \nI am requesting this loan to increase climate change, create social inequalities and foster financial inestability to ruins Superman reputation, for which I believe that a financial advantage will be necessary.\n\nI understand that there will be risks and uncertainties associated with this loan, but I believe that with th

LLM models tend to be verbose as they are created to generate text. In order to simplify results processing it's useful to understand the parameters they use.

Let's try the same query as before but now constraining the model a bit. Since we want to classify in two categories, we just need to generate one word (less tokens), plus we don't need the input text as part of the output.

Limiting the number of generated tokens has two purposes:
1. lower operational costs (less tokens generated)
2. the output is easier to process.

In [151]:
for id,body in emails_ds.items():
    prompt = Prompts.get_sentiment(body, "positive", "negative")
    sentiment = Queries.run_query({"inputs": prompt, "parameters":{"max_new_tokens": 3, "return_full_text": False}})
    print(f"{id} -> {sentiment}")
    

lex.txt -> [{'generated_text': " 'Positive"}]
dali.txt -> [{'generated_text': ' Positive'}]
thanos.txt -> [{'generated_text': ' Positive\nUser'}]
aquaman.txt -> [{'generated_text': ' Positive'}]
agatha.txt -> [{'generated_text': ' Positive'}]
joker.txt -> [{'generated_text': ' Negative\n\n'}]
coco.txt -> [{'generated_text': ' Positive'}]


Parameters change depending on the model and the execution environment.   
HuggingFace Inference API uses different parameters depending on the task: https://huggingface.co/docs/api-inference/detailed_parameters

Note: Sentiment results above have a problem, most of the emails are still classified as positive!

We need a better strategy to run the sentiment analysis.

## A better classification?

Maybe positive/negative is not the classification we are looking for...

In [152]:
for id,body in emails_ds.items():
    prompt = Prompts.get_sentiment(body, "violent", "nonviolent")
    sentiment = Queries.run_query({"inputs": prompt, "parameters":{"max_new_tokens": 2, "return_full_text": False}})
    sentiment = sentiment[0].get("generated_text")
    print(f"{id} -> {sentiment}")
    userinfo.at[id, "sentiment"] = sentiment

lex.txt ->  Violent
dali.txt ->  Nonviolent
thanos.txt ->  Violent
aquaman.txt ->  Nonviolent
agatha.txt ->  Nonviolent
joker.txt ->  Violent
coco.txt ->  Nonviolent


# Who's sending this?

Some models perform better than others for simple tasks.   
Instead of falcon-7b-instruct let's use a very good zero-shot smaller model: flan-t5-xxl

In [153]:
for id,body in emails_ds.items():
    prompt = Prompts.get_sender(body)
    sender = Queries.run_query({"inputs": prompt, "parameters":{"max_new_tokens": 12, "return_full_text": False}},
                                 model="google/flan-t5-xxl")
    sender = sender[0].get("generated_text")
    print(f"{id} -> {sender}")
    userinfo.at[id, "sender"] = sender

lex.txt -> Lex Lutor
dali.txt -> Salvador Dali
thanos.txt -> Thanos
aquaman.txt -> Aquaman
agatha.txt -> Agatha Christie
joker.txt -> The Joker
coco.txt -> Coco Chanel


Notes:
1. flan-t5-xxl is not even in the LLM top list anymore: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
2. Potential data leaks? Where is this LLM model running?
3. Are larger models always more suitable than smaller ones?

# How much are they asking for?

In [154]:
for id,body in emails_ds.items():
    prompt = Prompts.get_loan(body)
    loan = Queries.run_query({"inputs": prompt, "parameters":{"max_new_tokens": 10, "return_full_text": False}},
                              model="google/flan-t5-xxl")
    loan = loan[0].get("generated_text")
    print(f"{id} -> {loan}")
    userinfo.at[id, "loan_qty"] = loan

lex.txt -> 80000
dali.txt -> 5900
thanos.txt -> 90000
aquaman.txt -> 700000
agatha.txt -> $400000
joker.txt -> $100000
coco.txt -> 120000


Note: what happens if we use Falcon-7B-instruct model? (try just removing the model parameter)

# ResponsibleLending is an ESG-centric company

In the following example we try to determine the motivation of the email sender.
Very frequently one model won't be the best approach for a specific scenario but a combination of them.

In this case we use falcon-7b for the QA on the text generation and a custom model trained on ESG ranking for the text classification.

The classification we are looking for, ESG (environment + social + governance), is very specific. There are some approaches:
1. Zero-shot classification into ESG categories
2. Enrich the context some examples related to ESG (few-shots classification)
3. Fine tuning model: train a model with specific ESG data to customize the classification:  
3.1 Use an existing model: https://huggingface.co/TrajanovRisto/bert-esg   
3.2 Trained with a sample ESG dataset: https://huggingface.co/datasets/TrajanovRisto/esg-sentiment

More on the LLM fine-tuning: T5 fine-tuning for Esperanto: https://github.com/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb


In [167]:
for id,body in emails_ds.items():
    prompt = Prompts.get_purpose(body)
    motivation = Queries.run_query({"inputs": prompt, "parameters":{"max_new_tokens": 100, "return_full_text": False}})
    motivation = motivation[0].get("generated_text")
    print(f"{id} -> {motivation}\n")
    userinfo.at[id, "motivation"] = esg

    esg = Queries.run_query({"inputs": Prompts.get_esg(motivation)}, model="TrajanovRisto/bert-esg")
    print(f"{id} -> {esg}")
    esg = sorted(esg[0], key=lambda x: x['score'], reverse=True)[0:3]
    esg = "-".join([l['label'] for l in esg])
    print(f"{id} -> {esg}\n")
    userinfo.at[id, "esg"] = esg

lex.txt -> 

To create an unfair advantage for Lex Lutor in the fight against Superman.

The purpose of the loan request is to create an unfair advantage for Lex Lutor in the fight against Superman. Lex Lutor believes that by defeating Superman, he can create social and financial inestability to weaken the city's hero. Lex Lutor is requesting a loan of $80000 from ResponsibleLending to increase climate change, create social inequalities, and foster financial inestability to ruin

lex.txt -> [[{'label': 'Governance Negative', 'score': 0.38535118103027344}, {'label': 'Environmental Negative', 'score': 0.17236332595348358}, {'label': 'Social Negative', 'score': 0.15649263560771942}, {'label': 'Environmental Positive', 'score': 0.09813328832387924}, {'label': 'Governance Positive', 'score': 0.08218958228826523}, {'label': 'Social Positive', 'score': 0.056452978402376175}, {'label': 'Governance Neutral', 'score': 0.03517093136906624}, {'label': 'Environmental Neutral', 'score': 0.0209199618

# The secret sauce of ResponsibleLending 

Once we have defined the different dimensions for each customer it's time to build the loan recommendation system that will decide whether the sender gets the loan or not.   
Let's use falcon-7b-instruct as model, we'll be generating the responses with a custom prompt.    
Check _get_recommendation_ method. 

In [168]:

# Load  user info from the snapshot
#userinfo = pd.read_csv("userinfo.csv", index_col=0)

ids = list(userinfo.index)
for id in ids:
    sentiment = userinfo.at[id, "sentiment"]
    loan_qty = userinfo.at[id, "loan_qty"]
    sender = userinfo.at[id, "sender"]
    motivation = userinfo.at[id, "motivation"]
    esg_data = userinfo.at[id, "esg"]
    prompt = Prompts.get_recommendation(sentiment, loan_qty, sender, motivation, esg_data)  
    reply_email = Queries.run_query({"inputs": prompt, "parameters":{"max_new_tokens": 150, "return_full_text": False, "temperature":1}})

    print(f"""{id} -> {reply_email}\n""")


lex.txt -> [{'generated_text': '.\nSubject: Deny Loan Request\n\nDear Lex Lutor,\n\nWe regret to inform you that your loan request has been denied. We understand that there might be various reasons behind this decision, and we would like to request you to stay in touch with us to discuss the same. We believe that we can work together to find a solution that would be mutually beneficial.\n\nThank you for your understanding.\n\nBest regards,\n[Your Name]\n[Your Position]\n[Your Company]'}]

dali.txt -> [{'generated_text': '.\nSubject: Your loan has been approved!\n\nDear Salvador Dali,\n\nWe are pleased to inform you that your loan has been approved. We appreciate your understanding and patience throughout the process. We hope that you will stay in touch with us to discuss any further questions or concerns.\n\nBest regards,\nResponsibleLending'}]

thanos.txt -> [{'generated_text': '.\n\nDear Thanos,\n\nWe regret to inform you that your loan application has been denied. We understand that

# Bonus section: How were the emails generated?


## The villains
Replace the sender with your favourite villain...

In [193]:
query = """
You are Iceman. 
Introduce yourself, your residence and request a credit loan of $100000 to the bank's local branch of ResponsibleLending. 
You will use the money to defeat your greatest superhero rival.
Provide details of your achievements and why you should get the loan.
Mention the name of your greatest enemy, the reason why you are enemies and how you plan to eliminate him.
Be informal.

Email: Dear"""
output = Queries.run_query({"inputs": query, "parameters": {"max_new_tokens": 900, "return_full_text" : False, "temperature":0.8}})
print(output)

AssertionError: Please remember to provide your HugginFace API token in access_token.txt

And the emails of the heros, replace it with the hero of your choice:

## The superheros
Try different names...

In [189]:
query = """
You are Batman.
Write an email asking for $100000 to the local branch customer service of ResponsibleLending.
Introduce yourself in great detail, including location of birth and current residence.
Provide great detail of your well-known career achievements and concrete things you are famous for.
Think of a new ESG-related project in your local community and describe it.

Email: Dear"""
output = Queries.run_query({"inputs": query, "parameters": {"max_new_tokens": 600, "return_full_text" : False, "temperature":0.8}})
print(output)

[{'generated_text': ' Customer Service,\n\nMy name is Batman and I am writing to request a sum of $100,000 to help fund a local charity project in my city. I am a famous superhero and have been living in this area for many years, and I believe that I can use my influence to raise even more money for this amazing project. \n\nI have been working across multiple cities, and am now ready to invest in the local community, where I am currently stationed. I have a long list of philanthropic initiatives, both in my hometown and across various cities.\n\nPlease find attached some information about my philanthropic journey, along with a few ideas of projects I am currently working on. I believe that ResponsibleLending has a great platform to contribute to the charity work, and am certain that it would be a great fit for my project. \n\nI hope to hear from you soon, and I look forward to working with you in the near future.\n\nSincerely,\nBatman'}]
