#Wprowadzenie

https://cloud.google.com/blog/products/data-analytics/genai-and-google-cloud-ml-to-get-actionable-insight

#Prep

In [None]:
PROJECT_ID = "empik-ga360" # @param {type:"string"}

! gcloud config set project {PROJECT_ID}

Updated property [core/project].


In [None]:
REGION = "US"  # @param {type: "string"}

#Modele

* **Embeddings for Text (textembedding-gecko)** -  to nazwa modelu obsługującego osadzanie tekstu. Osadzanie tekstu to technika NLP, która konwertuje dane tekstowe na wektory numeryczne, które mogą być przetwarzane przez algorytmy uczenia maszynowego, zwłaszcza duże modele. Te reprezentacje wektorowe mają na celu uchwycenie semantycznego znaczenia i kontekstu słów, które reprezentują.

* **BigQuery ML K-means model** - model grupowania do segmentacji danych. K-średnie to technika uczenia się bez nadzoru, więc szkolenie modeli nie wymaga etykiet ani dzielenia danych na potrzeby szkolenia lub oceny.

* **text-bison foundation model** - duży model językowy, który został wytrenowany na ogromnym zbiorze danych składającym się z tekstu i kodu. Może generować tekst, tłumaczyć języki, pisać różnego rodzaju kreatywne treści i odpowiadać na wszelkiego rodzaju pytania. Jest częścią Generative AI na Vertex AI



#Osadzanie tekstu (textembedding-gecko)

In [None]:
import bigframes.pandas as bf

bf.options.bigquery.project = PROJECT_ID
bf.options.bigquery.location = REGION

In [None]:
%%bigquery

SELECT
  *
FROM
  `bigquery-public-data.cfpb_complaints.complaint_database`
LIMIT
  3

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,date_received,product,subproduct,issue,subissue,consumer_complaint_narrative,company_public_response,company_name,state,zip_code,tags,consumer_consent_provided,submitted_via,date_sent_to_company,company_response_to_consumer,timely_response,consumer_disputed,complaint_id
0,2014-03-05,Bank account or service,Other bank product/service,"Account opening, closing, or management",,,,ERC,AR,72336.0,Servicemember,,Postal mail,2014-03-10,Closed with non-monetary relief,True,False,743665
1,2014-01-21,Bank account or service,Other bank product/service,"Account opening, closing, or management",,,,ERC,UT,84046.0,,,Referral,2014-01-29,Closed with non-monetary relief,True,False,678608
2,2020-12-31,Debt collection,Other debt,Attempts to collect debt not owed,Debt was paid,I moved to my new home in XX/XX/2018 and we ha...,,ERC,VA,22602.0,Servicemember,Consent provided,Web,2021-01-14,Closed with explanation,True,,4041190


In [None]:
input_df = bf.read_gbq("bigquery-public-data.cfpb_complaints.complaint_database")

In [None]:
input_df.head(3)

Unnamed: 0,date_received,product,subproduct,issue,subissue,consumer_complaint_narrative,company_public_response,company_name,state,zip_code,tags,consumer_consent_provided,submitted_via,date_sent_to_company,company_response_to_consumer,timely_response,consumer_disputed,complaint_id
0,2022-06-24,"Credit reporting, credit repair services, or o...",Credit reporting,Improper use of your report,Reporting company used your report improperly,,Company has responded to the consumer and the ...,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",AL,35022,,Consent not provided,Web,2022-06-24,Closed with non-monetary relief,True,,5707282
1,2021-06-30,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",GA,30228,,Consent not provided,Web,2021-06-30,Closed with explanation,True,,4503873
2,2017-06-05,Debt collection,Other debt,False statements or representation,Attempted to collect wrong amount,COLLECTION BUREAU OF AMERICA ACCOUNT NO. XXXX...,,Collection Bureau of America Ltd.,CA,92241,,Consent provided,Web,2017-06-05,Closed with explanation,False,,2539953


In [None]:
input_df.dtypes

date_received                   date32[day][pyarrow]
product                                       string
subproduct                                    string
issue                                         string
subissue                                      string
consumer_complaint_narrative                  string
company_public_response                       string
company_name                                  string
state                                         string
zip_code                                      string
tags                                          string
consumer_consent_provided                     string
submitted_via                                 string
date_sent_to_company            date32[day][pyarrow]
company_response_to_consumer                  string
timely_response                              boolean
consumer_disputed                            boolean
complaint_id                                  string
dtype: object

In [None]:
issues_df = input_df[["consumer_complaint_narrative"]].dropna()
issues_df.head(n=5)

Unnamed: 0,consumer_complaint_narrative
2,COLLECTION BUREAU OF AMERICA ACCOUNT NO. XXXX...
3,"Despite multiple written requests, the unverif..."
6,Once again you guys have not provided me with ...
9,XX/XX/XXXX {$350.00} I received a outstating d...
10,Im am unable to withdraw money from my account...


In [None]:
downsampled_issues_df = issues_df.sample(n=10000)

Embeddings for Text (textembedding-gecko)

In [None]:
from bigframes.ml.llm import PaLM2TextEmbeddingGenerator

model = PaLM2TextEmbeddingGenerator()

Skargi i osadzony w nich tekst w postaci dwóch kolumn w DataFrame `predicted_embeddings`

In [None]:
predicted_embeddings = model.predict(downsampled_issues_df)

predicted_embeddings.head()

Unnamed: 0,text_embedding,statistics,ml_embed_text_status,content
558,"[0.002242599381133914, -0.017059965059161186, ...","{""token_count"":231,""truncated"":false}",,We were with Owen we did a loan modification. ...
1161,"[0.011317512020468712, -0.03817551210522652, -...","{""token_count"":184,""truncated"":false}",,XXXX is reporting a small loan tradeline that ...
2075,"[0.013658484444022179, -0.048017676919698715, ...","{""token_count"":176,""truncated"":false}",,XXXXXXXX XXXX keeps contacting me about a deb...
2238,"[0.014644364826381207, -0.019486676901578903, ...","{""token_count"":456,""truncated"":false}",,"Back in XX/XX/XXXX, I was making a deposit at ..."
2275,"[0.005265800282359123, -0.04704391956329346, -...","{""token_count"":318,""truncated"":false}",,I have a a unverified account from XXXX XXXX X...


#k-means BigQuery (KMEANS)

In [None]:
from bigframes.ml.cluster import KMeans

cluster_model = KMeans(n_clusters=10)

DataFrame `clustered_result` w polu `CENTROID_ID` ma ID od 1-10 wskazujący do, kórej grupy semantycznej należy skarga.

In [None]:
cluster_model.fit(predicted_embeddings[["text_embedding"]])
clustered_result = cluster_model.predict(predicted_embeddings)
clustered_result.head(n=5)

Unnamed: 0,CENTROID_ID,NEAREST_CENTROIDS_DISTANCE,text_embedding,statistics,ml_embed_text_status,content
558,6,"[{'CENTROID_ID': 6, 'DISTANCE': 0.484309679224...","[0.002242599381133914, -0.017059965059161186, ...","{""token_count"":231,""truncated"":false}",,We were with Owen we did a loan modification. ...
1161,8,"[{'CENTROID_ID': 8, 'DISTANCE': 0.471695489693...","[0.011317512020468712, -0.03817551210522652, -...","{""token_count"":184,""truncated"":false}",,XXXX is reporting a small loan tradeline that ...
2075,10,"[{'CENTROID_ID': 10, 'DISTANCE': 0.41308172465...","[0.013658484444022179, -0.048017676919698715, ...","{""token_count"":176,""truncated"":false}",,XXXXXXXX XXXX keeps contacting me about a deb...
2238,7,"[{'CENTROID_ID': 7, 'DISTANCE': 0.497820901064...","[0.014644364826381207, -0.019486676901578903, ...","{""token_count"":456,""truncated"":false}",,"Back in XX/XX/XXXX, I was making a deposit at ..."
2275,8,"[{'CENTROID_ID': 8, 'DISTANCE': 0.450831634508...","[0.005265800282359123, -0.04704391956329346, -...","{""token_count"":318,""truncated"":false}",,I have a a unverified account from XXXX XXXX X...


In [None]:
cluster_1_result = clustered_result[clustered_result["CENTROID_ID"] == 1][["content"]]
cluster_1_result_pandas = cluster_1_result.head(5).to_pandas()

In [None]:
cluster_2_result = clustered_result[clustered_result["CENTROID_ID"] == 2][["content"]]
cluster_2_result_pandas = cluster_2_result.head(5).to_pandas()

cluster_2_result_pandas.head(3)

Unnamed: 0,content
2623,I have not found a XXXX XXXX for Credit Accept...
9674,In XX/XX/XXXX I took a loan to buy a vehicle a...
11813,XXXX XXXX XXXX XXXX promised transportation an...


##Przykłady skarg dla klastrów 1 i 2.

`cluster_1_result_pandas["content"].iloc[i]`: Ten fragment kodu używa metody `iloc` do indeksowania danych w kolumnie "content" ramki danych cluster_1_result_pandas. Parametr i oznacza indeks wiersza, który chcemy pobrać.

In [None]:
prompt1 = 'comment list 1:\n'
for i in range(5):
    prompt1 += str(i + 1) + '. ' + cluster_1_result_pandas["content"].iloc[i] + '\n'

print(prompt1)

comment list 1:
1. In accordance with the Fair Credit Reporting act XXXX XXXX Account XXXX XXXX XXXX XXXX XXXX XXXX XXXX Account XXXX XXXX has violated my rights. 

15 U.S.C 1681 section 602 A. States I have the right to privacy.

15 U.S.C 1681 Section 604 A Section 2 : It also states a consumer reporting agency can not furnish a account without my written instructions
2. The problem that I have is according to 15 Usc 1681A ( XXXX ) There is a need to insure that consumer reporting agencies exercise their grave responsibilities with fairness, impartiality, and a respect for the consumers right to privacy.
3. In accordance with The Fair credit reporting act the list of accounts below has violated my Federal protected consumer rights to privacy and confidentiality under 15 USC 1681. 

XXXX : Account # XXXX XXXXXXXX XXXX XXXX  : Account # XXXX XXXX XXXX : Account # XXXX XXXX XXXX XXXX Account # XXXX XXXX XXXX : Account # XXXX XXXX XXXX XXXX XXXX Account # XXXX XXXX : Account # XXXX XXXX X

#LLM (text-bison foundation model)

##Prep dla promptu

In [None]:
prompt2 = 'comment list 2:\n'
for i in range(5):
    prompt2 += str(i + 1) + '. ' + cluster_2_result_pandas["content"].iloc[i] + '\n'

print(prompt2)

comment list 2:
1. I have not found a XXXX XXXX for Credit Acceptance nor XXXX XXXX XXXX. 

XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX  requires the licensing and regulation of finance lenders and brokers making and brokering consumer and commercial loans, except as specified ; prohibits misrepresentations, fraudulent and deceptive acts in connection with making and brokering of loans ; and provides administrative, civil ( injunction and ancillary relief ) and criminal remedies for violations of the law. 

Only active license Credit Acceptance have is out of state non in California state as required by law.
2. In XX/XX/XXXX I took a loan to buy a vehicle and in XXXX of the same year I had an accident and the car was lost after talking to the insurance company. I called the company to explain what had happened, at most companies what happens is they pause credit while you get insurance information, but here I didn't call them and they always gave me different information and

In [None]:
prompt = (
    "Proszę podkreślić w języku polskim najbardziej oczywistą różnicę pomiędzy "
    "tymi dwiema listami komentarzy:\n" + prompt1 + prompt2
)
print(prompt)

Proszę podkreślić w języku polskim najbardziej oczywistą różnicę pomiędzy tymi dwiema listami komentarzy:
comment list 1:
1. In accordance with the Fair Credit Reporting act XXXX XXXX Account XXXX XXXX XXXX XXXX XXXX XXXX XXXX Account XXXX XXXX has violated my rights. 

15 U.S.C 1681 section 602 A. States I have the right to privacy.

15 U.S.C 1681 Section 604 A Section 2 : It also states a consumer reporting agency can not furnish a account without my written instructions
2. The problem that I have is according to 15 Usc 1681A ( XXXX ) There is a need to insure that consumer reporting agencies exercise their grave responsibilities with fairness, impartiality, and a respect for the consumers right to privacy.
3. In accordance with The Fair credit reporting act the list of accounts below has violated my Federal protected consumer rights to privacy and confidentiality under 15 USC 1681. 

XXXX : Account # XXXX XXXXXXXX XXXX XXXX  : Account # XXXX XXXX XXXX : Account # XXXX XXXX XXXX XXXX

In [None]:
from bigframes.ml.llm import PaLM2TextGenerator

q_a_model = PaLM2TextGenerator()

In [None]:
df = bf.DataFrame({"prompt": [prompt]})

In [None]:
major_difference = q_a_model.predict(df)

major_difference["ml_generate_text_llm_result"].iloc[0]

' **Lista 1:**\n\n* **Komentarze są bardzo szczegółowe i odnoszą się do konkretnych przepisów prawnych.**\n* **Komentarze są napisane w języku prawniczym i zawierają wiele cytatów z przepisów prawnych.**\n* **Komentarze są bardzo długie i trudne do zrozumienia dla osób, które nie mają wykształcenia prawniczego.**\n\n**Lista 2:**\n\n* **Komentarze są bardziej ogólne i nie odnoszą się do konkretnych przepisów prawnych.**\n* **Komentarze są napisane w języku potocznym i są łatwe'

#Wyniki

 **Lista 1:**

* **Komentarze są bardzo szczegółowe i odnoszą się do konkretnych przepisów prawnych.**
* **Komentarze są napisane w języku prawniczym i zawierają wiele cytatów z przepisów prawnych.**
* **Komentarze są bardzo długie i trudne do zrozumienia dla osób, które nie mają wykształcenia prawniczego.**

**Lista 2:**

* **Komentarze są bardziej ogólne i nie odnoszą się do konkretnych przepisów prawnych.**
* **Komentarze są napisane w języku potocznym i są łatwe**