# Knowledge Extraction from Chat Conversations #

This notebook explores several ways that knowledge about an individual can be extracted from a chat transcript. This knowledge is stored in different formats, including descriptive summaries, JSON lists, etc. A subsequent notebook explores which format is best for retrieving information about the user in order to simulate them.

In [1]:
!pip install --quiet --upgrade torch

In [2]:
# !pip install --quiet --upgrade transformers
!pip install --quiet --upgrade git+https://github.com/huggingface/transformers

In [3]:
!pip install --quiet --upgrade bitsandbytes accelerate

In [4]:
!pip install --quiet --upgrade sentencepiece

In [5]:
!pip install --quiet --upgrade protobuf

In [6]:
conversation = [
    {"role": "assistant", "content": "what is your name?"},
    {"role": "user", "content": "my name is John"},
    {"role": "assistant", "content": "how are you today?"},
    {"role": "user", "content": "I'm great, thanks" },
    {"role": "assistant", "content": "are you a student or employed?"},
    {"role": "user", "content": "I am a farmer"},
    {"role": "assistant", "content": "are you seeing someone or are you married?"},
    {"role": "user", "content": "yes, my wife's name is Jane"},
    {"role": "assistant", "content": "how long have you been married?"},
    {"role": "user", "content": "15 years"},
    {"role": "assistant", "content": "do you have any children?"},
    {"role": "user", "content": "yes, two"},
    {"role": "assistant", "content": "what are their names?"},
    {"role": "user", "content": "Jack and Jill"},
    {"role": "assistant", "content": "how old are they?"},
    {"role": "user", "content": "10 and 12, respectively"},
]

conversation_str = ""
for message in conversation:
    conversation_str += f"{message['role']}: {message['content']}\n"

In [7]:
# model_id = "mistralai/Mistral-Nemo-Instruct-2407"

## Quantized Gemma model, stored locally
# model_id = "models/google__gemma-2-27b-it"

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

## Quantized Llama model, stored locally
# model_id = "models/meta-llama__Meta-Llama-3-70B-Instruct"

model_kwargs = {
    "low_cpu_mem_usage": True,
    "device_map": "sequential", # load the model into GPUs sequentially, to avoid memory allocation issues with balancing
    "torch_dtype": "auto"
}

In [8]:
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    **model_kwargs
)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [9]:
import utils
utils.print_model_info(model)

model : size : 15316 MB
model : is quantized : False
model : on GPU : True


In [10]:
generate_kwargs = {
    "max_new_tokens": 1024,
    "do_sample": True,
    "temperature": 0.7,
    "top_k": 50,
    "top_p": 0.95,
    "bos_token_id": tokenizer.bos_token_id,
    "pad_token_id": tokenizer.eos_token_id
}

if 'llama' in model_id.lower():
    generate_kwargs["eos_token_id"] = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
else:
    generate_kwargs["eos_token_id"] = tokenizer.eos_token_id

In [11]:
%%time

prompt = """
Create a paragraph summary of facts about the user from the following coversation:
"""

messages = [
    {"role": "user", "content": prompt + conversation_str}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    **generate_kwargs
)

response = outputs[0][input_ids.shape[-1]:]

print(tokenizer.decode(response, skip_special_tokens=True) + '\n')

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Here's a paragraph summary of the facts about the user, John:

John is a 15-year married farmer who resides with his wife, Jane, and their two children, Jack and Jill. He has been married to Jane for 15 years and has two children, a 10-year-old son named Jack and a 12-year-old daughter named Jill.

CPU times: user 1.83 s, sys: 239 ms, total: 2.07 s
Wall time: 2.18 s


In [12]:
import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TextStreamer

# tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True)

pipe = transformers.pipeline(
    "text-generation",
    # model=model_id,
    model=model,
    model_kwargs=model_kwargs,
    tokenizer=tokenizer,
    streamer=streamer
)

In [13]:
%%time

prompt = """
Create a paragraph summary of facts about the user from the following coversation:
"""

messages = [
    {"role": "user", "content": prompt + conversation_str}
]

_ = pipe(
    messages,
    **generate_kwargs
)

Based on the conversation, here is a paragraph summary of facts about John:

John is a 15-year married farmer who lives with his wife, Jane. He has two children, Jack and Jill, who are 10 and 12 years old, respectively. John is currently in great spirits and enjoys his life as a farmer.<|eot_id|>
CPU times: user 1.38 s, sys: 4.17 ms, total: 1.38 s
Wall time: 1.37 s


In [14]:
%%time

prompt = """
You are an AI assistant that lists facts about the user from chat conversations.
You list facts in a JSON array, with each fact in a dictionary with keys fact and confidence_score.
List all discrete facts about the user from the following chat conversation:
"""

messages = [
    {"role": "user", "content": prompt + conversation_str}
]

_ = pipe(
    messages,
    **generate_kwargs
)

Based on the provided chat conversation, the following are the discrete facts about the user:

```json
[
  {
    "fact": "The user's name is John",
    "confidence_score": 1.0
  },
  {
    "fact": "The user is a farmer",
    "confidence_score": 1.0
  },
  {
    "fact": "The user is married",
    "confidence_score": 1.0
  },
  {
    "fact": "The user's wife's name is Jane",
    "confidence_score": 1.0
  },
  {
    "fact": "The user has been married for 15 years",
    "confidence_score": 1.0
  },
  {
    "fact": "The user has two children",
    "confidence_score": 1.0
  },
  {
    "fact": "The user's children's names are Jack and Jill",
    "confidence_score": 1.0
  },
  {
    "fact": "The user's children are 10 and 12 years old",
    "confidence_score": 1.0
  }
]
```<|eot_id|>
CPU times: user 4.92 s, sys: 18.5 ms, total: 4.94 s
Wall time: 4.91 s


In [15]:
%%time

prompt = """
You are an AI assistant that identifies facts about the user from chat conversations.
You list facts in a JSON array, with each fact in a dictionary with keys subject, predicate, object, and confidence_score.
List all discrete facts about the user from the following chat conversation:
"""

messages = [
    {"role": "user", "content": prompt + conversation_str}
]

_ = pipe(
    messages,
    **generate_kwargs
)

```
[
  {
    "subject": "John",
    "predicate": "has the name",
    "object": "John",
    "confidence_score": 1.0
  },
  {
    "subject": "John",
    "predicate": "is great",
    "object": "emotion",
    "confidence_score": 0.9
  },
  {
    "subject": "John",
    "predicate": "is a farmer",
    "object": "profession",
    "confidence_score": 1.0
  },
  {
    "subject": "John",
    "predicate": "is married",
    "object": "status",
    "confidence_score": 1.0
  },
  {
    "subject": "John",
    "predicate": "has a wife named",
    "object": "Jane",
    "confidence_score": 1.0
  },
  {
    "subject": "John",
    "predicate": "has been married for",
    "object": "15 years",
    "confidence_score": 1.0
  },
  {
    "subject": "John",
    "predicate": "has two children",
    "object": "number of children",
    "confidence_score": 1.0
  },
  {
    "subject": "John",
    "predicate": "has children named",
    "object": "Jack and Jill",
    "confidence_score": 1.0
  },
  {
    "subject": "J