# Extract people's names from text using Mixtral 8x7b and returning it in JSON

Recently, [mistral.ai](https://mistral.ai) - the (real) open source AI company from France - released their new open source "mixture of experts (MoE)" large language model, called "Mixtral 8x7b". To simplify, it's basically an ensemble of 8 7b models, where the inputs get dynamically routed to 2 of these 8 models.

Also recently I got interested in using LLMs for something else than just chatbots. I stumbled upon a [repo](https://github.com/abacaj/openhermes-function-calling) from [Anton Bacaj](https://twitter.com/abacaj), where he get's a fine-tuned version of Mistral's 7b model to mimic the "function calling" API from OpenAI. 

This got me thinking about what I can use LLMs for. One idea was to extract structured data from unstructured text like podcast transcripts. 

This is my first take on it using Mixtral 8x7b through the [together.ai](https://together.ai) API. 

In [15]:
from tqdm import tqdm
from time import sleep
import json
import os
from pydantic import BaseModel
from typing import List

In [14]:
sessionKey = os.environ['TOGETHER_SESSION_KEY']
auth = os.environ['TOGETHER_AUTH']

This is the function that calls th model through [together.ai](https://together.ai/)'s API with the prompt and returns the response as JSON. 

I bet I could optimize the whole idea further by tuning the `temperature` and other parameters, but it works for now. 

In [9]:
def query_mistral(prompt):
    import requests
    endpoint = 'https://api.together.xyz/inference'
    res = requests.post(endpoint, json={
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "max_tokens": 512,
        "prompt": f"[INST] {prompt} [/INST]",
        "request_type": "language-model-inference",
        "temperature": 0.7,
        "top_p": 0.7,
        "top_k": 50,
        "repetition_penalty": 1,
        "stop": [
            "[/INST]",
            "</s>"
        ],
        "negative_prompt": "",
        "sessionKey": sessionKey
    }, headers={
        "Authorization": f"Bearer {auth}",
    })
    return res.json()

In [10]:
with open("transcript_housel_perell.txt", "r") as f:
    transcript = f.read()

I want to use `pydantic` to validate the returned JSON. That's why I implement a simple `pydantic` class here to validate the returned JSON against. 

I know that you can serialize the json schema from a pydantic model directly and put it in the LLM prompt, but it actually didn't work so well for me. I think I have to look further into it. 

In [11]:
class Person(BaseModel):
    name: str

class People(BaseModel):
    names: List[Person]

Below is the main code block. I loop over the text file in blocks of 16384 characters and extract the peoples names as JSON. 

Then I validate them against my `People` Model from above and if it's valid I will put it in my `result` list. 

There are clever things you could do to handle invalid json, like calling the LLM again or asking the LLM to turn the invalid JSON string into valid JSON, but again, it works for now.

I also track the error rate for non-valid JSON here. But using Mixtral I never got non-valid JSON. It may only return more than just the JSON string, but that's easily fixed with the `.split()` call.

In [18]:
def get_prompt(text):
    return f"""Given the following text, identify all the people's names mentioned and return them in a valid JSON format. 

    Text: '{text}' 

    Please extract the names and format them in JSON like this: 

    {{"names": [{{"name": "Name1"}}, {{"name": "Name2"}}, ...]}}.

    Answer only with the json string, nothing else. This is very important!
    
    """

In [19]:
result = []
error = 0
block_size = 16384
for i in tqdm(range(0, len(transcript), block_size)):
    prompt = get_prompt(transcript[i:i+block_size])

    # get the output text and split it at the first double newline to get only the returned json string
    outp = query_mistral(prompt)["output"]["choices"][0]["text"].split("\n\n")[0]

    # validate the returned json string and if it's valid then append it to the result list
    try:
        People.model_validate_json(outp)
        result.append(json.loads(outp))
    except:
        print("Invalid json")
        print(outp)
        error = error + 1 
    sleep(1) # to avoid rate limiting

print(f"Error rate: {error/len(result)}")


100%|██████████| 9/9 [00:26<00:00,  2.95s/it]

Error rate: 0.0





In [20]:
# flatten result
flat_result = [name for r in result for name in r["names"]]

# deduplicate the result
seen = set()
deduplicated_data = [f  for f in flat_result if f['name'] not in seen and not seen.add(f['name'])]
deduplicated_data

[{'name': 'People'},
 {'name': 'you'},
 {'name': 'Michael Lewis'},
 {'name': 'Bethany McLean'},
 {'name': 'Joe Weisenthal'},
 {'name': 'Motley Fool'},
 {'name': 'The Molly Fool'},
 {'name': 'ting'},
 {'name': 'Marshall McLuhan'},
 {'name': 'Bill Clinton'},
 {'name': 'Keith Rabois'},
 {'name': 'Marc Andreessen'},
 {'name': 'Frederick Lewis Allen'},
 {'name': 'Winston Churchill'},
 {'name': 'John Maynard Keynes'},
 {'name': 'David McCullough'},
 {'name': 'LeBron James'},
 {'name': 'Barry Ritholtz'},
 {'name': 'Mark Twain'},
 {'name': 'Bill Bonner'},
 {'name': 'Felix Salmon'},
 {'name': 'Robert Curson'},
 {'name': 'ny lawyer'},
 {'name': 'Steve Jobs'},
 {'name': 'Warren Buffett'},
 {'name': 'Howard Marks'},
 {'name': 'Kafka'},
 {'name': 'Tom Stoppard'},
 {'name': 'Ken Burns'},
 {'name': 'Rick Burns'},
 {'name': 'Bezos'},
 {'name': 'Will'},
 {'name': 'Stephen Pressfield'},
 {'name': "Patrick O'Shaughnessy"},
 {'name': 'Tim Ferriss'},
 {'name': 'Charlie Munger'},
 {'name': 'Nassim Taleb'},
