# Entity Extraction with Palmyra Instruct

This example will demonstrate how to use Writer's Palmyra Instruct LLM to extract specific information from an input. For our case, we'll be pulling video game titles from the titles of Reddit posts. This type of task can be useful for getting statistics on how often people are talking about specific things, among other uses.

### Dependencies

In [7]:
%pip install -q --disable-pip-version-check \
    requests writerai python-dotenv

Note: you may need to restart the kernel to use updated packages.


First, we need to set up the Writer security object. Make sure you have a `.env` file in the same directory with the following lines:
```
WRITER_ORG_ID=<your org ID>
WRITER_API_KEY=<your API key>
```

or just directly set the corresponding variables in the code block.

In [8]:
from dotenv import load_dotenv
from writer import Writer
from writer.models.shared import Security
import os

load_dotenv()
org_id = os.environ.get("WRITER_ORG_ID")
api_key = os.environ.get("WRITER_API_KEY")

writer = Writer(
    security=Security(
        api_key=api_key
    ),
    organization_id=org_id
)

### Gathering data

For this example, we'll use the [pushshift API](https://github.com/pushshift/api) to gather a bunch of Reddit titles. Since we're looking for video game titles, we'll pull from the "games" subreddit.

In [9]:
import requests
import datetime

url = "https://api.pushshift.io/reddit/submission/search"
params = {
    "subreddit": "games",
    "size": "25",
    "sort_type": "score",
    "after": int(datetime.datetime(2023,5,1).timestamp()),
    "before": int(datetime.datetime(2023,5,6).timestamp())
}

def get_titles():
    res = requests.get(url, params=params)
    titles = [post['title'] for post in res.json()['data']]
    return titles

### Extracting titles

The Palmyra Instruct model is already quite good at following instruction, so this task mostly boils down to sending the right prompt. Creating a prompt involves a bit of trial and error, e.g. changing the capitalization of the word "none" has given different results. 

In this example, the first prompt I tried had a couple of issues. One is that it would give the name of a more well-known game that sounds similar or was in the same franchise. You can see where I tried to fix that in the first commented-out line. This kind of worked, but then it would respond with "none" sometimes when there was clearly a correct answer. Similarly, at first it would format the name of the game differently than it was in the title (e.g. Street Fighter 6 -> Street Fighter VI). This isn't a problem per se, but indicated that the model was pulling more from its prior knowledge than the input. Both of these issues were mitigated by using the word "extract" instead of "output" or "return", so you can see how small of a wording difference can affect the result. I also added the line about unrecognized games to account for titles that hadn't been released yet during the model's training. This works alright as far as I can tell, but it still misses some cases as I'll talk about soon.

In [10]:
from writer.models import operations, shared

def get_answer(title: str):
    prompt = (
        f"Given a post, please extract the title of any video game mentioned. "
        # f"Do not respond with the name of a different game in the same franchise. "
        f"If something looks like a video game title, output it even if you don't recognize it. "
        # f"Try to format the title of the game in the same way as the original post. "
        f"If there was no specific game mentioned, just respond with \"none\".\n"
        f"Post: {title}\nAnswer:\n"
    )

    req = operations.CreateCompletionRequest(
        completion_request=shared.CompletionRequest(
            prompt=prompt, max_tokens=50, temperature=0
        ),
        model_id="palmyra-instruct",
    )

    res = writer.completions.create(req)
    if res.completion_response is not None:
        return res.completion_response.choices[0].text
    else:
        print(res.fail_response)

Now we can test it out:

In [11]:
titles =  get_titles()

for title in titles:
    answer = get_answer(title)
    print(f"{title}\n - {answer}")

Diablo IV | Welcome to the Server Slam
 - Diablo IV
I need help installing SimCity
 - SimCity
Dead island 2 live
 - Dead Island 2
is SpongeBob SquarePants: Battle for Bikini Bottom Rehydrated a good games for core gamers
 - SpongeBob SquarePants: Battle for Bikini Bottom Rehydrated
Gametype: Rocket Hell addon - Arena Unlimited mod for OpenArena
 - OpenArena
CONVERGENCE: A League of Legends Story | Official Teaser Rewind
 - League of Legends
A Pathfinder Abomination Vaults ARPG coming to Kickstarter!
 - Pathfinder Abomination Vaults
I'm making a game 2D, combat based, platformer. Here's the first trailer, let me know what you think!
 - None
Street Fighter 6 Was Developed With eSports In Mind, Capcom Confirms
 - Street Fighter 6
ArenaNet Studio Update: Spring 2023 - Guild Wars 2
 - Guild Wars 2
Riot Forge: CONV/RGENCE teaser
 - CONV/RGENCE
BKOM Studios and @paizo are proud to reveal Pathfinder: Abomination Vaults, cooperative hack and slash ARPG!
 - Pathfinder: Abomination Vaults
A Pathf

As you can see, it got most of the correct titles and never got anything that wasn't a title. It did get a false "none" for a few though. "Dredge" is understandable since it's just a real word in the middle of a sentence, that one would be hard to find for a person without knowing about it already. For some reason I could never get "Star Wars Jedi Survivor" to register, testing with different prompts alternated between giving back "none" or "Star Wars Jedi: Fallen Order". The last one it missed was Atari's "E.T.", which is over 40 years old now and only two letters long so another understandable miss.

Just for fun, we can change up the titles slightly to see what it would take to get them properly recognized and gain a better understanding:

In [12]:
title = "\"Star Wars Jedi Survivor\" - DF Tech Review - PS5 vs Xbox Series X/S - Ambitious But Compromised"
print(f"{title}\n - {get_answer(title)}")

title = "\"Star Wars Jedi: Survivor\" - DF Tech Review - PS5 vs Xbox Series X/S - Ambitious But Compromised"
print(f"{title}\n - {get_answer(title)}")

title = "Get ready to \"Dredge\" more eldritch aberrations from the deep"
print(f"{title}\n - {get_answer(title)}")

title = "20 Years After Atari's \"E.T.\", Another Company Made The Same Mistake"
print(f"{title}\n - {get_answer(title)}")

"Star Wars Jedi Survivor" - DF Tech Review - PS5 vs Xbox Series X/S - Ambitious But Compromised
 - Star Wars Jedi: Fallen Order
"Star Wars Jedi: Survivor" - DF Tech Review - PS5 vs Xbox Series X/S - Ambitious But Compromised
 - Star Wars Jedi: Survivor
Get ready to "Dredge" more eldritch aberrations from the deep
 - Dredge
20 Years After Atari's "E.T.", Another Company Made The Same Mistake
 - E.T.


All it took to recognize most of them was some quotes, which makes sense - clarity is one of the biggest reason us humans use quotation marks. Though notably, Jedi: Survivor needed the colon too. Maybe the model just does not want to acknowledge its existence.

If these results aren't accurate enough for your purposes, consider looking into creating a [customized model](https://dev.writer.com/docs/custom-models). These allow for more task-specific accuracy, but may require a lot of training data in the form of [example prompt] -> [ideal output] pairs.