Add sentence embedding tool #124

jhehemann · 2023-10-24T14:50:47Z

Sentence Embedding for extracting relevant information

Summary

The prediction_sentence_embedding.py tool is a component of an AI Mech service registered on the Autonolas Mech Hub. The tool executes prediction requests submitted on the Mech Hub and returns a prediction estimation as output. For this the tool performs the following tasks:

Generate search engine queries based on prediction request using the OpenAI API
Submit the queries using the Google custom search API and extract URLs
Iterate over the URLs, extract and clean the text.
Apply spaCy NLP pipeline to the text and extract sentences
Generate sentence embeddings and calculate similarity scores between the prediction request question and each sentence
Concatenate and return the top sentences from each website along with release dates.
Use the extracted information as additional information within a prediction prompt template and pass the prompt to the OpenAI API.
Return the obtained probability estimations along with confidence and additional information utility in json format to the user.

Improvements

The following paragraphs explain relevant changes to the original code.

Generate search engin queries

Use the gpt-3.5-turbo model as it is cheaper and the task is not too advanced

Submit queries and extract URLs

Fetch up to 10 URLs from each query
Add up to three of the fetched URLs to a common set
Only add an URL if it not already present in the common set and if it does not end with ".pdf" (This prevents downloading and processing large files)

Extract and clean text from each website

Extract text

Add request session with user agent header
Allow redirects
Submit a head request before the get request to guarantee the URLs content type is 'text/html'

Clean text

Remove irrelevant html tags and replace them with " " to prevent that surrounding text is concatenated without separation
Remove consecutive occurrences of "." characters

Apply NLP pipeline and extract relevant sentences and release dates

Process the resulting text using a spaCy NLP pipeline and separate it into sentences
Remove duplicate and discard too short sentences that might carry no meaning
Extract the prediction deadline (market closing) date from the prediction request and look for occurrences within the text that could have been discarded due to sentence length. As these sentences could still carry relevant insights, concatenate them with some surrounding context and extract them.
Generate embeddings for the prediction request question and the extracted sentences using the spaCy embeddings model provided by langchain
Calculate the similarity scores (cosine similarity) between the question embedding and each sentence embedding
Sort the sentences descending by similarity score and concatenate them
Extract release and modification date of the websites text by looking for commonly known date identifiers within the html

Generate and submit prediction prompt

Concatenate the processed relevant information from each URL and pass it to the prediction prompt template
Add a variable to the template that represents the current timestamp as a reference time for the OpenAI model

Additional settings and features

OpenAI model settings

The maximum number of completion tokens is reduced to give additional information more room. The temperature for the URL query task is set higher as the results might be more divers, whereas the temperature for the prediction task is set lower as the results should be less variant and more predictable.

Prediction prompt

The prediction prompt is very specific and explains different usecases and how the prediction should be adjusted given these usecases. This leaves less room for unknown usecases and user promt variations. The prompt is specifically designed for a prediction with the binary outcomes "Yes" and "No".

Automatically cap additional information length

The tool calculates the sum of tokens in the prediction prompt template, the received prediction request and the number of tokens reserved for the OpenAI chat completion response and divides the remaining token number that can be used for additional information by the number of URLs that are scraped. This ensures a dynamic token cap for each website's extracted information and prevents model errors induced by exceeding the number of maximal allowed tokens.

Performance and Limitations

The tool has only been tested manually with limited ressources. Here are some qualitative observations comparing the performance between the GPT-4 and the GPT-3.5-turbo model. A quantitative data analysis has not been conducted yet. Thus, the observations are not representative for use in production mode.

GPT-4

Higher accuracy predicting the outcome of a prediction market question
Generates higher probability and confidence values for both outcomes → False positives lead to higher losses

GPT-3.5-turbo

Less accuracy predicting the outcome of a prediction market question
Generates lower probability and confidence values for both outcomes → False negatives lead to lower wins

…additional information indicate that the event will happen after closing date, the probability of the event occurring within the remaining time is low.

… To evaluate if it yields better results.

…narios and make reasonable estimations: The event has already happened; the event has not happened, but will happen before the market closing time; the event will happen, but after market closing time.

…edundant sentence cap

…mbedding # Conflicts: # poetry.lock # pyproject.toml # tox.ini

0xArdi · 2024-05-22T19:41:38Z

Any suggestion for this PR @Adamantios ? It's been here for a long time, should we close it?

0xArdi · 2024-05-22T19:42:15Z

There is a lot of good work on it, ideally we address issues and merge.

Adamantios · 2024-05-27T09:37:38Z

There is a lot of good work on it, ideally we address issues and merge.

Yes, let's merge.

dvilelaf · 2024-06-24T11:22:58Z

@jhehemann do you think we can merge main and bring this PR up to date?

jhehemann and others added 30 commits October 22, 2023 22:41

WIP: Add initial sentence embedding tool file; install packages

06bb7fd

feat: Replace sentence transformers with SpacyEmbeddings from langchain

1dea7e1

refactor: Change url query engine to gpt-3.5-turbo

7a54a0d

enhancement: Model considers short timeframes until closing date. If …

dcdd68e

…additional information indicate that the event will happen after closing date, the probability of the event occurring within the remaining time is low.

fix: Add value type checks for event_date; Adjusted prediction promt.…

9be6df9

… To evaluate if it yields better results.

feat: Improved prompts - the gpt-4 model can handle the following sce…

2ec43c6

…narios and make reasonable estimations: The event has already happened; the event has not happened, but will happen before the market closing time; the event will happen, but after market closing time.

refactor: changed tool names

3f03bff

chore: Remove print statements

8aae4b9

fix: Keep parser in spacy pipeline

6dcf81a

Merge branch 'fix/improve-prompts' into feat/add-sentence-embedding

9918c53

chore: Remove print statements

59d0fcc

chore: Update poetry and remove unnecessary imports

2c34a2d

refactor: Change OpenAI default settings for URL query task; Remove r…

800838f

…edundant sentence cap

chore: Adjust .gitignore

cb5f29a

chore: Update pyproject.toml

b286c36

chore: Update lock file

e9fe020

chore: Update packages

3ed3c69

chore: Update tox.ini

955971c

refactor: Changed prompts

e258cc4

feat: Extract overall most relevant sentences, not just per URL

8f9defa

Merge branch 'develop' into fix/improve-prompts-gpt-3.5

40bedfd

fix: Deactivate date context extractions and fallback date assignings

ed034af

enhancement: Concatenate short sentences with subsequent until threshold

e24d085

chore: Remove print statements

82ddd1a

chore: Update lock file

5eca184

feat: Use en_core_web_lg spaCy model for NLP and similarity tasks

cb87f4d

refactor: Return to narrow prediction prompt template

bbe7cc2

fix: add missing comma

7e396cc

fix: return the expected type

116fcfb

refactor: use a smaller model

61c75c6

Adamantios added 3 commits October 26, 2023 18:06

style: specify the correct return type

e7771f2

chore: format using black

9dc2ed4

Merge remote-tracking branch 'upstream/main' into feat/add-sentence-e…

ca7d8c8

…mbedding # Conflicts: # poetry.lock # pyproject.toml # tox.ini

Adamantios requested a review from 0xArdi October 26, 2023 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sentence embedding tool #124

Add sentence embedding tool #124

jhehemann commented Oct 24, 2023

0xArdi commented May 22, 2024

0xArdi commented May 22, 2024

Adamantios commented May 27, 2024

dvilelaf commented Jun 24, 2024

Add sentence embedding tool #124

Are you sure you want to change the base?

Add sentence embedding tool #124

Conversation

jhehemann commented Oct 24, 2023

Sentence Embedding for extracting relevant information

Summary

Improvements

Generate search engin queries

Submit queries and extract URLs

Extract and clean text from each website

Extract text

Clean text

Apply NLP pipeline and extract relevant sentences and release dates

Generate and submit prediction prompt

Additional settings and features

OpenAI model settings

Prediction prompt

Automatically cap additional information length

Performance and Limitations

GPT-4

GPT-3.5-turbo

0xArdi commented May 22, 2024

0xArdi commented May 22, 2024

Adamantios commented May 27, 2024

dvilelaf commented Jun 24, 2024