Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sentence embedding tool #124

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

jhehemann
Copy link
Contributor

Sentence Embedding for extracting relevant information

Summary

The prediction_sentence_embedding.py tool is a component of an AI Mech service registered on the Autonolas Mech Hub. The tool executes prediction requests submitted on the Mech Hub and returns a prediction estimation as output. For this the tool performs the following tasks:

  1. Generate search engine queries based on prediction request using the OpenAI API
  2. Submit the queries using the Google custom search API and extract URLs
  3. Iterate over the URLs, extract and clean the text.
  4. Apply spaCy NLP pipeline to the text and extract sentences
  5. Generate sentence embeddings and calculate similarity scores between the prediction request question and each sentence
  6. Concatenate and return the top sentences from each website along with release dates.
  7. Use the extracted information as additional information within a prediction prompt template and pass the prompt to the OpenAI API.
  8. Return the obtained probability estimations along with confidence and additional information utility in json format to the user.

Improvements

The following paragraphs explain relevant changes to the original code.

Generate search engin queries

  • Use the gpt-3.5-turbo model as it is cheaper and the task is not too advanced

Submit queries and extract URLs

  • Fetch up to 10 URLs from each query
  • Add up to three of the fetched URLs to a common set
  • Only add an URL if it not already present in the common set and if it does not end with ".pdf" (This prevents downloading and processing large files)

Extract and clean text from each website

Extract text

  • Add request session with user agent header
  • Allow redirects
  • Submit a head request before the get request to guarantee the URLs content type is 'text/html'

Clean text

  • Remove irrelevant html tags and replace them with " " to prevent that surrounding text is concatenated without separation
  • Remove consecutive occurrences of "." characters

Apply NLP pipeline and extract relevant sentences and release dates

  • Process the resulting text using a spaCy NLP pipeline and separate it into sentences
  • Remove duplicate and discard too short sentences that might carry no meaning
  • Extract the prediction deadline (market closing) date from the prediction request and look for occurrences within the text that could have been discarded due to sentence length. As these sentences could still carry relevant insights, concatenate them with some surrounding context and extract them.
  • Generate embeddings for the prediction request question and the extracted sentences using the spaCy embeddings model provided by langchain
  • Calculate the similarity scores (cosine similarity) between the question embedding and each sentence embedding
  • Sort the sentences descending by similarity score and concatenate them
  • Extract release and modification date of the websites text by looking for commonly known date identifiers within the html

Generate and submit prediction prompt

  • Concatenate the processed relevant information from each URL and pass it to the prediction prompt template
  • Add a variable to the template that represents the current timestamp as a reference time for the OpenAI model

Additional settings and features

OpenAI model settings

The maximum number of completion tokens is reduced to give additional information more room. The temperature for the URL query task is set higher as the results might be more divers, whereas the temperature for the prediction task is set lower as the results should be less variant and more predictable.

Prediction prompt

The prediction prompt is very specific and explains different usecases and how the prediction should be adjusted given these usecases. This leaves less room for unknown usecases and user promt variations. The prompt is specifically designed for a prediction with the binary outcomes "Yes" and "No".

Automatically cap additional information length

The tool calculates the sum of tokens in the prediction prompt template, the received prediction request and the number of tokens reserved for the OpenAI chat completion response and divides the remaining token number that can be used for additional information by the number of URLs that are scraped. This ensures a dynamic token cap for each website's extracted information and prevents model errors induced by exceeding the number of maximal allowed tokens.

Performance and Limitations

The tool has only been tested manually with limited ressources. Here are some qualitative observations comparing the performance between the GPT-4 and the GPT-3.5-turbo model. A quantitative data analysis has not been conducted yet. Thus, the observations are not representative for use in production mode.

GPT-4

  • Higher accuracy predicting the outcome of a prediction market question
  • Generates higher probability and confidence values for both outcomes → False positives lead to higher losses

GPT-3.5-turbo

  • Less accuracy predicting the outcome of a prediction market question
  • Generates lower probability and confidence values for both outcomes → False negatives lead to lower wins

jhehemann and others added 30 commits October 22, 2023 22:41
…additional information indicate that the event will happen after closing date, the probability of the event occurring within the remaining time is low.
…narios and make reasonable estimations: The event has already happened; the event has not happened, but will happen before the market closing time; the event will happen, but after market closing time.
@0xArdi
Copy link
Collaborator

0xArdi commented May 22, 2024

Any suggestion for this PR @Adamantios ? It's been here for a long time, should we close it?

@0xArdi
Copy link
Collaborator

0xArdi commented May 22, 2024

There is a lot of good work on it, ideally we address issues and merge.

@Adamantios
Copy link
Collaborator

There is a lot of good work on it, ideally we address issues and merge.

Yes, let's merge.

@dvilelaf
Copy link
Collaborator

@jhehemann do you think we can merge main and bring this PR up to date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants