-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sentence embedding tool #124
Open
jhehemann
wants to merge
33
commits into
valory-xyz:main
Choose a base branch
from
jhehemann:feat/add-sentence-embedding
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…additional information indicate that the event will happen after closing date, the probability of the event occurring within the remaining time is low.
… To evaluate if it yields better results.
…narios and make reasonable estimations: The event has already happened; the event has not happened, but will happen before the market closing time; the event will happen, but after market closing time.
…edundant sentence cap
…mbedding # Conflicts: # poetry.lock # pyproject.toml # tox.ini
Any suggestion for this PR @Adamantios ? It's been here for a long time, should we close it? |
There is a lot of good work on it, ideally we address issues and merge. |
Yes, let's merge. |
@jhehemann do you think we can merge main and bring this PR up to date? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Sentence Embedding for extracting relevant information
Summary
The prediction_sentence_embedding.py tool is a component of an AI Mech service registered on the Autonolas Mech Hub. The tool executes prediction requests submitted on the Mech Hub and returns a prediction estimation as output. For this the tool performs the following tasks:
Improvements
The following paragraphs explain relevant changes to the original code.
Generate search engin queries
Submit queries and extract URLs
Extract and clean text from each website
Extract text
Clean text
Apply NLP pipeline and extract relevant sentences and release dates
Generate and submit prediction prompt
Additional settings and features
OpenAI model settings
The maximum number of completion tokens is reduced to give additional information more room. The temperature for the URL query task is set higher as the results might be more divers, whereas the temperature for the prediction task is set lower as the results should be less variant and more predictable.
Prediction prompt
The prediction prompt is very specific and explains different usecases and how the prediction should be adjusted given these usecases. This leaves less room for unknown usecases and user promt variations. The prompt is specifically designed for a prediction with the binary outcomes "Yes" and "No".
Automatically cap additional information length
The tool calculates the sum of tokens in the prediction prompt template, the received prediction request and the number of tokens reserved for the OpenAI chat completion response and divides the remaining token number that can be used for additional information by the number of URLs that are scraped. This ensures a dynamic token cap for each website's extracted information and prevents model errors induced by exceeding the number of maximal allowed tokens.
Performance and Limitations
The tool has only been tested manually with limited ressources. Here are some qualitative observations comparing the performance between the GPT-4 and the GPT-3.5-turbo model. A quantitative data analysis has not been conducted yet. Thus, the observations are not representative for use in production mode.
GPT-4
GPT-3.5-turbo