GitHub - touhi99/tags-generation

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
automatic-tagger.ipynb		automatic-tagger.ipynb
load_data.py		load_data.py
requirements.txt		requirements.txt

Repository files navigation

Auto-tagger

This is a simple prototype for automatically tag generator for programming tasks on publicly available datasets. In this repository, it is approached to showcase a tag generator with an open-sourced LLM (Mistral-7b) with prompt engineering. The case can be extended with better accuracy by fine-tuning the model with given data.

Requirements

To run the repository, Python => 3.9 is needed. The required packages needed to run the notebook can be installed with the following command:

pip install -r requirements.txt

Data

Data is used from https://www.kaggle.com/datasets/stackoverflow/pythonquestions

Usage

The repository contains a notebook that can be run with Jupyter Notebook or Jupyter Lab. The notebook contains all the steps needed to run the tag generator. The notebook expects the data folder to contain the questions and tag dataset. The notebook can be run with the following command from the terminal:

jupyter notebook

or

Jupyter lab

then run the automatic-tagger.ipynb notebook.

Approaches & Observations

Here the notebook performs the following steps:

Data is loaded, preprocessed (removing HTML tags, special characters, etc.), and concatenated (body, title)
Some data analyses are explored to understand tag, frequency, and the nature of how to proceed with the problem
For the small case scenario, a Large language model type open-sourced model is chosen to generate tags based on the prompt/few-shot prompt.
- An LLM model is selected (mistral-7b-instruct-quantized gguf version) for the case of a small prototype and fine-tuning with a smaller scale
- The model is loaded and prompt engineering is applied to generate tags for a subset of data
- Generated tags are then post-processed to remove other characters and compared with the actual tags The limitation of this approach is that it is truly a prototype and not specialized for the task/data. Context window limited. Model selection and quality of the model is important for the case of production/usable case. Fine-tune or better model selection can be done to improve the performance.
Fine-tuning is not done here but this model and approach can be fine-tuned with the given data to improve the performance

Results

Results are not great with base prompt and enhanced prompt. It showed an accuracy of around 0.2 and 0.33 on a small subset of data with zero-shot learning. The results can be improved by fine-tuning the model with the given data.

Alternate possible approach

Fine-tune LLM / Try with GPT4/similar model

When a case of true prototyping is needed, prompt engineering with GPT-4 can give some glimpse of possible approaches for quickly setting up
In case of the availability of sufficient data and resources, fine-tuning an available LLM model, can boost performance for task-specific cases. For example, here the model can be fine-tuned with the given data to generate tags for the given task.
Limitation: The cost of API / hosted model can be high

Fine-tune the NLP model as a supervised multi-label classification problem

If the tags are fixed or no new tags are expected and minimal frequent tags are negligible, then the problem can be tackled with a traditional NLP approach too with a BERT/DistilBERT model where the task will be classifying multiple labels for each text. Data needs to be split to train/test/Val and the model needs to fine-tune given training data and then can be tested to evaluate
Limitation: Manual effort for fine-tuning and may only be limited to a fixed set of tags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

automatic-tagger.ipynb

automatic-tagger.ipynb

load_data.py

load_data.py

requirements.txt

requirements.txt

Repository files navigation

Auto-tagger

Requirements

Data

Usage

Approaches & Observations

Results

Alternate possible approach

About

Releases

Packages

Languages

License

touhi99/tags-generation

Folders and files

Latest commit

History

Repository files navigation

Auto-tagger

Requirements

Data

Usage

Approaches & Observations

Results

Alternate possible approach

About

Resources

License

Stars

Watchers

Forks

Languages