# How to extract particular data from txt files. 

In [20]:
from transformers import pipeline
from datasets import load_dataset
import pandas as pd

In [30]:

# Define your question and context (text passage)
question = "What is the capital of France?"
context = """France is a country located in Western Europe. 
             The capital of France is Paris. It is a major global city 
             and a center for finance, diplomacy, and culture."""

# Load the QA pipeline with a pre-trained model
qa_pipeline =  pipeline("question-answering", model="Intel/dynamic_tinybert")

# Perform the QA task
answer = qa_pipeline({"question": question, "context": context})

# Print the answer
print(f"Answer: {answer['answer']}")
print(f"Confidence score: {answer['score']}")

Answer: Paris
Confidence score: 0.9864407777786255


In [25]:
dataset_models = load_dataset("librarian-bots/model_cards_with_metadata")['train']

In [26]:
context = dataset_models["card"][0]
print(context)

---
language: 
  - es
library_name: pysentimiento

tags:
  - twitter
  - sentiment-analysis

---
# Sentiment Analysis in Spanish
## robertuito-sentiment-analysis

Repository: [https://github.com/pysentimiento/pysentimiento/](https://github.com/finiteautomata/pysentimiento/)


Model trained with TASS 2020 corpus (around ~5k tweets) of several dialects of Spanish. Base model is [RoBERTuito](https://github.com/pysentimiento/robertuito), a RoBERTa model trained in Spanish tweets.

Uses `POS`, `NEG`, `NEU` labels.

## Usage

Use it directly with [pysentimiento](https://github.com/pysentimiento/pysentimiento)

```python
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="es")

analyzer.predict("Qué gran jugador es Messi")
# returns AnalyzerOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
```


## Results

Results for the four tasks evaluated in `pysentimiento`. Results are expressed as Macro F1 scores


| model         | emotion       |

In [23]:
question = 'What datasets was the model trained on?'

# Perform the QA task using the pipeline
predicted_answer = qa_pipeline({"question": question, "context": context})

# Print original answer and predicted answer with confidence score 
# print(f"Original Answer: {answer['text']}")
print(f"Predicted Answer: {predicted_answer['answer']}")
print(f"Confidence score: {predicted_answer['score']}")

Predicted Answer: finance, diplomacy, and culture
Confidence score: 0.4139981269836426


In [23]:
question = 'Who are the authors of the model?'

# Perform the QA task using the pipeline
predicted_answer = qa_pipeline({"question": question, "context": context})

# Print original answer and predicted answer with confidence score 
# print(f"Original Answer: {answer['text']}")
print(f"Predicted Answer: {predicted_answer['answer']}")
print(f"Confidence score: {predicted_answer['score']}")

Predicted Answer: Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque
Confidence score: 0.861379861831665


## Creating a processing pipeline

Process the data from the HF metadata dataset.

* Get the questions we want to answer from Huggingface metadata.

In [14]:
questions = pd.read_csv("questions.tsv",sep='\t').values.tolist()
for q in questions:
    print(q)

['What is the name of the model?']
['Who created/uploaded the model?']
['When was the model created/uploaded?']
['What tasks can the model solve?']
['What datasets was the model trained on?']
['What was the split distribution for the datasets?']
['What datasets were used to finetune the model?']
['What datasets were used to retrain the model?']
['What model is used as the base model?']
['What evaluation metrics were used?']
['What were the values of the evaluation metrics?']
['What hyperparameters were optimized during the training process?']
['What hyperparameters were selected?']
['What publications are related to the model?']
['Where is the model deployed?']
['What use license is associated with the model?']


In [9]:
dataset = load_dataset("librarian-bots/model_cards_with_metadata")['train']
HF_df = dataset.to_pandas()
print(df.head())

                                            modelId          author  \
0       pysentimiento/robertuito-sentiment-analysis   pysentimiento   
1  cardiffnlp/twitter-roberta-base-sentiment-latest      cardiffnlp   
2     jonatasgrosman/wav2vec2-large-xlsr-53-english  jonatasgrosman   
3                     openai/clip-vit-large-patch14          openai   
4                     google-bert/bert-base-uncased     google-bert   

              last_modified  downloads  likes   library_name  \
0 2024-02-27 20:46:41+00:00  164331420     49  pysentimiento   
1 2023-05-28 05:45:10+00:00  132882703    345   transformers   
2 2023-03-25 10:56:55+00:00   56836440    391   transformers   
3 2023-09-15 15:49:35+00:00   44628976   1036   transformers   
4 2024-02-19 11:06:12+00:00   38927964   1440   transformers   

                                                tags  \
0  [pysentimiento, pytorch, tf, safetensors, robe...   
1  [transformers, pytorch, tf, roberta, text-clas...   
2  [transformers, py

* Get tag information to extract data and answer some of the questions 

In [10]:
#Counting the unique number of tags
all_tags_set = set()
for index, row in HF_df.iterrows():
    tags = row['tags']
    for tag in tags:
        all_tags_set.add(tag)
print(len(all_tags_set))

40585


From the tags we can get the following information:
* The task they are performing
* The libraries that were used to train the model
* The datasets used to train the model
* The languages that the model can handle
* The Licenses under which the model can be use. 
* Other type of tags.


In [38]:
#Getting the other tags
tags_other = [val[0] for val in pd.read_csv("tags_other.tsv",sep='\t').values.tolist()]
tags_other = set(tags_other)
for tag in tags_other:
    print(tag)

tags
custom_code
Merge
Inference Endpoints
Carbon Emissions
Eval Results
text-generation-inference
8-bit precision
4-bit precision
AutoTrain Compatible
Has a Space
Mixture of Experts


In [44]:
#Getting the language tags
tags_language = [val[0] for val in pd.read_csv("tags_language.tsv",sep='\t').values.tolist()]
tags_language = set(tags_language)
for tag in tags_language:
    print(tag)

as
tr
bs
ja
fi
ar
sn
be
mk
te
mr
id
tt
mg
cs
tl
en
kn
th
ba
hi
ru
bn
ha
bo
is
uk
az
ta
eu
ro
da
vi
am
hu
ko
ln
it
ps
fo
km
ms
ka
lt
es
br
ne
uz
cy
sv
sr
he
haw
sl
de
jw
mi
sw
pl
la
yue
pa
kk
sd
hr
el
sk
su
nl
mt
ca
si
ml
gu
sa
ur
mn
yi
tg
lv
et
so
af
gl
fr
oc
ht
pt
fa
my
zh
nn
tk
lb
bg
yo
lo
sq
hy


In [46]:
#Getting the library tags
tags_libraries = [val[0] for val in pd.read_csv("tags_libraries.tsv",sep='\t').values.tolist()]
tags_libraries = set(tags_libraries)
for tag in tags_libraries:
    print(tag)

Fairseq
PEFT
setfit
Adapters
timm
MLX
SpanMarker
Rust
PyTorch
TensorFlow
ml-agents
Keras
Flair
PaddlePaddle
NeMo
Safetensors
Stanza
OpenCLIP
Habana
Graphcore
fastText
Diffusers
JAX
TensorBoard
Transformers.js
ESPnet
pythae
pyannote.audio
stable-baselines3
BERTopic
Scikit-learn
sample-factory
Core ML
Joblib
AllenNLP
fastai
ONNX
speechbrain
OpenVINO
paddlenlp
TF Lite
Transformers
spaCy
sentence-transformers
Asteroid
unity-sentis
GGUF


In [47]:
#Getting the task tags
tags_task = [val[0] for val in pd.read_csv("tags_task.tsv",sep='\t').values.tolist()]
tags_task = set(tags_task)
for tag in tags_task:
    print(tag)

Table Question Answering
Image Feature Extraction
Text-to-Speech
Document Question Answering
Image-to-Text
Text-to-Image
Sentence Similarity
Image-to-Video
Translation
Graph Machine Learning
Image Segmentation
Audio Classification
Object Detection
Text2Text Generation
Unconditional Image Generation
Depth Estimation
Image-to-3D
Summarization
Zero-Shot Object Detection
Text-to-Video
Audio-to-Audio
Token Classification
Reinforcement Learning
Automatic Speech Recognition
Tabular Classification
Text Generation
Text-to-3D
Image-Text-to-Text
Mask Generation
Image Classification
Zero-Shot Classification
Fill-Mask
Robotics
Zero-Shot Image Classification
Tabular Regression
Text Classification
Visual Question Answering
Question Answering
Feature Extraction
Video Classification
Voice Activity Detection
Text-to-Audio


In [39]:
#Getting the dataset tags
dataset_datasets = load_dataset("librarian-bots/dataset_cards_with_metadata")['train']
tags_dataset = set(dataset_datasets["datasetId"])
i = 5
for tag in tags_dataset:
    if(i==0):
        break
    print(tag)
    i-=1

itinerai/us_places
Qqcf16426/mangaupdates
CATIE-AQ/bisect_fr_prompt_textual_merging
CyberHarem/sheffield_kantaicollection
LexiconShiftInnovations/SinhalaWikipediaArticles


* Now let's set up the questions we want the model to answer