## Testing models to extract particular data from txt files. 

In [103]:
from transformers import pipeline
from datasets import load_dataset
import pandas as pd

In [104]:

# Define your question and context (text passage)
question = "What is the capital of France?"
context = """France is a country located in Western Europe. 
             The capital of France is Paris. It is a major global city 
             and a center for finance, diplomacy, and culture."""

# Load the QA pipeline with a pre-trained model
qa_pipeline =  pipeline("question-answering", model="Intel/dynamic_tinybert")

# Perform the QA task
answer = qa_pipeline({"question": question, "context": context})

# Print the answer
print(f"Answer: {answer['answer']}")
print(f"Confidence score: {answer['score']}")

Answer: Paris
Confidence score: 0.9864407777786255


In [105]:
dataset_models = load_dataset("librarian-bots/model_cards_with_metadata")['train']

Downloading readme:   0%|          | 0.00/5.89k [00:00<?, ?B/s]

Downloading data: 100%|███████████████████████████████████████████████████| 184M/184M [00:05<00:00, 31.9MB/s]
Downloading data: 100%|█████████████████████████████████████████████████| 68.1M/68.1M [00:02<00:00, 24.2MB/s]
Downloading data: 100%|█████████████████████████████████████████████████| 68.3M/68.3M [00:02<00:00, 23.7MB/s]


Generating train split:   0%|          | 0/575348 [00:00<?, ? examples/s]

In [106]:
context = dataset_models["card"][0]
print(context)

---
language: 
  - es
library_name: pysentimiento

tags:
  - twitter
  - sentiment-analysis

---
# Sentiment Analysis in Spanish
## robertuito-sentiment-analysis

Repository: [https://github.com/pysentimiento/pysentimiento/](https://github.com/finiteautomata/pysentimiento/)


Model trained with TASS 2020 corpus (around ~5k tweets) of several dialects of Spanish. Base model is [RoBERTuito](https://github.com/pysentimiento/robertuito), a RoBERTa model trained in Spanish tweets.

Uses `POS`, `NEG`, `NEU` labels.

## Usage

Use it directly with [pysentimiento](https://github.com/pysentimiento/pysentimiento)

```python
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="es")

analyzer.predict("Qué gran jugador es Messi")
# returns AnalyzerOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
```


## Results

Results for the four tasks evaluated in `pysentimiento`. Results are expressed as Macro F1 scores


| model         | emotion       |

In [107]:
question = 'What datasets was the model trained on?'

# Perform the QA task using the pipeline
predicted_answer = qa_pipeline({"question": question, "context": context})

# Print original answer and predicted answer with confidence score 
# print(f"Original Answer: {answer['text']}")
print(f"Predicted Answer: {predicted_answer['answer']}")
print(f"Confidence score: {predicted_answer['score']}")

Predicted Answer: TASS 2020 corpus
Confidence score: 0.2816251814365387


In [108]:
question = 'Who are the authors of the model?'

# Perform the QA task using the pipeline
predicted_answer = qa_pipeline({"question": question, "context": context})

# Print original answer and predicted answer with confidence score 
# print(f"Original Answer: {answer['text']}")
print(f"Predicted Answer: {predicted_answer['answer']}")
print(f"Confidence score: {predicted_answer['score']}")

Predicted Answer: Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque
Confidence score: 0.861379861831665


## Creating a processing pipeline

Process the data from the HF metadata dataset.

* Get the questions we want to answer from Huggingface metadata.

In [149]:
questions = pd.read_csv("questions.tsv",sep='\t').values.tolist()
for q in questions:
    print(q)

['What is the name of the model?']
['What user uploaded the model?']
['When was the model created/uploaded?']
['What tasks can the model solve?']
['What datasets was the model trained on?']
['What was the split distribution for the datasets?']
['What datasets were used to finetune the model?']
['What datasets were used to retrain the model?']
['What model is used as the base model?']
['What evaluation metrics were used?']
['What were the values of the evaluation metrics?']
['What hyperparameters were optimized during the training process?']
['What hyperparameters values were selected?']
['What publications are related to the model?']
['Where is the model deployed?']
['What use license is associated with the model?']
['What languages does the model work with?']
['What software libraries were used for the model to work?']
['Who created the model?']


In [162]:
dataset = load_dataset("librarian-bots/model_cards_with_metadata")['train']
HF_df = dataset.to_pandas()
print(HF_df.head())

                                            modelId          author  \
0       pysentimiento/robertuito-sentiment-analysis   pysentimiento   
1  cardiffnlp/twitter-roberta-base-sentiment-latest      cardiffnlp   
2     jonatasgrosman/wav2vec2-large-xlsr-53-english  jonatasgrosman   
3                     openai/clip-vit-large-patch14          openai   
4                     google-bert/bert-base-uncased     google-bert   

              last_modified  downloads  likes   library_name  \
0 2024-02-27 20:46:41+00:00  127797475     50  pysentimiento   
1 2023-05-28 05:45:10+00:00  107275926    356   transformers   
2 2023-03-25 10:56:55+00:00   54407155    394   transformers   
3 2023-09-15 15:49:35+00:00   48829033   1050   transformers   
4 2024-02-19 11:06:12+00:00   41541158   1454   transformers   

                                                tags  \
0  [pysentimiento, pytorch, tf, safetensors, robe...   
1  [transformers, pytorch, tf, roberta, text-clas...   
2  [transformers, py

* Get tag information to extract data and answer some of the questions 

In [163]:
#Counting the unique number of tags
# all_tags_set = set()
# for index, row in HF_df.iterrows():
#     tags = row['tags']
#     for tag in tags:
#         all_tags_set.add(tag)
# print(len(all_tags_set))

From the tags we can get the following information:
* The task they are performing
* The libraries that were used to train the model
* The datasets used to train the model
* The languages that the model can handle
* The Licenses under which the model can be use. 
* Other type of tags.


In [164]:
#Getting the language tags
tags_language = [val[0].lower() for val in pd.read_csv("tags_language.tsv",sep='\t').values.tolist()]
tags_language = set(tags_language)
i = 0
for tag in tags_language:
    if i == 5:
        break
    i+=1
    print(tag)

am
tr
yi
sv
fi


In [165]:
#Getting the library tags
tags_libraries = [val[0].lower() for val in pd.read_csv("tags_libraries.tsv",sep='\t').values.tolist()]
tags_libraries = set(tags_libraries)
for tag in tags_libraries:
    print(tag)

core ml
asteroid
unity-sentis
nemo
spanmarker
peft
sample-factory
gguf
ml-agents
timm
espnet
joblib
flair
pyannote.audio
bertopic
transformers.js
setfit
sentence-transformers
fasttext
openclip
speechbrain
diffusers
transformers
rust
tensorboard
tensorflow
paddlepaddle
openvino
habana
mlx
scikit-learn
adapters
graphcore
paddlenlp
fastai
tf lite
stanza
keras
jax
stable-baselines3
pythae
spacy
onnx
safetensors
fairseq
allennlp
pytorch


In [166]:
#Getting the other tags
tags_other = [val[0].lower() for val in pd.read_csv("tags_other.tsv",sep='\t').values.tolist()]
tags_other = set(tags_other)
for tag in tags_other:
    print(tag)

carbon emissions
tags
eval results
mixture of experts
8-bit precision
inference endpoints
autotrain compatible
custom_code
merge
4-bit precision
text-generation-inference
has a space


In [197]:
#Getting the task tags
tags_task = [val[0].lower().replace('-'," ") for val in pd.read_csv("tags_task.tsv",sep='\t').values.tolist()]
tags_task = set(tags_task)
for tag in tags_task:
    print(tag)

text to audio
visual question answering
unconditional image generation
zero shot object detection
feature extraction
mask generation
depth estimation
image to 3d
text2text generation
text classification
voice activity detection
video classification
image feature extraction
reinforcement learning
zero shot classification
image classification
object detection
robotics
translation
image to text
sentence similarity
text to speech
table question answering
image text to text
audio to audio
audio classification
document question answering
text to 3d
text to video
automatic speech recognition
text generation
image to video
token classification
text to image
tabular regression
fill mask
image segmentation
question answering
tabular classification
graph machine learning
summarization
zero shot image classification


In [198]:
#Getting the dataset tags
dataset_datasets = load_dataset("librarian-bots/dataset_cards_with_metadata")['train']
tags_dataset = set(dataset_datasets["datasetId"])
i = 5
for tag in tags_dataset:
    if(i==0):
        break
    print(tag)
    i-=1

ineoApp/factures
mattwiner/autotrain-data-baseball-jersey-detection
MThonar/link
misshimichka/flower_faces_dataset_v2
andyflinn/heidiland


* Now let's set up the questions we want the model to answer

In [199]:
# Function to perform QA for a question-context pair
def answer_question(question, context):
  answer = qa_pipeline({"question": question, "context": context})
  print("Question:", question)
  print("Answer:", answer,"/n")
  return answer['answer']+''

In [200]:
#Create new columns in the dataframe
HF_df_small = HF_df.iloc[0:10] 

new_columns = {}

for idx in range(len(questions)):
    q_id = "q_id_"+str(idx)
    new_columns[q_id] = [None for _ in range(len(HF_df_small))]

HF_df_small = HF_df_small.assign(**new_columns)

print(HF_df_small.head())

                                            modelId          author  \
0       pysentimiento/robertuito-sentiment-analysis   pysentimiento   
1  cardiffnlp/twitter-roberta-base-sentiment-latest      cardiffnlp   
2     jonatasgrosman/wav2vec2-large-xlsr-53-english  jonatasgrosman   
3                     openai/clip-vit-large-patch14          openai   
4                     google-bert/bert-base-uncased     google-bert   

              last_modified  downloads  likes   library_name  \
0 2024-02-27 20:46:41+00:00  127797475     50  pysentimiento   
1 2023-05-28 05:45:10+00:00  107275926    356   transformers   
2 2023-03-25 10:56:55+00:00   54407155    394   transformers   
3 2023-09-15 15:49:35+00:00   48829033   1050   transformers   
4 2024-02-19 11:06:12+00:00   41541158   1454   transformers   

                                                tags  \
0  [pysentimiento, pytorch, tf, safetensors, robe...   
1  [transformers, pytorch, tf, roberta, text-clas...   
2  [transformers, py

In [201]:
#Process basic default information
HF_df_small.loc[:,"q_id_0"] = HF_df_small.loc[:, ("modelId")]
HF_df_small.loc[:,"q_id_1"] = HF_df_small.loc[:, ("author")]
HF_df_small.loc[:,"q_id_2"] = HF_df_small.loc[:, ("createdAt")]

In [202]:
print(HF_df_small[['q_id_0', 'q_id_1', 'q_id_2']])

                                              q_id_0                 q_id_1  \
0        pysentimiento/robertuito-sentiment-analysis          pysentimiento   
1   cardiffnlp/twitter-roberta-base-sentiment-latest             cardiffnlp   
2      jonatasgrosman/wav2vec2-large-xlsr-53-english         jonatasgrosman   
3                      openai/clip-vit-large-patch14                 openai   
4                      google-bert/bert-base-uncased            google-bert   
5   CAMeL-Lab/bert-base-arabic-camelbert-mix-pos-egy              CAMeL-Lab   
6                      tohoku-nlp/bert-base-japanese             tohoku-nlp   
7             sentence-transformers/all-MiniLM-L6-v2  sentence-transformers   
8  mrm8488/distilroberta-finetuned-financial-news...                mrm8488   
9                 distilbert/distilbert-base-uncased             distilbert   

                      q_id_2  
0  2022-03-02 23:29:05+00:00  
1  2022-03-15 01:21:58+00:00  
2  2022-03-02 23:29:05+00:00  
3  202

In [204]:
# Process tags

for index, row in HF_df_small.iterrows():
    for tag in row['tags']:
        print(index,tag)
        #Check question 3
        tag_for_q3 = tag.replace('-'," ")
        if tag_for_q3 in tags_task:
            if(HF_df_small["q_id_3"][index] == None):
                HF_df_small["q_id_3"][index] = []
            HF_df_small["q_id_3"][index].append(tag_for_q3)
        #Check question 4
        if "dataset:" in tag:
            if(HF_df_small["q_id_4"][index] == None):
                HF_df_small["q_id_4"][index] = []
            HF_df_small["q_id_4"][index].append(tag.replace("dataset:",""))
        #Check question 13
        if "arxiv:" in tag:
            if(HF_df_small["q_id_13"][index] == None):
                HF_df_small["q_id_13"][index] = []
            HF_df_small["q_id_13"][index].append(tag.replace("arxiv:",""))
         #Check question 15
        if "license:" in tag:
            if(HF_df_small["q_id_15"][index] == None):
                HF_df_small["q_id_15"][index] = []
            HF_df_small["q_id_15"][index].append(tag.replace("license:",""))
        #Check question 16
        if tag in tags_language:
            if(HF_df_small["q_id_16"][index] == None):
                HF_df_small["q_id_16"][index] = []
            HF_df_small["q_id_16"][index].append(tag)
        #Check question 17
        if tag in tags_libraries:
            if(HF_df_small["q_id_17"][index] == None):
                HF_df_small["q_id_17"][index] = []
            HF_df_small["q_id_17"][index].append(tag)
    #Check question 3 again
    if(row['pipeline_tag'] != None):
        tag_for_q3 = row['pipeline_tag'].replace("-"," ")
        if(HF_df_small["q_id_3"][index] == None):
            HF_df_small["q_id_3"][index] = []
        if(tag_for_q3 not in HF_df_small["q_id_3"][index]):
            HF_df_small["q_id_3"][index].append(tag_for_q3)

0 pysentimiento
0 pytorch
0 tf
0 safetensors
0 roberta
0 twitter
0 sentiment-analysis
0 es
0 arxiv:2106.09462
0 has_space
0 region:us
1 transformers
1 pytorch
1 tf
1 roberta
1 text-classification
1 en
1 dataset:tweet_eval
1 arxiv:2202.03829
1 autotrain_compatible
1 endpoints_compatible
1 has_space
1 region:us
2 transformers
2 pytorch
2 jax
2 safetensors
2 wav2vec2
2 automatic-speech-recognition
2 audio
2 en
2 hf-asr-leaderboard
2 mozilla-foundation/common_voice_6_0
2 robust-speech-event
2 speech
2 xlsr-fine-tuning-week
2 dataset:common_voice
2 dataset:mozilla-foundation/common_voice_6_0
2 license:apache-2.0
2 model-index
2 endpoints_compatible
2 has_space
2 region:us
3 transformers
3 pytorch
3 tf
3 jax
3 safetensors
3 clip
3 zero-shot-image-classification
3 vision
3 arxiv:2103.00020
3 arxiv:1908.04913
3 endpoints_compatible
3 has_space
3 region:us
4 transformers
4 pytorch
4 tf
4 jax
4 rust
4 coreml
4 onnx
4 safetensors
4 bert
4 fill-mask
4 exbert
4 en
4 dataset:bookcorpus
4 dataset:wik

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small["q_id_17"][index] = []
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small["q_id_3"][index] = []
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small["q_id_16"][index] = []
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small["q_id_4"][index] = []
A value is trying to be set on

In [205]:
print(HF_df_small[['q_id_3', 'q_id_4','q_id_13','q_id_15','q_id_16','q_id_17']])

                                      q_id_3  \
0                                       None   
1                      [text classification]   
2             [automatic speech recognition]   
3           [zero shot image classification]   
4                                [fill mask]   
5                     [token classification]   
6                                [fill mask]   
7  [feature extraction, sentence similarity]   
8                      [text classification]   
9                                [fill mask]   

                                              q_id_4  \
0                                               None   
1                                       [tweet_eval]   
2  [common_voice, mozilla-foundation/common_voice...   
3                                               None   
4                            [bookcorpus, wikipedia]   
5                                               None   
6                                        [wikipedia]   
7  [s2orc, flax-sentenc

In [211]:
# Process card
qa_pipeline = pipeline("question-answering", model="Intel/dynamic_tinybert",device=0)

questions_to_process = {5,8,9,10,11,14,18}

# Process each row of the DataFrame
for index, row in HF_df_small.iterrows():
  context = row["card"]  # Assuming "card" column contains the context

  # Create an empty dictionary to store answers for each question
  answers = {}
  q_cnt = -1
  for question in questions:
    q_cnt+=1
    if(q_cnt not in questions_to_process): continue
    q = question[0]
    answer = answer_question(q, context)
    q_id = "q_id_"+str(q_cnt)
    answers[q_id] = answer  # Store answer for each question
  
  # Add a new column for each question and populate with answers
  for question, answer in answers.items():
    HF_df_small[question][index] = answer
    

  print("Answers added for row", index)  # Optional progress indicator

Question: What was the split distribution for the datasets?
Answer: {'score': 1.1775950881087738e-08, 'start': 1277, 'end': 1292, 'answer': '| 0.670 ± 0.006'} /n
Question: What model is used as the base model?
Answer: {'score': 0.4569787085056305, 'start': 440, 'end': 447, 'answer': 'RoBERTa'} /n
Question: What evaluation metrics were used?
Answer: {'score': 0.003498514648526907, 'start': 949, 'end': 964, 'answer': 'Macro F1 scores'} /n
Question: What were the values of the evaluation metrics?
Answer: {'score': 8.506896847393364e-05, 'start': 924, 'end': 964, 'answer': 'Results are expressed as Macro F1 scores'} /n
Question: What hyperparameters were optimized during the training process?
Answer: {'score': 2.9859681944799377e-06, 'start': 3407, 'end': 3425, 'answer': '500 million tweets'} /n
Question: Where is the model deployed?
Answer: {'score': 0.0011761118657886982, 'start': 465, 'end': 479, 'answer': 'Spanish tweets'} /n
Question: Who created the model?
Answer: {'score': 0.0006037

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small[question][index] = answer


Question: Where is the model deployed?
Answer: {'score': 0.0015904817264527082, 'start': 4104, 'end': 4119, 'answer': 'Dublin, Ireland'} /n
Question: Who created the model?
Answer: {'score': 0.0062449295073747635, 'start': 3819, 'end': 3880, 'answer': 'Francesco  and\n      Neves, Leonardo  and\n      Espinosa Anke'} /n
Answers added for row 1
Question: What was the split distribution for the datasets?
Answer: {'score': 0.2762574851512909, 'start': 1466, 'end': 1482, 'answer': 'Common Voice 6.1'} /n
Question: What model is used as the base model?
Answer: {'score': 0.0021992120891809464, 'start': 1257, 'end': 1264, 'answer': 'XLSR-53'} /n
Question: What evaluation metrics were used?
Answer: {'score': 8.480218821205199e-05, 'start': 991, 'end': 1005, 'answer': 'en\n    metrics'} /n
Question: What were the values of the evaluation metrics?
Answer: {'score': 0.02585451304912567, 'start': 4477, 'end': 4493, 'answer': '## Evaluation\n\n1'} /n
Question: What hyperparameters were optimized dur

In [212]:
print(HF_df_small[['q_id_5', 'q_id_8','q_id_9','q_id_10','q_id_11','q_id_14','q_id_18']])

                                    q_id_5                             q_id_8  \
0                          | 0.670 ± 0.006                            RoBERTa   
1                              t = '@user'                       RoBERTa-base   
2                         Common Voice 6.1                            XLSR-53   
3  ~93% for racial classification and ~63%  ViT-L/14 Transformer architecture   
4                                    [CLS]                              [CLS]   
5                                       12              CAMeLBERT-Mix POS-EGY   
6                      WordPiece algorithm                    BERT base model   
7                               128 tokens              Sentence-Transformers   
8         agreement rate of 5-8 annotators                 RoBERTa-base model   
9                                    [CLS]            distilbert-base-uncased   

                          q_id_9  \
0                Macro F1 scores   
1          System Demonstrations   


In [213]:
print(HF_df_small.columns)

Index(['modelId', 'author', 'last_modified', 'downloads', 'likes',
       'library_name', 'tags', 'pipeline_tag', 'createdAt', 'card', 'q_id_0',
       'q_id_1', 'q_id_2', 'q_id_3', 'q_id_4', 'q_id_5', 'q_id_6', 'q_id_7',
       'q_id_8', 'q_id_9', 'q_id_10', 'q_id_11', 'q_id_12', 'q_id_13',
       'q_id_14', 'q_id_15', 'q_id_16', 'q_id_17', 'q_id_18'],
      dtype='object')


In [209]:
#Removing unused columns

HF_df_small_clean = HF_df_small.drop(columns=['modelId', 'author', 'last_modified', 'downloads', 'likes',
       'library_name', 'tags', 'pipeline_tag', 'createdAt', 'card'])

print(HF_df_small_clean.head())

                                             q_id_0          q_id_1  \
0       pysentimiento/robertuito-sentiment-analysis   pysentimiento   
1  cardiffnlp/twitter-roberta-base-sentiment-latest      cardiffnlp   
2     jonatasgrosman/wav2vec2-large-xlsr-53-english  jonatasgrosman   
3                     openai/clip-vit-large-patch14          openai   
4                     google-bert/bert-base-uncased     google-bert   

                      q_id_2                            q_id_3  \
0  2022-03-02 23:29:05+00:00                              None   
1  2022-03-15 01:21:58+00:00             [text classification]   
2  2022-03-02 23:29:05+00:00    [automatic speech recognition]   
3  2022-03-02 23:29:05+00:00  [zero shot image classification]   
4  2022-03-02 23:29:04+00:00                       [fill mask]   

                                              q_id_4  \
0                                               None   
1                                       [tweet_eval]   
2  [comm