# How to extract particular data from txt files. 

In [68]:
from transformers import pipeline
from datasets import load_dataset
import pandas as pd

In [69]:

# Define your question and context (text passage)
question = "What is the capital of France?"
context = """France is a country located in Western Europe. 
             The capital of France is Paris. It is a major global city 
             and a center for finance, diplomacy, and culture."""

# Load the QA pipeline with a pre-trained model
qa_pipeline =  pipeline("question-answering", model="Intel/dynamic_tinybert")

# Perform the QA task
answer = qa_pipeline({"question": question, "context": context})

# Print the answer
print(f"Answer: {answer['answer']}")
print(f"Confidence score: {answer['score']}")

Answer: Paris
Confidence score: 0.9864407777786255


In [70]:
dataset_models = load_dataset("librarian-bots/model_cards_with_metadata")['train']

In [71]:
context = dataset_models["card"][0]
print(context)

---
language: 
  - es
library_name: pysentimiento

tags:
  - twitter
  - sentiment-analysis

---
# Sentiment Analysis in Spanish
## robertuito-sentiment-analysis

Repository: [https://github.com/pysentimiento/pysentimiento/](https://github.com/finiteautomata/pysentimiento/)


Model trained with TASS 2020 corpus (around ~5k tweets) of several dialects of Spanish. Base model is [RoBERTuito](https://github.com/pysentimiento/robertuito), a RoBERTa model trained in Spanish tweets.

Uses `POS`, `NEG`, `NEU` labels.

## Usage

Use it directly with [pysentimiento](https://github.com/pysentimiento/pysentimiento)

```python
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="es")

analyzer.predict("Qué gran jugador es Messi")
# returns AnalyzerOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
```


## Results

Results for the four tasks evaluated in `pysentimiento`. Results are expressed as Macro F1 scores


| model         | emotion       |

In [72]:
question = 'What datasets was the model trained on?'

# Perform the QA task using the pipeline
predicted_answer = qa_pipeline({"question": question, "context": context})

# Print original answer and predicted answer with confidence score 
# print(f"Original Answer: {answer['text']}")
print(f"Predicted Answer: {predicted_answer['answer']}")
print(f"Confidence score: {predicted_answer['score']}")

Predicted Answer: TASS 2020 corpus
Confidence score: 0.2816251814365387


In [73]:
question = 'Who are the authors of the model?'

# Perform the QA task using the pipeline
predicted_answer = qa_pipeline({"question": question, "context": context})

# Print original answer and predicted answer with confidence score 
# print(f"Original Answer: {answer['text']}")
print(f"Predicted Answer: {predicted_answer['answer']}")
print(f"Confidence score: {predicted_answer['score']}")

Predicted Answer: Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque
Confidence score: 0.861379861831665


## Creating a processing pipeline

Process the data from the HF metadata dataset.

* Get the questions we want to answer from Huggingface metadata.

In [74]:
questions = pd.read_csv("questions.tsv",sep='\t').values.tolist()
for q in questions:
    print(q)

['What is the name of the model?']
['Who created/uploaded the model?']
['When was the model created/uploaded?']
['What tasks can the model solve?']
['What datasets was the model trained on?']
['What was the split distribution for the datasets?']
['What datasets were used to finetune the model?']
['What datasets were used to retrain the model?']
['What model is used as the base model?']
['What evaluation metrics were used?']
['What were the values of the evaluation metrics?']
['What hyperparameters were optimized during the training process?']
['What hyperparameters values were selected?']
['What publications are related to the model?']
['Where is the model deployed?']
['What use license is associated with the model?']
['What languages does the model work with?']
['What software libraries were used for the model to work?']


In [75]:
dataset = load_dataset("librarian-bots/model_cards_with_metadata")['train']
HF_df = dataset.to_pandas()
print(HF_df.head())

                                            modelId          author  \
0       pysentimiento/robertuito-sentiment-analysis   pysentimiento   
1  cardiffnlp/twitter-roberta-base-sentiment-latest      cardiffnlp   
2     jonatasgrosman/wav2vec2-large-xlsr-53-english  jonatasgrosman   
3                     openai/clip-vit-large-patch14          openai   
4                     google-bert/bert-base-uncased     google-bert   

              last_modified  downloads  likes   library_name  \
0 2024-02-27 20:46:41+00:00  164336016     49  pysentimiento   
1 2023-05-28 05:45:10+00:00  132880138    348   transformers   
2 2023-03-25 10:56:55+00:00   57980766    391   transformers   
3 2023-09-15 15:49:35+00:00   45378724   1039   transformers   
4 2024-02-19 11:06:12+00:00   40564287   1443   transformers   

                                                tags  \
0  [pysentimiento, pytorch, tf, safetensors, robe...   
1  [transformers, pytorch, tf, roberta, text-clas...   
2  [transformers, py

* Get tag information to extract data and answer some of the questions 

In [76]:
#Counting the unique number of tags
all_tags_set = set()
for index, row in HF_df.iterrows():
    tags = row['tags']
    for tag in tags:
        all_tags_set.add(tag)
print(len(all_tags_set))

40747


From the tags we can get the following information:
* The task they are performing
* The libraries that were used to train the model
* The datasets used to train the model
* The languages that the model can handle
* The Licenses under which the model can be use. 
* Other type of tags.


In [86]:
#Getting the language tags
tags_language = [val[0] for val in pd.read_csv("tags_language.tsv",sep='\t').values.tolist()]
tags_language = set(tags_language)
i = 0
for tag in tags_language:
    if i == 5:
        break
    i+=1
    print(tag)

am
tr
yi
sv
fi


In [78]:
#Getting the library tags
tags_libraries = [val[0] for val in pd.read_csv("tags_libraries.tsv",sep='\t').values.tolist()]
tags_libraries = set(tags_libraries)
for tag in tags_libraries:
    print(tag)

MLX
unity-sentis
TensorBoard
Graphcore
Rust
sample-factory
AllenNLP
Core ML
JAX
Diffusers
PaddlePaddle
OpenVINO
ml-agents
Keras
timm
ESPnet
NeMo
Scikit-learn
SpanMarker
setfit
fastText
sentence-transformers
speechbrain
BERTopic
Fairseq
Safetensors
Habana
spaCy
TensorFlow
PEFT
Joblib
Transformers
Adapters
OpenCLIP
paddlenlp
fastai
Flair
Stanza
GGUF
PyTorch
stable-baselines3
TF Lite
pythae
Transformers.js
pyannote.audio
Asteroid
ONNX


In [79]:
#Getting the other tags
tags_other = [val[0] for val in pd.read_csv("tags_other.tsv",sep='\t').values.tolist()]
tags_other = set(tags_other)
for tag in tags_other:
    print(tag)

Merge
AutoTrain Compatible
Has a Space
tags
Eval Results
8-bit precision
Inference Endpoints
Carbon Emissions
Mixture of Experts
custom_code
text-generation-inference
4-bit precision


In [80]:
#Getting the task tags
tags_task = [val[0] for val in pd.read_csv("tags_task.tsv",sep='\t').values.tolist()]
tags_task = set(tags_task)
for tag in tags_task:
    print(tag)

Text-to-Audio
Feature Extraction
Depth Estimation
Text-to-3D
Image-to-Video
Robotics
Token Classification
Zero-Shot Classification
Object Detection
Automatic Speech Recognition
Image Segmentation
Text-to-Video
Audio-to-Audio
Question Answering
Image-to-Text
Image-Text-to-Text
Zero-Shot Object Detection
Text-to-Image
Visual Question Answering
Unconditional Image Generation
Zero-Shot Image Classification
Image Classification
Graph Machine Learning
Video Classification
Translation
Image-to-3D
Tabular Regression
Text-to-Speech
Summarization
Mask Generation
Document Question Answering
Text Generation
Fill-Mask
Voice Activity Detection
Text2Text Generation
Table Question Answering
Sentence Similarity
Reinforcement Learning
Text Classification
Audio Classification
Tabular Classification
Image Feature Extraction


In [81]:
#Getting the dataset tags
dataset_datasets = load_dataset("librarian-bots/dataset_cards_with_metadata")['train']
tags_dataset = set(dataset_datasets["datasetId"])
i = 5
for tag in tags_dataset:
    if(i==0):
        break
    print(tag)
    i-=1

ineoApp/factures
mattwiner/autotrain-data-baseball-jersey-detection
MThonar/link
misshimichka/flower_faces_dataset_v2
andyflinn/heidiland


* Now let's set up the questions we want the model to answer

In [82]:
# Function to perform QA for a question-context pair
def answer_question(question, context):
  answer = qa_pipeline({"question": question, "context": context})
  print("Question:", question)
  print("Answer:", answer,"/n")
  return answer['answer']+''

In [95]:
#Create new columns in the dataframe
HF_df_small = HF_df.iloc[0:10] 

new_columns = {}

for idx in range(len(questions)):
    q_id = "q_id_"+str(idx)
    new_columns[q_id] = [None for _ in range(len(HF_df_small))]

HF_df_small = HF_df_small.assign(**new_columns)

print(HF_df_small.head())

                                            modelId          author  \
0       pysentimiento/robertuito-sentiment-analysis   pysentimiento   
1  cardiffnlp/twitter-roberta-base-sentiment-latest      cardiffnlp   
2     jonatasgrosman/wav2vec2-large-xlsr-53-english  jonatasgrosman   
3                     openai/clip-vit-large-patch14          openai   
4                     google-bert/bert-base-uncased     google-bert   

              last_modified  downloads  likes   library_name  \
0 2024-02-27 20:46:41+00:00  164336016     49  pysentimiento   
1 2023-05-28 05:45:10+00:00  132880138    348   transformers   
2 2023-03-25 10:56:55+00:00   57980766    391   transformers   
3 2023-09-15 15:49:35+00:00   45378724   1039   transformers   
4 2024-02-19 11:06:12+00:00   40564287   1443   transformers   

                                                tags  \
0  [pysentimiento, pytorch, tf, safetensors, robe...   
1  [transformers, pytorch, tf, roberta, text-clas...   
2  [transformers, py

In [97]:
# Process tags

for index, row in HF_df_small.iterrows():
    for tag in row['tags']:
        print(index,tag)
        #Check question 3
        if tag in tags_task:
            if(HF_df_small["q_id_3"][index] == None):
                HF_df_small["q_id_3"][index] = []
            HF_df_small["q_id_3"][index].append(tag)
        #Check question 4
        if "dataset:" in tag:
            if(HF_df_small["q_id_4"][index] == None):
                HF_df_small["q_id_4"][index] = []
            HF_df_small["q_id_4"][index].append(tag.replace("dataset:",""))
        #Check question 13
        if "arxiv:" in tag:
            if(HF_df_small["q_id_13"][index] == None):
                HF_df_small["q_id_13"][index] = []
            HF_df_small["q_id_13"][index].append(tag.replace("arxiv:",""))
         #Check question 15
        if "license:" in tag:
            if(HF_df_small["q_id_15"][index] == None):
                HF_df_small["q_id_15"][index] = []
            HF_df_small["q_id_15"][index].append(tag.replace("license:",""))
        #Check question 16
        if tag in tags_language:
            if(HF_df_small["q_id_16"][index] == None):
                HF_df_small["q_id_16"][index] = []
            HF_df_small["q_id_16"][index].append(tag)
        #Check question 17
        if tag in tags_libraries:
            if(HF_df_small["q_id_17"][index] == None):
                HF_df_small["q_id_17"][index] = []
            HF_df_small["q_id_17"][index].append(tag)


0 pysentimiento
0 pytorch
0 tf
0 safetensors
0 roberta
0 twitter
0 sentiment-analysis
0 es
0 arxiv:2106.09462
0 has_space
0 region:us
1 transformers
1 pytorch
1 tf
1 roberta
1 text-classification
1 en
1 dataset:tweet_eval
1 arxiv:2202.03829
1 autotrain_compatible
1 endpoints_compatible
1 has_space
1 region:us
2 transformers
2 pytorch
2 jax
2 safetensors
2 wav2vec2
2 automatic-speech-recognition
2 audio
2 en
2 hf-asr-leaderboard
2 mozilla-foundation/common_voice_6_0
2 robust-speech-event
2 speech
2 xlsr-fine-tuning-week
2 dataset:common_voice
2 dataset:mozilla-foundation/common_voice_6_0
2 license:apache-2.0
2 model-index
2 endpoints_compatible
2 has_space
2 region:us
3 transformers
3 pytorch
3 tf
3 jax
3 safetensors
3 clip
3 zero-shot-image-classification
3 vision
3 arxiv:2103.00020
3 arxiv:1908.04913
3 endpoints_compatible
3 has_space
3 region:us
4 transformers
4 pytorch
4 tf
4 jax
4 rust
4 coreml
4 onnx
4 safetensors
4 bert
4 fill-mask
4 exbert
4 en
4 dataset:bookcorpus
4 dataset:wik

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small["q_id_16"][index] = []
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small["q_id_13"][index] = []
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small["q_id_4"][index] = []
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small["q_id_15"][index] = []
A value is trying to be set o

In [102]:
print(HF_df_small[['q_id_3', 'q_id_4','q_id_13','q_id_15','q_id_16','q_id_17']])

  q_id_3                                             q_id_4  \
0   None                                               None   
1   None                                       [tweet_eval]   
2   None  [common_voice, mozilla-foundation/common_voice...   
3   None                                               None   
4   None                            [bookcorpus, wikipedia]   
5   None                             [financial_phrasebank]   
6   None                                               None   
7   None  [s2orc, flax-sentence-embeddings/stackexchange...   
8   None                                        [wikipedia]   
9   None                            [bookcorpus, wikipedia]   

                                             q_id_13         q_id_15 q_id_16  \
0                       pysentimiento and RoBERTuito            None    [es]   
1          Association for Computational Linguistics            None    [en]   
2                    grosman2021xlsr53-large-english    [apache-2.

In [100]:
# Process card
qa_pipeline = pipeline("question-answering", model="Intel/dynamic_tinybert")

questions_to_process = {5,8,9,10,11,14}

# Process each row of the DataFrame
for index, row in HF_df_small.iterrows():
  context = row["card"]  # Assuming "card" column contains the context

  # Create an empty dictionary to store answers for each question
  answers = {}
  q_cnt = -1
  for question in questions:
    q_cnt+=1
    if(q_cnt not in questions_to_process): continue
    q = question[0]
    answer = answer_question(q, context)
    q_id = "q_id_"+str(q_cnt)
    answers[q_id] = answer  # Store answer for each question
  
  # Add a new column for each question and populate with answers
  for question, answer in answers.items():
    HF_df_small[question][index] = answer
    

  print("Answers added for row", index)  # Optional progress indicator

Question: Who created/uploaded the model?
Answer: {'score': 0.0015278777573257685, 'start': 3876, 'end': 3886, 'answer': 'RoBERTuito'} /n
Question: What was the split distribution for the datasets?
Answer: {'score': 1.177582831246582e-08, 'start': 1277, 'end': 1292, 'answer': '| 0.670 ± 0.006'} /n
Question: What model is used as the base model?
Answer: {'score': 0.45697927474975586, 'start': 440, 'end': 447, 'answer': 'RoBERTa'} /n
Question: What evaluation metrics were used?
Answer: {'score': 0.0034985386300832033, 'start': 949, 'end': 964, 'answer': 'Macro F1 scores'} /n
Question: What were the values of the evaluation metrics?
Answer: {'score': 8.506868471158668e-05, 'start': 924, 'end': 964, 'answer': 'Results are expressed as Macro F1 scores'} /n
Question: What hyperparameters were optimized during the training process?
Answer: {'score': 2.9859666028642096e-06, 'start': 3407, 'end': 3425, 'answer': '500 million tweets'} /n
Question: What publications are related to the model?
Answ

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  HF_df_small[question][index] = answer


Question: Who created/uploaded the model?
Answer: {'score': 3.0807539133093087e-06, 'start': 169, 'end': 181, 'answer': 'RoBERTa-base'} /n
Question: What was the split distribution for the datasets?
Answer: {'score': 0.005795312579721212, 'start': 1601, 'end': 1612, 'answer': "t = '@user'"} /n
Question: What model is used as the base model?
Answer: {'score': 0.032325588166713715, 'start': 169, 'end': 181, 'answer': 'RoBERTa-base'} /n
Question: What evaluation metrics were used?
Answer: {'score': 1.642551796976477e-05, 'start': 4029, 'end': 4050, 'answer': 'System Demonstrations'} /n
Question: What were the values of the evaluation metrics?
Answer: {'score': 0.031056080013513565, 'start': 4293, 'end': 4301, 'answer': '251--260'} /n
Question: What hyperparameters were optimized during the training process?
Answer: {'score': 4.464515313884476e-06, 'start': 200, 'end': 204, 'answer': '124M'} /n
Question: What publications are related to the model?
Answer: {'score': 0.00018329665181227028, 

In [87]:
print(HF_df_small.head())

                                             modelId          author  \
1   cardiffnlp/twitter-roberta-base-sentiment-latest      cardiffnlp   
2      jonatasgrosman/wav2vec2-large-xlsr-53-english  jonatasgrosman   
3                      openai/clip-vit-large-patch14          openai   
4                      google-bert/bert-base-uncased     google-bert   
5  mrm8488/distilroberta-finetuned-financial-news...         mrm8488   

              last_modified  downloads  likes  library_name  \
1 2023-05-28 05:45:10+00:00  132880138    348  transformers   
2 2023-03-25 10:56:55+00:00   57980766    391  transformers   
3 2023-09-15 15:49:35+00:00   45378724   1039  transformers   
4 2024-02-19 11:06:12+00:00   40564287   1443  transformers   
5 2024-01-21 15:17:58+00:00   34522962    213  transformers   

                                                tags  \
1  [transformers, pytorch, tf, roberta, text-clas...   
2  [transformers, pytorch, jax, safetensors, wav2...   
3  [transformers, py