## About This Notebook
This is the official Google Collab Notebook for the repository [llm-evaluation-suit](https://github.com/yigithakverdi/llm-evaluation-suit). We need to reframe the give task from different campaign as instructed in the `README.md` files foud in the related task folders. Given datasets from these two different datasources SemEval and EVALITA is not suited for LLM testking we need to reframe them, making them suitable for evaluating LLMs

Reframe each task so that an LLM can receive a sentence and a prompt as input and produce a single answer as output. There are multiple different tasks that we can get, but the general the structure of the dataset should follow the below format in;

```json
  // dataset.jsonl
  {
    "sentence": str,
    "answer" : list[str],
    "label" : int
  }  
```

Prompt engineering part is where the actual value might come in, we need to create at most 5 prompts for each given tasks and provide them in different file named `prompt.json`in the following format.

The meticulous design of the input provided to an LLM to solve a task. The aim is to improve the performance and manage the behavior of LLMs. Along with the dataset reformatting you have to provide different prompts for each of defined task.

```json
  //prompt.json
  {
      "prompt": str
  }
```

Datasets assigned ↪ 8, 15 \
Distractor ↪ 23


In [1]:
## Default imports
import pandas as pd
import numpy as np
import pprint
import json
import xml.etree.ElementTree as ET
import re
import random

from google.colab import drive
from tqdm import tqdm

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
## Configs, environment variables etc.
drive.mount('/content/drive')

DATASET_DIR = "drive/MyDrive/Datasets/NLP Tasks (Italian)"
DRIVE_DIR = "drive/MyDrive"

print('\n')
!ls 'drive/MyDrive/Datasets/NLP Tasks (Italian)'

Mounted at /content/drive


task-0-emotivita       task-19-absita	  task-25-textentail	      task-5-discotex
task-12-tagit	       task-1-hodi	  task-26-hypernym_discovery  task-7-ami
task-13-concretext     task-20-itamoji	  task-27-haspeede3	      task-8-sardistance
task-14-ghigliottinAI  task-21-ironita	  task-28-pretens	      task-9-haspeede
task-15-prelearn       task-22-gxg	  task-29-tweet_intimacy
task-17-diacrita       task-23-postwita   task-2-nermud
task-18-accomplit      task-24-sentipolc  task-4-wic


In [192]:
## Some utility functions
def convert_to_jsonl(data, filename):
    with open(filename, 'w') as f:
        for item in data:
            json.dump(item, f)
            f.write('\n')

## XML parser, it returns a document object
def parse_xml(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    documents = []
    for doc in root.findall('doc'):
        doc_id = doc.get('id')
        title = doc.find('title').text.strip()
        text = doc.find('text').text.strip()
        documents.append((doc_id, title, text))
    return documents

## Simple substring match
def concept_to_wiki(concept, wiki_docs):
  for doc in wiki_docs:
    if(concept.lower() == doc[1].lower()):
      return doc
  return ''

## Defining function for formatting datasets into jsonl
def format_prelearn_datasets(dataset, description="Formatting"):
  df_json = []
  pbar = tqdm(dataset.iterrows(), desc=description)
  for index, row in pbar:

    ## Obtaining wiki concepts using concept_to_wiki function
    wiki_A = concept_to_wiki(row['concept_A'], documents)
    wiki_B = concept_to_wiki(row['concept_B'], documents)
    df_json.append(
        {
            "wikipedia_passage_concept_A": wiki_A,
            "concept_A": row["concept_A"],
            "wikipedia_passage_concept_B": wiki_B,
            "concept_B": row["concept_B"],
            "choices": ["False", "True"],
            "label": row["label"]
        }
    )
  return df_json

## Randomizing labels
def randomize_labels(dataset_json):
  pass

## Function removes \t and \n and considers the case where the last line
## contains only \t, no \n, then it does further cleaning on the sentence
## such as removing spaces after or before quotes.
def get_sentence(input_string):
  input_string = re.sub(r'\t.*?(\n|$)', ' ', input_string).rstrip()
  return input_string

## Function for parsing whole sentence to, word PoS tag tupples in order to
## do a refined target word and distiractor selection. Once the sentence is spit
## into (word, pos-tag) tupple, target word and distiractor selection is done
## ...
## ...
## Target word choice and districator choice
## --> Brute force approach: temporarly brute for approach is tried currently, using
##                           static list, a random tag is choosen from it, furthermore
##                           same pos tags are used for distiractors. If not enough
##                           pos tags are consist in the sentence randomly choosen
##
def brute_force_select(word_tag_lst,  preferred_pos_tags):
  filtered_tuple_list = [(real_index, word, pos_tag) for real_index, (word, pos_tag) in enumerate(word_tag_lst) if pos_tag in preferred_pos_tags]
  distiractors = []
  if(filtered_tuple_list):
    target_index = random.choice(range(len(filtered_tuple_list)))
    target_index, target_word, target_pos = filtered_tuple_list[target_index]
    preferred_pos_tags.remove(target_pos)
  else:
    target_index = random.choice(range(len(word_tag_lst)))
    target_word, target_pos = word_tag_lst[target_index]


  for i in range(3):
    index = random.choices(range(len(preferred_pos_tags)))[0]
    distiractors.append(preferred_pos_tags[index])
    preferred_pos_tags.pop(index)

  return target_index, target_word, target_pos, distiractors

## --> Statistical approach: <TO BE DETERMINED>
## ...
## ...
## ...
def statistical_select(word_tag_lst):
  pass

def select_target_word_and_distractions(input_string):
  word_tag_lst = []
  for elements in input_string.split('\n'):
    word, tag = elements.split('\t')
    word_tag_lst.append((word, tag))
  preferred_pos_tags = ['NOUN', 'VERB', 'ADJ', 'ADP', 'PRON', 'AUX', 'PROPN']
  target_index, target_word, target_pos, distiractors = brute_force_select(word_tag_lst, preferred_pos_tags)
  return target_index, target_word, target_pos, distiractors


## Using TF-IDF vectorizer with cosine similarity to match
## wiki concepts efficiently with the concept pairs
# def concept_to_wiki(concept, wiki_docs):

#     corpus = [doc[2] for doc in wiki_docs]
#     corpus.append(concept)

#     tfidf_vectorizer = TfidfVectorizer()
#     tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
#     # cos_sim_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])

#     # best_match_index = cos_sim_scores.argmax()
#     # best_match_score = cos_sim_scores[0][best_match_index]
#     # best_match = wiki_docs[best_match_index]

#     # return best_match, best_match_score
#     return tfidf_matrix, corpus

## SardiStance (Task 8)
Stance Detection (SD) has emerged as a significant area of research, focusing on identifying people's attitudes towards specific topics, which is crucial for public administration, policy-making, marketing, and security. By monitoring public opinions and reactions, administrators can better address the needs of the population, such as identifying extremist tendencies or preventing dissatisfaction. The task of SD involves determining an author's stance—whether they are in favor, against, or neutral—towards a specific target. SemEval introduced the first shared task on SD in 2016, focusing on tweets related to various subjects like political figures and social issues. Following initiatives like IberEval expanded SD research to include languages like Catalan and Spanish and introduced multimodal approaches to stance detection. Besides, the broader SDQC framework classifies attitudes in rumors as Support, Deny, Query, or Comment, further explored in SemEval tasks in 2017 and 2019, incorporating analysis of social media content like tweets and Reddit posts.

## Task Description and Expected Output
With this task proposal we would like to invite participants to explore features based on the textual content of the tweet, such as structural, stylistic, and affective features, but also features based on contextual information that documents not emerge directly from the text, such as for instance knowledge about the domain of the political debate or information about the user's community.Overall, we propose two different subtasks:

  - Task A **_(Task that is assigned)_** - Textual Stance Detection: The first task is a three-class classification task where the system has to predict whether a tweet is in favour, against or neutral/none towards the given target (following the guidelines below), exploiting only textual information, i.e. the text of the tweet.

  - <del>Task B - Contextual Stance Detection: The second task is the same as the first one: a three-class classification task where the system has to predict whether a tweet is in favour, against or neutral/none towards the given target. Here participants will have access to a wider range of contextual information based on the post such as: the number of retweets, the number of favours, the type of posting source (e.g. iOS or Android), and date of posting.

  <del>Furthermore we will share (and encourage its exploitation) contextual information related to the user, such as: number of tweets ever posted, user's bio (only emojis), user's number of followers, user's number of friends. Additionally we will share users' contextual information about their social network, such as: friends, replies, retweets, and quotes' relations. The personal ids of the users will be anonymized but their network structures will be maintained intact.

Dataset comes with three different `.csv` files `TRAIN`, `TEST` and `GOLD`. We need to merge `TEST` and `GOLD` together and produce two different `.jsonl` files `SardiStance-train.jsonl` and `SardiStance-test.jsonl`. Each line in our output file should be in the format
  
  ```json
  {
      "text": str,
      "choices": list[str]
      "label": int
  }
  ```

Furthermore we need to create `prompt.jsonl` file. In this file you have to report the prompts you designed for the task. Each line in your output file (1 line per prompt) must be a JSON object like the one below (max 5 lines in this file):

  ```json
  {
      "prompt": str
  }
  ```

## Further Resources

  - Official documentation of the campaign is in the PDF format can be found at the following link ➡ [SardiStance PDF Document](https://s3.cbk.cloud.syseleven.net/elg-public/a354fc6ec78040ca8bf67a6d0425120a_u2235_paper159-2.pdf)

  - A web archive of the campaign can be found on the following link ➡ [Web Archive Link to Official Campaign](https://web.archive.org/web/20200809224021/https://www.di.unito.it/~tutreeb/sardistance-evalita2020/index.html)

In [4]:
## Loading datasets
df_train = pd.read_csv(f'{DATASET_DIR}/task-8-sardistance/dataset/SardiStance4ELG/development/TRAIN_anonymized.csv')
df_test = pd.read_csv(f'{DATASET_DIR}/task-8-sardistance/dataset/SardiStance4ELG/TEST_anonymized.csv')
df_gold = pd.read_csv(f'{DATASET_DIR}/task-8-sardistance/dataset/SardiStance4ELG/TEST-GOLD.csv')

In [5]:
df_train

Unnamed: 0,tweet_id,user_id,text,label
0,699725964326915,246324638513337,CONTRORDINE COMPAGNI. \nFATE GIRARE LA VOCE CH...,AGAINST
1,935933812070316,102639953724275,A proposito dell'incontro delle @6000sardine c...,NONE
2,645018490992395,827980738531521,Care sardine che richiamate in continuazione i...,AGAINST
3,544253932762359,550094690571751,"Domanda ad una sardina "" secondo te Qual è il ...",AGAINST
4,250599450823911,596243198626016,A salvini non gli frega nulla delle sardine no...,NONE
...,...,...,...,...
2127,699350542668374,837943029514384,"Pidini non montatevi la testa, non avete perso...",NONE
2128,285922246863380,255983118202161,Volete sapere perché il pd nonostante tutto di...,NONE
2129,645814852346899,203006039273884,#Omnibusla7 Telese in piena fregola per il cap...,NONE
2130,254474324611529,152960836276941,Noto con piacere che le librerie si stanno rip...,NONE


In [6]:
df_test

Unnamed: 0,tweet_id,user_id,text
0,100333271264733,502726142324871,Quando vedo sardine pieddini e pidiessini mi c...
1,100802632657232,207046357942323,Sardine pidini sinistri di vario tipo cinque s...
2,101176390580663,404518296438980,#Salvinivergognati ha scatenato la macchina de...
3,102223067642419,436334476725993,Oggi a San Giovanni con le sardine ce stato un...
4,103369223683140,217958117453818,"È sabato tutti a #reggioemilia ore 18,30 piazz..."
...,...,...,...
1105,997817227475980,432164115674436,Sono nella #menaide del Pd. #Sardine. Ps: mena...
1106,998106237052447,760855778876687,NE CARNE NE PESCE !\nSARDINE O MORTADELLA ? \n...
1107,998463405443867,928031026240529,#iostoconsalvini ottimo hastag per bloccare #S...
1108,998495737101996,760856860830982,"La ""sardina"" Jasmine scrive a Salvini: ""Combat..."


In [7]:
df_gold

Unnamed: 0,tweet_id,label
0,100333271264733,AGAINST
1,100802632657232,AGAINST
2,101176390580663,FAVOR
3,102223067642419,AGAINST
4,103369223683140,FAVOR
...,...,...
1105,997817227475980,AGAINST
1106,998106237052447,AGAINST
1107,998463405443867,AGAINST
1108,998495737101996,FAVOR


In [8]:
## Merging test and gold sets
df_test = pd.merge(df_test, df_gold, on='tweet_id', how='inner')
df_test

Unnamed: 0,tweet_id,user_id,text,label
0,100333271264733,502726142324871,Quando vedo sardine pieddini e pidiessini mi c...,AGAINST
1,100802632657232,207046357942323,Sardine pidini sinistri di vario tipo cinque s...,AGAINST
2,101176390580663,404518296438980,#Salvinivergognati ha scatenato la macchina de...,FAVOR
3,102223067642419,436334476725993,Oggi a San Giovanni con le sardine ce stato un...,AGAINST
4,103369223683140,217958117453818,"È sabato tutti a #reggioemilia ore 18,30 piazz...",FAVOR
...,...,...,...,...
1105,997817227475980,432164115674436,Sono nella #menaide del Pd. #Sardine. Ps: mena...,AGAINST
1106,998106237052447,760855778876687,NE CARNE NE PESCE !\nSARDINE O MORTADELLA ? \n...,AGAINST
1107,998463405443867,928031026240529,#iostoconsalvini ottimo hastag per bloccare #S...,AGAINST
1108,998495737101996,760856860830982,"La ""sardina"" Jasmine scrive a Salvini: ""Combat...",FAVOR


In [10]:
## Initially turning datasets into asked JSON format
##    {
##       "text": str,
##       "choices": list[str]
##       "label": int
##    }

df_test_json = []
df_train_json = []

print("------ Formatting test set")
for index, row in tqdm(df_test.iterrows()):
  df_test_json.append(
      {
          "text" : f"{row['text']}",
          "choices": [f"{row['label']}"],
          "label" : 0
      }
  )

print("\n")
print("------ Formatting training set")
for index, row in tqdm(df_train.iterrows()):
  df_train_json.append(
      {
          "text" : f"{row['text']}",
          "choices": [f"{row['label']}"],
          "label" : 0
      }
  )

print("\n")
pprint.pp(df_test_json[:5])

------ Formatting test set


1110it [00:00, 12268.86it/s]




------ Formatting training set


2132it [00:00, 15798.26it/s]



[{'text': 'Quando vedo sardine pieddini e pidiessini mi chiedo..\n'
          '\n'
          'MA ESISTE UN DIRIGENTE DEL PCI DEGLI ANNI 80/90 MORTO DI FAME ?',
  'choices': ['AGAINST'],
  'label': 0},
 {'text': 'Sardine pidini sinistri di vario tipo cinque stalle e diarrea e '
          'succedanee\n'
          "Piu' vi agitate e più mi fate godere ogni volta che scrivete e fate "
          'minchiate è come essere fidanzato alla più bella donna del mondo\n'
          'Andate a cagare e pulitevi con le ortiche',
  'choices': ['AGAINST'],
  'label': 0},
 {'text': '#Salvinivergognati ha scatenato la macchina del fango contro le '
          '#Sardine, quattro post da stamattina. Bene. Vuol dire che ha paura, '
          'e fa bene! Sono i movimenti dal basso che distruggono le bugie '
          'sovraniste della Lega. Avanti tutta!',
  'choices': ['FAVOR'],
  'label': 0},
 {'text': 'Oggi a San Giovanni con le sardine ce stato un ritorno al passato , '
          'la nascita dei 5S , movi




In [11]:
## Next adding the distractors
## ...
## since there are only 3 categorical answers, we can just add all of them to the
## choices lists (due to assignment requiremenets at most 4 answers required one
## of them being correct and rest is distractors)
choices = ["AGAINST", "FAVOR", "NONE"]

## Adding missing choices to the choice list in each JSON object using set
## differences
for item in tqdm(df_test_json):
  item['choices'].extend(list(set(choices) - set(item['choices'])))

for item in tqdm(df_train_json):
  item['choices'].extend(list(set(choices) - set(item['choices'])))

In [None]:
## Create prompt engineering logic here suited for the dataset context
## ...
## at most 5 prompts needed for this specific dataset and required to be
## recorded in the prompt.json file



In [13]:
!mkdir "SardiStance"
convert_to_jsonl(df_train_json, "SardiStance/SardiStance-train.jsonl")
convert_to_jsonl(df_test_json, "SardiStance/SardiStance-test.jsonl.jsonl")

## PreLearn (Prerequisite Relation Learning) (Task 15)
The proliferation of e-learning platforms, electronic textbooks and educational applications has shed light on the need of developing systems able to identify educational relations between learning concepts in order to develop intelligent agents to support both students and teachers in distant learning. Prerequisite relations are the most relevant among all educational relations since they establish which sequence of concepts allows students to have a full understanding of a subject. The need of inferring prerequisite relations from educational texts inspired PRELEARN (Prerequisite Relation Learning), the first shared task on Automatic Prerequisite Learning. We invite participants to build models that identify prerequisite relations between pairs of concepts. We will challenge these models proposing different experimental settings and scenarios.

## Dataset Creation
The dataset was built upon the AL-CPL dataset (Liang et al. 2018), a collection of binary-labelled concept pairs extracted from textbooks on four domains: data mining, geometry, physics and precalculus. In AL-CPL, for each domain, relevant concepts were extracted from a textbook and matched with pages from English Wikipedia if the title and the concept name corresponded. Then, domain experts were asked to manually annotate if pairs of concepts showed a prerequisite relation or not, therefore the dataset consists of both positive and negative concept pairs. In ITA-PREREQ we took the Italian version of the Wikipedia pages considered for AL-CPL, excluding from the dataset those concepts (and the relations where they were involved) for which an Italian page was not available. Finally, we mapped both positive and negative relations between pairs of the remaining concepts from AL-CPL to ITA-PREREQ.

## Task Description and Expected Output
4 different sets for each domain, data mining, geometry, physics and precalculus. Each domain should have its own training and testing sets so it will add up to 8 individual `.jsonl` files (exact format is `PRELEARN-<name of the set>-<train/test>.jsonl`. Expected output is as follows;

  ```json
  {
      "wikipedia_passage_concept_A": str,
      "concept_A": str,
      "wikipedia_passage_concept_B": str,
      "concept_B": str,
      "choices": list[str]
      "label": int,
  }
  ```

Maybe Wikipedia passage part could be either reduced to only `<text>` section or we can preserve rest of the sections `<title>` and `<text>` and rest of the XML section into individual JSON object such as;

  ```json
  {
    "title": str,
    "text": str,
    "url": str,
    "id": int
  }
  ```

All the domain train, test sets should be in the same format. Furthermore the format of the dataset should be same as the previous prompt file formats, and at most 5 prompts

  ```json
  {
    "prompt": str
  }
  ```

## Further Resources
- Official website is can be accessed from the following link ↪ [Official Website](https://sites.google.com/view/prelearn20)
- Dataset is from ELG as well ↪ [EGL Dataset Link](https://live.european-language-grid.eu/catalogue/corpus/8084/download/)

In [66]:
## Loading training datasets
df_physics_train = pd.read_csv(f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_training_data/physics-pairs_train.csv')
df_precalculus_train = pd.read_csv(f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_training_data/precalculus-pairs_train.csv')
df_dataminig_train = pd.read_csv(f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_training_data/data_mining-pairs_train.csv')
df_geometry_train = pd.read_csv(f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_training_data/geometry-pairs_train.csv')

## Loading test datasets
df_physics_test = pd.read_csv(f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_test-data/physics-pairs_test.txt')
df_precalculus_test = pd.read_csv(f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_test-data/precalculus-pairs_test.txt')
df_dataminig_test = pd.read_csv(f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_test-data/data_mining-pairs_test.txt')
df_geometry_test = pd.read_csv(f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_test-data/geometry-pairs_test.txt')

## Loading wiki documents
df_wiki_documents = f'{DATASET_DIR}/task-15-prelearn/PRELEARN_dataset/PRELEARN_dataset/PRELEARN_training_data/ITA_prereq-pages.xml'

In [67]:
## Adding the headers to dataset for training set
df_physics_train.columns = ['concept_A', 'concept_B', 'label']
df_precalculus_train.columns = ['concept_A', 'concept_B', 'label']
df_dataminig_train.columns = ['concept_A', 'concept_B', 'label']
df_geometry_train.columns = ['concept_A', 'concept_B', 'label']

## Adding the headers to dataset for test set
df_physics_test.columns = ['concept_A', 'concept_B', 'label']
df_precalculus_test.columns = ['concept_A', 'concept_B', 'label']
df_dataminig_test.columns = ['concept_A', 'concept_B', 'label']
df_geometry_test.columns = ['concept_A', 'concept_B', 'label']

In [68]:
df_geometry_test

Unnamed: 0,concept_A,concept_B,label
0,Poligono,Segmento,1
1,Pendenza topografica,Geometria,1
2,Terna pitagorica,Teorema di Pitagora,1
3,Circonferenza,Geometria,1
4,Terna pitagorica,Triangolo rettangolo,1
...,...,...,...
194,Vertice (geometria),Segmento,0
195,Perpendicolarità,Rombo (geometria),0
196,Angolo,Parallelismo (geometria),0
197,Bidimensionalità,Perpendicolarità,0


In [69]:
## Initially turning the whole dataset into JSON object
## ...
## ...
## again prototyping on small subset of the actual dataset using the .head(10)
## on the loop
df_physics_train_json = format_prelearn_datasets(df_physics_train, description="Formatting physics train dataset")
df_precalculus_train_json = format_prelearn_datasets(df_precalculus_train, description="Formatting precalculus train dataset")
df_datamining_train_json = format_prelearn_datasets(df_dataminig_train, description="Formatting datamining train dataset")
df_geometry_train_json = format_prelearn_datasets(df_geometry_train, description="Formatting geometry train dataset")

df_physics_test_json = format_prelearn_datasets(df_physics_test, description="Formatting physics test dataset")
df_precalculus_test_json = format_prelearn_datasets(df_precalculus_test, description="Formatting precalculus test dataset")
df_datamining_test_json = format_prelearn_datasets(df_dataminig_test, description="Formatting datamining test dataset")
df_geometry_test_json = format_prelearn_datasets(df_geometry_test, description="Formatting geometry test dataset")

## df_geometry_train_json = format_prelearn_datasets(dataset, description="Description")
print("\n\n")
pprint.pp(df_geometry_train_json[:1])

Formatting physics train dataset: 2219it [00:00, 5566.52it/s]
Formatting precalculus train dataset: 1715it [00:00, 4755.91it/s]
Formatting datamining train dataset: 423it [00:00, 7283.60it/s]
Formatting geometry train dataset: 1547it [00:00, 10265.24it/s]
Formatting physics test dataset: 199it [00:00, 5845.41it/s]
Formatting precalculus test dataset: 199it [00:00, 4828.32it/s]
Formatting datamining test dataset: 98it [00:00, 6030.81it/s]
Formatting geometry test dataset: 199it [00:00, 6583.11it/s]




[{'wikipedia_passage_concept_A': ('1140445',
                                  'Decagono',
                                  'In geometria, un decagono è un poligono con '
                                  'dieci lati e dieci angoli. In un decagono '
                                  'regolare tutti i lati hanno lunghezza '
                                  'uguale e tutti gli angoli sono di 144º. '
                                  "L'area di un decagono regolare con lato "
                                  'lungo formula_1 è data da:\n'
                                  'Un decagono regolare può essere costruito '
                                  'con riga e compasso. Qui sotto ne è '
                                  "mostrata un'animazione:\n"
                                  'Il perimetro di un decagono regolare si '
                                  'trova moltiplicando un suo lato per 10\n'
                                  'P= a • 10'),
  'concept_A': 'Decagono',
  'wikipe




In [70]:
## Parsing Wikipedia content to be insterted into JSON objects
## ...
## ...
documents = parse_xml(df_wiki_documents)
documents[:1]

[('109852',
  'Triangolo rettangolo',
  'Il triangolo rettangolo è un triangolo in cui l\'angolo formato da due lati, detti cateti, è retto, ovvero di 90° (o radianti). Il lato opposto all\'angolo retto si chiama ipotenusa. L\'ipotenusa è per il teorema di Pitagora uguale alla radice quadrata della somma dei quadrati dei cateti. \nIl triangolo rettangolo rappresenta un caso particolare di triangolo generico, per cui molte relazioni fondamentali si semplificano. Il caso più particolare è quello del triangolo rettangolo isoscele, caso per il quale \nAggiungendo a un triangolo rettangolo il triangolo ottenuto con la sua riflessione rispetto all\'ipotenusa si ottiene un aquilone. Aggiungendogli il triangolo ottenuto sottoponendolo alla rotazione di π intorno al punto medio dell\'ipotenusa si ottiene il rettangolo per il quale l\'ipotenusa è diagonale principale.\nDal triangolo rettangolo isoscele con entrambe le costruzioni si ottiene il quadrato di lato formula_2.\nOgni similitudine trasf

In [71]:
!mkdir "PreLearn"
convert_to_jsonl(df_physics_train_json, "PreLearn/PRELEARN-physics-train.jsonl")
convert_to_jsonl(df_precalculus_train_json, "PreLearn/PRELEARN-precalculus-train.jsonl")
convert_to_jsonl(df_datamining_train_json, "PreLearn/PRELEARN-datamining-train.jsonl")
convert_to_jsonl(df_geometry_train_json, "PreLearn/PRELEARN-geometry-train.jsonl")

convert_to_jsonl(df_physics_test_json, "PreLearn/PRELEARN-physics-test.jsonl")
convert_to_jsonl(df_precalculus_test_json, "PreLearn/PRELEARN-precalculus-test.jsonl")
convert_to_jsonl(df_datamining_test_json, "PreLearn/PRELEARN-datamining-test.jsonl")
convert_to_jsonl(df_geometry_test_json, "PreLearn/PRELEARN-geometry-test.jsonl")

## PoSTWITA Task (Distractor - Task 23)
Work on Part-of-Speech (PoS) tagging has mainly concentrated on standardized texts for many years. However, the interest in automatic evaluation of social media texts, in particular for microblogging texts such as tweets, is growing considerably: information found on Twitter has already been shown to be useful for a variety of applications for identifying trends and upcoming events in various fields. As the nature of social media texts is clearly different from standardized texts, both regarding the nature of lexical items and their distributional properties (short messages, emoticons and mentions, threaded messages, etc.), Natural Language Processing methods need to be adapted for obtaining a reliable processing. The basis for such an adaption is a tagged social media text corpus [Neunerdt et al. 2013] for training and testing automatic procedures. Various attempt to produce such kind of specialised tools are available in literature (e.g. [Gimpel et al. 2011; Derczynski et al. 2013; Neunerdt et al. 2013; Owoputi et al. 2013]) for other languages, but Italian completely lack of such resources both regarding annotated corpora and specific PoS-tagging tools.
For all the above mentioned reasons, we proposed a task for EVALITA 2016 concerning the domain adaptation of PoS-taggers to Twitter texts.

## Task and Expected Output
The files we are interested in are `goldDEVset-2016_09_05_anon_rev.txt` and `goldTESTset-2016_09_05_anon_rev.txt in the postwita/` folder. One dataset for the PoS task. Create `postwita-train.jsonl` (corresponding to the DEV original file) and `postwita-test.jsonl`. Each line in your output file must be a JSON object like the one below:

  ```json
  {
      "sentence_id": ...,
      "sentence": ...,
      "target_word": ...,
      "word_idx": ...,
      "choices": [...],
      "label": ...
  }
  ```

We expect from you to design a strategy to include distractors among the choices, so you select three different uncorrect labels beside the correct one. These three labels must be challenging for the word in the given context. As mentioned in the assignment document at most 4 choices, 3 of them distiractors, since PoS has more then 4 choices we need to choose reasonable distiractors.

## Futher Resources
- Official website can be found on the following link ↪ [Official PoSTWITA Website](https://corpora.ficlit.unibo.it/PoSTWITA/)
- Download link is from European Language Grind (EGL) ↪ [Dataset Downloand Link](https://live.european-language-grid.eu/catalogue/corpus/7481/download/)

In [196]:
## Parsing the txt file in order to format it into expected JSON output
## using regex to, match the header of each entry or post
## ...
## ...
train_json = []
test_json = []

with open(f'{DATASET_DIR}/task-23-postwita/postwita/postwita/goldTESTset-2016_09_12_anon_rev.txt', 'r', encoding='utf-8')  as file:
  data = file.read()
  entries = re.findall(r'_{5}(\d+)_{5}\n(.*?)\n\n', data, re.DOTALL)
  pbar = tqdm(entries, desc="Formatting test set")
  for entry in pbar:
    choices = []
    sentence_id, sentence_raw = entry
    target_index, target_word, target_pos, distiractors = select_target_word_and_distractions(sentence_raw)
    choices.append(target_pos)
    choices.extend(distiractors)
    test_json.append({
        "sentence_id": sentence_id,
        "sentence": get_sentence(sentence_raw),
        "target_word": target_word,
        "word_idx": target_index,
        "choices": choices,
        "label": 0
    })

print("\n")
pprint.pp(test_json[:2])
print("\n")

with open(f'{DATASET_DIR}/task-23-postwita/postwita/postwita/fixed_golddev.txt', 'r', encoding='utf-8')  as file:
  data = file.read()
  entries = re.findall(r'_{5}(\d+)_{5}\n(.*?)\n\n', data, re.DOTALL)
  pbar = tqdm(entries, desc="Formatting training set")
  for entry in pbar:
    choices = []
    sentence_id, sentence_raw = entry
    target_index, target_word, target_pos, distiractors = select_target_word_and_distractions(sentence_raw)
    choices.append(target_pos)
    choices.extend(distiractors)
    train_json.append({
        "sentence_id": sentence_id,
        "sentence": get_sentence(sentence_raw),
        "target_word": target_word,
        "word_idx": target_index,
        "choices": choices,
        "label": 0
    })

print("\n")
pprint.pp(train_json[:2])

Formatting test set: 100%|██████████| 301/301 [00:00<00:00, 24452.56it/s]




[{'sentence_id': '563811619483176961',
  'sentence': '“ @mention_1 : @mention_2 solo un dm la pregooo ... se nn le '
              'piaccio mi def ok ? 😜😂 ” vita mia',
  'target_word': 'mi',
  'word_idx': 14,
  'choices': ['PRON', 'NOUN', 'VERB', 'ADJ'],
  'label': 0},
 {'sentence_id': '562679456184426497',
  'sentence': '" E non è detto che non ci si riunisca , nella vita non si può '
              'mai dire " <3 #sacre http://anonymized.url.com',
  'target_word': 'è',
  'word_idx': 3,
  'choices': ['AUX', 'PROPN', 'ADJ', 'ADP'],
  'label': 0}]




Formatting training set: 100%|██████████| 6438/6438 [00:00<00:00, 30548.10it/s]



[{'sentence_id': '162545185920778240',
  'sentence': 'Governo Monti : decreto in cdm per approvazione ! '
              'http://anonymized.url.com',
  'target_word': 'cdm',
  'word_idx': 5,
  'choices': ['PROPN', 'ADP', 'PRON', 'ADJ'],
  'label': 0},
 {'sentence_id': '192902763032743936',
  'sentence': '#Ferrara critica #Grillo perché dice cose che dicevano '
              'Berlusconi e Bossi . E che non hanno fatto .',
  'target_word': 'Berlusconi',
  'word_idx': 8,
  'choices': ['PROPN', 'VERB', 'PRON', 'NOUN'],
  'label': 0}]





In [197]:
!mkdir "PoSTWITA"
convert_to_jsonl(train_json, "PoSTWITA/postwita-train.jsonl")
convert_to_jsonl(test_json, "PoSTWITA/postwita-test.jsonl")

## Testing
This section provides few cells for confirming that the JSONL files are correctly parsed and in desired format. In this cell, we'll load the JSONL data files and examine the first few records. Let's start by importing the necessary libraries and reading the data. Next, we’ll perform some basic preprocessing steps on the data. This may include handling missing values, converting data types, and cleaning text fields. You can adjust the file path, column names, and preprocessing steps as needed. Let me know if you'd like any additional information or if there's anything else I can help you with!
