# TL;DR – Too Long, Doctor

TL;DR is a ML model designed to synthesize and cluster scientific papers. Tailored for both students and researchers seeking to optimize their study time, TL;DR provides a tool to quickly grasp the essence of complex scientific material. Additionally, it caters to those who desire a concise summary or a preliminary overview of a paper before delving into a detailed reading.

# Importing libraries

In [1]:
# Import library to extract data from XML file
import xml.etree.ElementTree as ET
import pandas as pd
import os

In [2]:
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/simonebellavia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/simonebellavia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
import torch
from transformers import BertTokenizer

# Global Functions

In [3]:
"""
    Extracts data from an XML file and returns it as a dictionary.

    Args:
        file (str): The path to the XML file.

    Returns:
        dict: A dictionary containing the extracted data.
"""
def extract_data(file):
    # Create a dictionary to store the data
    data = {}
    # Parse the XML file
    tree = ET.parse(file)
    # Get the root of the XML file
    root = tree.getroot()

    # Initialize abstract data
    data['abstract'] = {}

    # Initialize body data
    data['body'] = []

    # Initialize keywords data
    data['keywords'] = []

    # Extract title and abstract
    article_meta = root.find('.//article-meta')
    if article_meta is not None:
        title_group = article_meta.find('title-group')
        data['title'] = title_group.find('article-title').text if title_group is not None else None

        abstract_section = article_meta.find('abstract')
        if abstract_section is not None:
            for section in abstract_section.findall('sec'):
                section_title = section.find('title').text if section.find('title') is not None else ''
                section_text = section.find('p').text if section.find('p') is not None else ''
                if 'simple summary' in section_title.lower():
                    data['abstract']['simple_summary'] = section_text
                elif 'abstract' in section_title.lower():
                    data['abstract']['abstract'] = section_text

        # Extract keywords
        kwd_group = article_meta.find('kwd-group')
        if kwd_group is not None:
            data['keywords'] = [kwd.text for kwd in kwd_group.findall('kwd') if kwd.text]

    # Extract body sections
    body_section = root.find('body')
    if body_section is not None:
        for sec in body_section.findall('sec'):
            section_data = {
                'title': sec.find('title').text if sec.find('title') is not None else None,
                'content': [p.text for p in sec.findall('p') if p.text]
            }
            data['body'].append(section_data)

    # Return the extracted data
    return data

# Dataset Generation

In [4]:
# Extract data from the XML files
# Create a list to store the data
data = []
# Get the path of the XML files
path = './data'

# Get the list of the XML files
files = os.listdir(path)

# Loop through the XML files
for file in files:
    # Extract data from the XML file
    data.append(extract_data(path + '/' + file))

In [5]:
# Convert the list of dictionaries to a Pandas DataFrame
df = pd.DataFrame(data)
# Save the DataFrame as a JSON file
df.to_json('data.json', orient='records')

In [6]:
# iterate over data list to print its contents
for i in range(len(data)):
    # print the title of the article
    print(data[i]['title'])

A Case of Elephantiasis—Successful Recovery
Owners’ Beliefs regarding the Emotional Capabilities of Their Dogs and Cats
Hepatitis E Virus (HEV) Spreads from Pigs and Sheep in Mongolia
Goose Meat as a Source of Dietary Manganese—A Systematic Review
Feral Kinetics and Cattle Research Within Planetary Boundaries
Rotational Grazing Modifies 
Spatial and Temporal Variability of Trace and Macro Elements in the Red Crab 
Effects of Different Bedding Materials on Production Performance, Lying Behavior and Welfare of Dairy Buffaloes
Preweaning Nutrition and Its Effects on the Growth, Immune Competence and Metabolic Characteristics of the Dairy Calf
PolarBearVidID: A Video-Based Re-Identification Benchmark Dataset for Polar Bears
Establishment of a Real-Time PCR Assay for the Detection of 
Dietary Protein Requirement of Juvenile Dotted Gizzard Shad 
Differential Impacts of Cereal and Protein Sources Fed to Pigs after Weaning on Diarrhoea and Faecal Shedding of 
Selection of Appropriate Dogs to B

In [7]:
for i in range(len(data)):
    # print the abstract of the article
    print(data[i]['abstract'])

{}
{'simple_summary': 'Understanding how pet dogs and cats are feeling is very difficult and getting it wrong could result in welfare issues for the animals and the risk of injury for humans. Scientific research on pet emotion is in its early stages and pet owners are currently one of the best sources of information because they spend so much time with their animals. In this online survey, 438 owners were asked whether their dogs and/or cats could express 22 different emotions. If they answered ‘yes’, they were then asked how they identify that emotion in their pet. Owners believed dogs could feel more emotions than cats, and that they could use different sets of behavioral signs to identify different dog/cat emotions. The number of reported dog emotions tended to increase with the owner’s increased personal experience with dogs but decreased with the owner’s increased professional experience with dogs. Owners who owned both cats and dogs believed that cats could feel fewer emotions th

In [8]:
for i in range(len(data)):
    # print the body of the article
    print(data[i]['body'])

[]
[{'title': '1. Introduction', 'content': ['Dogs and cats may provide a wide range of health, emotional, behavioral, cognitive, educational, and social benefits to humans, and support economic growth [', 'Considering the importance of understanding emotion in human–animal interactions, we know surprisingly little about what extent dogs and cats express a typical range of emotions and what behavioral signs are indicative to humans for potentially detecting these emotions. This lack of information may be due to inherent challenges in studying emotion in nonhuman animals, as their behavior and communicative signals vary greatly from our own and we cannot rely on language to help bridge the gap. Furthermore, human emotion perception and processes seem to be preferentially adapted to conspecifics, but we may adopt an identical or similar process to perceive and interpret both conspecific and heterospecific emotional cues [', 'However, regardless of their ability or performance in identify

In [9]:
# Entra dentro body per combinare tutti i paragrafi in un unico testo
# Cicla su tutti i dizonari dentro body, che sono i paragrafi
# I paragrafi sono un dizionario con chiave title e content
# Title contiene una stringa con il titolo del paragrafo
# Content è una lista di stringhe che vanno combinate in un unico testo
# Cicla su content e combina tutte le stringhe in un unico testo

for i in range(len(data)):
    # print each element of body inside data
    for section in data[i]["body"]:
        # print the title of the section
        print(section["title"])
        # print the content of the section
        print(section["content"])

1. Introduction
['Dogs and cats may provide a wide range of health, emotional, behavioral, cognitive, educational, and social benefits to humans, and support economic growth [', 'Considering the importance of understanding emotion in human–animal interactions, we know surprisingly little about what extent dogs and cats express a typical range of emotions and what behavioral signs are indicative to humans for potentially detecting these emotions. This lack of information may be due to inherent challenges in studying emotion in nonhuman animals, as their behavior and communicative signals vary greatly from our own and we cannot rely on language to help bridge the gap. Furthermore, human emotion perception and processes seem to be preferentially adapted to conspecifics, but we may adopt an identical or similar process to perceive and interpret both conspecific and heterospecific emotional cues [', 'However, regardless of their ability or performance in identifying animal emotions, pet own

In [10]:
def combine_body_content(body_list):
    combined_content = []

    # Verifica che il body sia una lista
    if isinstance(body_list, list):
        # Cicla su tutti i dizionari dentro body, che sono i paragrafi
        for section in body_list:
            # Ottieni il titolo e il contenuto della sezione, se esistente
            title = section.get('title')
            content = ' '.join(section.get('content', []))
            # Combina il titolo e il contenuto con uno spazio e aggiungi al contenuto combinato
            combined_section = ' '.join(filter(None, [title, content])).strip()
            combined_content.append(combined_section)
    # Unisci tutte le sezioni in una singola stringa separata da spazi
    return ' '.join(combined_content)

# Applica la funzione alla colonna "body" del dataframe
df['combined_body'] = df['body'].apply(combine_body_content)

In [11]:
# Verifica il contenuto della nuova colonna "combined_body"
print(df['combined_body'].head())

0                                                     
1    1. Introduction Dogs and cats may provide a wi...
2    1. Introduction The hepatitis E virus (family ...
3    1. Introduction Manganese (Mn) is one of the e...
4    1. A Commentary on Feed Additives as Solutions...
Name: combined_body, dtype: object


In [12]:
for i in range(len(data)):
    # print the keywords of the article
    print(data[i]['keywords'])

[]
['emotion', 'expression', 'dog', 'cat', 'owner perception', 'welfare']
['hepatitis E', 'pig', 'sheep', 'prevalence', 'Mongolia', 'phylogenetic analysis']
['goose', 'meat', 'manganese', 'thermal treatment', 'adequate intake', 'reference values-requirements', 'PRISMA']
['critical kinetics', 'planetary boundaries', 'technical solutions', 'feed additives', 'livestock', 'Jevons’ paradox', 'Denmark', 'anthropology']
['cattle', 'ectoparasites', 'control', 'grasslands', 'ticks']
['crustacean', 'marine economic resource', 'health risk', 'marine pollution', 'El Niño Southern Oscillation']
['bedding material', 'buffalo', 'behaviororistics', 'milk yeild', 'animal welfare']
['immune challenge', 'accelerated milk feeding', 'heifer calf development']
['deep learning', 're-identification', 'computer vision', 'animal identification', 'animal welfare', 'automated behavior analysis', 'motion features', 'video-based method', 'dataset']
['polymerase chain reaction', 'bearded dragon', 'lizard', 'reptile'

In [13]:
print(type(df.loc[0, 'abstract']))

<class 'dict'>


In [14]:
print(df.dtypes)

abstract         object
body             object
keywords         object
title            object
combined_body    object
dtype: object


In [15]:
# Funzione per combinare "simple summary" e "abstract" gestendo i valori None
def combine_abstract(abstract):
    if isinstance(abstract, dict):
        simple_summary = abstract.get('simple_summary') or ''  # Restituisce una stringa vuota se il valore è None
        abstract_text = abstract.get('abstract') or ''  # Restituisce una stringa vuota se il valore è None
        return ' '.join([simple_summary, abstract_text]).strip()
    return ''

# Applica la funzione a ciascuna riga della colonna "abstract"
df['combined_abstract'] = df['abstract'].apply(combine_abstract)

# Verifica il risultato
print(df['combined_abstract'].head())

0                                                     
1    Understanding how pet dogs and cats are feelin...
2    Hepatitis E virus (HEV) is a zoonotic pathogen...
3    Manganese is a trace element with many critica...
4    This commentary advocates a note of caution wi...
Name: combined_abstract, dtype: object


In [16]:
print(df.dtypes)

abstract             object
body                 object
keywords             object
title                object
combined_body        object
combined_abstract    object
dtype: object


In [17]:
# Stampa i primi elementi della colonna "combined_abstract"
print(df['combined_abstract'].head())

# Verifica il tipo del primo elemento della colonna "combined_abstract"
print(type(df.loc[0, 'combined_abstract']))

0                                                     
1    Understanding how pet dogs and cats are feelin...
2    Hepatitis E virus (HEV) is a zoonotic pathogen...
3    Manganese is a trace element with many critica...
4    This commentary advocates a note of caution wi...
Name: combined_abstract, dtype: object
<class 'str'>


# Text Classification

For the Text Classification task we will use BERT.

## Pre-Processing

In [20]:
# Carica il tokenizzatore di BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Funzione per tokenizzare un testo con BERT
def tokenize_with_bert(text):
    return tokenizer.encode_plus(
        text,
        add_special_tokens=True,  # Aggiungi '[CLS]' e '[SEP]'
        max_length=512,  # Imposta la massima lunghezza dei token
        padding='max_length',  # Aggiungi padding per raggiungere la massima lunghezza
        truncation=True,  # Tronca i token in eccesso
        return_attention_mask=True,  # Restituisci la maschera di attenzione
        return_tensors='pt'  # Restituisci tensori PyTorch
    )

# Applica la tokenizzazione all'abstract combinato
df['bert_input'] = df['combined_abstract'].apply(lambda x: tokenize_with_bert(x))

# Estrai i token e le maschere di attenzione per l'addestramento
input_ids = torch.cat([item['input_ids'] for item in df['bert_input']])
attention_masks = torch.cat([item['attention_mask'] for item in df['bert_input']])

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 8.68kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 3.58MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 1.20MB/s]
config.json: 100%|██████████| 570/570 [00:00<00:00, 482kB/s]


## Modelling