<a href="https://colab.research.google.com/github/traopia/KGNarrative/blob/master/esperiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mining News' Descriptions

In [1]:
!pip install transformers 
!pip install datasets
!pip install pynvml
!pip install evaluate 
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import transformers
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import numpy as np
import os
import nltk
import torch
import evaluate
import sys
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict
from pynvml import nvmlInit, nvmlDeviceGetHandleByIndex, nvmlDeviceGetMemoryInfo
from sklearn.model_selection import train_test_split

In [3]:
#! git clone https://github.com/traopia/KGNarrative.git

In [4]:
# define utils functions to facilitate gpu 

def check_gpu_availability():
    # Check if CUDA is available
    print(f"Cuda is available: {torch.cuda.is_available()}")

def getting_device(gpu_prefence=True) -> torch.device:
    """
    This function gets the torch device to be used for computations, 
    based on the GPU preference specified by the user.
    """
    
    # If GPU is preferred and available, set device to CUDA
    if gpu_prefence and torch.cuda.is_available():
        device = torch.device('cuda')
    # If GPU is not preferred or not available, set device to CPU
    else: 
        device = torch.device("cpu")
    
    # Print the selected device
    print(f"Selected device: {device}")
    
    # Return the device
    return device

# Define a function to print GPU memory utilization
def print_gpu_utilization():
    # Initialize the PyNVML library
    nvmlInit()
    # Get a handle to the first GPU in the system
    handle = nvmlDeviceGetHandleByIndex(0)
    # Get information about the memory usage on the GPU
    info = nvmlDeviceGetMemoryInfo(handle)
    # Print the GPU memory usage in MB
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

# Define a function to print training summary information
def print_summary(result):
    # Print the total training time in seconds
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    # Print the number of training samples processed per second
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    # Print the GPU memory utilization
    print_gpu_utilization()

In [5]:
# CHECK IF GPU IS UP
check_gpu_availability()



Cuda is available: True


In [6]:
# SAVE THE DEVICE WE ARE WORKING WITH
device = getting_device(gpu_prefence=True)

Selected device: cuda


#### Importing documents


In [7]:
path2file = "/content/DWIE_train_topic_news.csv"
df = pd.read_csv(path2file)


In [8]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,story,Instances Knowledge Graph,Types Knowledge Graph,Subclass Knowledge Graph,predicted_label1
0,0,Multi-lingual in the World Wide Web Profession...,Personal Translator - product_of - Linguatec |...,Personal Translator - type - entity | Personal...,misc - subclass_of - entity | product - subcla...,tech
1,1,German Know-How For China's Energy Sector Ener...,Shangdong Linuo Paradigma - based_in0 - German...,China - type - entity | China - type - gpe | C...,gpe - subclass_of - location | gpe0 - subclass...,business
2,2,Trash crisis forces Lebanon's environmental aw...,Kassem Kazak - citizen_of - Lebanon | Kassem K...,Kareem Chehayeb - type - entity | Kareem Cheha...,journalist - subclass_of - per | manager - sub...,tech
3,3,"Iran calls for end to Saudi air campaign, as U...",Houthi - based_in0 - Yemen | Houthi - based_in...,Medecins Sans Frontieres - type - entity | Med...,ngo - subclass_of - org | gpe - subclass_of - ...,business
4,4,Ai Weiwei Drifting Ai Weiwei: Uncomfortable cr...,DW (Deutsch+) - based_in0 - Germany | Ai Weiwe...,DW (Deutsch+) - type - entity | DW (Deutsch+) ...,media - subclass_of - org | activist - subclas...,entertainment
5,5,"Journal Interview with Günter Nooke, The Chanc...",Günter Nooke - agent_of - Germany | Günter Noo...,Günter Nooke - type - entity | Günter Nooke - ...,gov_per - subclass_of - per | location - subcl...,politics
6,6,Turkey summons German envoy for second time in...,Martin Erdmann - agent_of - Germany | Martin E...,European Union - type - entity | European Unio...,igo - subclass_of - org | so - subclass_of - i...,politics
7,7,Medvedev seeks investment at Davos economic fo...,Dmitry Medvedev - agent_of - Russia | Dmitry M...,Dmitry Medvedev - type - entity | Dmitry Medve...,head_of_gov - subclass_of - politician | polit...,business
8,8,Ex-CIA fugitive Robert Seldon Lady detained in...,Osama Moustafa Hassan Nasr - citizen_of - Egyp...,Osama Moustafa Hassan Nasr - type - clergy | O...,offender - subclass_of - per | agency - subcla...,business
9,9,"Nearly 10,000 migrants rescued on Mediterranea...",Rome - in0 - Italy | Rome - in0-x - Italian | ...,Strait of Sicily - type - entity | Strait of S...,location - subclass_of - entity | waterbody - ...,business


#### News Summarization


In [9]:
model_nm1 = 'google/pegasus-multi_news'

In [10]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig

# Load the tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(model_nm1)
model = AutoModelForSeq2SeqLM.from_pretrained(model_nm1).to(device)

In [11]:
print(print_gpu_utilization())

GPU memory occupied: 3017 MB.
None


In [12]:
# Create a list to hold the summaries
summaries = []


In [13]:
# Loop through the stories
for story in df["story"]:
    # Tokenize the story
    inputs = tokenizer.encode(story, return_tensors="pt", max_length=1024, truncation=True).to(device)

    # Generate the summary
    outputs = model.generate(inputs, max_length=50, min_length=1, length_penalty=15.0, num_beams=4, early_stopping=True)
    
    # Decode the summary and add it to the list
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    summaries.append(summary)

In [14]:
# Add the summaries to the dataframe
df["core description"] = summaries

In [15]:
df["story"][0]

'Multi-lingual in the World Wide Web Professional translators and interpreters beware! There’s a new product out at CeBIT this year and it’s set to make your job obsolete - or at least the more mundane portion of business correspondence. Imagine sitting in an office in Berlin, Paris or New York, typing an e-mail to an international colleague and getting a reply - all in one language. Sure, it’s no problem if everyone speaks the same language. But what happens when three different languages are involved? You get a translator, right? Or you spend a little more time flipping through an English-German-French dictionary. Too expensive, too time consuming? Well, there’s always the option of learning a foreign language. In the global business community, bilungualism is not a bad investment. But for those of you who didn’t memorize the German irregular verbs in school and who still can’t be bothered to remember the gender of French nouns, there’s a technical solution: the Personal Translator b

In [20]:
df["core description"][100]

"– The Syrian Observatory for Human Rights says the death toll from a bus bombing yesterday that targeted evacuees from two towns in the country's north has risen to 126, with 68 children among the dead. The number of wounded"

In [22]:
# Save the updated dataframe
df.to_csv("your_updated_dataframe.csv", index=False)

#### 