In [4]:
import pandas as pd #importing pandas
#reading CSV file
df = pd.read_csv('papers.csv', sep=",", engine='python', on_bad_lines='skip') #using on bad lines to skip lines with irreguar structure
df.head() #displaying first 5 rows

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


In [5]:
df['paper_text'] #accessing paper text column

Unnamed: 0,paper_text
0,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,Bayesian Query Construction for Neural\nNetwor...
4,"Neural Network Ensembles, Cross\nValidation, a..."
...,...
7679,Single Transistor Learning Synapses\n\nPaul Ha...
7680,"Bias, Variance and the Combination of\nLeast S..."
7681,A Real Time Clustering CMOS\nNeural Engine\nT....
7682,Learning direction in global motion: two\nclas...


In [6]:
#retrieving paper text from a specific index
paper_index=4
text=df['paper_text'][paper_index]
text

'Neural Network Ensembles, Cross\nValidation, and Active Learning\n\nAnders Krogh"\nNordita\nBlegdamsvej 17\n2100 Copenhagen, Denmark\n\nJesper Vedelsby\nElectronics Institute, Building 349\nTechnical University of Denmark\n2800 Lyngby, Denmark\n\nAbstract\nLearning of continuous valued functions using neural network ensembles (committees) can give improved accuracy, reliable estimation of the generalization error, and active learning. The ambiguity\nis defined as the variation of the output of ensemble members averaged over unlabeled data, so it quantifies the disagreement among\nthe networks. It is discussed how to use the ambiguity in combination with cross-validation to give a reliable estimate of the ensemble\ngeneralization error, and how this type of ensemble cross-validation\ncan sometimes improve performance. It is shown how to estimate\nthe optimal weights of the ensemble members using unlabeled data.\nBy a generalization of query by committee, it is finally shown how\nthe am

In [7]:
#importing necessary libraries
import re
import nltk
import spacy
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#loading spacy model
nlp = spacy.load("en_core_web_sm")

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
def txt_processing(doc_text):

    doc_text = doc_text.lower() #converting to lower case to eliminate case variations
    doc_text = BeautifulSoup(doc_text, "html.parser").text # Removing HTML tag
    doc_text = re.sub(r'\d+', '', doc_text) # Removing numbers
    doc_text = re.sub(r'http\S+|www\S+|https\S+', '', doc_text, flags=re.MULTILINE)# Removing links
    doc_text = re.sub(r'\b\w\.\b', '', doc_text)
    doc_text = re.sub(r'\(.*?\)', '', doc_text)  # Removes () and their contents

    tokens = word_tokenize(doc_text) # Tokenization

    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]# Removing stop or filler words


    cleaned_doc = ' '.join(tokens) # joining as a string

    return tokens,cleaned_doc

In [9]:
#processing
doc = nlp(text)
tokens,doc = txt_processing(doc.text)

In [10]:
print(doc)

neural network ensembles , cross validation , active learning anders krogh '' nordita blegdamsvej copenhagen , denmark jesper vedelsby electronics institute , building technical university denmark lyngby , denmark abstract learning continuous valued functions using neural network ensembles give improved accuracy , reliable estimation generalization error , active learning . ambiguity defined variation output ensemble members averaged unlabeled data , quantifies disagreement among networks . discussed use ambiguity combination cross-validation give reliable estimate ensemble generalization error , type ensemble cross-validation sometimes improve performance . shown estimate optimal weights ensemble members using unlabeled data . generalization query committee , finally shown ambiguity used select new training data labeled active learning scheme . introduction well known combination many different predictors improve predictions . neural networks community `` ensembles '' neural networks 

In [11]:
#printing formatted table
doc=nlp(doc)
print(f"{'Token':<15} {'Part-of-Speech':<15}")
print("-" * 30)
for token in doc[:50]: # for the first 50
    print(f"{token.text:<15} {token.pos_:<15}")

Token           Part-of-Speech 
------------------------------
neural          ADJ            
network         NOUN           
ensembles       NOUN           
,               PUNCT          
cross           VERB           
validation      NOUN           
,               PUNCT          
active          ADJ            
learning        VERB           
anders          NOUN           
krogh           PROPN          
''              PUNCT          
nordita         PROPN          
blegdamsvej     PROPN          
copenhagen      PROPN          
,               PUNCT          
denmark         PROPN          
jesper          PROPN          
vedelsby        PROPN          
electronics     PROPN          
institute       PROPN          
,               PUNCT          
building        VERB           
technical       ADJ            
university      PROPN          
denmark         NOUN           
lyngby          NOUN           
,               PUNCT          
denmark         VERB           
abstract 

In [12]:
from string import punctuation

punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:

allowed_pos=['ADJ','VERB','NOUN','INTJ'] #defining allowed parts of speech tags
tokens_1=[]

for token in doc: #iterating through each token in the document
    if  token.text in punctuation: #skip punctuations
        continue
    elif token.pos_ in allowed_pos:
        tokens_1.append(token.text)

print(tokens_1,end=" ")   #printing filtered tokens

['neural', 'network', 'ensembles', 'cross', 'validation', 'active', 'learning', 'anders', 'building', 'technical', 'denmark', 'lyngby', 'denmark', 'learning', 'continuous', 'valued', 'functions', 'using', 'neural', 'network', 'ensembles', 'give', 'improved', 'accuracy', 'reliable', 'estimation', 'generalization', 'error', 'active', 'learning', 'ambiguity', 'defined', 'variation', 'output', 'ensemble', 'members', 'averaged', 'unlabeled', 'data', 'quantifies', 'disagreement', 'networks', 'discussed', 'use', 'ambiguity', 'combination', 'cross', 'validation', 'give', 'reliable', 'estimate', 'ensemble', 'generalization', 'error', 'type', 'ensemble', 'cross', 'validation', 'improve', 'performance', 'shown', 'estimate', 'optimal', 'weights', 'ensemble', 'members', 'using', 'unlabeled', 'data', 'generalization', 'committee', 'shown', 'ambiguity', 'used', 'select', 'new', 'training', 'data', 'labeled', 'active', 'learning', 'scheme', 'introduction', 'known', 'combination', 'many', 'different', 

In [14]:
#importing counter to count frequency
from collections import Counter

word_freq=Counter(tokens_1)
print(word_freq,end="\t") #printing word frequency

Counter({'ensemble': 45, 'generalization': 42, 'error': 40, 'networks': 37, 'ambiguity': 26, 'learning': 22, 'training': 22, 'weights': 21, 'individual': 19, 'neural': 18, 'cross': 18, 'validation': 18, 'active': 18, 'examples': 18, 'set': 16, 'network': 14, 'input': 14, 'average': 13, 'ensembles': 12, 'shown': 12, 'errors': 12, 'e': 11, 'figure': 11, 'function': 10, 'sets': 10, 'size': 10, 'unlabeled': 9, 'used': 9, 'weighted': 9, 'using': 8, 'data': 8, 'use': 8, 'estimate': 8, 'optimal': 8, 'different': 8, 'estimated': 8, 'output': 7, 'possible': 7, 'variance': 7, 'results': 7, 'equation': 7, 'test': 7, 'line': 7, 'shows': 7, 'members': 6, 'combination': 6, 'trained': 6, 'term': 6, 'see': 5, 'simple': 5, 'distribution': 5, 'random': 5, 'good': 5, 'described': 5, 'estimates': 5, 'passive': 5, 'anders': 4, 'disagreement': 4, 'known': 4, 'many': 4, 'several': 4, 'disagree': 4, 'information': 4, 'j': 4, 'v': 4, 'find': 4, 'correlation': 4, 'example': 4, 'way': 4, 'large': 4, 'square': 4,

In [15]:
#storing sentences of the document as list
doc=nlp(doc)
sent_token=[sent.text for sent in doc.sents]
len(sent_token)

191

In [16]:
sent_score = {}
for sent in sent_token:
    sent_score[sent] = 0  # Initialize the sentence score
    for word in sent.split():
        if word in word_freq:
            sent_score[sent] += word_freq[word]  # Incrementing score by word's frequency

print(sent_score)

{"neural network ensembles , cross validation , active learning anders krogh '' nordita blegdamsvej copenhagen , denmark jesper vedelsby electronics institute , building technical university denmark lyngby , denmark abstract learning continuous valued functions using neural network ensembles give improved accuracy , reliable estimation generalization error , active learning .": 345, 'ambiguity defined variation output ensemble members averaged unlabeled data , quantifies disagreement among networks .': 148, 'discussed use ambiguity combination cross-validation give reliable estimate ensemble generalization error , type ensemble cross-validation sometimes improve performance .': 235, 'shown estimate optimal weights ensemble members using unlabeled data .': 125, 'generalization query committee , finally shown ambiguity used select new training data labeled active learning scheme .': 168, 'introduction well known combination many different predictors improve predictions .': 29, "neural ne

In [17]:
pd.DataFrame(list(sent_score.items()),columns=["sentence",'score']) #converting sentence score dictionary to a data frame

Unnamed: 0,sentence,score
0,"neural network ensembles , cross validation , ...",345
1,ambiguity defined variation output ensemble me...,148
2,discussed use ambiguity combination cross-vali...,235
3,shown estimate optimal weights ensemble member...,125
4,"generalization query committee , finally shown...",168
...,...,...
186,proceedings fifth workshop computational learn...,28
187,"[ ] y. freund , s. seung , e. shamir , n. tish...",0
188,"information , prediction , query committee .",8
189,advances neural information processing systems...,27


In [18]:
from heapq import nlargest #importing nlargest for finding top N items

num_of_sent=3
nn=nlargest(num_of_sent,sent_score,key=sent_score.get)
" ".join(nn) # joining top sentences to a string

"holding examples generalization errors individual members ensemble , e ( x , increase , conjecture good choice size ensemble test set size , ambiguity increase thus one get decrease overall generalization error . find ensemble generalization error first term right weighted average generalization errors individual networks , second weighted average ambiguities , refer ensemble ambiguity . neural network ensembles , cross validation , active learning anders krogh '' nordita blegdamsvej copenhagen , denmark jesper vedelsby electronics institute , building technical university denmark lyngby , denmark abstract learning continuous valued functions using neural network ensembles give improved accuracy , reliable estimation generalization error , active learning ."

In [42]:
#importing and creating a summarization pipeline
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn", tokenizer="facebook/bart-large-cnn", framework="tf")


All PyTorch model weights were used when initializing TFBartForConditionalGeneration.

All the weights of TFBartForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


In [43]:
#joining filtered tokens and displaying result
result_text = " ".join(tokens_1)
(result_text)

"neural network ensembles cross validation active learning anders building technical denmark lyngby denmark learning continuous valued functions using neural network ensembles give improved accuracy reliable estimation generalization error active learning ambiguity defined variation output ensemble members averaged unlabeled data quantifies disagreement networks discussed use ambiguity combination cross validation give reliable estimate ensemble generalization error type ensemble cross validation improve performance shown estimate optimal weights ensemble members using unlabeled data generalization committee shown ambiguity used select new training data labeled active learning scheme introduction known combination many different predictors improve predictions neural networks community ensembles neural networks investigated several authors see instance networks ensemble trained individually predictions combined combination done majority simple averaging use weighted combination networks

In [49]:
from transformers import BartTokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
inputs = tokenizer(result_text, return_tensors="tf", truncation=True, max_length=1024)

In [None]:
# Decode the input IDs back to text for summarization
input_text = tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True)

# Generate summary using the decoded text
summary = summarizer(input_text, max_length=150, min_length=30, do_sample=False)

# Print the summary
print(summary[0]['summary_text'])