# Keyword Extraction using PositionRank

In [1]:
import pandas as pd
import subprocess
import re
import string
# subprocess.run(["spacy","download","en_core_web_sm"])
from fuzzywuzzy import process

import spacy
import pytextrank

  from .autonotebook import tqdm as notebook_tqdm


## Dataset

In [2]:
# Reading the data 
dataset_csv = "ICMLA_2014_2015_2016_2017.csv"
encoding = "ISO-8859-1"
data_df = pd.read_csv(dataset_csv, encoding=encoding).set_index("paper_id")
data_df.head()

Unnamed: 0_level_0,title,keywords,abstract,session,year
paper_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Ensemble Statistical and Heuristic Models for ...,"statistical word alignment, ensemble learning,...",Statistical word alignment models need large a...,Ensemble Methods,2014
2,Improving Spectral Learning by Using Multiple ...,"representation, spectral learning, discrete fo...",Spectral learning algorithms learn an unknown ...,Ensemble Methods,2014
3,Applying Swarm Ensemble Clustering Technique f...,"software defect prediction, particle swarm opt...",Number of defects remaining in a system provid...,Ensemble Methods,2014
4,Reducing the Effects of Detrimental Instances,"filtering, label noise, instance weighting",Not all instances in a data set are equally be...,Ensemble Methods,2014
5,Concept Drift Awareness in Twitter Streams,"twitter, adaptation models, time-frequency ana...",Learning in non-stationary environments is not...,Ensemble Methods,2014


## Position Rank Example

In [3]:
data_df["text"] = data_df["title"] + " " + data_df["abstract"]
corpus = data_df["text"].values

In [4]:
title = data_df["title"].iloc[0]
abstract = data_df["abstract"].iloc[0]
text = f"{title} {abstract}"

In [5]:
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")
stopwords = nlp.Defaults.stop_words
# add PyTextRank to the spaCy pipeline
nlp.add_pipe("positionrank")

<pytextrank.positionrank.PositionRankFactory at 0x1863bdaf790>

In [6]:
with open("stopwords-en.txt", encoding="utf-8") as sw:
    STOPWORDS_EN = sw.readlines()
    STOPWORDS_EN = [word.replace("\n","") for word in STOPWORDS_EN]
    

In [7]:
def pre_process(text):
    text = text.lower()
    text = text.split()
    # removing stopwords before removing punctuations because
    # stopwords include many words with apostrophe
    cleaned_text = [word for word in text if word not in STOPWORDS_EN]
    text = " ".join(cleaned_text)
    # remove punctuations and digits
    text = re.sub("[^a-zA-Z]", " ", text)
    text = re.sub(' +', ' ', text)
    return text

In [8]:
def extract_keywords_positionrank(text):
    doc = nlp(text)
    doc_keywords = [keyword.text for keyword in doc._.phrases if len(keyword.text.split()) <= 3]
    deduplicated_doc_keywords = list(process.dedupe(doc_keywords, threshold=70))
    final_keywords = ", ".join(deduplicated_doc_keywords[:5])
    return final_keywords

In [9]:
print(text)
print("================================================================")
print(pre_process(text))
print("================================================================")
print(extract_keywords_positionrank(pre_process(text)))

Ensemble Statistical and Heuristic Models for Unsupervised Word Alignment Statistical word alignment models need large amount of training data while they are weak in small-size corpora. This paper proposes a new approach of unsupervised hybrid word alignment technique using ensemble learning method. This algorithm uses three base alignment models in several rounds to generate alignments. The ensemble algorithm uses a weighed scheme for resampling training data and a voting score to consider aggregated alignments. The underlying alignment algorithms used in this study include IBM Model 1, 2 and a heuristic method based on Dice measurement. Our experimental results show that by this approach, the alignment error rate could be improved by at least %15 for the base alignment models.
ensemble statistical heuristic models unsupervised word alignment statistical word alignment models training data weak small size corpora paper proposes approach unsupervised hybrid word alignment technique ens

In [10]:
# Applying positionrank to whole dataset
data_df["text"] = data_df["text"].apply(pre_process)
data_df["extracted_keywords"] = data_df["text"].apply(extract_keywords_positionrank)
data_df["extracted_keywords"]

paper_id
1      alignment statistical word, generate alignment...
2      spectral representations situations, spectral ...
3      clustering algorithms, cluster ensembles, dete...
4      noisy detrimental instances, accurate estimate...
5      learning presence drift, drift awareness twitt...
                             ...                        
444    machine learning tool, chess grandmasters, stu...
445    application stochastic generator, statistical ...
446    challenging behaviors, autism spectrum disorde...
447    psychosis experience, method, mobile apps, sma...
448    bnp cluster analysis, patients, psychotherapy ...
Name: extracted_keywords, Length: 448, dtype: object

In [11]:
data_df["extracted_keywords"].iloc[2]

'clustering algorithms, cluster ensembles, detection systems, robustness stability accuracy, prominent method'

In [12]:
data_df["keywords"].iloc[2]

'software defect prediction, particle swarm optimization, cluster data, ensemble clustering'