# Yet Another Keyword Extractor (YAKE)

In [1]:
import pandas as pd

from nltk.corpus import stopwords
import yake
from fuzzywuzzy import process
import string

## Dataset

In [2]:
# Reading the data 
dataset_csv = "ICMLA_2014_2015_2016_2017.csv"
encoding = "ISO-8859-1"
data_df = pd.read_csv(dataset_csv, encoding=encoding).set_index("paper_id")
data_df.head()

Unnamed: 0_level_0,title,keywords,abstract,session,year
paper_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Ensemble Statistical and Heuristic Models for ...,"statistical word alignment, ensemble learning,...",Statistical word alignment models need large a...,Ensemble Methods,2014
2,Improving Spectral Learning by Using Multiple ...,"representation, spectral learning, discrete fo...",Spectral learning algorithms learn an unknown ...,Ensemble Methods,2014
3,Applying Swarm Ensemble Clustering Technique f...,"software defect prediction, particle swarm opt...",Number of defects remaining in a system provid...,Ensemble Methods,2014
4,Reducing the Effects of Detrimental Instances,"filtering, label noise, instance weighting",Not all instances in a data set are equally be...,Ensemble Methods,2014
5,Concept Drift Awareness in Twitter Streams,"twitter, adaptation models, time-frequency ana...",Learning in non-stationary environments is not...,Ensemble Methods,2014


## YAKE Example 

In [3]:
data_df["text"] = data_df["title"] + " " + data_df["abstract"]
corpus = data_df["text"].values

In [4]:
title = data_df["title"].iloc[0]
abstract = data_df["abstract"].iloc[0]
text = f"{title} {abstract}"

In [32]:
def extract_keywords_yake(text):
    deduplication_threshold = 0.7
    deduplication_algo = 'seqm'
    numOfKeywords = 20
    y = yake.KeywordExtractor( 
        n=3, # maximum ngram size
        dedupLim=0.7, # deduplication threshold
        dedupFunc='seqm', # deduplication algorithm
        top=numOfKeywords, 
        features=None)
    doc_keywords = [keyword[0] for keyword in y.extract_keywords(text)]#[::-1]
    deduplicated_doc_keywords = list(process.dedupe(doc_keywords, threshold=70))
    final_keywords = ", ".join(deduplicated_doc_keywords[:5])
    return final_keywords

In [33]:
print(text)
print("================================================================")
print(extract_keywords_yake(text))

Ensemble Statistical and Heuristic Models for Unsupervised Word Alignment Statistical word alignment models need large amount of training data while they are weak in small-size corpora. This paper proposes a new approach of unsupervised hybrid word alignment technique using ensemble learning method. This algorithm uses three base alignment models in several rounds to generate alignments. The ensemble algorithm uses a weighed scheme for resampling training data and a voting score to consider aggregated alignments. The underlying alignment algorithms used in this study include IBM Model 1, 2 and a heuristic method based on Dice measurement. Our experimental results show that by this approach, the alignment error rate could be improved by at least %15 for the base alignment models.
Alignment Statistical word, Unsupervised Word Alignment, small-size corpora, large amount, Statistical and Heuristic


In [7]:
%%timeit
# Applying YAKE to whole dataset
data_df["extracted_keywords"] = data_df["text"].apply(extract_keywords_yake)
data_df["extracted_keywords"]

59.4 s ± 9.04 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
import numpy as np
# output = [extract_keywords_yake(text) for text in data_df["text"].values]
# output = [extract_keywords_yake(text) for text in data_df["text"].values]
data_df["extracted_keywords"] = extract_keywords_yake(data_df["text"].values)
# data_df["extracted_keywords"]

 'Improving Spectral Learning by Using Multiple Representations Spectral learning algorithms learn an unknown function by learning a spectral (e.g., Fourier) representation of the function. However, there are many possible spectral representations, none of which will be best in all situations. Consequently, it seems natural to consider how a spectral learner could make use of multiple representations when learning. This paper proposes and compares three approaches to learning from multiple spectral representations. Empirical results suggest that an ensemble approach to multi-spectrum learning, in which spectral models are learned independently in each of a set of candidate representations and then combined in a majority-vote ensemble, works best in practice.'
 'Applying Swarm Ensemble Clustering Technique for Fault Prediction Using Software Metrics Number of defects remaining in a system provides an insight into the quality of the system. Defect detection systems predict defects by usi

In [8]:
data_df["extracted_keywords"].iloc[2]

'Number of defects, Ensemble Clustering Technique, defect prediction software, Software Metrics Number, data mining techniques'

In [9]:
data_df["keywords"].iloc[2]

'software defect prediction, particle swarm optimization, cluster data, ensemble clustering'

In [10]:
data_df.to_csv("yake_keywords.csv")