# Keyword Extraction using TF-IDF

In [120]:
import pandas as pd
import numpy as np
import re

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from fuzzywuzzy import process

stop_words = set(stopwords.words("english"))
wordnet = WordNetLemmatizer()
# Initialize the TfidfVectorizer 
vectorizer = TfidfVectorizer(ngram_range=(1,4))

## Dataset

In [121]:
# Reading the data 
dataset_csv = "ICMLA_2014_2015_2016_2017.csv"
encoding = "ISO-8859-1"
data_df = pd.read_csv(dataset_csv, encoding=encoding).set_index("paper_id")
data_df.head()

Unnamed: 0_level_0,title,keywords,abstract,session,year
paper_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Ensemble Statistical and Heuristic Models for ...,"statistical word alignment, ensemble learning,...",Statistical word alignment models need large a...,Ensemble Methods,2014
2,Improving Spectral Learning by Using Multiple ...,"representation, spectral learning, discrete fo...",Spectral learning algorithms learn an unknown ...,Ensemble Methods,2014
3,Applying Swarm Ensemble Clustering Technique f...,"software defect prediction, particle swarm opt...",Number of defects remaining in a system provid...,Ensemble Methods,2014
4,Reducing the Effects of Detrimental Instances,"filtering, label noise, instance weighting",Not all instances in a data set are equally be...,Ensemble Methods,2014
5,Concept Drift Awareness in Twitter Streams,"twitter, adaptation models, time-frequency ana...",Learning in non-stationary environments is not...,Ensemble Methods,2014


## Data Pre-processing (Data Cleaning)

In [122]:
def pre_process(text):
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower()
    text = text.split()
    text = [wordnet.lemmatize(word) for word in text if not word in stop_words]
    return" ".join(text)

In [123]:
# Applying pre_process to single example text
title = data_df["title"].iloc[0]
abstract = data_df["abstract"].iloc[0]
text = f"{title} {abstract}"
print(text)
print("================================")
pre_process(text)

Ensemble Statistical and Heuristic Models for Unsupervised Word Alignment Statistical word alignment models need large amount of training data while they are weak in small-size corpora. This paper proposes a new approach of unsupervised hybrid word alignment technique using ensemble learning method. This algorithm uses three base alignment models in several rounds to generate alignments. The ensemble algorithm uses a weighed scheme for resampling training data and a voting score to consider aggregated alignments. The underlying alignment algorithms used in this study include IBM Model 1, 2 and a heuristic method based on Dice measurement. Our experimental results show that by this approach, the alignment error rate could be improved by at least %15 for the base alignment models.


'ensemble statistical heuristic model unsupervised word alignment statistical word alignment model need large amount training data weak small size corpus paper proposes new approach unsupervised hybrid word alignment technique using ensemble learning method algorithm us three base alignment model several round generate alignment ensemble algorithm us weighed scheme resampling training data voting score consider aggregated alignment underlying alignment algorithm used study include ibm model heuristic method based dice measurement experimental result show approach alignment error rate could improved least base alignment model'

In [124]:
# Applying preprocessing to entire dataset
data_df["text"] = data_df["title"] + " " + data_df["abstract"]
data_df["text"] = data_df["text"].apply(pre_process)
corpus = data_df["text"].values

In [125]:
# Fit and transform the text
tfidf = vectorizer.fit_transform(corpus)

# Get the feature names
feature_names = vectorizer.get_feature_names_out()

tfidf.shape

(448, 131538)

In [126]:
author_keywords = data_df["keywords"].iloc[0]
author_keywords

'statistical word alignment, ensemble learning, heuristic word alignment'

In [127]:
keyword_list = feature_names[np.argsort(tfidf.toarray()[0])[-10:][::-1]]
keyword_list

array(['alignment', 'alignment model', 'word alignment',
       'base alignment model', 'base alignment', 'word', 'ensemble',
       'algorithm us', 'heuristic', 'model'], dtype=object)

## Reducing duplication of keywords

In [128]:
deduplicated_keywords = process.dedupe(keyword_list, threshold=70)
deduplicated_keywords

dict_keys(['base alignment model', 'word alignment', 'ensemble', 'algorithm us', 'heuristic'])