## Keyword Extraction

`Task:`
- Develop a model for identifying and extracting keywords/ key phrases from the input sequence.
- Use Named Entity Recognition and PoS tagging as a feature that contribute to the extraction.

### Import the necessary libraries and load the dataset

In [9]:
import os
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.tokenize import word_tokenize
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
from collections import defaultdict

In [8]:
# !pip install keybert
# !pip install keyphrase_vectorizers

In [12]:
files = os.listdir('docsutf8')

In [38]:
data = defaultdict(list)

for f in files:
    with open(f"docsutf8/{f}", encoding="utf8") as fi, open(f"keys/{f.split('.')[0] + '.key'}", encoding="utf8") as key:
        data['file'].append(fi.read())
        data['key'].append(key.read().splitlines())

df = pd.DataFrame(data)

### Understand the data and preprocess the text

In [39]:
df.head()

Unnamed: 0,file,key
0,"Complex Langevin (CL) dynamics [1,2] provides...","[CL, complexified configuration space, Complex..."
1,Nuclear theory devoted major efforts since 4 d...,"[C60, combining quantum features, field of clu..."
2,The next important step might be the derivatio...,"[continuum space-time, Dirac equation, future ..."
3,This work shows how our approach based on the ...,"[class virial expansions, field partition func..."
4,A fluctuating vacuum is a general feature of q...,[a collection of fermionic fields describing c...


In [40]:
df.shape

(493, 2)

In [41]:
print(df['file'][0])

Complex Langevin (CL) dynamics  [1,2] provides an approach to circumvent the sign problem in numerical simulations of lattice field theories with a complex Boltzmann weight, since it does not rely on importance sampling. In recent years a number of stimulating results has been obtained in the context of nonzero chemical potential, in both lower and four-dimensional field theories with a severe sign problem in the thermodynamic limit  [3–8] (for two recent reviews, see e.g. Refs.  [9,10]). However, as has been known since shortly after its inception, correct results are not guaranteed  [11–16]. This calls for an improved understanding, relying on the combination of analytical and numerical insight. In the recent past, the important role played by the properties of the real and positive probability distribution in the complexified configuration space, which is effectively sampled during the Langevin process, has been clarified  [17,18]. An important conclusion was that this distribution 

'''
Complex Langevin (CL) dynamics  [1,2] provides an approach to circumvent the sign problem in numerical simulations of lattice field theories with a complex Boltzmann weight, since it does not rely on importance sampling. In recent years a number of stimulating results has been obtained in the context of nonzero chemical potential, in both lower and four-dimensional field theories with a severe sign problem in the thermodynamic limit  [3–8] (for two recent reviews, see e.g. Refs.  [9,10]). However, as has been known since shortly after its inception, correct results are not guaranteed  [11–16]. This calls for an improved understanding, relying on the combination of analytical and numerical insight. In the recent past, the important role played by the properties of the real and positive probability distribution in the complexified configuration space, which is effectively sampled during the Langevin process, has been clarified  [17,18]. An important conclusion was that this distribution should be sufficiently localised in order for CL to yield valid results. Importantly, this insight has recently also led to promising results in nonabelian gauge theories, with the implementation of SL(N,C) gauge cooling  [8,10].
'''

In [45]:
def remove_brackets(text):
    return re.sub(r'\[.*?\]', '', text)

df['file'] = df['file'].apply(remove_brackets)

In [48]:
# remove stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

df['file'] = df['file'].apply(remove_stopwords)

In [51]:
print(df['file'][0])

Complex Langevin  dynamics provides approach circumvent sign problem numerical simulations lattice field theories complex Boltzmann weight, since rely importance sampling. recent years number stimulating results obtained context nonzero chemical potential, lower four-dimensional field theories severe sign problem thermodynamic limit . However, known since shortly inception, correct results guaranteed . calls improved understanding, relying combination analytical numerical insight. recent past, important role played properties real positive probability distribution complexified configuration space, effectively sampled Langevin process, clarified . important conclusion distribution sufficiently localised order CL yield valid results. Importantly, insight recently also led promising results nonabelian gauge theories, implementation SL gauge cooling .


'''Complex Langevin (CL) dynamics provides approach circumvent sign problem numerical simulations lattice field theories complex Boltzmann weight, since rely importance sampling. recent years number stimulating results obtained context nonzero chemical potential, lower four-dimensional field theories severe sign problem thermodynamic limit (for two recent reviews, see e.g. Refs. ). However, known since shortly inception, correct results guaranteed . calls improved understanding, relying combination analytical numerical insight. recent past, important role played properties real positive probability distribution complexified configuration space, effectively sampled Langevin process, clarified . important conclusion distribution sufficiently localised order CL yield valid results. Importantly, insight recently also led promising results nonabelian gauge theories, implementation SL(N,C) gauge cooling .
'''

In [50]:
# remove ()
def remove_parentheses(text):
    return re.sub(r'\(.*?\)', '', text)

df['file'] = df['file'].apply(remove_parentheses)

In [52]:
# lower case
df['file'] = df['file'].str.lower()

### Embeddings

In [53]:
doc = df['file'].tolist()
doc[:2]

['complex langevin  dynamics provides approach circumvent sign problem numerical simulations lattice field theories complex boltzmann weight, since rely importance sampling. recent years number stimulating results obtained context nonzero chemical potential, lower four-dimensional field theories severe sign problem thermodynamic limit . however, known since shortly inception, correct results guaranteed . calls improved understanding, relying combination analytical numerical insight. recent past, important role played properties real positive probability distribution complexified configuration space, effectively sampled langevin process, clarified . important conclusion distribution sufficiently localised order cl yield valid results. importantly, insight recently also led promising results nonabelian gauge theories, implementation sl gauge cooling .',
 'nuclear theory devoted major efforts since 4 decades describe thermalization nuclear reactions, predominantly using semi-classical met

`BOW`

In [54]:
cv = CountVectorizer(max_df=0.85, max_features=10000) 
# max_df - ignore terms that have a document frequency strictly higher than the given threshold
# max_features - maximum number of features

In [55]:
bow = cv.fit_transform(doc)
bow_array = bow.toarray()
bow_array.shape

(493, 9213)

`TF IDF`

In [56]:
tfidf = TfidfTransformer(smooth_idf=True, use_idf=True)
# smooth_idf - smooth idf weights by adding one to document frequencies
# use_idf - enable inverse-document-frequency reweighting
# by default, smooth_idf=True and use_idf=True

In [58]:
tfidf.fit(bow) # fit the transformer on the bag of words
tfidf.idf_.shape

(9213,)