<b><center><font size="4">Keyword Extraction from Political Party Programmes - Portuguese Legislative Elections 2022</font></center></b>
<hr>

**Notebook Developed by**: [Ricardo Campos](http://www.ccc.ipt.pt/~ricardo)<br>
**email:**  ricardo.campos@ipt.pt<br>
**Affiliation:** *Assistant Professor* @ [Polytechnic Institute of Tomar](http://portal2.ipt.pt/en/);
*Researcher* @ [LIAAD](https://www.inesctec.pt/en/centres/liaad)-[INESC TEC](https://www.inesctec.pt/en)

<hr>

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#YAKE!-Installation" data-toc-modified-id="YAKE!-Installation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>YAKE! Installation</a></span></li><li><span><a href="#Keyword-Extraction" data-toc-modified-id="Keyword-Extraction-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Keyword Extraction</a></span></li><li><span><a href="#Text2WordCloud" data-toc-modified-id="Text2WordCloud-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Text2WordCloud</a></span></li><li><span><a href="#Text-Analsyis" data-toc-modified-id="Text-Analsyis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Text Analsyis</a></span><ul class="toc-item"><li><span><a href="#Getting-common-keywords-across-the-entire-collection" data-toc-modified-id="Getting-common-keywords-across-the-entire-collection-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Getting common keywords across the entire collection</a></span></li><li><span><a href="#Comparing-left-wing-vs-right-wing-political-parties" data-toc-modified-id="Comparing-left-wing-vs-right-wing-political-parties-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Comparing left-wing vs right-wing political parties</a></span></li><li><span><a href="#Comparing-two-Political-Parties" data-toc-modified-id="Comparing-two-Political-Parties-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Comparing two Political Parties</a></span></li><li><span><a href="#Determining-relevant-keywords-across-the-entire-collection" data-toc-modified-id="Determining-relevant-keywords-across-the-entire-collection-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Determining relevant keywords across the entire collection</a></span></li></ul></li><li><span><a href="#References" data-toc-modified-id="References-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>References</a></span></li></ul></div>

# Keyword Extraction from Political Party Programmes using YAKE. Portuguese Legislative Elections 2022

Checkout our tutorial on [medium](https://medium.com/p/dd7fdcd671c9)

## YAKE! Installation

To start with, begin by installing yake:

In [None]:
!pip install git+http://github.com/LIAAD/yake

<hr>

## Keyword Extraction

The code bellow begins by defining the list of political parties. For each political party, the code will read the corresponding programme and apply YAKE! to extract the top-200 keywords. The keywords are then kept on a dictionary where the key is the name of the political party and the value is a list of 200 tuples consisting of a keyword and a score of relevance, the lower the score the more relevant the keyword is. Output example: `{'ADN': [('Alternativa Democrática', 0.0010326142301984532), ('estado', 0.0021829409873160214),…..]}`. Parsed texts obtained from Github are assumed to be under a folder named `data/PoliticalPartiesProposals-Parsed`

In [None]:
import yake

ListOfPoliticalParties = ["ADN", "BE", "CDS", "Chega", "ErgueTe", "IL", "Livre", "MAS", "MPT", "NosCid", "PAN", "PCP", "PS", "PSD", "RIR", "Volt"]

dictOfKeywords = {}

for PoliticalParty in ListOfPoliticalParties:
    path = 'data/PoliticalPartiesProposals-Parsed'
    file = open(f'{path}/Prog_{PoliticalParty}.txt',encoding="utf8")
    text = file.read()

    language = "pt"
    max_ngram_size = 3
    numOfKeywords = 200

    custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, top=numOfKeywords, features=None)
    keywords = custom_kw_extractor.extract_keywords(text)
    
    dictOfKeywords[PoliticalParty] = keywords
    
    print(f"Done for {PoliticalParty}")

Following, we can list in a pandas dataframe the top-200 keywords for the entire set of political parties considered in this tutorial:

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth',40)
pd.set_option('display.max_rows', None)

dictOfKeywords2Pandas = {}

for PoliticalParty in dictOfKeywords:
    listOfKeywords = [kw[0] for kw in dictOfKeywords[PoliticalParty]]
    dictOfKeywords2Pandas[PoliticalParty] = listOfKeywords

pd.DataFrame(dictOfKeywords2Pandas)

<hr>

## Text2WordCloud

Next step is to transform the text into wordclouds. To accomplish this objective, we will make use of the wordcloud python package. The code bellow exemplifies this process. In the code, we begin by defining the path where we want the wordclouds to be saved (e.g., in our case we created a folder named WordCloud under the data folder, `data/Figure`), the path where the parsed texts can be found (`data/PoliticalPartiesProposals-Parsed`) and the filename (`wc_flag.jpg`) of the background image that should support the wordcloud. Note that this image, should be under the `data/Figure` folder.

You should also be aware that wordclouds are defined to distinguish between the relevance of the keywords by making use of different font size: the higher the relevance of the keyword, the larger the font size should be. However, from YAKE!'s code, we could understand that the higher the relevance the lower the score. So before we move on, we should make an adaptation of YAKE!'s score (by inverting it) such that it is ready for the wordcloud python package. The code bellow makes this adaptation, generates, shows and save the word clouds under the specified folders.

In [None]:
!pip install wordcloud

In [None]:
import yake
import numpy as np
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt

path_figures = "data/Figures"
path_texts = "data/PoliticalPartiesProposals-Parsed"
background_image = "wc_flag.jpg"

for PoliticalParty in dictOfKeywords:
    keywords = dictOfKeywords[PoliticalParty]
    
    #Invert the scores of YAKE
    keyword2WordCloud = {}
    for keyword in keywords:
        if keyword[1] < 0:
            keyword2WordCloud[keyword[0]] = 1 
        else:
            keyword2WordCloud[keyword[0]] = 1 - keyword[1]

    
    mask = np.array(Image.open(f"{path_figures}/{background_image}"))
    wordcloud = WordCloud(background_color="white",contour_color='firebrick', max_font_size=100,width = 1520, height = 535, mask=mask).generate_from_frequencies(keyword2WordCloud) #Objeto que permite gerar wordcloud a partir de texto
    image_colors = ImageColorGenerator(mask)
    plt.figure(figsize=(16,9))
    plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear") #imshow plota imagens que derivam de arrays
    plt.axis("off")
    plt.savefig(f"{path_figures}/{PoliticalParty}.png", format="png")

<hr>

## Text Analsyis

### Getting common keywords across the entire collection

In [None]:
ListOfKeywords = []
for PoliticalParty in dictOfKeywords2Pandas:
    ListOfKeywords.append(set(keyword.lower() for keyword in dictOfKeywords2Pandas[PoliticalParty]))

print(ListOfKeywords[0].intersection(*ListOfKeywords))

### Comparing left-wing vs right-wing political parties

Next, we can compare the differences between keywords used by left-wing and right-wing political parties. To do so, we begin by doing an intersection of all the relevant keywords retrieved by left-wing parties:

In [None]:
ListOfLeftPoliticalParties = ["BE", "Livre", "PAN", "PCP", "PS"]

ListOfLeftKeywords = []

for PoliticalParty in ListOfLeftPoliticalParties:
    ListOfLeftKeywords.append(set(keyword.lower() for keyword in dictOfKeywords2Pandas[PoliticalParty]))

ListOfLeftKeywordsIntersected = ListOfLeftKeywords[0].intersection(*ListOfLeftKeywords)
print(ListOfLeftKeywordsIntersected)

Then, we do the same for right-wing ones:

In [None]:
ListOfRightPoliticalParties = ["CDS", "Chega", "IL", "PSD"]

ListOfRightKeywords = []

for PoliticalParty in ListOfRightPoliticalParties:
    ListOfRightKeywords.append(set(keyword.lower() for keyword in dictOfKeywords2Pandas[PoliticalParty]))

ListOfRightKeywordsIntersected = ListOfRightKeywords[0].intersection(*ListOfRightKeywords)
print(ListOfRightKeywordsIntersected)

We can now make a difference between the two sets. First, we conduct a left difference and found that keywords such as "administração pública" (public administration) or "habitação" (housing) were found as relevant in all of the programmes of the left-wing political spectrum, but not on all the right-wing parties.

In [None]:
LeftDifference = ListOfLeftKeywordsIntersected - ListOfRightKeywordsIntersected
LeftDifference

Likewise, keywords such as "segurança" (security) or "qualidade" (quality) were found in all of the programmes of the right-wing spectrum, parties, but not on all the left-wing one.

In [None]:
RightDifference = ListOfRightKeywordsIntersected - ListOfLeftKeywordsIntersected
RightDifference

### Comparing two Political Parties

One can also compare two political parties (e.g., PS and PSD) to see the differences in-between them.

In [None]:
ListOfLeftPoliticalParties = ["PS"]

ListOfLeftKeywords = []

for PoliticalParty in ListOfLeftPoliticalParties:
    for keyword in dictOfKeywords2Pandas[PoliticalParty]:
        ListOfLeftKeywords.append(keyword.lower())

setOfLeftKeywords = set(ListOfLeftKeywords)

In [None]:
ListOfRightPoliticalParties = ["PSD"]

ListOfRightKeywords = []

for PoliticalParty in ListOfRightPoliticalParties:
    for keyword in dictOfKeywords2Pandas[PoliticalParty]:
        ListOfRightKeywords.append(keyword.lower())

setOfRightKeywords = set(ListOfRightKeywords)

In [None]:
setOfRightKeywords.difference(setOfLeftKeywords)

In [None]:
setOfLeftKeywords.intersection(setOfRightKeywords)  

<hr>

### Determining relevant keywords across the entire collection

Another interesting thing to do here is to count the number of times a word appears across the sixteen texts. Thus, instead of valuing words that occur a lot in a specific document and little in the rest of the collection (as TF.IDF does), we are interested in valuing words that occur frequently across the various texts considered. We assume YAKE! as a filtering step in this process, that enabled us to only focus on keywords that worth to have a look at, and based on this, we try to understand the most relevant keywords across the entire collection of texts. To this regard, we devise a simple formula which multiplies the term frequency (TF) of a keyword in the entire collection of documents D, by the log of the number of documents where the keyword appears (|{d ∈ D: keyword ∈ d})

To compute this, we begin by determing the entire list of keywords (removing duplicates, after transforming each keyword in lowercase):

In [None]:
ListOfAllKeywords = []
for PoliticalParty in dictOfKeywords:
     for kw in dictOfKeywords[PoliticalParty]:
            ListOfAllKeywords.append(kw[0].lower())

SetOfAllKeywords = set(ListOfAllKeywords)
SetOfAllKeywords

Next, we count the term frequency of each keyword together with the number of documents where the keyword appears and save this information in a dictionary (dictOfOccurrences) of the form {"public":[4,2],…} meaning that the keyword "public" appears 4 times in 2 documents.

In [None]:
import nltk
from nltk import word_tokenize
import collections

ListOfPoliticalParties = ["ADN", "BE", "CDS", "Chega", "ErgueTe", "IL", "Livre", "MAS", "MPT", "NosCid", "PAN", "PCP", "PS", "PSD", "RIR", "Volt"]

dictOfOccurrences = {}

for PoliticalParty in ListOfPoliticalParties:
    file = open(f'{path_texts}/Prog_{PoliticalParty}.txt',encoding="utf8")
    text = file.read().lower()
    
    tokens = nltk.word_tokenize(text)
    
    uni = nltk.FreqDist(tokens)
    bi = nltk.FreqDist(nltk.bigrams(tokens))
    tri = nltk.FreqDist(nltk.trigrams(tokens))
    
    ListOfDicts = []
    ListOfDicts.append({k:v for k,v in uni.items()})
    ListOfDicts.append({" ".join(k):v for k,v in bi.items()})
    ListOfDicts.append({" ".join(k):v for k,v in tri.items()})
    
    counter = collections.Counter()
    for d in ListOfDicts:
        counter.update(d)

    dictOfKeywords = dict(counter)
    
    for keyword in SetOfAllKeywords:
        if keyword in dictOfKeywords:
            if keyword in dictOfOccurrences:
                dictOfOccurrences[keyword][0] += dictOfKeywords[keyword]
                dictOfOccurrences[keyword][1] += 1
            else:
                dictOfOccurrences[keyword] = [dictOfKeywords[keyword], 1]
    
dictOfOccurrences

Having this information available, we are now ready to compute the `GlobalWeight` for each of the keywords.

In [None]:
import math
dictOfOccurrencesWeight = {}

for keyword in dictOfOccurrences:
    dictOfOccurrencesWeight[keyword] = dictOfOccurrences[keyword][0] * (math.log(dictOfOccurrences[keyword][1]))

dictOfOccurrencesWeight

The following creates a word cloud:

In [None]:
import yake
import numpy as np
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt

background_image = "wc_map.jpg"

mask = np.array(Image.open(f"{path_figures}/{background_image}"))
wordcloud = WordCloud(background_color="white",contour_color='firebrick', max_font_size=100,width = 1520, height = 535, mask=mask).generate_from_frequencies(dictOfOccurrencesWeight) #Objeto que permite gerar wordcloud a partir de texto
image_colors = ImageColorGenerator(mask)
plt.figure(figsize=(16,9))
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear") #imshow plota imagens que derivam de arrays
plt.axis("off")
plt.savefig(f"{path_figures}/GlobalWordCloud.png", format="png")

The following code sorts the values in descending order as a means to feed a bar plot.

In [None]:
dictOfOccurrencesWeight_sortedByValue = {k: dictOfOccurrencesWeight[k] for k in sorted(dictOfOccurrencesWeight, key=dictOfOccurrencesWeight.get, reverse=True)}
dictOfOccurrencesWeight_sortedByValue

We are now ready to get a bar plot of the top-50 YAKE! keywords across all the documents.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

labels = list(dictOfOccurrencesWeight_sortedByValue.keys())[:50]
y = list(dictOfOccurrencesWeight_sortedByValue.values())[:50]

plt.figure(figsize=(18,12))

plt.bar(labels, y)
plt.title("YAKE! Top-50 keywords across the entire collection")
plt.xlabel("keywords")
plt.ylabel("Global Weight")
plt.xticks(list(range(0,len(labels)+1)), labels, rotation =90)
plt.yticks([]);
plt.savefig(f"{path_figures}/GlobalWeight.png", format="png")

From the plot, we can also observe that the keywords occupying the top positions of the dictionary mostly consist of 1-term (e.g., "saúde"). This is easily explained by the fact that 1-term keywords are more easily found across all the documents than 2-terms or 3-terms. As a matter of fact, all the programmes may have the keyword "saúde" (health), but not all of them may have the keyword "cuidados de saúde" (health care). It also suggest that more elaborated solutions that are not only related with the term frequency should be studied in the future (perhaps including YAKE!'s score in the process). As for now, we overcome this problem by creating two further plots for 2 and 3-term keywords by filtering the dictionary to this specific number of terms.

To begin with, we determine the `bigramKeywords` and `trigramKeywords`, and then, based on this we will have dictionaries of bigrams and trigrams. 

In [None]:
bigramKeywords = []
trigramKeywords = []

for keyword in SetOfAllKeywords:
    if len(keyword.split()) == 2:
        bigramKeywords.append(keyword)
    elif len(keyword.split()) == 3:
        trigramKeywords.append(keyword)
        
dictOfTrigramOccurrencesWeight = {}
for keyword in trigramKeywords:
    dictOfTrigramOccurrencesWeight[keyword] = dictOfOccurrencesWeight[keyword]
    
dictOfBigramOccurrencesWeight = {}
for keyword in bigramKeywords:
    dictOfBigramOccurrencesWeight[keyword] = dictOfOccurrencesWeight[keyword]

dictOfTrigramOccurrencesWeight_sortedByValue = {k: dictOfTrigramOccurrencesWeight[k] for k in sorted(dictOfTrigramOccurrencesWeight, key=dictOfTrigramOccurrencesWeight.get, reverse=True)}
dictOfBigramOccurrencesWeight_sortedByValue = {k: dictOfBigramOccurrencesWeight[k] for k in sorted(dictOfBigramOccurrencesWeight, key=dictOfBigramOccurrencesWeight.get, reverse=True)}

Plotting the bigrams:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

labels = list(dictOfBigramOccurrencesWeight_sortedByValue.keys())[:50]
y = list(dictOfBigramOccurrencesWeight_sortedByValue.values())[:50]

plt.figure(figsize=(18,8))

plt.bar(labels, y)
plt.title("YAKE! Top-50 keywords (2-terms) across the entire collection")
plt.xlabel("keywords")
plt.ylabel("Global Weight")
plt.xticks(list(range(0,len(labels)+1)), labels, rotation =90)
plt.yticks([]);
plt.savefig(f"{path_figures}/GlobalWeight_2terms.png", format="png")

Plotting the trigrams

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

labels = list(dictOfTrigramOccurrencesWeight_sortedByValue.keys())[:50]
y = list(dictOfTrigramOccurrencesWeight_sortedByValue.values())[:50]

plt.figure(figsize=(18,8))

plt.bar(labels, y)
plt.title("YAKE! Top-50 keywords (3-terms) across the entire collection")
plt.xlabel("keywords")
plt.ylabel("Global Weight")
plt.xticks(list(range(0,len(labels)+1)), labels, rotation =90)
plt.yticks([]);
plt.savefig(f"{path_figures}/GlobalWeight_3terms.png", format="png")

<hr>

## References

Please cite the following works when using YAKE:

**In-depth journal paper at Information Sciences Journal**

- Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. [pdf](https://doi.org/10.1016/j.ins.2019.09.013)

**ECIR'18 Best Short Paper**

- Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 684 - 691. [pdf](https://link.springer.com/chapter/10.1007/978-3-319-76941-7_63)

- Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE! Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 806 - 810. [pdf](https://link.springer.com/chapter/10.1007/978-3-319-76941-7_80)