Characteristic words from TT descriptions
===================================

This experiment focuses on the descriptions of the Twitter accounts of Polish Sejm MPs.
We ask ourselves are there any characteristic words used in the descriptions.

This is first toy example, because the number of descriptions is rather small (<400), so the results are statistically insignificant. Also only the biggest parties were used in the experiment (and only for some we obtained any results). But this is good first try for further more challenging tasks.

In the solution I used TF-IDF vectorization of each description, building simple classifier on top of that.
Description were preprocessed so only word-alike tokens were left, and transformed to their canonical forms (geting rid of inflection).

The classifier (even that it is very simple model) achieved good result on the test set (0.73 f1-score with 6 labels to classify). But the task was pretty much simple - in most cases the descriptions contain direct indication of the MPs affiliation - such words were identified as the top features used in classification. Still there are some other words which were selected as importat features... and those are interesting to check.

Data collection: around 2023-01-11


In [13]:
from aipolit.utils.text import read_tsv

In [231]:
# you need to create such file on your own
polit_data = read_tsv('../local_data/politycy-dane.tsv')

In [275]:
# For pretty printing of table data
import pandas as pd
from IPython.display import display, HTML

def show_pretty_table(raw_data, header):
    df = pd.DataFrame(raw_data, columns=header)
    display(HTML(df.to_html()))

In [325]:
# Source for parts of this code: 
# https://datascience.stackexchange.com/questions/103735/methods-for-finding-characteristic-words-for-a-group-of-documents-in-comparison

import numpy as np
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

In [233]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_recall_fscore_support

In [234]:
import random

In [235]:
# Used for lemmatization
import spacy
from spacy.lang.pl.examples import sentences

nlp_pl = spacy.load("pl_core_news_sm")

In [236]:
# Remove emojis preprocessing step
# Source: https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python
import re
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

In [266]:
# Remove urls
# Source:
# https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python

def remove_urls(text):
    text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text, flags=re.MULTILINE)
    return text

In [270]:
def remove_emails(text):
    text = re.sub(r'([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+', "", text)
    return text

In [271]:
# Leave only word-alike tokens
# remove emojis
# run lemmatization
def preprocess_sentence(nlp, text):
    text = remove_emojis(text)
    text = remove_urls(text)
    text = remove_emails(text)
    text = re.sub(r"[^\w]+", " ", text)
    
    text = re.sub(r"^\s+", "", text)
    text = re.sub(r"\s+$", "", text)
    text = re.sub(r"\s+", " ", text)
    
    result = []
    doc = nlp_pl(text)
    for token in doc:
        #print(token.text, token.pos_, token.dep_, token.lemma_)
        result.append(token.lemma_)

    return " ".join(result)

In [272]:
# Demonstrating how preprocessing works
sample_sentence = "2022 🇵🇱 Minister ds. Unii Europejskiej #babieslivesmatter http://example.com s@a.com"

preprocessed_sentence = preprocess_sentence(nlp_pl, sample_sentence)

print("Input:\n  " + sample_sentence)
print("")
print("After preprocessing:\n  " + preprocessed_sentence)


Input:
  2022 🇵🇱 Minister ds. Unii Europejskiej #babieslivesmatter http://example.com s@a.com

After preprocessing:
  2022 minister do spraw unia europejski babieslivesmatter


In [341]:
# Loading sample data

# not enough data for all the parties :(
processed_parties = ['PiS', 'PO', 'Lewica', 'Solidarna Polska', 'PSL', 'Polska2050']

#processed_parties = ['PiS', 'PO']

descriptions_texts = []
descriptions_labels = []

for e in polit_data:
    party = e['party']
    desc = e['description']
    if not desc:
        continue

    if party in processed_parties:
        desc = preprocess_sentence(nlp_pl, desc)
        # We take only descriptons with at least 10 characters
        if len(desc) < 10:
            continue
        descriptions_texts.append(desc)
        descriptions_labels.append(party)  
        
print("Descriptions:", len(descriptions_texts))
print("Labels:", len(descriptions_labels))

Descriptions: 319
Labels: 319


In [357]:

from collections import Counter

def show_label_distribution():
    print("Label distribution")

    label_count = Counter()
    for label in descriptions_labels:
        label_count[label] += 1
    
    raw_data = []
    for label, value in label_count.items():
        raw_data.append((label, value))
    show_pretty_table(raw_data, ["Label", "Count"])
    
show_label_distribution()

Label distribution


Unnamed: 0,Label,Count
0,PiS,146
1,PO,94
2,PSL,21
3,Lewica,32
4,Solidarna Polska,19
5,Polska2050,7


In [358]:
print("Sample descriptions:")
desc_and_labels = [(d, l) for d, l in zip(descriptions_texts, descriptions_labels)]

def show_desc_sample():
    data = []
    for entry in random.sample(desc_and_labels, k=10):
        desc = entry[0]
        label = entry[1]
        data.append((label, desc))
    show_pretty_table(data, ['label', 'tt description'])  

show_desc_sample()

Sample descriptions:


Unnamed: 0,label,tt description
0,PiS,wiceminister sport i Turystyki poseł na sejm RP VI VII VIII oraz IX kadencja Facebook
1,PO,poseł z ca sekretarz generalny pO przewodniczaca sejmowy komisja Infrastruktury wiceminister Infrastruktury i rozwój w Rząda platforma obywatelski
2,PiS,pełnomocnik rząd do spraw centralny port Komunikacyjn wiceminister Fundusz i polityka regionalny poseł na sejm RP VIII i IX kadencja
3,PiS,marszałek senior sejm RP Wiceprezes PiS poseł Ziemia piotrkowski przewodniczący podkomisja Smoleński minister obrona narodowy w rok 2015 2018
4,PO,poseł na sejm RP wiceprzewodniczący platforma obywatelski minister obrona narodowy 2011 2015 wiceprezes rada minister 2014 2015
5,PiS,poseł na sejm RP klub parlamentarny prawo i sprawiedliwość
6,PiS,poseł na sejm RP sekretarz stan Cyfryzacja w kancelaria prezes rada minister MorawieckiM stowarzyszenie DlaPolski
7,PO,poseł na sejm mąż ani tata Ignacy i Ksawer
8,PO,poseł członek komisja sprawa zagraniczny i do spraw unia europejski
9,Lewica,poseł na sejm RP IX Kadencja Lider _ _ Lewica w województwo łódzki ZawszeBliskoŁodzić


In [380]:
# Build classifier

tfidf = TfidfVectorizer(
    min_df=0.001, max_df=0.2, max_features=10_000, ngram_range=(1, 3),
    token_pattern=r"(?u)\b\w+\b") # one char tokens are also valid


X_train, X_test, y_train, y_test = train_test_split(
    descriptions_texts,
    descriptions_labels, 
    test_size=0.2, 
    random_state=42)


X_train_tfidf_matrix = tfidf.fit_transform(X_train)

# 2. Train classifier
clf = RandomForestClassifier()
clf.fit(X_train_tfidf_matrix, y_train)


In [394]:
# Sample classifier predictions on TEST set

def show_sample_predictions():
    MAX_SAMPLE_TO_SHOW = 100

    X_test_tfidf_matrix = tfidf.transform(X_test[0:MAX_SAMPLE_TO_SHOW])
    y_test_actual = clf.predict(X_test_tfidf_matrix)

    raw_data = []
    for desc, pred_actual, pred_expected in zip(
        X_test[0:MAX_SAMPLE_TO_SHOW],
        y_test_actual,
        y_test[0:MAX_SAMPLE_TO_SHOW],
    ):
        PRED_STATUS = "FAIL"
        if pred_actual == pred_expected:
            PRED_STATUS = "SUCCESS"
        raw_data.append((PRED_STATUS, pred_actual, pred_expected, desc))
    show_pretty_table(raw_data, ['Result', "Actual", "Expected", "TT Description"])
    
        
show_sample_predictions()


Unnamed: 0,Result,Actual,Expected,TT Description
0,SUCCESS,PiS,PiS,poseł na sejm RP sekretarz stan w MI_GOV_PL b minister gospodarka morski i żegluga śródlądowy b poseł do PE 2009 2015 pisorgpl
1,SUCCESS,PiS,PiS,minister sprawa zagraniczny MSZ_RP minister of Foreign Affairs of poland polandMFAć tweets as Chairperson in Office of the OSCE OSCE2022POL
2,SUCCESS,PiS,PiS,poseł na sejm RP VIII i IX kadencja prezydium Republikanie _
3,SUCCESS,PiS,PiS,b minister transport 05 07 poseł na sejm RP III IV V VI VII VIII kadencja PiS górny Śląsk Polska
4,SUCCESS,PiS,PiS,poseł na sejm RP sekretarz stan w ministerstwo rolnictwo i rozwój wieś
5,SUCCESS,PSL,PSL,poseł na sejm RP Wiceprezes PSL oraz szef łódzki struktura wojewódzki prezes forum młody Ludowców w rok 2005 2013 pasjonat piłka nożny
6,FAIL,PiS,PO,poseł na sejm RP
7,SUCCESS,PO,PO,mama Weronik i Mikołaj socjolożka posłanka na sejm RP Wiceprzewodnicząca PO RP Wiceprzewodnicząca sejmowy komisja polityka społeczny i rodzina
8,SUCCESS,PiS,PiS,poseł na sejm RP PiS sekretarz stan w min Klimat i środowisko główny konserwator przyroda
9,FAIL,PiS,PSL,poseł na sejm RP IX kadencja prezes podlaskiy zarząd wojewódzki PSL Mors


In [391]:
# Score of the classifier
from sklearn.metrics import classification_report


def show_score(clf, X_input, y_expected, dataset_name):
    X_tfidf_matrix = tfidf.transform(X_input)

    y_actual = clf.predict(X_tfidf_matrix)
    score = precision_recall_fscore_support(y_expected, y_actual, average='weighted')
    print(f"Score for {dataset_name} is:")
    print(f"  precistion: {score[0]}")
    print(f"  recall: {score[1]}")
    print(f"  f-score: {score[2]}")
    print(f"  Support: {score[3]}")
    print("")
    
    print("Classification report")
    print(classification_report(y_expected, y_actual, target_names=clf.classes_))
    
show_score(clf, X_train, y_train, "TRAIN SET")
show_score(clf, X_test, y_test, "TEST SET")

Score for TRAIN SET is:
  precistion: 0.9771352608840146
  recall: 0.9764705882352941
  f-score: 0.97638002395614
  Support: None

Classification report
                  precision    recall  f1-score   support

          Lewica       1.00      0.97      0.98        29
              PO       1.00      0.99      0.99        71
             PSL       1.00      0.88      0.94        17
             PiS       0.96      0.99      0.97       114
      Polska2050       1.00      1.00      1.00         6
Solidarna Polska       0.94      0.94      0.94        18

        accuracy                           0.98       255
       macro avg       0.98      0.96      0.97       255
    weighted avg       0.98      0.98      0.98       255

Score for TEST SET is:
  precistion: 0.7909064440993789
  recall: 0.765625
  f-score: 0.734665194040194
  Support: None

Classification report
                  precision    recall  f1-score   support

          Lewica       1.00      1.00      1.00         3
    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [383]:
feature_names = tfidf.get_feature_names_out()
print("Feature count:", len(feature_names))
print("Number of stopwords:", len(tfidf.stop_words_))
#print("Stopwords:", tfidf.stop_words_)
#print("Features:", feature_names)

Feature count: 4986
Number of stopwords: 12


In [384]:
TOP_N_FEATURES = 50 # Top N features to be considered
MIN_CONFIDENCE = 0.6 # Take only results with high confidence

# 3. Get feature importances
feature_importances = clf.feature_importances_

# 4. Sort and get important features
word_indices = np.argsort(feature_importances)[::-1] # using argsort we get indices of important features

top_words_per_class = defaultdict(list)

for word_idx in word_indices[:TOP_N_FEATURES]:
    word = feature_names[word_idx]
    clf_input = [word]
    #word_class = clf.predict(tfidf.transform(clf_input))[0]
    class_probs = clf.predict_proba(tfidf.transform(clf_input))[0]
    class_idx = np.argmax(class_probs)
    class_prob = class_probs[class_idx]
    if class_prob < MIN_CONFIDENCE:
        continue
    word_class = clf.classes_[class_idx]    
    top_words_per_class[word_class].append(word)

In [385]:
for label, top_words in top_words_per_class.items():
    print(f"Top words characteristic for class: {label}")
    for word in top_words:
        print(f"  {word}")
    print("")

Top words characteristic for class: PO
  platforma obywatelski
  posłanka na sejm
  koalicja obywatelski
  rp platforma obywatelski

Top words characteristic for class: Lewica
  nowy lewica
  _ lewica
  _ _ lewica

Top words characteristic for class: PiS
  prawo i sprawiedliwość
  prawo i
  pis
  sprawiedliwość
  prawo
  i sprawiedliwość
  stan
  sekretarz stan
  minister
  komisja
  ix
  viii

Top words characteristic for class: Polska2050
  poseł rp polska2050

