# Twitter and politics

## Setup

In [None]:
%load_ext autoreload
%autoreload 2
!python -m pip install -r requirements.txt
from utils.initialization import *


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Project description

This project investigates how Danish politicians act on the social media platform [twitter.com](https://twitter.com).

<b><u> Data </u></b>

The data set used for this project is a collection of tweets from the top 3 politicians by vote from each party with at least 5 mandates as of the 2022 danish election.

<b><u> Problem formulation </u></b>

This project will investigate the following questions:

1. What type of issues are the politicians tweeting about and how does this reflect their "mærkesager"? ( foreslag: and how does this reflect the specific ideological or political goal they promote. )
2. How are politicians clustered based on their tweets and how does this compare to the clustering based on their party? 
3. How can we create a candidate test based on the politicians' tweets?

<b><u> Methods for solving the problem </u></b>

The project will use the following methods to solve the problems stated above:

1. Word clouds for each party/politician reveal the words they used the most and using <b>frequent itemsets</b> we may recognize a "mærkesag".
2. Different <b>clustering</b> techniques taught in the course will be investigated and used to cluster the politicians.
3. A <b>recommender system</b> is created based on a user tweet and the tweets of politicians to form a candidate test. Here <b>similar items</b> can be used to obtain a <b>similarity score</b> used by the recommender system.

<b><u> Code </u></b>

Most of the code used for the project is written directly in this notebook. However, to avoid cluttering the notebook, some of the code is written as functions in separate files and imported into the notebook. These files can be seen in `utils/` folder in the github repository for this project found [here](https://github.com/s180820/02807_dkpol).

<b><u> Contribution </u></b>

All members of the group contributed equally to the project both in terms of coding and writing. The group worked together on the project and discussed the results and the code. 

## Data

### Getting the data

Based on https://www.dr.dk/nyheder/politik/folketingsvalg/valgte a list of the top 3 politicians by vote from each party with at least 5 mandates as of the 2022 danish election is created. 

Data is collected using the <font color="red"> TODO </font> twitter API. 

As the twitter API needs the twitter user id to collect tweets, the twitter user ids are manually collected using the ["Find Twitter ID"](https://www.codeofaninja.com/tools/find-twitter-id/) by [CodeOfaNinja](https://www.codeofaninja.com/).

The resulting politicians and their twitter ids are seen in the dataframe below.

We observe that we have 12 different parties and 36 politicians in total.

In [None]:
from Data.twitter_ids import twitter_ids

data = pd.DataFrame(columns=['name', "party", 'twitter_id'])
i = 0
for party in twitter_ids:
    for person in twitter_ids[party]:
        data.loc[i, :] = [person, party, twitter_ids[party][person]]
        i += 1

display(data)


Unnamed: 0,name,party,twitter_id
0,magnus_heunicke,socialdemokratiet,22695562
1,nicolai_wammen,socialdemokratiet,2803948786
2,mattias_tesfaye,socialdemokratiet,546254893
3,jakob_ellemann,venstre,155584627
4,soren_gade,venstre,975064362359623680
5,sophie_lohde,venstre,44611200
6,lars_lokke,moderaterne,26201346
7,henrik_frandsen,moderaterne,1249019841924734977
8,rosa_eriksen,moderaterne,1560192117858861056
9,jacob_mark,sf,2373406198


The dataset is then created with the Twitter API using these ids and the code used for this is collected in the file `data_request.ipynb`. <font color="red"> TODO </font> collect data retrieving code in py file and explain it here.

This results in multiple csv files for each politician looking like this:

In [None]:
print("Example of data frame: ")
example_df = pd.read_csv("Data/tweets/alternativet/christina_olumeko_0.csv", index_col=0)
display(example_df.head())

print("\033[1mExample of tweet: \033[0m")
example_tweet = example_df.loc[0, "text"]
print(example_tweet)

Example of data frame: 


Unnamed: 0,edit_history_tweet_ids,id,text
0,['1566066637933088769'],1566066637933088769,"Rigtig ærgerligt, at Socialdemokratiet dropper..."
1,['1566055972031922176'],1566055972031922176,“Kommissionen for den glemte kvindekamp” under...
2,['1565978593318129665'],1565978593318129665,@Mhvid @SophieHAndersen Ja og særligt når komm...
3,['1565956735612985347'],1565956735612985347,Grineren video fra ⁦@Vejdirektoratet⁩. Lad os ...
4,['1565947952501334018'],1565947952501334018,@AFreltoft @politiken Dejligt at Københavns go...


[1mExample of tweet: [0m
Rigtig ærgerligt, at Socialdemokratiet dropper at arbejde for mindre gadeparkering i København🚗Det optager nemlig meget plads, og er samtidig nødvendigt for at nå Københavns klimamål. Alternativet arbejder videre 🌱 #dkgreen https://t.co/lUB0dMk5ud


### Cleaning the data

As we are only interested in the text of each tweet, we will collect the text for each politician as a list of strings (tweets) and add it as a new column to the data. However, we would like to clean the text a bit before doing this, for this purpose we do the following:

- only keep words and numbers, i.e. no emojis
- remove punctuation, stopwords and urls
- make all words lowercase

<font color="red"> Maybe we can use mrjob to do this? </font>

In [None]:
print("Original tweet: \n", example_tweet)
print("\n")
print("Clean tweet: \n" + clean_tweet(example_tweet))

Original tweet: 
 Rigtig ærgerligt, at Socialdemokratiet dropper at arbejde for mindre gadeparkering i København🚗Det optager nemlig meget plads, og er samtidig nødvendigt for at nå Københavns klimamål. Alternativet arbejder videre 🌱 #dkgreen https://t.co/lUB0dMk5ud


Clean tweet: 
rigtig ærgerligt socialdemokratiet dropper arbejde mindre gadeparkering københavn optager nemlig plads samtidig nødvendigt nå københavns klimamål alternativet arbejder videre dkgreen


In [None]:
filename = "Data/cleaned_data.csv"
if not os.path.exists(filename):
    os.system(f"python utils/clean_data_mrjob.py Data/tweets > Data/tmp_cleaned_data.txt")
    data_ = pd.DataFrame(columns=["name", "tweets"])
    with open("Data/tmp_cleaned_data.txt", "rb") as f:
        lines = f.readlines()
        for i, line in enumerate(lines):
            line = eval(line.decode())
            data_.loc[i,"name"] = list(line.keys())[0]
            data_.loc[i, "tweets"] = list(line.values())[0]
    data_.to_csv(filename, index = False)

data_ = pd.read_csv(filename)
data = data.merge(data_)
data["tweets"] = [eval(t) for t in data.tweets]
data["tokens"] = [[w for w in word_tokenize(" ".join(data["tweets"][i])) if w.isalnum()] for i in range(len(data))]
data

Unnamed: 0,name,party,twitter_id,tweets
0,magnus_heunicke,socialdemokratiet,22695562,[afsætter året styrke hjælpen børn pårørende a...
1,nicolai_wammen,socialdemokratiet,2803948786,[dage siden sagde nyt ejendomsvurderingssystem...
2,mattias_tesfaye,socialdemokratiet,546254893,[this is literally the same logic many th c am...
3,jakob_ellemann,venstre,155584627,[tide få fleksibel genåbning vores børn ældre ...
4,soren_gade,venstre,975064362359623680,[kære marianne synes burde læse lovforslaget i...
5,sophie_lohde,venstre,44611200,[flertallet veto dermed røde partier stort set...
6,lars_lokke,moderaterne,26201346,[mon ikke sjov form argumentation mangler lidt...
7,jacob_mark,sf,2373406198,[slår fast syvtommersøm kom så godt igennem fo...
8,pia_dyhr,sf,65025162,"[stemmer nok selvom synes gør godt klaus, brug..."
9,kirsten_andersen,sf,235646319,[arbejde få medarbejdere ser virkeligheden sun...


## Similarity
In this notebook we will explore similarity between Danish politicians based on how they communicate on the social platform Twitter. The similarity will be Jaccard similarity calculated using minhashing on TF-IDFs of all tweets, the politicians have tweeted, and the similarity is lastly visualized as a heatmap. We expect politicians from the same political party to have a higher similarity than politicians of different parties. (The possibility of two politicians with contradicting views on a topic is present, and whether these will end up similar is very interesting to investigate). 

### Jaccard similarity

The similarity we will apply in this notebook is the Jaccard similarity, which states that the similarity of sets is based on the relative size of their intersections. Hence, the Jaccard similarity of sets A and B is |A n B| / |A u B|.  In this notebook we investigate the textualy similar sets of tweets in the large corpus of collected tweets by the politicians. Since we are working on tweets, the similarity of these is on the character level and not in the sense of what meaning they convey, so we are not actually extracting any semantic meaning of the tweets. Hence, when irony and sarcasm as literary devices where there is difference between what the content appears to mean versus its literal meaning is being used, the similarity score will be larger than it should be. As irony and sarcasm is widely used in the Danish language, we expect some this to impact the accuracy of our similarity scoring and recommendation negatively.     


We have chosen to look at the similarity based on the tf-idf words of each politician's collected tweets in order to work at a more content based level rather than context based level. 

### TF-IDF










### Minhashing and the signature matrix

(The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union.)
Our collected data set of tweets is not considered a rather large data set, but non the less, for the sake of theoretical application and hypothetic scalability we will be using minhashing as a way to convert large sets into much smaller representations for optimization purpose. These smaller sets are called signatures. We can then compare the signatures of two sets and estimate the Jaccard similarity of the underlying sets from the signatures alone. 

The signatures are composed by the results of 300 calculations each of which is a “minhash” of the collected TF-IDFS. ...




In [None]:
# # Load data 
# TODO 

# # make a dataframe of the excluded names
# exclude = ['Anders_Bjarklev', 'Anders_Lund_Madsen', 'DTU', 'Michael_Kristiansen',
#             'Peter_Mogensen', 'Selma_Montgomery']
# excluded_df = data[data.name.isin(exclude)]
# data = data[~data.name.isin(exclude)] 

## Clustering

## Topic modelling 

## Recommendation system 
Additionally, we will create a recommendation system using Location-Sensitive hashing (LSH) imitating the well-known Danish ‘candidate test’ (https://www.dr.dk/nyheder/politik/folketingsvalg/kandidattest). The candidate test is used by most citizens prior to the governmental elections, to help them determine which politician to vote for, as the outcome of the test is a list of the politicians, they agree the most with on numerous topics based on a questionnaire. Our recommendation system will provide the top 3 most similar politicians of a test subject with a known political standpoint. (maybe something with the fact that we do not know if the test subjects twitter profiles reflect their political standpoint and the use of sarcasm, irony etc. which is widely used in the Danish language. 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=30797f9c-952e-45b4-98d4-31c9ac73ae78' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>