# Debate analysis example

In this notebook we present an example of how to evaluate the aspects we defined on shprt debate. We will use as example a the topic of Universal Basic Income. 
We will use three different debates: 
* one from [Wikidebate](https://en.wikiversity.org/wiki/Category:Wikidebates) titled ['_Should universal basic income be established?_'](https://en.wikiversity.org/wiki/Should_universal_basic_income_be_established%3F)
* one from [Kialo](https://www.kialo.com) titled ['_Should governments provide a universal basic income?_'](https://www.kialo.com/should-governments-provide-a-universal-basic-income-14053)
* one from [/rchangemyview](https://www.reddit.com/r/changemyview/) titled ['_CMV: Universal basic income is the way of the future._'](https://www.reddit.com/r/changemyview/comments/tdmuae/cmv_universal_basic_income_is_the_way_of_the/).

For further general information on data collection see the readme file and `src/data_collection/` folder for the code.

We start by importing all the functions we will use to evaluate the defined metrics.

In [None]:
from src.complexity_utils import *
from src.disagreement_utils import *
from src.equality_engagement_utils import *
from src.reason_utils import *
from src.sentiment_utils import *
from src.sourcing_utils import *
from src.topic_distance_utils import *

from sentence_transformers import SentenceTransformer,util,SimilarityFunction
from transformers import AutoTokenizer

import pandas as pd

We proceeed to import the data we need. In this notebook we have three separated csv files, each row of these files contain a post, which is assigned with the following infomration:
* **id**: a unique id identifying the post
* **page_id**: an id identifying the topic 
* **item**: the content of the post
* **parent_id**: the id of the parent post, in the case of root post this value is 0
* **title**: a title identifying the topic
* **debate_id**: an id identifying the debate (as Kialo and CMV in the original dataset may have different debates for the same topic)
* **length**: post length in charachters
* **level**: depth level of the post (root posts are assigned with a level of 0)
* **thread_id**: an id identifying the thread each post belongs to
* **author**: author(s) of the post, in the case of Wikidebate more than one user could be involved. In the case of Kialo we cannot associate each post with its authors, thus this column is assigned with Nan values, while we use unmatched statistics when needed.
* **platform**: short name of the platform the post was published in. 

Moreover Canghe My View has:

* **original_item**: post content before preprocessing, used to extract references. 

In [None]:
platforms=['wiki','kialo','cmv']
wiki_data=pd.read_csv('data/UBI_wiki.csv',index_col=0)
kialo_data=pd.read_csv('data/UBI_kialo.csv',index_col=0)
cmv_data=pd.read_csv('data/UBI_cmv.csv',index_col=0)
merged_data = pd.concat([wiki_data,kialo_data,cmv_data],ignore_index=True)

### Engagement & Equality
We now evaluate the engagement and the equality score for our data. The function provide us with two lists: the first one contains engagement scores for respectively Wikidebate, Kialo and CMV, while the second equality scores in the same order of appereance

In [None]:
engagement_value,equality_value=engagement_and_equality_assignment(merged_data,platforms)

### Sourcing 
To evaluate the sourcing score we first get the number of reference contained in each debate using the function *get_platforms_reference_number*, which provide a list containing Wikidebate, Kialo and CMV debates total number of references. 
Then we evaluate the actual score by dividing this number by the number of total posts contained in the debate 

In [None]:
number_of_references=get_platforms_reference_number(merged_data,platforms)

### Topic diversity

In order to evaluate the topic diversity we employ the `SentenceTransformers` and the `transformers` libraries. In particoular we usde the `all-MiniLM-L6-v2` model. The first step is thus to import the model and the tokenizer. 

In [None]:
model_sbert_name = "all-MiniLM-L6-v2"
model_sbert = SentenceTransformer(model_sbert_name)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")


First we collect the threads ids in a matrix (`threads_id_matrices`) that we will use to keep track of the thread each post belongs to. We will use this matrix later.
Then we obtain matrices containing the embeddings of each post (`embeddings_matrices`), these embeddings are obtained by using the model we imported. The `debate_matrices` contain the `debate_id` of each post and are useful to keep track of the debate each post belongs to. 

In [None]:
threads_id_matrices=get_thread_id_claims(merged_data,platforms)
embeddings_matrices, debate_matrices, documents_matrices = get_sentence_transformers_claim_embeddings(merged_data, platforms, model_sbert, tokenizer)

The embeddings of each post belonging to the same thread are aggregrated by averaging on them. The distance between threads of the same debate is then obtained as 1-cosine similarity. 
It is recomanded to set a `treshold` thant controls the maximum number of posts to consider for each thread. In this case a number equal to `treshold` of random posts is selected, and this operation is repeted `num_runs` times, to insure the reliability.
The function `obtain_debate_wise_topic_distance` returns a vector containing the average of topic distances between threads of each debate.
In our case in, in fact, a list containing respectevely the Wikidebate, Kialo and CMV debate's topic distance.  

In [None]:
emb_dir = 'data/st_embeddings'
deb_dir = 'data/st_debates'
treshold=11

debate_wise_topic_distance = obtain_debate_wise_topic_distance(merged_data,platforms,treshold,emb_dir,deb_dir,
                                                            threads_id_matrices,model_sbert,num_runs=10)

In [None]:
polarity=VADER_sentiment(merged_data)
sentiment=tex_blob_sentiment(merged_data,platforms)
mltd=mltd(merged_data)
readability_score=readability_score(merged_data)