<a href="https://colab.research.google.com/github/victorknox/rude-mood/blob/main/politeness_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Downloading the libraries and dependencies

 ---

In [None]:
!python -m spacy download en_core_web_sm
!pip3 install convokit

In [None]:
!python -m nltk.downloader all

In [8]:
import pandas as pd
from csv import reader
import numpy as np
from tqdm import tqdm
from collections import defaultdict
import spacy

We are using the Cornell Conversational Analysis Toolkit ([Convokit](https://convokit.cornell.edu/)) for running the politeness analyser

In [12]:
import convokit

In [13]:
from convokit import Corpus, Speaker, Utterance
from convokit import download
from pandas import DataFrame
from typing import List, Dict, Set
import nltk

In [14]:
df=pd.read_csv('/content/dataset/lin_comments.csv',nrows=100)
ids=list(df.index)
# df=df.reset_index()
df['id']=ids
df['timestamp']=ids
df['conversation_id']=ids
df['reply_to']=ids
modif_df=df.rename(columns={'body':'text','author':'speaker'})

In [15]:
modif_df

Unnamed: 0,title,text,votes,subreddit name,speaker,politeness score,id,timestamp,conversation_id,reply_to
0,Why can Linux run on most desktops but not mos...,"It _can_ run on most phones, but requires a lo...",134,linux,PureTryOut,,0,0,0,0
1,Why can Linux run on most desktops but not mos...,There's Android. There's [Sailfish OS](https:/...,175,linux,da_peda,,1,1,1,1
2,Why can Linux run on most desktops but not mos...,Two reasons. Most phones have locked boot load...,117,linux,1_p_freely,,2,2,2,2
3,Why can Linux run on most desktops but not mos...,Most android devices run on downstream kernels...,25,linux,Worldly_Topic,,3,3,3,3
4,Why can Linux run on most desktops but not mos...,Phone kernels are very far downstream from mai...,9,linux,Atemu12,,4,4,4,4
...,...,...,...,...,...,...,...,...,...,...
95,Let us introduce you to the secure open-source...,Why do open source projects choose such terrib...,8,linux,RootHouston,,95,95,95,95
96,Let us introduce you to the secure open-source...,This is going to sound petty (because it is). ...,115,linux,fuckEAinthecloaca,,96,96,96,96
97,Let us introduce you to the secure open-source...,Looks interesting! The deployment is kinda wei...,32,linux,ABotelho23,,97,97,97,97
98,Let us introduce you to the secure open-source...,how does this compare to RocketChat?,18,linux,limeunderground,,98,98,98,98


Generating a [Convokit Corpus](https://convokit.cornell.edu/documentation/corpus.html) from the above pandas dataframe

In [16]:
new_corpus=Corpus.from_pandas(modif_df)

100it [00:00, 4413.85it/s]


- Printing basic information about the corpus

In [17]:
new_corpus.print_summary_stats()

Number of Speakers: 72
Number of Utterances: 100
Number of Conversations: 100


Importing the [Convokit Classifier](https://convokit.cornell.edu/documentation/classifier.html) and its dependencies to run our politeness detection analysis

In [18]:
import random
from sklearn import svm
from scipy.sparse import csr_matrix
from sklearn.metrics import classification_report

In [19]:
from convokit import Classifier

- Convokit has many corpuses on which we can train the classifier.

- In this case we have chosen the [wikipedia-politeness-corpus](https://convokit.cornell.edu/documentation/wiki_politeness.html) for the purpose

In [20]:
wiki_corpus = Corpus(download("wikipedia-politeness-corpus"))
# binary_corpus = Corpus(utterances=[utt for utt in wiki_corpus.iter_utterances() if utt.meta["Binary"] != 0])

Downloading wikipedia-politeness-corpus to /root/.convokit/downloads/wikipedia-politeness-corpus
Downloading wikipedia-politeness-corpus from http://zissou.infosci.cornell.edu/convokit/datasets/wikipedia-politeness-corpus/wikipedia-politeness-corpus.zip (1.7MB)... Done


Importing the text parser and Politeness Strategies features to annotate the wikipedia corpus

In [23]:
from convokit import TextParser
parser = TextParser()

In [24]:
from convokit import PolitenessStrategies
ps = PolitenessStrategies()

Annotating the corpus with politeness strategies

In [25]:
wiki_corpus = ps.transform(wiki_corpus, markers=True)

We make a subset of the corpus as we are only interested with the politeness part

In [28]:
binary_corpus = Corpus(utterances=[utt for utt in wiki_corpus.iter_utterances() if utt.meta["Binary"] != 0])

In [None]:
# clf_cv = Classifier(obj_type="utterance", 
#                         pred_feats=["politeness_strategies"], 
#                         labeller=lambda utt: utt.meta['Binary'] == 1)

# clf_cv.evaluate_with_cv(binary_corpus)

**Training the Classifer**
1. Generating the `train_corpus`

In [31]:
# clf.summarize(test_pred)
test_ids = binary_corpus.get_utterance_ids()[-100:]
train_corpus = Corpus(utterances=[utt for utt in binary_corpus.iter_utterances() if utt.id not in test_ids])
# test_corpus = Corpus(utterances=[utt for utt in binary_corpus.iter_utterances() if utt.id in test_ids])
print("train size = {}".format(len(train_corpus.get_utterance_ids())))


train size = 2078


  2. Initializing the classifier object and training over `train_corpus`

In [32]:
clf = Classifier(obj_type="utterance", 
                        pred_feats=["politeness_strategies"], 
                        labeller=lambda utt: utt.meta['Binary'] == 1)
clf.fit(train_corpus)

Initialized default classification model (standard scaled logistic regression).


<convokit.classifier.classifier.Classifier at 0x7f960f086910>

**Politeness prediction in another corpus using the same classifier**
1. Initialzing the `test_corpus`. Notice this 'new_corpus` is the corpus that we have made using the `lin_comments.csv`.

In [34]:
test_ids_new = new_corpus.get_utterance_ids()[0:]
test_corpus = Corpus(utterances=[utt for utt in new_corpus.iter_utterances() if utt.id in test_ids_new])
print("train size = {}, test size = {}".format(len(train_corpus.get_utterance_ids()),
                                               len(test_corpus.get_utterance_ids())))

train size = 2078, test size = 100


  2. Running the classifer on each [utterance](https://convokit.cornell.edu/documentation/utterance.html) of the `test_corpus`

In [35]:
list_utt=[]
for utt in new_corpus.iter_utterances():
  if utt.id in test_ids_new:
    try:
      utt=parser.transform_utterance(utt)
      utt=ps.transform_utterance(utt)
      list_utt.append(utt)
    except:
      continue
# print(list_utt)
test_pred = clf.transform_objs(list_utt)

In [67]:
records= []
for i in test_pred:
  text=i.text
  prediction=i.retrieve_meta('prediction')
  score=i.retrieve_meta('pred_score')
  # string =f'prediction: {prediction}  score: {score}\n'
  records.append([str(text),str(prediction),str(score)])


Storing the results in a csv

In [68]:
import csv   
fields=['text','prediction','score']
with open(r'results.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(fields)
    for row in records:
      writer.writerow(row)

Results:

In [69]:
results=pd.read_csv('results.csv')
results

Unnamed: 0,text,prediction,score
0,"It _can_ run on most phones, but requires a lo...",0,0.280605
1,Two reasons. Most phones have locked boot load...,0,0.170371
2,Most android devices run on downstream kernels...,0,0.249747
3,Phone kernels are very far downstream from mai...,0,0.190903
4,It's more accurate to say you can run the same...,0,0.144272
...,...,...,...
88,Why do open source projects choose such terrib...,0,0.102888
89,This is going to sound petty (because it is). ...,1,0.540058
90,Looks interesting! The deployment is kinda wei...,0,0.091368
91,how does this compare to RocketChat?,0,0.119554
