### Politeness prediction with ConvoKit

This notebook demonstrates how to train a simple classifier to predict the politeness level of a request by considering the politeness strategies used, as seen in the paper [A computational approach to politeness with application to social factors](https://www.cs.cornell.edu/~cristian/Politeness.html), using ConvoKit. Note that this notebook is *not* intended to reproduce the paper results: legacy code for reproducibility is available at this [repository](https://github.com/sudhof/politeness). 

In [3]:
import pandas as pd
import numpy as np
from tqdm import tqdm

import sys
sys.path.insert(0, "../Cornell-Conversational-Analysis-Toolkit/")
import convokit
from convokit import Corpus, User, Utterance

print(convokit.__file__)

../Cornell-Conversational-Analysis-Toolkit/convokit/__init__.py


In [4]:
from pandas import DataFrame
from typing import List, Dict, Set

#### 1: load annotated dataset

We will be using the wikipedia annotations from the [Stanford Politeness Corpus](https://www.cs.cornell.edu/~cristian/Politeness.html). 

Code below demonstrates how to convert the original CSV file into the corpus format expected by ConvoKit, but this resultant corpus can also be directly downloaded using the helper function `download("wiki-politeness-annotated")`. 

In [109]:
# you may need to modify the filepath depending on where your downloaded version is stored 
df = pd.read_csv("/kitchen/experimental_liye/perception/Stanford_politeness_corpus/wikipedia.annotated.csv")

- to see how the data looks:

In [16]:
df.head(2)

Unnamed: 0,Community,Id,Request,Score1,Score2,Score3,Score4,Score5,TurkId1,TurkId2,TurkId3,TurkId4,TurkId5,Normalized Score
0,Wikipedia,629705,Where did you learn English? How come you're t...,13,9,11,11,5,A2UFD1I8ZO1V4G,A2YFPO0N4GIS25,AYG3MF094634L,A38WUWONC7EXTO,A15DM9BMKZZJQ6,-1.120049
1,Wikipedia,244336,Thanks very much for your edit to the <url> ar...,23,16,24,21,25,A2QN0EGBRGJU1M,A2GSW5RBAT5LQ5,AO5E3LWBYM72K,A2ULMYRKQMNNFG,A3TFQK7QK8X6LM,1.313955


Firstly, we need to convert it to the format ConvoKit expects. Here is a simple helper function that does the job. 

In [32]:
def convert_df_to_corpus(df: DataFrame, id_col: str, text_col: str, meta_cols: List[str]) -> Corpus:
    
    """ Helper function to convert data to Corpus format
     
    Arguments:
        df {DataFrame} -- Actual data, in a pandas Dataframe
        id_col {str} -- name of the column that corresponds to utterances ids 
        text_col {str} -- name of the column that stores texts of the utterances  
        meta_cols {List[str]} -- set of columns that stores relevant metadata 
    
    Returns:
        Corpus -- the converted corpus
    """
    
    # in this particular case, user, reply_to, and timestamp information are all not applicable 
    # and we will simply either create a placeholder entry, or leave it as None 
        
    user = User("wiki_user")
    time = "NOT_RECORDED"

    utterance_list = []    
    for index, row in tqdm(df.iterrows()):
        
        # extracting meta data
        metadata = {}
        for meta_col in meta_cols:
            metadata[meta_col] = row[meta_col]
        
        utterance_list.append(Utterance(row[id_col], user, row[id_col], None, time, \
                                        row[text_col], meta=metadata))
    
    return Corpus(utterances = utterance_list)

For meta data, we will include the normalized score, its corresponding binary label (based on a 75% vs. 25% percentile cutoff -- technically there are three classes, but we will only look at the two ends, thus "binary"), as well as all original annotations with turker information. 

- detailed annotations information 

In [30]:
# for simplicity, we will condense the turker information together
df["Annotations"] = [dict(zip([df.iloc[i]["TurkId{}".format(j)] for j in range(1,6)], \
                             [df.iloc[i]["Score{}".format(j)] for j in range(1,6)])) for i in tqdm(range(len(df)))]

100%|██████████| 4353/4353 [00:10<00:00, 430.54it/s]


- polite vs. impolite label (note that we are only interested in labels that are either +1 or -1)

In [35]:
# computing the binary label based on Normalized score
top = np.percentile(df['Normalized Score'], 75)
bottom = np.percentile(df["Normalized Score"], 25)
df['Binary'] = [int(score >= top) - int(score <= bottom) for score in df['Normalized Score']]

In [39]:
wiki_corpus = convert_df_to_corpus(df, "Id", "Request", ["Normalized Score", "Binary", "Annotations"])

4353it [00:01, 3687.95it/s]


In [48]:
# if you were to download the data directly, here is how: 
# wiki_corpus = Corpus(download("wiki-politeness-annotated"))

In [5]:
wiki_corpus = Corpus(filename="../.convokit/saved-corpora/wiki-politeness-annotated/")

#### 2: annotate the corpus with politeness strategies

To get politeness strategies for each utterance, we will first obtain dependency parses for the utterances, and then check for strategy use. 

In [13]:
from convokit import Parser, PolitenessStrategies

- adding dependency parses

In [14]:
annotator = Parser()
wiki_corpus = annotator.fit_transform(wiki_corpus)

- adding strategy information

In [15]:
ps = PolitenessStrategies()
wiki_corpus = ps.transform(wiki_corpus)

Below is an example of how a processed utterance now look. Dependency parses are stored in `parsed`, and politeness strategies are in `politeness_strategies`

In [16]:
wiki_corpus.get_utterance(629705)

Utterance({'id': 629705, 'user': User([('name', 'wiki_user')]), 'root': 629705, 'reply_to': None, 'timestamp': 'NOT_RECORDED', 'text': "Where did you learn English? How come you're taking on a third language?", 'meta': {'Normalized Score': -1.1200492637766977, 'Binary': -1, 'Annotations': {'A2UFD1I8ZO1V4G': 13, 'A2YFPO0N4GIS25': 9, 'AYG3MF094634L': 11, 'A38WUWONC7EXTO': 11, 'A15DM9BMKZZJQ6': 5}, 'parsed': Where did you learn English? How come you're taking on a third language?, 'politeness_strategies': {'feature_politeness_==Please==': 0, 'feature_politeness_==Please_start==': 0, 'feature_politeness_==Indirect_(btw)==': 0, 'feature_politeness_==Hedges==': 0, 'feature_politeness_==Factuality==': 0, 'feature_politeness_==Deference==': 0, 'feature_politeness_==Gratitude==': 0, 'feature_politeness_==Apologizing==': 0, 'feature_politeness_==1st_person_pl.==': 0, 'feature_politeness_==1st_person==': 0, 'feature_politeness_==1st_person_start==': 0, 'feature_politeness_==2nd_person==': 1, 'fea

You may want to save the corpus by doing `wiki_corpus.dump("wiki-politeness-annotated")` for further exploration. Note that if you do not specify a base path, data will be saved to `.convokit/saved-corpora` in your home directory by default. 

#### 3. predict politeness 

We will see how a simple classifier considering the use of politeness strategies perform. Note that this is only for demonstration, and not geared towards achieving best performance. 

(Most of the code below are adapted from [here](https://github.com/sudhof/politeness/blob/master/scripts/train_model.py))

In [19]:
import random
from sklearn import svm
from scipy.sparse import csr_matrix
from sklearn.metrics import classification_report

For this prediction task, we will only consider the polite vs. impolite group (i.e., those with "Binary" field being either +1 or -1)

In [17]:
binary_corpus = Corpus(utterances=[utt for utt in wiki_corpus.iter_utterances() if utt.meta["Binary"] != 0])

In [9]:
# adapted from "documents2feature_vectors"
def corpus2feature_vectors(corpus: Corpus, ids: List[int]):
    
    """
    Arguments:
        corpus {Corpus} -- The corpus being converted. 
                        Requires pre-computed politeness_strategies in the utterance meta field. 
                        
        ids {List[int]} -- ids being considered 
    """
    
    fks = False
    
    X, y = [], []
    
    for utt_id in ids:
        
        utt = corpus.get_utterance(utt_id)
        fs = utt.meta["politeness_strategies"]
        if not fks:
            fks = sorted(fs.keys())
        fv = [fs[f] for f in fks]
        
        # the utterance is regarded as polite if its score is in the top 75% percentile (i.e., Binary = 1)
        # and it is regarded as impolite is its score lies in the bottom 25% percentile (i.e., Binary = -1)
        l = 1 if utt.meta["Binary"] == 1 else 0
        
        X.append(fv)
        y.append(l)
        
    X = csr_matrix(np.asarray(X))
    y = np.asarray(y)
    
    return X, y

In [94]:
# adapted from "train_svm"
def train_svm(corpus: Corpus, ntesting: int = 500, rseed:int = 123):
    
    """
    Arguments:
        corpus: annotated training corpus
        ntesting:  number of docs to reserve for testing
    """

    utt_ids = corpus.get_utterance_ids()
    random.seed(rseed)
    random.shuffle(utt_ids)

    testing_ids = utt_ids[-ntesting:]
    training_ids = utt_ids[:-ntesting]
    
    print("sample test_ids:", testing_ids[-5:])

    X, y = corpus2feature_vectors(corpus, training_ids)
    Xtest, ytest = corpus2feature_vectors(corpus, testing_ids)

    print("Fitting")
    clf = svm.SVC(C=0.02, kernel='linear', probability=True)
    clf.fit(X, y)

    # Test
    y_pred = clf.predict(Xtest)
    print("accuracy = {}".format(np.mean(y_pred == ytest)))
    print(classification_report(ytest, y_pred, labels = [1, 0], target_names=["polite", "impolite"]))

    return clf

In [95]:
X, y = corpus2feature_vectors(binary_corpus, binary_corpus.get_utterance_ids())

In [96]:
clf = train_svm(binary_corpus)

sample test_ids: [485293, 252109, 623560, 328144, 627508]
Fitting
accuracy = 0.736
             precision    recall  f1-score   support

     polite       0.75      0.70      0.73       249
   impolite       0.72      0.77      0.75       251

avg / total       0.74      0.74      0.74       500



We can then use this classifier to predict politeness labels for Utterances. As an example, we will use some test utterances, but you can also consider use this classifier to predict on new utterances. 

In [97]:
test_ids = [485293, 252109, 623560, 328144, 627508]

In [98]:
# For unlabeled utterances, you will need to featurize slightly differently, as y wouldn't be available
Xtest, y = corpus2feature_vectors(binary_corpus, test_ids)

- predicting for test utterances

In [99]:
ypred = clf.predict(Xtest)
yprob = clf.predict_proba(Xtest)

In [None]:
- to check predicted politeness label ()

In [101]:
pred2label = {1: "polite", 0: "impolite"}

for i, idx in enumerate(test_ids):
    print(i)
    test_utt = binary_corpus.get_utterance(idx)
    print("test utterance:\n{}".format(test_utt.text))
    print("------------------------")
    print("Result: {}, probability estimates = {}\n".format(pred2label[ypred[i]], yprob[i]))

0
test utterance:
Blocked, templated.  Next?
------------------------
Result: impolite, probability estimates = [0.81650082 0.18349918]

1
test utterance:
Stephan, what did you mean by ''"Is English your native language? You seem to fill in a lot of things not said with your assumptions."'' on my talk?
------------------------
Result: impolite, probability estimates = [0.7367566 0.2632434]

2
test utterance:
I see you created a nonsense article yesterday because you were bored. If I unblock you will you disrupt more?
------------------------
Result: polite, probability estimates = [0.37764129 0.62235871]

3
test utterance:
I have no need to search the interwebs, all that matters is it offends people and is a violation of NPOV and MoS. "All Wikipedia articles and other encyclopedic content must be written from a neutral point of view (NPOV), representing fairly and without bias all significant views (that have been published by reliable sources)" - ess-eff is a bias term for a minority 

We note that this is an implementation of a politeness classifier is trained on a specific dataset (wikipedia) and on a specific binarization of politeness classes. Depending on your scenario, you might find it is preferable to directly use the politeness strategies, as exemplified in the [conversations gone awry example](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/conversations-gone-awry/Conversations_Gone_Awry_Prediction.ipynb), rather than a politeness label/score.