# 1 - Data Pre-filtering
Implementing the SVM model suggested by Esaú Villatoro-Tello in the paper "A Two-step Approach for Effective Detection of Misbehaving Users in Chats".

### Pre-filtering Training Data
Any conversations that satisfy at least one of the following conditions are removed:
1. conversations with only one participant
2. conversations with less than 6 interventions per-user
3. conversations with long sequences of unrecognized characters

In [3]:
import xml.etree.ElementTree as ET
import datetime

train_data_path = ".../.../data/pan12-sexual-predator-identification-training-corpus-2012-05-01/"
test_data_path = ".../.../data/pan12-sexual-predator-identification-test-corpus-2012-05-21/"

training_xml = ET.parse(train_data_path + 'pan12-sexual-predator-identification-training-corpus-2012-05-01.xml')
root = training_xml.getroot()        

Now root stores the xml file in a nested list, let's remove conversations meeting the first criteria: conversations with only one participant

In [4]:
conv_2_remove = []
authors = []
init_len = len(root)

for conversation in root:
    authors.clear()
    
    # find all unique authors in this conversation
    for message in conversation:
        author = message.find('author').text
        if author not in authors:
            authors.append(author)
    
    # remove if one or less authors
    if (len(authors)) <= 1 and \
    conversation.get('id') not in conv_2_remove:
        conv_2_remove.append(conversation.get('id'))

# print("Removing {} out of {} conversations".format(init_len - len(root), init_len))
print("Removing {} out of {} conversations".format(len(conv_2_remove), init_len))

Removing 12773 out of 66927 conversations


We have now removed 12773 conversations. Next we remove all conversations that meet the second criteria: conversations with each user having 5 or less messages

In [5]:
for conversation in root:
    if conversation.get('id') in conv_2_remove:
        continue
    authors = {}
    for message in conversation:
        author = message.find('author').text
        if author in authors:
            authors[author] = authors[author] + 1
        else:
            authors[author] = 1
    remove = True
    for author in authors:
        if authors[author] > 5:
            remove = False
            
    if remove is True and \
    conversation.get('id') not in conv_2_remove:
        conv_2_remove.append(conversation.get('id'))

# print("Removing {} out of {} conversations".format(init_len - len(root), init_len))
print("Removing {} out of {} conversations".format(len(conv_2_remove), init_len))

Removing 51827 out of 66927 conversations


We have now removed 51827 conversations out of the original 66927. Lastly we will remove any conversations with messages containing long sequences of unrecognized characters.

In [6]:
import re
for conversation in root:
    if conversation.get('id') in conv_2_remove:
        continue
    remove = False
    for message in conversation:
        text = message.find("text").text
        if text is None or len(text) < 20:
            continue
        match_str = re.findall("[\W_]", text)
        if len(match_str) / len(text) > 0.6:
            remove = True
            break

    if remove is True and \
        conversation.get('id') not in conv_2_remove:
            conv_2_remove.append(conversation.get('id'))

print("Removing {} out of {} conversations".format(len(conv_2_remove), init_len))
# print("Removing {} out of {} conversations".format(init_len - len(root), init_len))

Removing 52224 out of 66927 conversations


We have now removed 52224 out of 66927 conversations. Next we remove the conversations we want to remove from root itself and write the new xml back into aother file.

In [7]:
for conversation in root.findall('conversation'):
    if conversation.get('id') in conv_2_remove:
        root.remove(conversation)
print("The new root has a length of {}.".format(len(root)))


The new root has a length of 14703.


In [8]:
from xml.etree.ElementTree import ElementTree
tree = ElementTree(root)
tree.write(open('.../.../data/svm_training_data/training_data.xml', 'wb'))
print("Filtered data successfully written!")

Filtered data written!
