# Social Honeypot Dataset Preparation

Preparation of the social honeypot dataset.

The dataset can be downloaded from:
http://infolab.tamu.edu/data/

## Initialization

Import python packages, initialize parameters and load datasets from csv files

### Imports 

Import needed python packages.

In [1]:
import pandas as pd
import numpy as np
import string

from nltk.tokenize import TweetTokenizer
from bs4 import BeautifulSoup
from tqdm import tqdm
from sklearn.model_selection import train_test_split

tqdm.pandas()

### Parameters

Initialize variables for the dataset headers and, input and output file locations.

In [2]:
user_dataset_csv = '../../data/external/social_honeypot_icwsm_2011/legitimate_users_tweets.txt'
bot_dataset_csv = '../../data/external/social_honeypot_icwsm_2011/content_polluters_tweets.txt'
dataset_headers = ['userId', 'tweetId', 'text', 'createdAt']

train_clean_csv = '../../data/interim/social_honeypot_train_clean.csv'
test_clean_csv = '../../data/interim/social_honeypot_test_clean.csv'

## Prepare datasets|

Load datasets, create type columns and split them into training and test sets.

### Load datasets

Load the training and test datasets from disk and drop the unnecessary columns.

In [3]:
df_user = pd.read_csv(user_dataset_csv, header=None, names=dataset_headers, sep='\t')
df_bot = pd.read_csv(bot_dataset_csv, header=None, names=dataset_headers, sep='\t')
df_user.drop(['userId', 'tweetId', 'createdAt'], axis=1, inplace=True)
df_bot.drop(['userId', 'tweetId', 'createdAt'], axis=1, inplace=True)

### Create tweet type columns

Create a new column in both dataframes signifying the type of the tweet: 0 for normal user and 1 for content polluter.

In [4]:
df_user['type'] = 0
df_bot['type'] = 1

### Split datasets

Split datasets into training and test sets.

In [5]:
df_user_train, df_user_test = train_test_split(df_user, test_size=20000)
df_bot_train, df_bot_test = train_test_split(df_bot, test_size=20000)

### Combine datasets

Combine the user and bot dataframes into 1 for both the training and the test dataset.

In [6]:
df_train = df_bot_train.append(df_user_train)
df_test = df_bot_test.append(df_user_test)

## Cleanup

The following cleanup steps will be applied on the datasets

1. Convert to lowercase
3. tokenkize using nltk's twitter tokenizer
4. replace all http links with the string 'http'
5. Filter out all mentions tokens
6. Filter out all hashtags tokens
7. Filter out all tokens containing non-letter characters
8. Join tokens back together

### Process text function

Define a function that will be applied to all texts in the datasets.

In [7]:
#create a set of all lowercase ascii character plus "'"
letters = set(string.ascii_lowercase + "'")
tokenizer = TweetTokenizer()

def process_text(text):
    text = str(text)
    text = text.lower()
    tokens = tokenizer.tokenize(text)
    tokens = ['http' if t.startswith('http') else t for t in tokens]
    tokens = list(filter(lambda t: not t.startswith('@'), tokens))
    tokens = list(filter(lambda t: not t.startswith('#'), tokens))
    tokens = list(filter(lambda t: set(t).issubset(letters), tokens))
    tokens = list(filter(lambda t: not t == "'", tokens))
    return " ".join(tokens)

### Process datasets

Apply the function on all datasets.

In [8]:
df_train.text = df_train.text.progress_map(process_text)
df_test.text = df_test.text.progress_map(process_text)
df_train = df_train[df_train.text!='']
df_test = df_test[df_test.text!='']
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

100%|██████████| 5540068/5540068 [03:40<00:00, 25119.95it/s]
100%|██████████| 40000/40000 [00:01<00:00, 24511.04it/s]


## Save clean datasets

Save the cleaned up datasets back to disk.

In [9]:
df_train.to_csv(train_clean_csv, index=False)
df_test.to_csv(test_clean_csv, index=False)