# Detection of Twitter users who use hateful lexicon using graph machine learning with Stellargraph

We consider the use-case of identifying hateful users on Twitter motivated by the work in [1] and using the dataset also published in [1]. Classification is based on a graph based on users' retweets and attributes as related to their account activity, and the content of tweets.

We pose identifying hateful users as a binary classification problem and demonstrate the advantage of connected vs unconnected data with regards to increased prediction accuracy for a highly unbalanced dataset in a semi-supervised setting with few training examples.

For connected data, we use Graph Neural Network methods, GCN [2], GAT [3], and GraphSAGE [4] as implemented in the `stellargraph` library. We pose the problem of idnetifying hateful tweeter users as node attribute inference in graph machine learning.

**References**

1. "Like Sheep Among Wolves": Characterizing Hateful Users on Twitter. M. H. Ribeiro, P. H. Calais, Y. A. Santos, V. A. F. Almeida, and W. Meira Jr.  arXiv preprint arXiv:1801.00317 (2017).


2. Semi-Supervised Classification with Graph Convolutional Networks. T. Kipf, M. Welling. ICLR 2017. arXiv:1609.02907 


3. Graph Attention Networks. P. Velickovic et al. ICLR 2018


4. Inductive Representation Learning on Large Graphs. W.L. Hamilton, R. Ying, and J. Leskovec arXiv:1706.02216 
[cs.SI], 2017.

In [None]:
import networkx as nx
import pandas as pd
import numpy as np
import seaborn as sns
import itertools
import os

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegressionCV

import stellargraph as sg
from stellargraph.mapper import GraphSAGENodeGenerator, FullBatchNodeGenerator
from stellargraph.layer import GraphSAGE, GCN, GAT
from stellargraph import globalvar

from keras import layers, optimizers, losses, metrics, Model, models
from sklearn import preprocessing, feature_extraction
from sklearn.model_selection import train_test_split
from sklearn import metrics

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def remove_prefix(text, prefix):
    return text[text.startswith(prefix) and len(prefix):]

def plot_history(history):
    metrics = sorted(set([remove_prefix(m, "val_") for m in list(history.history.keys())]))
    for m in metrics:
        # summarize history for metric m
        plt.plot(history.history[m])
        plt.plot(history.history['val_' + m])
        plt.title(m, fontsize=18)
        plt.ylabel(m, fontsize=18)
        plt.xlabel('epoch', fontsize=18)
        plt.legend(['train', 'validation'], loc='best')
        plt.show()


### Loading the data

**Downloading the dataset:**

The dataset for this demo was published in [1] and it is freely available to download from Kaggle [here](https://www.kaggle.com/manoelribeiro/hateful-users-on-twitter/home).

The following is the description of the datasets:

>This dataset contains a network of 100k users, out of which ~5k were annotated as hateful or
>not. For each user, several content-related, network-related and activity related features
>were provided. 

Additional files of hateful lexicon can be found [here]( 
https://github.com/manoelhortaribeiro/HatefulUsersTwitter/tree/master/data/extra)

Download the dataset and then set the `data_dir` variable to point to the download location.

In [None]:
data_dir = os.path.expanduser("~/data/hateful-twitter-users")

### First load and prepare the node features

Each node in the graph is associated with a large number of features. These are,

hate :("hateful"|"normal"|"other")
  if user was annotated as hateful, normal, or not annotated.
  
  (is_50|is_50_2) :bool
  whether user was deleted up to 12/12/17 or 14/01/18. 
  
  (is_63|is_63_2) :bool
  whether user was suspended up to 12/12/17 or 14/01/18. 
        
  (hate|normal)_neigh :bool
  is the user on the neighborhood of a (hateful|normal) user? 
  
  [c_] (statuses|follower|followees|favorites)_count :int
  number of (tweets|follower|followees|favorites) a user has.
  
  [c_] listed_count:int
  number of lists a user is in.

  [c_] (betweenness|eigenvector|in_degree|outdegree) :float
  centrality measurements for each user in the retweet graph.
  
  [c_] *_empath :float
  occurrences of empath categories in the users latest 200 tweets.

  [c_] *_glove :float          
  glove vector calculated for users latest 200 tweets.
  
  [c_] (sentiment|subjectivity) :float
  average sentiment and subjectivity of users tweets.
  
  [c_] (time_diff|time_diff_median) :float
  average and median time difference between tweets.
  
  [c_] (tweet|retweet|quote) number :float
  percentage of direct tweets, retweets and quotes of an user.
  
  [c_] (number urls|number hashtags|baddies|mentions) :float
  number of bad words|mentions|urls|hashtags per tweet in average.
  
  [c_] status length :float
  average status length.
  
  hashtags :string
  all hashtags employed by the user separated by spaces.
  
**Notice** that c_ are attributes calculated for the 1-neighborhood of a user in the retweet network (averaged out).

First, we are going to load the user features and prepare them for machine learning.

In [None]:
users_feat = pd.read_csv(os.path.join(data_dir, 
                                      'users_neighborhood_anon.csv'))
users_feat.head()

Let's have a look at the distribution of hateful, normal (not hateful), and other (unknown) users in the dataset

In [None]:
print("Initial hateful/normal users distribution")
print(users_feat.shape)
print(users_feat.hate.value_counts())

There is a clear imbalance on the number of users tagged as hateful vs normal and unknown.

### Data cleaning and preprocessing

The dataset as given includes a large number of graph related features that are manually extracted. 

Since we are going to emply modern graph neural networks methods for classification, we are going to drop these manually engineered features. 

The power of Graph Neural Networks stems from their ability to learn useful graph-related features eliminating the need for manual feature engineering.

In [None]:
def data_cleaning(feat):
    feat = feat.drop(columns=["hate_neigh", "normal_neigh"])
    
    # Convert target values in hate column from strings to integers (0,1,2)
    feat['hate'] = np.where(feat['hate']=='hateful', 1, np.where(feat['hate']=='normal', 0, 2))
    
    # missing information
    number_of_missing = feat.isnull().sum()
    number_of_missing[number_of_missing!=0]
    
    # Replace NA with 0
    feat.fillna(0, inplace=True)

    # droping info about suspension and deletion as it is should not be use din the predictive model
    feat.drop(feat.columns[feat.columns.str.contains("is_")], axis=1, inplace=True)

    # drop glove features
    feat.drop(feat.columns[feat.columns.str.contains("_glove")], axis=1, inplace=True)

    # drop c_ features
    feat.drop(feat.columns[feat.columns.str.contains("c_")], axis=1, inplace=True)

    # drop sentiment features for now
    feat.drop(feat.columns[feat.columns.str.contains("sentiment")], axis=1, inplace=True)

    # drop hashtag feature
    feat.drop(['hashtags'], axis=1, inplace=True)

    # Drop centrality based measures
    feat.drop(columns=['betweenness', 'eigenvector', 'in_degree', 'out_degree'], inplace=True)
    
    feat.drop(columns=['created_at'], inplace=True)
    
    return feat

In [None]:
node_data = data_cleaning(users_feat)

Of the original **1037** node features, we are keeping only **204** that are based on a user's attributes and tweet lexicon.

In [None]:
node_data.shape

In [None]:
node_data.head()

Next apply normalization for the continuous features

In [None]:
columns_to_use = node_data.columns[1:].values # contains everything excluding user_id
len(columns_to_use)

In [None]:
# Ignore the first two columns because those are user_id and hate (the target variable)
df_values = node_data.iloc[:, 2:].values

In [None]:
pt = preprocessing.PowerTransformer(method='yeo-johnson', 
                                    standardize=True) 

In [None]:
df_values_log = pt.fit_transform(df_values)

Let's have a look at one of the normalized features before and after the power transform was applied.

In [None]:
paper_rc = {'lines.linewidth': 3}                  
sns.set_context("paper", rc = paper_rc) 

In [None]:
sns.kdeplot(df_values[100, :])
s = plt.ylabel("Density", fontsize=18)
s = plt.xlabel("Feature value", fontsize=18)
s = plt.title("Before Power Transform", fontsize=18)

In [None]:
sns.kdeplot(df_values_log[100, :])
s = plt.ylabel("Density", fontsize=18)
s = plt.xlabel("Feature value", fontsize=18)
s = plt.title("After Power Transform", fontsize=18)

Feature normalization looks like it is doing the right thing as the raw features have long tails that are eliminated after applying the power transform. 

So let us use the normalized features from now on.

In [None]:
node_data.iloc[:, 2:] = df_values_log

In [None]:
# Set the dataframe index to be the same as the user_id and drop the user_id columns
node_data.index = node_data.index.map(str)
node_data.drop(columns=['user_id'], inplace=True)

Node features are now ready for machine learning.

In [None]:
node_data.head()

### Next load the graph

Now that we have the node features prepared for machine learning, let us load the retweet graph.

In [None]:
g_nx = nx.read_edgelist(path=os.path.expanduser(os.path.join(data_dir,
                                                             "users.edges")))

In [None]:
g_nx.number_of_nodes(), g_nx.number_of_edges()

The graph has just over 100k nodes and approximately 2.2m edges.

We aim to train a graph neural network model that will predict the "hate"attribute on the nodes.

In [None]:
print(set(node_data["hate"]))

### Splitting the data

For machine learning we want to take a subset of the nodes for training, and use the rest for validation and testing. We'll use scikit-learn again to split our data into training and test sets.

The total number of annotated nodes is very small when compared to the totail number of nodes in the graph. We are only going to use 15% of the annotated nodes for training and the reamining 85% of nodes for testing.

First, we are going to select the subset of nodes that are annotated as hateful or normal. These will be the nodes that have 'hate' values that are either 0 or 1.

In [None]:
# choose the nodes annotated with normal or hateful classes
pd.options.mode.chained_assignment = None

annotated_users = node_data[node_data['hate']!=2]

In [None]:
annotated_users.head()

In [None]:
annotated_users.shape

In [None]:
annotated_user_features = annotated_users.drop(columns=['hate'])
annotated_user_targets = annotated_users['hate']

There are 4971 annoted nodes out of a possible, approximately, 100k nodes.

In [None]:
print(annotated_user_targets.value_counts())

In [None]:
# split the data
train_data, test_data, train_targets, test_targets = train_test_split(annotated_user_features,
                                         annotated_user_targets,
                                         test_size=0.85,
                                         random_state=101)
train_targets = train_targets.values
train_targets = train_targets[...,np.newaxis]
test_targets = test_targets.values
test_targets = test_targets[...,np.newaxis]
#train_data.drop(columns=['hate'], inplace=True)
#test_data.drop(columns=['hate'], inplace=True)
print("Sizes and class distributions for train/test data")
print("Shape train_data {}".format(train_data.shape))
print("Shape test_data {}".format(test_data.shape))
print("Train data number of 0s {} and 1s {}".format(np.sum(train_targets==0), 
                                                    np.sum(train_targets==1)))
print("Test data number of 0s {} and 1s {}".format(np.sum(test_targets==0), 
                                                   np.sum(test_targets==1)))

In [None]:
train_targets.shape, test_targets.shape

In [None]:
train_data.head()

In [None]:
train_data.shape, test_data.shape

We are going to use 745 nodes for training and 4226 nodes for testing.

In [None]:
# choosing features to assign to a graph, excluding target variable
node_features = node_data.drop(columns=['hate'])
node_features.head()

### Dealing with imbalanced data

As the model is imblanaced we introduce class weights.

In [None]:
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced', 
                                     np.unique(train_targets), 
                                     train_targets[:,0])
train_class_weights = dict(zip(np.unique(train_targets), 
                               class_weights))
train_class_weights

Our data is now ready for machine learning.

Node features are stored in the Pandas DataFrame `node_features`.

The graph in networkx format is stored in the variable `g_nx`.

### Specify global parameters

Here we specify some parameters that control the type of model we are going to use. For example, we specify the base model type, e.g., GCN, GraphSAGE, etc, and the number of estimators in the ensemble as well as model-specific parameters.

In [None]:
model_type = 'graphsage'    # Can be either gcn, gat, or graphsage

if model_type == "graphsage":
    # For GraphSAGE model
    batch_size = 50; 
    num_samples = [20, 10]
    epochs = 30          # The number of training epochs
elif model_type == "gcn":
    # For GCN model
    epochs = 20          # The number of training epochs
elif model_type == "gat":
    # For GAT model
    layer_sizes = [8, 1]
    attention_heads = 8
    epochs = 20         # The number of training epochs    

## Creating the base graph machine learning model in Keras

Now create a `StellarGraph` object from the `NetworkX` graph and the node features and targets. It is `StellarGraph` objects that we use in this library to perform machine learning tasks on.

In [None]:
G = sg.StellarGraph(g_nx, node_features=node_data)

To feed data from the graph to the Keras model we need a generator. The generators are specialized to the model and the learning task. 

For training we map only the training nodes returned from our splitter and the target values.

In [None]:
if model_type == 'graphsage':
    generator = GraphSAGENodeGenerator(G, batch_size, num_samples)
    train_gen = generator.flow(train_data.index, 
                               train_targets, 
                               shuffle=True)
elif model_type == 'gcn': 
    generator = FullBatchNodeGenerator(G, method="gcn", sparse=True)
    train_gen = generator.flow(train_data.index, 
                               train_targets, )
elif model_type == 'gat':
    generator = FullBatchNodeGenerator(G, method="gat", sparse=True)
    train_gen = generator.flow(train_data.index, 
                               train_targets,)

Now we can specify our machine learning model, we need a few more parameters for this but the parameters are model-specific.

In [None]:
if model_type == 'graphsage':
    base_model = GraphSAGE(
        layer_sizes=[32, 32],
        generator=train_gen,
        bias=True,
        dropout=0.5,
    )
    x_inp, x_out = base_model.default_model(flatten_output=True)
    prediction = layers.Dense(units=1, activation="sigmoid")(x_out)
elif model_type == 'gcn':
    base_model = GCN(
        layer_sizes=[32, 16],
        generator = generator,
        bias=True,
        dropout=0.5,
        activations=["elu", "elu"]
    )
    x_inp, x_out = base_model.node_model()
    prediction = layers.Dense(units=1, activation="sigmoid")(x_out)
elif model_type == 'gat':
    base_model = GAT(
        layer_sizes=layer_sizes,
        attn_heads=attention_heads,
        generator=generator,
        bias=True,
        in_dropout=0.5,
        attn_dropout=0.5,
        activations=["elu", "sigmoid"],
        normalize=None,
    )
    x_inp, prediction = base_model.node_model()

### Create a Keras model

Now let's create the actual Keras model with the graph inputs `x_inp` provided by the `base_model` and outputs being the predictions from the softmax layer.

In [None]:
model = Model(inputs=x_inp, outputs=prediction)

We compile our Keras model to use the `Adam` optimiser and the binary cross entroy loss.

In [None]:
model.compile(
    optimizer=optimizers.Adam(lr=0.005),
    loss=losses.binary_crossentropy,
    metrics=["acc"],
)

In [None]:
# The model is of type stellargraph.utils.ensemble.Ensemble but has 
# a very similar interface to a Keras model
model 

Train the model, keeping track of its loss and accuracy on the training set, and its performance on the validation set during the training (e.g., for early stopping), and generalization performance of the final model on a held-out test set (we need to create another generator over the test data for this)

In [None]:
test_gen = generator.flow(test_data.index, test_targets)

Now we can train the model by calling the `fit_generator` method.

In [None]:
class_weight = None
if model_type == 'graphsage':
    class_weight=train_class_weights
history = model.fit_generator(
    train_gen,
    epochs=epochs,
    validation_data=test_gen,
    verbose=0,
    shuffle=False,
    class_weight=class_weight,
)

In [None]:
plot_history(history)

### Model Evaluation

Now we have trained the model, let's evaluate it on the test set.

We are going to consider 4 evaluation metrics calculated on the test set: Accuracy, Area Under the ROC curve (AU-ROC), the ROC curve, and the confusion table.

#### Accuracy

In [None]:
test_metrics = model.evaluate_generator(test_gen)
print("\nTest Set Metrics:")
for name, val in zip(model.metrics_names, test_metrics):
    print("\t{}: {:0.4f}".format(name, val))

#### AU-ROC

Let's use the trained GNN model to make a prediction for each node in the graph.

Then, select only the predictions for the nodes in the test set and calculate the AU-ROC as another performance metric in addition to the accuracy shown above.

In [None]:
all_nodes = node_data.index
all_gen = generator.flow(all_nodes)

In [None]:
all_predictions = model.predict_generator(all_gen).squeeze()[..., np.newaxis]

In [None]:
all_predictions.shape

In [None]:
all_predictions_df = pd.DataFrame(all_predictions, 
                                  index=node_data.index)

Let's extract the predictions for the test data only.

In [None]:
test_preds = all_predictions_df.loc[test_data.index, :]

In [None]:
test_preds.shape

The predictions are the probability of the true class that in this case is the probability of a user being hateful.

In [None]:
test_preds.head()

In [None]:
test_predictions = test_preds.values

In [None]:
test_predictions.shape

In [None]:
test_predictions_class = ((test_predictions>0.5)*1).flatten()
test_df = pd.DataFrame({"Predicted_score": test_predictions.flatten(), 
                        "Predicted_class": test_predictions_class, 
                        "True": test_targets[:,0]})
roc_auc = metrics.roc_auc_score(test_df['True'].values, 
                                test_df['Predicted_score'].values)
print("The AUC on test set:\n")
print(roc_auc)

#### Confusion table

In [None]:
pd.crosstab(test_df['True'], test_df['Predicted_class'])

#### ROC curve

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(test_df['True'], test_df['Predicted_score'], pos_label=1)
plt.figure(figsize=(12,6,))

lw = 2
plt.plot(fpr, tpr, color='darkblue',
         lw=lw, label='GNN ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Receiver operating characteristic curve', fontsize=18)
plt.legend(loc="lower right")
plt.show()

## Visualisation of node embeddings

Evaluate node embeddings as activations of the output of one of the graph convolutional or aggregation layers in the Keras model, and visualise them, coloring nodes by their subject label.

You can find the index of the layer of interest by calling `model.layers`.

First, create a Keras model for calculating the embeddings

In [None]:
model.layers

In [None]:
if model_type == 'graphsage':
    # For GraphSAGE, we are going to use the output activations 
    # of the second GraphSAGE layer as the node embeddings
    # x_inp, prediction
    emb_model = Model(inputs=x_inp, outputs=model.layers[-4].output)
    emb = emb_model.predict_generator(generator=all_gen, )
elif model_type == 'gcn':
    # For GCN, we are going to use the output activations of 
    # the second GCN layer as the node embeddings
    emb_model = Model(inputs=x_inp, outputs=model.layers[6].output)
    emb = emb_model.predict_generator(generator=all_gen)
elif model_type == 'gat':
    # For GAT, we are going to use the output activations of the 
    # first Graph Attention layer as the node embeddings
    emb_model = Model(inputs=x_inp, outputs=model.layers[6].output)
    emb = emb_model.predict_generator(generator=all_gen)

In [None]:
emb.shape

In [None]:
emb = emb.squeeze()

In [None]:
if model_type == "graphsage":
    emb_all_df = pd.DataFrame(emb, index=node_data.index)
elif model_type == "gcn" or model_type == "gat":
    emb_all_df = pd.DataFrame(emb, index=G.nodes())

Select the embeddings for the test set. We are only going to visualise the test set embeddings.

In [None]:
emb_test = emb_all_df.loc[test_data.index, :]

Project the embeddings to 2d using either TSNE or PCA transform, and visualise, coloring nodes by their subject label

In [None]:
X = emb_test
y = test_targets

In [None]:
X.shape

In [None]:
transform = TSNE # or use PCA 

trans = transform(n_components=2)
emb_transformed = pd.DataFrame(trans.fit_transform(X), index=test_data.index)
emb_transformed['label'] = y

In [None]:
alpha = 0.7

fig, ax = plt.subplots(figsize=(7,7))
ax.scatter(emb_transformed[0], emb_transformed[1], c=emb_transformed['label'].astype("category"), 
            cmap="jet", alpha=alpha)
ax.set(aspect="equal", xlabel="$X_1$", ylabel="$X_2$")
plt.title('{} visualization of embeddings for tweeter dataset'.format(transform.__name__))
plt.show()

The node embeddings shown above indicate that the majority of hateful users tend to cluster together. However, some normal users are also in the same neighbourhood and these will be difficult to distinguish from hateful ones. Similarly, there is a small number of hateful users dispersed among normal users and these will also be difficult classify correctly.

### Predictions using Logistic Regression

Finally, we train a Logistic Regression model on the same train and test data but this time ignoring the graph structure and focusing entirely on the node features.

The variables `train_data`, `test_data`, `train_targets`, and `test_targets`, hold the data we need to train the Logistic Regression classifier. 

In [None]:
lr = LogisticRegressionCV(cv=5, 
                          class_weight=class_weight, 
                          max_iter=10000)  # Let's use the default parameters

In [None]:
lr.fit(train_data, train_targets.ravel())

We can now use the trained model to predict the test data

In [None]:
test_preds_lr = lr.predict_proba(test_data)

In [None]:
test_preds_lr.shape

In [None]:
test_preds_lr

#### Calculate AUC metric

In [None]:
test_predictions_class_lr = ((test_preds_lr[:, 1]>0.5)*1).flatten()
test_df_lr = pd.DataFrame({"Predicted_score": test_preds_lr[:, 1].flatten(), 
                        "Predicted_class": test_predictions_class_lr, 
                        "True": test_targets[:,0]})
roc_auc_lr = metrics.roc_auc_score(test_df_lr['True'].values, test_df_lr['Predicted_score'].values)
print("The AUC on test set:\n")
print(roc_auc_lr)

#### The confusion table

In [None]:
pd.crosstab(test_df_lr['True'], test_df_lr['Predicted_class'])

#### The ROC curve

In [None]:
fpr_lr, tpr_lr, thresholds_lr = metrics.roc_curve(test_df_lr['True'], test_df_lr['Predicted_score'], pos_label=1)
plt.figure(figsize=(12,6,))
lw = 2
plt.plot(fpr_lr, tpr_lr, color='darkorange',
         lw=lw, label='LR ROC curve (area = %0.2f)' % roc_auc_lr)
plt.plot(fpr, tpr, color='darkblue',
         lw=lw, label='GNN ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Receiver operating characteristic curve', fontsize=18)
plt.legend(loc="lower right")
plt.show()

Let's have a closer look at the True Positive and False Positive Rates for the GNN and Logistic Regression approaches.

In [None]:
tp_fp_rates = pd.DataFrame({"tpr gnn": tpr[0:100], 
                            "fpr gnn": fpr[0:100],
                            "tpr lr": tpr_lr[0:100],
                            "fpr lr": fpr_lr[0:100]})

In [None]:
tp_fp_rates.iloc[60:61, :]

#### Comparison between LR and GNN

**Note:** This comparison is valid when comparing GraphSAGE with Logistic Regression using a specific split of the data. Using a different GNN algorithm will very likely produce different numerical results, although, the conclusion below will still generally stand.

Comparing the ROC curves between the two machine leanring methods, we see that adding the relatioship information in our machine learning model via the training of a GNN, improves overall performance.

When classifying a user as hateful, it is important to minimise the number of false positives that is the number of normal users that are incorrectly classified as hateful. At the same time, we would like to classify as many hateful users as possible. We can achieve both of these goals by setting decision thresholds according guided by the ROC curve. 

The table above shows that if we are willing to tolerate approximately a false positive rate of approximately 1%, then the GNN model achieves a true positive rate of 32% whereas the logistic regression model achieves a true positive rate of 18%. Utilizing the information in the relationships between users improves true positive rate by 14% for approximately the same false positive rate.