## Creating the Feature Matrix
The following two lists contain the features chosen for the naive (user based) and graph based feature matrixes. We create three feature matrixes. One with just user features, one with graph features, and then a final one with a combination of the two

In [1]:
import pickle

with open("./data/user_filled.pickle", "rb") as file:
    users = pickle.load(file)

In [2]:
user_features = [
    "statuses_count",
    "friends_count",
    "favourites_count",
    "age",
    "activity_statuses",
    "activity_friends"
]

graph_features = [
    "role",
    "eigenvector",
    "closeness"
]

In [3]:
X_naive = []
X_graph = []
X_hybrid = []
Y = []

for user in users:
    x_naive = []
    x_graph = []
    x_hybrid = []
    
    for feature in user_features:
        x_naive.append(int(user[feature]) if user[feature] else 0)
        x_hybrid.append(int(user[feature]) if user[feature] else 0)
    for graph_feature in graph_features:
        x_graph.append(user[graph_feature])
        x_hybrid.append(user[graph_feature])
        
    X_naive.append(x_naive)
    X_graph.append(x_graph)
    X_hybrid.append(x_hybrid)
    Y.append(int(user["class"] == "bot"))

## Simple Classifiers
The first step is to train 10 simple classifiers from SKLearn. This is just to get a baseline comparison before training more complex models in order to see where we stand in terms of possible results

In [4]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [13]:
names = ["Nearest Neighbors", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=10),
    RandomForestClassifier(max_depth=10, n_estimators=10, max_features=3),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

naive_scores = []

for name, classifier in zip(names, classifiers): 
    
    X = StandardScaler().fit_transform(X_naive)
    N_X_train, N_X_test, N_y_train, N_y_test = \
        train_test_split(X_naive, Y, test_size=.2, random_state=42)
    clf = classifier.fit(N_X_train, N_y_train)
    naive_scores.append(clf.score(N_X_test, N_y_test))

In [14]:
graph_scores = []

for name, classifier in zip(names, classifiers): 
    X = StandardScaler().fit_transform(X_graph)
    G_X_train, G_X_test, G_y_train, G_y_test = \
        train_test_split(X_graph, Y, test_size=.2, random_state=42)

    clf = classifier.fit(G_X_train, G_y_train)
    graph_scores.append(clf.score(G_X_test, G_y_test))

In [15]:
hybrid_scores = []

for name, classifier in zip(names, classifiers): 
    X = StandardScaler().fit_transform(X_hybrid)
    H_X_train, H_X_test, H_y_train, H_y_test = \
        train_test_split(X_hybrid, Y, test_size=.2, random_state=42)

    clf = classifier.fit(G_X_train, G_y_train)
    hybrid_scores.append(clf.score(G_X_test, G_y_test))

## Simple Classification Results
These results are pretty promising. These show that in all cases the graph based features are at least comparable to their counterparts. 

In [16]:
from IPython.display import HTML, display
import tabulate
import numpy as np

table = np.array([names, naive_scores, graph_scores, hybrid_scores]).T
display(HTML(tabulate.tabulate(table, tablefmt='html', headers=["Name", "Naive Score", "Graph Score", "Hybrid Score"])))

Name,Naive Score,Graph Score,Hybrid Score
Nearest Neighbors,0.715686,0.686275,0.686275
Gaussian Process,0.696078,0.696078,0.696078
Decision Tree,0.745098,0.705882,0.696078
Random Forest,0.823529,0.666667,0.676471
Neural Net,0.686275,0.696078,0.696078
AdaBoost,0.754902,0.686275,0.686275
Naive Bayes,0.735294,0.686275,0.686275
QDA,0.72549,0.676471,0.676471


In [21]:
import numpy as np
from numpy import loadtxt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.metrics import accuracy_score
import tensorflow as tf
tf.random.set_seed(42)
np.random.seed(42)

In [32]:
names = ["User", "Graph", "Hybrid"]
features = [X_naive, X_graph, X_hybrid]
accuracies = []

for X, name in zip(features, names):
    X_train, X_test, y_train, y_test = \
            train_test_split(X, Y, test_size=.2, random_state=42)

    X_Train = np.array(X_train)
    y_Train = np.array(y_train)
    X_Test= np.array(X_test)
    y_Test=np.array(y_test)

    model = Sequential()
    model.add(Dense(8, input_dim=len(X_train[0]), activation='relu')) # input_dim = 6 for Naive and 5 for Graph
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    model.fit(X_Train, y_Train, epochs=350, batch_size=10, verbose=0)

    _, accuracy = model.evaluate(X_Test, y_Test, verbose=0)
    accuracies.append(accuracy)
    
table = np.array([names, accuracies]).T
display(HTML(tabulate.tabulate(table, tablefmt='html', headers=["Feature Set", "Accuracy"])))

Feature Set,Accuracy
User,0.686275
Graph,0.701961
Hybrid,0.768627


In [31]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

accuracies = []

for X, name in zip(features, names):
    X_train, X_test, y_train, y_test = \
            train_test_split(X, Y, test_size=.2, random_state=42)

    X_Train = np.array(X_train)
    y_Train = np.array(y_train)
    X_Test= np.array(X_test)
    y_Test=np.array(y_test)

    model = XGBClassifier()
    model.fit(X_Train, y_Train)

    y_pred = model.predict(X_Test)
    predictions = [round(value) for value in y_pred]

    accuracies.append(accuracy_score(y_Test, predictions) * 100.0)
    
table = np.array([names, accuracies]).T
display(HTML(tabulate.tabulate(table, tablefmt='html', headers=["Feature Set", "Accuracy"])))

Feature Set,Accuracy
User,76.0784
Graph,69.8039
Hybrid,78.0392
