<a href="https://colab.research.google.com/github/vaxxstance/vaxxstance.github.io/blob/main/VaxxStance_Baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## VaxxStance@IberLEF 2021 Contextual Baseline

Mount your drive in order to access the files

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The activities of the user in social networks will be measured with the following function. 

For this task, we measure twitter activities in terms of user status in twitter with the following measurements: 

- statuses_count  
- friends_count  
- followers_count  
- created_at  
- emoji (emojis in bio)

(Data available at file USER.csv)


On the other hand, the information related to the tweet itself will be measured with: 

- retweet_count  
- favorite_count  
- source  (FB, TWITTER)
- created_at

(Data available at file TWEETS.csv)


In [2]:
import pandas as pd

def getTwitterActionsFeatures(df_origin):
  df_tweets = pd.read_csv('/content/drive/path/to/dataset/tweet.csv')
  df_user = pd.read_csv('/content/drive/path/to/dataset/user.csv')
  df_merge = pd.merge(df_origin, df_tweets, on = ['user_id', 'tweet_id'])
  df_merged = pd.merge(df_merge, df_user, on = ['user_id'])
  return df_merged


In [None]:
# We build a train dataset that contains all the features that we want to use for training.

df = pd.read_csv('/content/drive/path/to/dataset/train.csv')
df_train = getTwitterActionsFeatures(df)
df_train

In [12]:
#data transformations:
# 1. labels to categorical codes
df_train.label = pd.Categorical(df_train.label)
df_train['label'] = df_train.label.cat.codes
# 2. source labels(twitter, fb) to  categorical codes
df_train['source'] = ('twitter' in str(df_train['source']))
df_train['source'] = df_train['source'].astype(int)
# 3. emoji in bio as count & fill NaN
import re
df_train['emoji_in_bio'] = df_train['emoji_in_bio'].apply(lambda x: len(re.findall(r'[\U0001f600-\U0001f650]', str(x))))
# 4. dates to timestamps
import numpy as np
df_train['created_at_x'] = pd.to_datetime(df_train['created_at_x']).astype(int)
df_train['created_at_y'] = pd.to_datetime(df_train['created_at_y']).astype(int)

In [24]:
from sklearn.model_selection import GroupShuffleSplit

## We divide the training set into train and test making sure that there is no user overlap among sets, in order to avoid overfitting.
train_inds, test_inds = next(GroupShuffleSplit(test_size=0.30, n_splits=2, random_state = 7).split(df_train, groups=df_train['user_id']))
train = df_train.iloc[train_inds]
test = df_train.iloc[test_inds]

X_train = train.drop(['tweet_id', 'user_id', 'text', 'label'], 1) # we don't need user ID, tweet Id, text or truth label in this set.
X_test = test.drop(['tweet_id', 'user_id', 'text', 'label'], 1) # we don't need user ID, tweet Id, text or truth label in this set.
y_train = train['label'] # truth labels
y_test = test['label']

In [25]:
def CreateBalancedSampleWeights(y_train, largest_class_weight_coef):
    classes = np.unique(y_train, axis = 0)
    classes.sort()
    class_samples = np.bincount(y_train)
    total_samples = class_samples.sum()
    n_classes = len(class_samples)
    weights = total_samples / (n_classes * class_samples * 1.0)
    class_weight_dict = {key : value for (key, value) in zip(classes, weights)}
    class_weight_dict[classes[1]] = class_weight_dict[classes[1]] * largest_class_weight_coef
    sample_weights = [class_weight_dict[y] for y in y_train]
    return sample_weights

In [26]:
from sklearn.utils import class_weight
import xgboost as xgb
import numpy as np
from sklearn import metrics
from sklearn.metrics import f1_score


largest_class_weight_coef = max(df_train['label'].value_counts().values)/df_train.shape[0]
weight = CreateBalancedSampleWeights(y_train, largest_class_weight_coef)

param_dist = {'objective':'multi:softmax', 'num_class': 3, 'eta': 0.3, 'max_depth':6, 'random_state': 24}
xg = xgb.XGBClassifier(**param_dist,  weights = weight)
bst = xg.fit(X_train, y_train)
preds = bst.predict(X_test)
print(metrics.classification_report(y_test, preds, digits=3))
fval = f1_score(y_test, preds, average=None)
print('F1 score average (Favour, against): ', (fval[0]+ fval[1])/2)

              precision    recall  f1-score   support

           0      0.913     0.741     0.818       170
           1      0.705     0.771     0.736       292
           2      0.609     0.625     0.617       192

    accuracy                          0.720       654
   macro avg      0.743     0.712     0.724       654
weighted avg      0.731     0.720     0.723       654

F1 score average (Favour, against):  0.7773396815950007
