### Sentiment140 data loading demo

This is a data loading demo for the pre-partitioned sentiment140 (twitter) dataset. Through the following guide, we can load the whole data from files and create **one (torchtext) dataset for each (twitter) user**.

The data is obtained by running the script from [LEAF](https://github.com/TalwalkarLab/leaf/tree/master/data/sent140). See more details in [my repo](https://github.com/wingter562/LEAF_prepartitioned/tree/main/sentiment140_k30).

In [None]:
# check the directory tree and create a dir for the CIFAR dataset
!ls -lh
!mkdir datasets

total 4.0K
drwxr-xr-x 1 root root 4.0K Oct  8 13:45 sample_data


In [None]:
# import packages
import torch
import numpy as np
import random

Download the pre-partitioned data, which has been split into two files for training and test beforehand.
- training data file name: all_data_niid_0_keep_30_train_9.json
- test file name: all_data_niid_0_keep_30_test_9.json

In [None]:
# download it from my repo
%cd /content/datasets/
!git clone https://github.com/wingter562/LEAF_prepartitioned.git
%cd LEAF_prepartitioned/sentiment140_k30/
!ls -lh

/content/datasets
Cloning into 'LEAF_prepartitioned'...
remote: Enumerating objects: 80, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 80 (delta 0), reused 12 (delta 0), pack-reused 56[K
Unpacking objects: 100% (80/80), done.
/content/datasets/LEAF_prepartitioned/sentiment140_k30
total 25M
-rw-r--r-- 1 root root 2.8M Oct 27 14:04 all_data_niid_0_keep_30_test_9.json
-rw-r--r-- 1 root root  22M Oct 27 14:04 all_data_niid_0_keep_30_train_9.json
-rw-r--r-- 1 root root 1.5K Oct 27 14:04 README.md


The data stored in the json files are in the following **format**: three key-value pairs at the top hierarchy, tweets data in the 3rd key-value pair as a dictionary.
- "users": the list of users
- "num_samples": the list of number of samples for each user
- "user_data": 
  - username: 
    - "x": the list of the user's tweets, each tweet is a list containing the following items
      - tweet id, e.g., "1932561653"
      - date and timestamp, e.g., "Tue May 26 21:43:20 PDT 2009"
      - query tag, e.g., "NO_QUERY"
      - username (same as the key in the "user_data" dict), e.g., "sunshine_diva"
      - tweet text, e.g., "@faithgoddess7 I love the outdoors...but mostly in the Spring and Fall. Not too cold or too hot... "
      - train/test tag, e.g., "training"
    - "y": the labels of the user's tweets as a list of binary values, e.g., \[1 1 1 1 1 1 1 1 1 1\].

Based on the json data file structure, we define a function to extract the data, training and test separately, into two dictionaries using "username" as the keys.

Note: tags like data/timestamp and query will be discarded as we are only interested in the tweet text and the label.

In [None]:
from collections import defaultdict
import os
import json

def read_json(fpath):
  assert fpath.endswith('.json')
  udata = defaultdict(lambda : None)

  print('loading %s...' % fpath)
  with open(fpath, 'r') as f:
    all_data = json.load(f)
  users = all_data['users']
  num_samples = all_data['num_samples']
  tweets = all_data['user_data']
  # parse user by user
  for u in users:
    # each user have multiple tweets, field 4 is the text for each tweet
    u_tweets = [tw[4] for tw in tweets[u]['x']]
    u_labels = tweets[u]['y']
    udata[u] = {'x':u_tweets, 'y':u_labels}

  #users = list(sorted(tweets.keys()))
  return users, num_samples, udata

Next, we use the function to load the json files into dictionaries with "username" as the keys. The resulting train_data/test_data is in the following dictionary hierarchy:
- username:
  - 'x': the list of tweet texts
  - 'y': the list of the corresponding labels

In [None]:
users, num_samples_train, train_data = read_json("all_data_niid_0_keep_30_train_9.json")
_, num_samples_test, test_data = read_json("all_data_niid_0_keep_30_test_9.json")
print("number of users: ", len(users))
print("number of training samples: ", sum(num_samples_train))
print("number of test samples: ", sum(num_samples_test))
print("data sample format: ", train_data[users[0]])

loading all_data_niid_0_keep_30_train_9.json...
loading all_data_niid_0_keep_30_test_9.json...
number of users:  2875
number of training samples:  130743
number of test samples:  16025
data sample format:  {'x': ["@JennysMyName i'm sorry  p.s. Your mom can call tomorrow at 12:30pm or later. haah", 'This is making me cry ', "@DeeYoung08 i'm sorry  wish you could make it!", '@Janettex3 i guess he did! It is rude! ', '@M4DDYL0V3SY0U lmao, wow! Poor Maddy ', '@GabrielaElena neither can i  and my knees give out :/', '@tishh i agree ', '@biancaalosa i know! Its so sad  its /soeasyy; just search email: uptownherolover@aim.com', 'Good morning twitter world. P.s. I miss them ', "@RiskyBusinessMB i so wish i could've gone last night, i live 2 hours away   maybe another PPP with more notice?", "@biancaduhh awh! I'm sorry  my prayers are with your dad, &amp; his friends family &amp; friends.", 'Last time in the drama room ', 'This is sad ', '@JennysMyName neither do i ', "Freakin' cricket. Go away

In this demo, we wrap each twitter users' data into a **custom dataset** object so that each user's samples can be loaded and processed through the [new Pytorch NLP pipeline](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) for sentiment analysis.

Native torchtext-style datasets (e.g., [torchtext.datasets.AG_NEWS](https://pytorch.org/text/stable/datasets.html)) are iterable (to enable dynamic loading of data streams) and yield a **(label, text)** tuple when iterated (see below). 

Nonetheless, in our case, a plain (map-style) dataset is enough for the pipeline.


```
# An example of native torchtext dataset
from torchtext.datasets import AG_NEWS

train_iter = AG_NEWS(split='train')
next(train_iter)

>>> (3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green
again.")
```



In [None]:
from torch.utils.data import Dataset

class Sent140_user_dataset(Dataset):
  def __init__(self, username, user_data):
    """Create an IMDB dataset instance given a path and fields.
    Arguments:
    username: username
    user_data: user data as a dictionary formatted as we describe above
    """
    self.username = username
    self.user_data = user_data
    self.text = self.user_data['x']  # as a list of tweet text strings
    self.labels = self.user_data['y']  # as a list of integers (either 0 or 1)
    assert len(self.text) == len(self.labels)
    self.size = len(self.labels)

  def __len__(self):
    return self.size

  def __getitem__(self, index):
    return tuple([self.labels[index], self.text[index]])

Now we wrap each the data, one dataset object for each user:

In [None]:
# one dataset object for one twitter user
user_datasets = {}
for u in users:
  user_datasets[u] = Sent140_user_dataset(u, train_data[u])

# pick one user and check it out
some_user = 'KayleenDuhh'
print("user:", user_datasets[some_user].username)
data_iterator = iter(user_datasets[some_user])
print("first three tweets:")
print(next(data_iterator))
print(next(data_iterator))
print(next(data_iterator))

user: KayleenDuhh
first three tweets:
(0, "@JennysMyName i'm sorry  p.s. Your mom can call tomorrow at 12:30pm or later. haah")
(0, 'This is making me cry ')
(0, "@DeeYoung08 i'm sorry  wish you could make it!")


Since each twitter user has very limited size of corpus, to **build a vocabulary** we need to use a global iterator (as a function) that iterates through all the tweets by all the users. The following codes build a global vocab collaboratively.

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer(tokenizer='spacy', language='en_core_web_sm')  # the default is "basic_english"
def yield_tokens(user_datasets):
  for uname in user_datasets:
    data_iter = iter(user_datasets[uname])
    for _, text in data_iter:
      yield tokenizer(text)

tokens_iter = yield_tokens(user_datasets)
print(next(tokens_iter))
print(next(tokens_iter))

# create vocab object, using torchtext apis and our token iterator function
vocab = build_vocab_from_iterator(yield_tokens(user_datasets), specials=["<pad>", "<unk>"])
vocab.set_default_index(vocab["<unk>"])

['@JennysMyName', 'i', "'m", 'sorry', ' ', 'p.s', '.', 'Your', 'mom', 'can', 'call', 'tomorrow', 'at', '12:30pm', 'or', 'later', '.', 'haah']
['This', 'is', 'making', 'me', 'cry']


The vocab object maps words to their indices (or one-hot vectors equivalently) and also provides methods that reflect the information such as the size of the vocabulary and words frequencies.

In [None]:
# test the vocab
print(vocab(['here', 'is', 'an', 'example']))
print("Vocabulary size: ", len(vocab))
print("Special tokens and 8 most frequent tokens", vocab.get_itos()[:10])
print("Index of word 'winter':", vocab['winter'])

[97, 17, 124, 4486]
Vocabulary size:  125979
Special tokens and 8 most frequent tokens ['<pad>', '<unk>', '!', '.', 'I', ' ', ',', 'to', 'the', 'you']
Index of word 'winter': 1797


With a global vocab ready (logically visible to all the clients), we now can set up the dataloaders for local training. Here we demonstrate with the users' training sets.

Loading text data is a bit tricky because sentences can vary in length. Depending on the model, a batch of sentences can be loaded in two ways:
- The DataLoader **stacks** the sentences when producing a batch and thus requires them to be of same length. Here is where the "padding" operation comes in. 
- The DataLoader **concatenates** the sentences when producing a batch and records the offset of each sentence. This is to be coupled with "**Bag-of-Words**" models (see the book [McTear 2016](https://link.springer.com/content/pdf/10.1007/978-3-319-32967-3.pdf) and the [wiki link](https://en.wikipedia.org/wiki/Bag-of-words_model) for details) that typically use a special embedding layer (torch.nn.EmbeddingBag)

We implement the **second** method by virtue of the pass-in function **collate_fn** demonstrated in this [Torchtext tutorial](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html). 

In [None]:
from torch.utils.data import DataLoader

# define a sentence-to-indices pipeline using the tokenizer and vocab
# e.g., text_pipe('winter is coming.') >>> [1797, 17, 355, 3]
text_pipe = lambda x: vocab(tokenizer(x))

# the collate function is applied by the Pytorch Dataloader
def collate_batch(batch):
  label_list, text_list, offsets = [], [], [0]
  for (label, text) in batch:
    label_list.append(label)
    processed_text = torch.tensor(text_pipe(text), dtype=torch.int)
    text_list.append(processed_text)
    offsets.append(len(processed_text))
  offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)  # cumulative offsets
  label_list = torch.tensor(label_list, dtype=torch.int)
  text_list = torch.cat(text_list)  # concat them
  return label_list, text_list, offsets

# create dataloaders
local_loaders = {}
for u in users:
  local_loaders[u] = DataLoader(user_datasets[u], 
                                batch_size=8, shuffle=False, 
                                collate_fn=collate_batch)

Now our local DataLoaders are ready. Check it out:

In [32]:
some_user = 'KayleenDuhh'
print("The shape of the batches for '%s':" % some_user)
for labels, text, offsets in local_loaders[some_user]:
  print(text.shape, "with %d sentences" % len(labels))

The shape of the batches for 'KayleenDuhh':
torch.Size([81]) with 8 sentences
torch.Size([97]) with 8 sentences
torch.Size([146]) with 8 sentences
torch.Size([118]) with 8 sentences
torch.Size([152]) with 8 sentences
torch.Size([67]) with 5 sentences


Each batch generated by our DataLoader is a 1-D tensor because sentences got concatenated. This is the shape required by a special embedding layer called [torch.nn.EmbeddingBag](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag). The offsets are needed for model inference.

Before we move on to define the embedding layer and our model, we may want to load **pre-trained embedding vectors** so that our model's embedding layer can be properly initialized.

The vocab object we set up above is essentially a **word-to-index** dictionary. The new [torchtext Vocab](https://pytorch.org/text/stable/vocab.html) class no longer defines **index-to-vector** mapping (i.e., embedding), which means that we need to an extra object for pre-trained embeddings. The pre-training embeddings are [torchtext Vectors](https://pytorch.org/text/stable/vocab.html#pretrained-word-embeddings) objects.

The following code loads **GloVe** as the pre-trained embeddings:

In [None]:
import torchtext

# 6 billion word, 100-dim version of GloVe
glove_emb = torchtext.vocab.GloVe(name='6B', dim=100)

.vector_cache/glove.6B.zip: 862MB [02:41, 5.33MB/s]                           
100%|█████████▉| 399999/400000 [00:18<00:00, 22146.99it/s]


In [30]:
print("The embedding for word 'winter' from GloVe:")
print(glove_emb.get_vecs_by_tokens("winter", lower_case_backup=True))
print("The embedding shape for sentence '<pad> winter <unk>' from GloVe:")
print(glove_emb.get_vecs_by_tokens(['<pad>', 'winter', '<unk>'], lower_case_backup=True).shape)

The embedding for word 'winter' from GloVe:
tensor([-2.8663e-01,  6.7126e-01,  2.5696e-01, -3.4079e-01, -3.7587e-01,
        -6.8104e-01, -6.2413e-02,  4.7955e-01, -1.2093e+00, -4.8105e-01,
        -3.5496e-01, -7.1598e-01,  9.6647e-01, -1.1323e-01, -7.5996e-02,
         2.0510e-02, -8.1094e-02, -1.6928e-01, -3.0567e-01, -2.7979e-01,
        -3.6918e-01,  5.8724e-01,  8.2248e-01,  6.8072e-01, -4.6516e-01,
         8.2520e-01, -7.3008e-01, -6.3272e-01,  9.8242e-02, -5.7056e-01,
        -5.6975e-01, -4.0802e-01, -1.0243e-01, -5.2213e-01, -1.0497e-01,
         8.2012e-01,  3.4849e-02,  2.7265e-01, -2.1298e-01, -1.6997e-01,
        -1.1399e-01, -3.7558e-01, -9.0068e-02, -4.4219e-02,  9.6546e-01,
         4.1123e-01,  5.0636e-01, -1.0452e+00,  1.0377e+00, -4.7861e-01,
         2.0212e-01, -5.1381e-01,  3.6144e-01,  5.7141e-01, -1.2959e-01,
        -2.5805e+00,  8.0368e-02,  5.4031e-01,  1.2210e+00,  2.3555e-01,
        -2.9840e-01,  1.3073e+00, -2.7097e-01, -1.7336e-01,  1.0842e+00,
       

Now we define our **Bag-of-Words model with a Feed-Forward Network (FFN)** as the classifier. The structure of our model (again, from the [TorchText tutorial](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)) is shown below:

<p align="center">
  <img src="https://pytorch.org/tutorials/_images/text_sentiment_ngrams_model.png" width="450" height="450" />
</p>
<p align="center">
The BoW model from the <a href="https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html">TorchText tutorial</a> 
</p>

In [51]:
from torch import nn

class Model_BoW(nn.Module):
  def __init__(self, embed_dim, out_dim, vocab, pretrained):
    """
    embed_dim: number of dimensions for the embedding layer
    out_dim: number of dimensions for the output layer
    vocab: the vocabulary object
    pretrained: the loaded pre-trained embeddings
    """
    super(Model_BoW, self).__init__()
    # self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
    self.embed_dim = embed_dim
    self.out_dim = out_dim
    assert self.embed_dim == pretrained.dim

    # load pre-trained embedding as the initial weights
    pretrained_weights = self.embeddings2weights_by_vocab(vocab, pretrained)
    self.embedding = nn.EmbeddingBag.from_pretrained(pretrained_weights, sparse=False)

    self.fc = nn.Linear(embed_dim, out_dim)
    self.init_weights()
  
  def embeddings2weights_by_vocab(self, vocab, pretrained):
    """
    Load the pre-trained embeddings by the order of vacabulary entries into weights.
    Use uniform random for out-of-table tokens
    """
    # vocab.get_itos() returns the list of tokens by index order
    # glove_emb.get_vecs_by_tokens() can take a list of tokens and return a 2D tensor
    return glove_emb.get_vecs_by_tokens(vocab.get_itos(), lower_case_backup=True)

  def init_weights(self):
    initrange = 0.5
    #self.embedding.weight.data.uniform_(-initrange, initrange)
    self.fc.weight.data.uniform_(-initrange, initrange)
    self.fc.bias.data.zero_()

  def forward(self, text, offsets):
    embedded = self.embedding(text, offsets)
    out = self.fc(embedded)
    out = torch.sigmoid(out)
    return out

Now, using a global vocab and the loaded pre-trained embeddings, we can instantiate **a global model** that will be distributed to the clients to be trained on their local datasets/dataloaders we built earlier.

In [57]:
glob_model = Model_BoW(embed_dim=100, out_dim=1, vocab=vocab, pretrained=glove_emb)
print(glob_model)

# model inference
some_user = 'KayleenDuhh'
for batch in local_loaders[some_user]:
  labels, text, offsets = batch
  preds = glob_model(text, offsets).squeeze()
  print("\nModel predicts: ", preds.tolist())
  print("Ground truth: ", labels.tolist())

Model_BoW(
  (embedding): EmbeddingBag(125979, 100, mode=mean)
  (fc): Linear(in_features=100, out_features=1, bias=True)
)

Model predicts:  [0.3835672438144684, 0.3440566062927246, 0.34107184410095215, 0.36228814721107483, 0.49583756923675537, 0.3856084644794464, 0.5766748785972595, 0.5032239556312561]
Ground truth:  [0, 0, 0, 0, 0, 0, 0, 0]

Model predicts:  [0.39959895610809326, 0.4715730845928192, 0.42841678857803345, 0.2638038694858551, 0.2840256989002228, 0.5685793161392212, 0.4764051139354706, 0.3809822201728821]
Ground truth:  [0, 0, 0, 0, 0, 0, 0, 0]

Model predicts:  [0.5822653770446777, 0.42088374495506287, 0.47541746497154236, 0.35256025195121765, 0.381011962890625, 0.4372074604034424, 0.5040506720542908, 0.45315638184547424]
Ground truth:  [0, 0, 0, 0, 0, 0, 0, 0]

Model predicts:  [0.45184630155563354, 0.4816446304321289, 0.42339685559272766, 0.5139278769493103, 0.4345231354236603, 0.41869232058525085, 0.4425433874130249, 0.5118927359580994]
Ground truth:  [0, 0, 1, 1, 1

So far, user datasets are ready and the model has been defined for FL experiments. For the implementation of model training, evaluation and test, we refer the readers to the following links. They are local-logic based but can be transplanted to FL-logic easily.

1. [TorchText tutorial](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) (new TorchText pipeline)
2. [bentrevett pytorch-sentiment-analysis repo](https://github.com/bentrevett/pytorch-sentiment-analysis) (old TorchText pipeline)