# Text classification with GNN: TextGCN
In this tutorial, we will go through the details of how to implement the TextGCN model proposed by Liang et al. 2019. <br>

From the previous lectures, we have seen how the GNN models work to extract graph and node representations in unsupervised and supervised learning tasks.<br>
Here we demonstrate how GNNs could be applied on learning document embeddings.
Text GCN is a model which allows us to use a graph neural network for text classification where the type of network is convolutional. The below figure is a representation of the adaptation of convolutional graphs using the Text GCN.

![](https://149695847.v2.pressablecdn.com/wp-content/uploads/2021/11/image-15-1024x412.png)


The model in Text GCN takes the input in the form of an identity matrix so that every word can be represented as the one-hot vector. To generate the TF-IDF (term frequency-inverse document frequency) of the word in the document the model generates the edges among nodes based on the word occurrence in the global corpus. Like in the traditional way in TF-IDF the term frequency represents the number of occurrences of the word in the document.to gather the co-occurrence statistics the model supplies a fixed size window on the documents in the corpus and the sliding of the window makes the global word co-occurrence information useful for prediction and classification. 

Mathematically the weight of an edge between node i and node j is defined as 


![](https://149695847.v2.pressablecdn.com/wp-content/uploads/2021/11/image-16.png)

We summarize the pipeline for using TextGCN to perform text classification as follows:

1. **Preparation of text data**: cleaning data, removing stopwords, train-test split.
2. **Preparation of text graph**: buid a heterogenous graph based on the aforementioned formula.
3. **Model training**: create a GCN model with takes the text graph, node feature and edge weight to learn document embeddings.

## Data preparation

In [1]:
# install packages
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)

!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git
!pip install -q torch-cluster -f https://data.pyg.org/whl/torch-${TORCH}.html

1.12.1+cu113
[K     |████████████████████████████████| 7.9 MB 7.5 MB/s 
[K     |████████████████████████████████| 3.5 MB 7.9 MB/s 
[?25h  Building wheel for torch-geometric (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 2.4 MB 8.7 MB/s 
[?25h

In [2]:
# import packages
import random
import numpy as np
import pickle as pkl
import torch
import scipy.sparse as sp
import pandas as pd 

from math import log
from collections import defaultdict
from torch_geometric.data import Data
from sklearn.preprocessing import LabelEncoder
from tqdm.auto import tqdm

In [3]:
# create a empty Data object
pyg_data = Data()
pyg_data

Data()

In [4]:
# download data
! git clone https://github.com/iworldtong/text_gcn.pytorch.git
%cd text_gcn.pytorch/preprocess/
! python remove_words.py R8
%cd ../../

Cloning into 'text_gcn.pytorch'...
remote: Enumerating objects: 80, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 80 (delta 12), reused 8 (delta 8), pack-reused 59[K
Unpacking objects: 100% (80/80), done.
Checking out files: 100% (39/39), done.
/content/text_gcn.pytorch/preprocess
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
{'these', 'be', "didn't", 'once', 'in', 'can', 'have', 'whom', 'both', 'no', 'so', 'itself', 'very', 'while', "hasn't", 'before', 'yourselves', 'at', 'because', "should've", "you'll", 'shan', 'about', 'were', "weren't", 'our', 'from', 'here', 'just', "you'd", 'above', "doesn't", 'has', 'him', 'any', 'their', 'why', "you've", 'an', 'aren', 'wouldn', 'yours', 'you', 'how', 'a', 'haven', 'been', 'into', 'who', 'for', 'his', 'and', 'below', 'most', 'hers', 'mustn', "won't", "shouldn't", 'during', 'herself', 'theirs', 'did', 'e

In [5]:
dataset = "R8"

# list for data
train_data = []
test_data = []

with open('text_gcn.pytorch/data/' + dataset + '.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        temp = line.strip().split("\t") # temp: [docID, Train/Test, target_class]
        if "train" in temp:
            train_data.append(temp)
        elif "test" in temp:
            test_data.append(temp)

In [6]:
train_data[:3]

[['0', 'train', 'earn'], ['1', 'train', 'acq'], ['2', 'train', 'earn']]

In [7]:
# unpack train data
train_ids, _, train_target = zip(*train_data)
train_ids = np.array(list(map(int,train_ids)))
train_target = list(train_target)

# unpack test data
test_ids, _, test_target = zip(*test_data)
test_ids = np.array(list(map(int,test_ids)))
test_target = list(test_target)

# all ids
document_ids = np.append(train_ids,test_ids)

# handling labels
label_list = train_target + test_target
enc = LabelEncoder()
label_list = enc.fit_transform(label_list)
train_target = enc.transform(train_target)
test_target = enc.transform(test_target)

num_classes = len(enc.classes_)
print("Number of documents",len(document_ids))
print("Number of classes:",num_classes)

# sizes
train_size = len(train_ids)
val_size = int(0.1 * train_size)
real_train_size = train_size - val_size
test_size = len(test_ids)

Number of documents 7674
Number of classes: 8


In [8]:
# load document content from cleaned data
doc_content_list = []
with open('text_gcn.pytorch/data/corpus/' + dataset + '.clean.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        doc_content_list.append(line.strip())
print(doc_content_list[0])

champion products approves stock split champion products inc said board directors approved two one stock split common shares shareholders record april company also said board voted recommend shareholders annual meeting april increase authorized capital stock five mln mln shares reuter


## Build vocabulary
To create the text graph, we need to first calculate the TF and IDF as the edge weight.

In [9]:
# build vocabulary set and calculate word frequency
word_freq = {}
word_set = set()
for doc_words in doc_content_list:
    words = doc_words.split()
    for word in words:
        word_set.add(word)
        if word in word_freq:
            word_freq[word] += 1
        else:
            word_freq[word] = 1

vocab = list(word_set)
vocab_size = len(vocab)
print("Vocaulary size:",vocab_size)

Vocaulary size: 7688


In [10]:
# create document-word dictionary
word_doc_list = defaultdict(set)

for i in range(len(doc_content_list)):
    doc_words = doc_content_list[i]
    words = doc_words.split()
    for word in words:
            word_doc_list[word].add(i)

# calculate term frequency
word_doc_freq = {}
for word, doc_list in word_doc_list.items():
    word_doc_freq[word] = len(doc_list)
    
# word-index mapping
word_id_map = {}
for i in range(vocab_size):
    word_id_map[vocab[i]] = i

vocab_str = '\n'.join(vocab)

# Heterogeneous graph construction
There are two types of node in the text graph: **Document node** and **Word node**.<br>
* Document-word edge: if a word appears in the document, create an edge between them. TF-IDF is used as the edge weight.
* Word-word edge: capturing the co-occurrence of words if they appeared in same document. PMI is used as edge weight.

![](https://149695847.v2.pressablecdn.com/wp-content/uploads/2021/11/image-16.png)

In [11]:
# word co-occurence with context windows
# store words in the same window
window_size = 15
windows = []

for doc_words in doc_content_list:
    words = doc_words.split()
    length = len(words)
    if length <= window_size:
        windows.append(words)
    else:
        for j in range(length - window_size + 1):
            window = words[j: j + window_size]
            windows.append(window)

# calculate p(word)
word_window_freq = {}
for window in windows:
    appeared = set()
    for i in range(len(window)):
        word = window[i]
        # continue if the frequency of the word has been calculated
        if word in appeared:
            continue
        if word in word_window_freq:
            word_window_freq[word] += 1
        else:
            word_window_freq[word] = 1
        appeared.add(word)

# calculate co-occurrence frequency (i.e., p(i,j))
word_pair_count = {}
for window in tqdm(windows):
    for i in range(1, len(window)):
        for j in range(0, i):
            word_i = window[i]
            word_i_id = word_id_map[word_i]
            word_j = window[j]
            word_j_id = word_id_map[word_j]
            if word_i_id == word_j_id:
                continue
            word_pair_str = str(word_i_id) + ',' + str(word_j_id)
            if word_pair_str in word_pair_count:
                word_pair_count[word_pair_str] += 1
            else:
                word_pair_count[word_pair_str] = 1
            # two orders
            word_pair_str = str(word_j_id) + ',' + str(word_i_id)
            if word_pair_str in word_pair_count:
                word_pair_count[word_pair_str] += 1
            else:
                word_pair_count[word_pair_str] = 1

  0%|          | 0/400703 [00:00<?, ?it/s]

## Creating wor-word edges and calculate edge weights

In [12]:
row = []
col = []
weight = []

# pmi as weights

num_window = len(windows)

for key in word_pair_count:
    temp = key.split(',')
    i = int(temp[0])
    j = int(temp[1])
    count = word_pair_count[key]
    word_freq_i = word_window_freq[vocab[i]]
    word_freq_j = word_window_freq[vocab[j]]
    # PMI = p(i,j) / (p(i)*p(j))
    pmi = log((count / num_window) /
              (word_freq_i * word_freq_j/(num_window * num_window)))
    if pmi <= 0:
        continue
    row.append(train_size + i)
    col.append(train_size + j)
    weight.append(pmi)

In [24]:
row[:5]

[8529, 12591, 10396, 12591, 10396]

In [25]:
col[:5]

[12591, 8529, 12591, 10396, 8529]

In [13]:
print("Number of edges:",len(row))

Number of edges: 2432734


## Creating word-document edges and calculate TF-IDF

In [14]:
# doc word frequency
doc_word_freq = defaultdict(lambda:0)

for doc_id in document_ids:
    doc_words = doc_content_list[doc_id]
    words = doc_words.split()
    for word in words:
        word_id = word_id_map[word]
        doc_word_str = str(doc_id) + ',' + str(word_id)
        doc_word_freq[doc_word_str] += 1

for doc_index in document_ids:
    doc_words = doc_content_list[doc_index]
    
    # avoid repeated calculation
    words = set(doc_words.split()) 
    for word in words:
        word_index = word_id_map[word]
        key = str(doc_index) + ',' + str(word_index)
        freq = doc_word_freq[key]
        if doc_index < train_size:
            row.append(doc_index)
        else:
            row.append(doc_index + vocab_size)
        col.append(train_size + word_index)
        
        # TF-IDF as edge weight
        idf = log(1.0 * len(document_ids) /
                  word_doc_freq[vocab[word_index]])
        weight.append(freq * idf) # TF*IDF


In [15]:
print("Number of edges:",len(row))

Number of edges: 2756404


### Storing all data in `pyg_data`

In [16]:
# nodes
node_size = train_size + vocab_size + test_size
pyg_data.num_nodes = node_size

adj = sp.csr_matrix(
    (weight, (row, col)), shape=(node_size, node_size))
adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T > adj) # symmetric

from torch_geometric.utils import from_scipy_sparse_matrix
edge_index, edge_weight = from_scipy_sparse_matrix(adj)

pyg_data.edge_index = edge_index.long()
pyg_data.edge_weight = edge_weight.float()

In [17]:
# masks for training, testing
train_masks = torch.zeros(node_size).bool()
train_masks[train_ids] = 1
test_masks = torch.zeros(node_size).bool()
test_masks[test_ids+vocab_size] = 1 

pyg_data.train_mask = train_masks
pyg_data.test_mask = test_masks

# labels
pyg_data.num_classes = num_classes
targets = [torch.FloatTensor(train_target), torch.zeros(vocab_size), torch.FloatTensor(test_target)]
targets = torch.cat(targets,dim=0).long()
pyg_data.target = targets
assert targets.shape[0] == pyg_data.num_nodes

# initial node feature
node_feature = torch.eye(pyg_data.num_nodes)
pyg_data.x = node_feature

In [32]:
pyg_data

Data(num_nodes=15362, edge_index=[2, 3080074], edge_weight=[3080074], train_mask=[15362], test_mask=[15362], num_classes=8, target=[15362], x=[15362, 15362])

In [28]:
model = GCNConv(15362,8).to(device)

In [30]:
model(pyg_data.x,pyg_data.edge_index,pyg_data.edge_weight).shape

torch.Size([15362, 8])

## Practice: Training TextGCN model
So far, we prepare everything you need to train the TextGCN model. <br>
Please try to create a GCN model and perform the text classification as a node classification task.<br>

In [48]:
from torch_geometric.nn import GCNConv
import torch.nn.functional as F
import torch.nn as nn

class TextGCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        ############################################################################
        # TODO: Your code here! 
        # create your GCN models here
        # Note: the input and output dimension is defined by the input parameters in_channels and out_channels
        self.gcn1 = GCNConv(in_channels, hidden_channels)
        self.gcn2 = GCNConv(hidden_channels, out_channels)
        ############################################################################

    def forward(self, x, edge_index, edge_weight):
        # x = None
        ############################################################################
        # TODO: Your code here!       
        # define your forward pass logic here
        x = self.gcn1(x, edge_index, edge_weight)
        x = torch.relu(x)
        x = self.gcn2(x, edge_index, edge_weight)
        ############################################################################
        return x

In [49]:
# model configurations
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dim = 64 
model = TextGCN(pyg_data.x.shape[1], dim, 8).to(device)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(reduction="none")
############################################################################
# TODO: Your code here!  
# setup the model, optimizer here

############################################################################
print(model)

TextGCN(
  (gcn1): GCNConv(15362, 64)
  (gcn2): GCNConv(64, 8)
)


In [38]:
# map data to GPU device
pyg_data.x = pyg_data.x.to(device)
pyg_data.edge_index = pyg_data.edge_index.to(device)
pyg_data.edge_weight = pyg_data.edge_weight.to(device)
pyg_data.target = pyg_data.target.to(device)

In [39]:
for epoch in range(200):
    logits = model(pyg_data.x, pyg_data.edge_index, pyg_data.edge_weight)
    loss = criterion(logits, pyg_data.target)[pyg_data.train_mask]
    loss = loss.mean()    
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # evaluation
    # train_accuracy = None
    # test_accuracy = None
    ############################################################################
    # TODO: Your code here!  
    # calculate the train/test accuracy
    
    ############################################################################
    # print(f"Train Accuracy:{train_accuracy:.4f} | Test Accuracy:{test_accuracy:.4f}")

TypeError: ignored

## Additional questions
If you already finish the above exercise, try to answer the follwing questions!
* How many GCN layers achieves the best performance?
* Does the window size effect the performance? What happens if we increase or decrease the window size?
* What are the limitations of TextGCN?