# Text Classification - Adjaceny Matrix


## $\color{blue}{Sections:}$

* Preamble
1.   Admin
2.   Data
3.   Adjacency Matrix
4.   Save


## $\color{blue}{Preamble:}$

The representation of our graph is central to any graph neiral network.

In this notebook we create the adjacency matricies that will be central to all GNN approaches.

## $\color{blue}{Admin}$
* Install relevant Libraries
* Import relevant Libraries

In [None]:
import openai
import re
import pandas as pd
from google.colab import drive
from google.colab import userdata
import os

## $\color{blue}{Data}$

* Connect to Drive
* Load the data to a string

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive'

Mounted at /content/drive
/content/drive/MyDrive


In [None]:
import pandas as pd
path = 'class/datasets/'
df_train = pd.read_pickle(path + 'df_train_augmentation_ft')
df_dev = pd.read_pickle(path + 'df_dev_augmentation_ft')
df_test = pd.read_pickle(path + 'df_test_augmentation_ft')

In [None]:
df_test.columns

Index(['master', 'book_idx', 'chapter_idx', 'content', 'vanilla_embedding.1',
       'direct_ft_augmented_embedding', 'ner_responses'],
      dtype='object')

In [None]:
for el in df_train['ner_responses'][0:5]:
  print(el)
  print()

Halted, he peered down the dark winding stairs and called out coarsely:   —Come up, @@Kinch##Person ! Come up, you fearful jesuit!   Solemnly he came forward and mounted the round gunrest. He faced about and blessed gravely thrice the tower, the surrounding land and the awaking mountains.

Then, catching sight of @@Stephen Dedalus##Person , he bent towards him and made rapid crosses in the air, gurgling in his throat and shaking his head.

@@Stephen Dedalus##Person , displeased and sleepy, leaned his arms on the top of the staircase and looked coldly at the shaking gurgling face that blessed him, equine in its length, and at the light untonsured hair, grained and hued like pale oak.   @@Buck Mulligan##Person peeped an instant under the mirror and then covered the bowl smartly.   —Back to barracks!

he said sternly.   He added in a preacher’s tone:   —For this, O dearly beloved, is the genuine @@Christine##Person : body and soul and blood and ouns. Slow music, please. Shut your eyes, ge

## $\color{blue}{Adjacency-Matrix}$


In [None]:
def get_entities(df):

  # Extract entities
  pattern = r"@@([^#]*)##(\w+\b)\S*"
  all_entities = [re.findall(pattern, text) for text in df['ner_responses']]

  #hold entities
  people = [None] * df.shape[0]
  locations = [None] * df.shape[0]
  entities = [None] * df.shape[0]

  count = 0
  # populate entity holders
  for i in range(len(entities)):

    people_holder = []
    locations_holder = []
    entity_holder = []

    for entity, label in all_entities[i]:
      if (label == 'Person') or (label == 'person'):
        person_input = entity.lower()
        pattern = r'\b(dr\.?|mr\.?|mrs\.?|miss)\b'
        person_clean = re.sub(pattern, '', person_input, flags=re.IGNORECASE)
        people_holder.append(person_clean.strip())
        entity_holder.append(person_clean.strip())
      elif (label == 'Location') or (label == 'location'):
        locations_holder.append(entity.lower().strip())
        entity_holder.append(entity.lower().strip())

    if people_holder:
      people[i] = people_holder
    if locations_holder:
      locations[i] = locations_holder
    if entity_holder:
      entities[i] = entity_holder

  return people, locations, entities

In [None]:
train_people, train_locations, train_entities = get_entities(df_train)
dev_people, dev_locations, dev_entities = get_entities(df_dev)
test_people, test_locations, test_entities = get_entities(df_test)

In [None]:
# make adjacency of train + dev nodes
df1 = df_train[['ner_responses']]
df2 = df_dev[['ner_responses']]
df_val = pd.concat([df2,df1])
val_people, val_locations, val_entities = get_entities(df_val)

In [None]:
import torch
def create_adjacency(lstr):
  n = len(lstr)
  matrix = torch.zeros((n, n))
  for i in range(n):
    for j in range(n):
      if (i != j) and (lstr[i] != None) and (lstr[j] != None):
        for entity in lstr[i]:
          if entity in lstr[j]:
            matrix[i,j] = 1
  return matrix


## $\color{blue}{Save}$


In [None]:
path = 'class/tensors/adj_{}.pt'

In [None]:
# train
# train_people_adj = create_adjacency(train_people)
# torch.save(train_people_adj, path.format('train_people'))

# train_locations_adj = create_adjacency(train_locations)
# torch.save(train_locations_adj, path.format('train_locations'))

train_entities_adj = create_adjacency(train_entities)
torch.save(train_entities_adj, path.format('train_augmented_entities'))


In [None]:
# dev
# dev_people_adj = create_adjacency(dev_people)
# torch.save(dev_people_adj, path.format('dev_people'))

# dev_locations_adj = create_adjacency(dev_locations)
# torch.save(dev_locations_adj, path.format('dev_locations'))

dev_entities_adj = create_adjacency(dev_entities)
torch.save(dev_entities_adj, path.format('dev_augmented_entities'))

In [None]:
# test
# test_people_adj = create_adjacency(test_people)
# torch.save(test_people_adj, path.format('test_people'))

# test_locations_adj = create_adjacency(test_locations)
# torch.save(test_locations_adj, path.format('test_locations'))

test_entities_adj = create_adjacency(test_entities)
torch.save(test_entities_adj, path.format('test_augmented_entities'))

In [None]:
# train
# val_people_adj = create_adjacency(val_people)
# torch.save(val_people_adj, path.format('val_people.1'))

# val_locations_adj = create_adjacency(val_locations)
# torch.save(val_locations_adj, path.format('val_locations.1'))

val_entities_adj = create_adjacency(val_entities)
torch.save(val_entities_adj, path.format('val_augmented_entities'))

Create Adj and DF for only connected nodes on train and dev sets

In [None]:
# train_entities_adj = create_adjacency(train_entities)
# train_connected_mask = (train_entities_adj.sum(dim=-1) != 0)
# train_connected_entities_adj = train_entities_adj[train_connected_mask][:,train_connected_mask]
# torch.save(train_connected_entities_adj,'class/tensors/adj_train_connected_entities.pt')

# dev_entities_adj = create_adjacency(dev_entities)
# dev_connected_mask = (dev_entities_adj.sum(dim=-1) != 0)
# dev_connected_entities_adj = dev_entities_adj[dev_connected_mask][:,dev_connected_mask]
# torch.save(dev_connected_entities_adj,'class/tensors/adj_dev_connected_entities.pt')


In [None]:
# path = 'class/datasets/'

# train_mask = train_connected_mask.tolist()
# df_train_connected = df_train.loc[train_mask,:]
# df_train_connected.to_pickle(path + 'df_train_connected')

# dev_mask = dev_connected_mask.tolist()
# df_dev_connected = df_dev.loc[dev_mask,:]
# df_dev_connected.to_pickle(path + 'df_dev_connected')

In [None]:
df_dev.loc[dev_mask,:].shape

(338, 37)

In [None]:
# # make adjacency of train + dev nodes
# path = 'class/tensors/adj_{}.pt'
# df1 = df_train_connected[['index', 'ner_responses']]
# df2 = df_dev_connected[['index', 'ner_responses']]
# df_val_connected = pd.concat([df2,df1])
# val_people, val_locations, val_entities = get_entities(df_val_connected)
# val_connected_entities_adj = create_adjacency(val_entities)
# torch.save(val_connected_entities_adj, path.format('val_connected_entities'))