# 🚀 Neural Network-Based Text Compression

#### 🖋️ Authors
- Feidnand Eide
- Seran Shanmugathas


## 📚 Install Libraries
We will need the following libraries:
- `pytorch`
- `pytorch-lightning`
- `pandas`
- `numpy`

In [8]:
%pip install numpy pandas torch torch_geometric pytorch-lightning --quiet

## 📌 Import Dependencies
The following libraries are used in this project:
- Standard libraries: `enum`
- PyTorch and PyTorch Lightning for model building and training
- Pandas for data handling

In [9]:
# Standard libraries
from enum import Enum

# Pandas
import pandas as pd

# PyTorch Lightning
import pytorch_lightning as pl

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# PyTorch geometric
import torch_geometric
import torch_geometric.data as geom_data
import torch_geometric.nn as geom_nn

# PL callbacks
from pytorch_lightning.callbacks import ModelCheckpoint
from torch import Tensor

## 🔧 Configuration
Set up the configuration for the model training.

In [10]:
config: dict = {
    "data_path": "dataset_tsmc2014/dataset_TSMC2014_NYC.txt",
    "save_path": "models/model.pth",
    "batch_size": 32,
    "max_length": 512,
    "vocab_size": 256,
    "embedding_dim": 128,
    "hidden_dim": 256,
    "num_layers": 2,
    "dropout_rate": 0.5,
    "max_epochs": 1,
    "learning_rzate": 1e-3,
    "num_workers": 11,
    "log_every_n_steps": 20,
    "pin_memory": True if torch.cuda.is_available() else False,
    "accelerator": "cuda" if torch.cuda.is_available() else "cpu",
}

## 🗂️ Load and Preprocess the Dataset


> Shell command for downloading data

In [11]:
!wget http://www-public.tem-tsp.eu/~zhang_da/pub/dataset_tsmc2014.zip
!unzip dataset_tsmc2014.zip

--2024-01-20 14:15:59--  http://www-public.tem-tsp.eu/~zhang_da/pub/dataset_tsmc2014.zip
Resolving www-public.tem-tsp.eu (www-public.tem-tsp.eu)... 157.159.10.107, 2001:660:3203:100:1:0:80:107
Connecting to www-public.tem-tsp.eu (www-public.tem-tsp.eu)|157.159.10.107|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25546284 (24M) [application/zip]
Saving to: ‘dataset_tsmc2014.zip’


2024-01-20 14:16:05 (4.60 MB/s) - ‘dataset_tsmc2014.zip’ saved [25546284/25546284]

Archive:  dataset_tsmc2014.zip
   creating: dataset_tsmc2014/
  inflating: dataset_tsmc2014/dataset_TSMC2014_NYC.txt  
  inflating: dataset_tsmc2014/dataset_TSMC2014_readme.txt  
  inflating: dataset_tsmc2014/dataset_TSMC2014_TKY.txt  


> Enum for defining columns in the dataset

In [12]:
class Columns(Enum):
    """
    Enum containing the columns of the dataset.
    """
    USER_ID = "User ID"
    VENUE_ID = "Venue ID"
    VENUE_CATEGORY_ID = "Venue category ID"
    VENUE_CATEGORY_NAME = "Venue category name"
    LATITUDE = "Latitude"
    LONGITUDE = "Longitude"
    TIMEZONE = "Timezone"
    UTC_TIME = "UTC time"

In [13]:
def load_data(path: str) -> pd.DataFrame:
    """
    Load the data as a dataframe

    Parameters
    ----------
    path: str
        The path to load the data

    Returns
    -------
    pd.DataFrame
        The dataset
    """
    columns = [column.value for column in Columns]
    return pd.read_csv(path, sep='\t', encoding='latin-1', names=columns)

df = load_data(config["data_path"])

In [14]:
df.head()

Unnamed: 0,User ID,Venue ID,Venue category ID,Venue category name,Latitude,Longitude,Timezone,UTC time
0,470,49bbd6c0f964a520f4531fe3,4bf58dd8d48988d127951735,Arts & Crafts Store,40.71981,-74.002581,-240,Tue Apr 03 18:00:09 +0000 2012
1,979,4a43c0aef964a520c6a61fe3,4bf58dd8d48988d1df941735,Bridge,40.6068,-74.04417,-240,Tue Apr 03 18:00:25 +0000 2012
2,69,4c5cc7b485a1e21e00d35711,4bf58dd8d48988d103941735,Home (private),40.716162,-73.88307,-240,Tue Apr 03 18:02:24 +0000 2012
3,395,4bc7086715a7ef3bef9878da,4bf58dd8d48988d104941735,Medical Center,40.745164,-73.982519,-240,Tue Apr 03 18:02:41 +0000 2012
4,87,4cf2c5321d18a143951b5cec,4bf58dd8d48988d1cb941735,Food Truck,40.740104,-73.989658,-240,Tue Apr 03 18:03:00 +0000 2012


> Clean the data

In [15]:
print("Original data")

print(len(df))
df = df.drop_duplicates()
df = df.dropna()
print(len(df))



Original data
227428
227178


## 🤖 The Model
Here we define our GCN-based Recommendation system

In [16]:
class GCNLayer(nn.Module):
    def __init__(self, input_channels: int, output_channels: int):
        """
        Graph Convolutional Network (GCN) Layer.

        Parameters
        ----------
        input_channels : int
            The number of input channels (features) per node.
        output_channels : int
            The number of output channels (features) per node.
        """
        super().__init__()
        self.projection = nn.Linear(input_channels, output_channels)

    def forward(self, node_features: Tensor, adjacency_matrix: Tensor) -> Tensor:
        """
        Forward pass of the GCN layer.

        Parameters
        ----------
        node_features : Tensor
            Tensor containing node features with shape [batch_size, num_nodes, c_in].
        adjacency_matrix : Tensor
            Batch of adjacency matrices of the graph. Shape: [batch_size, num_nodes, num_nodes].
            If there is an edge from node i to j, adj_matrix[b,i,j]=1, else 0. Supports directed
            edges with non-symmetric matrices. Assumes identity connections are already added.

        Returns
        -------
        Tensor
            The updated node features after applying the GCN layer, with shape
            [batch_size, num_nodes, c_out].

        """
        # Number of neighbours = number of incoming edges
        num_neighbours = adjacency_matrix.sum(dim=-1, keepdims=True)
        node_features = self.projection(node_features)
        node_features = torch.bmm(adjacency_matrix, node_features)
        return node_features / num_neighbours

In [21]:
node_features = torch.arange(8, dtype=torch.float32).view(1, 4, 2)
adjacency_matrix = Tensor([[[1, 1, 0, 0], [1, 1, 1, 1], [0, 1, 1, 1], [0, 1, 1, 1]]])

print("Node features:\n", node_features)
print("\nAdjacency matrix:\n", adjacency_matrix)

Node features:
 tensor([[[0., 1.],
         [2., 3.],
         [4., 5.],
         [6., 7.]]])

Adjacency matrix:
 tensor([[[1., 1., 0., 0.],
         [1., 1., 1., 1.],
         [0., 1., 1., 1.],
         [0., 1., 1., 1.]]])


## 🏋️‍♂️ Training
Setting up the training environment and initiating the training process.

## 📈 Evaluation
Evaluating the model on the test set.