## Document Classifier

In [4]:
!pip install -qq torchtext
!pip install -qq torchdata
!pip install torch==2.0.1+cpu torchvision==0.15.2+cpu torchtext==0.15.2+cpu --index-url https://download.pytorch.org/whl/cpu


Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch==2.0.1+cpu
  Downloading https://download.pytorch.org/whl/cpu/torch-2.0.1%2Bcpu-cp310-cp310-linux_x86_64.whl (195.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m195.4/195.4 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.15.2+cpu
  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.15.2%2Bcpu-cp310-cp310-linux_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchtext==0.15.2+cpu
  Downloading https://download.pytorch.org/whl/torchtext-0.15.2%2Bcpu-cp310-cp310-linux_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
Collecting torchdata==0.6.1 (from torchtext==0.15.2+cpu)
  Downloading https://download.pytorch.org/whl/torchdata-0.6.1-cp310-cp310-manylinux_2_

In [4]:
!pip install portalocker


Collecting portalocker
  Downloading portalocker-3.0.0-py3-none-any.whl.metadata (8.5 kB)
Downloading portalocker-3.0.0-py3-none-any.whl (19 kB)
Installing collected packages: portalocker
Successfully installed portalocker-3.0.0


In [7]:
from tqdm import tqdm
import numpy as np
import pandas as pd
from itertools import accumulate
import matplotlib.pyplot as plt
from torchtext.data.utils import get_tokenizer

import torch
import torch.nn as nn

from torch.utils.data import DataLoader
import numpy as np
from torchtext.datasets import AG_NEWS
from IPython.display import Markdown as md
from tqdm import tqdm

from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import AG_NEWS
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
from sklearn.manifold import TSNE
import plotly.graph_objs as go
from sklearn.model_selection import train_test_split

from torchtext.data.utils import get_tokenizer

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

In [8]:
def plot(COST,ACC):
    fig, ax1 = plt.subplots()
    color = 'tab:red'
    ax1.plot(COST, color=color)
    ax1.set_xlabel('epoch', color=color)
    ax1.set_ylabel('total loss', color=color)
    ax1.tick_params(axis='y', color=color)

    ax2 = ax1.twinx()
    color = 'tab:blue'
    ax2.set_ylabel('accuracy', color=color)  # you already handled the x-label with ax1
    ax2.plot(ACC, color=color)
    ax2.tick_params(axis='y', color=color)
    fig.tight_layout()  # otherwise the right y-label is slightly clipped

    plt.show()

### Creating iterator and checking text, associated labels

In [30]:
train_iter= iter(AG_NEWS(split="train"))

In [31]:
size = sum(1 for _ in train_iter)  # Count the number of items
print(f"Size of train_iter: {size}")

Size of train_iter: 120000


In [32]:
train_iter= iter(AG_NEWS(split="train"))
y,text= next((train_iter))
print(y,text)

3 Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.


In [33]:
next((train_iter)) ## we can use next and keep iterating and get label, text

(3,
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')

In [22]:
ag_news_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tec"}
ag_news_label[y]

'Business'

In [23]:
num_class = len(set([label for (label, text) in train_iter ]))
num_class

4

## Data Preparation

1. What is an Iterable?

Definition: An iterable is any Python object that can be looped over (iterated through). It contains elements that you can access one at a time.
Key Property: An iterable implements the __iter__() method, which returns an iterator.

Examples: Lists, tuples, dictionaries, strings, and objects that define __iter__() or __getitem__(). How to Identify an Iterable

You can pass an iterable to iter() to get an iterator. **AG_NEWS is an iterable object**

The AG_NEWS dataset in torchtext does not support direct indexing like a list or tuple. It is not a random access dataset but rather an iterable dataset that needs to be used with an iterator. This approach is more effective for text data.



In [48]:

# Reinitialize train_iter, loads AG_NEWS dataset which contains labels and text, without iter. AG_NEWS is an iterable object
train_iter = AG_NEWS(split="train")


# Define tokenizer and yield_tokens
tokenizer = get_tokenizer("basic_english")

# The purpose of the generator function yield_tokens is to yield tokenized texts one at a time.
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text.lower())  # Lowercase conversion for consistency

If we had initalized AG News with iter and then called yield_tokens then it will give you list of tokens for one sentence at a time and then calling next(yield_tokens(train_iter)) will give next sentence list of tokens.

What build_vocab_from_iterator Expects?

The function build_vocab_from_iterator works with any iterable that provides tokens one at a time. It does not require an explicit iterator.
It will internally convert the iterable into an iterator using iter() if necessary.

In [44]:

# Build vocabulary, unk for unknown words
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

# Print the vocabulary size and sample tokens
print(f"Vocabulary size: {len(vocab)}")
print(f"Sample tokens: {list(vocab.get_stoi().keys())[:10]}")


Vocabulary size: 95811
Sample tokens: ['zzz', 'zygmunt', 'zwiki', 'zvidauri', 'zurine', 'zurab', 'zuo', 'zuloaga', 'zovko', 'zotinca']


In [51]:
vocab(["age","hello"]) ## get token indices

[2120, 12544]

### Next Steps to

Load the dataset: train_iter and test_iter hold training and test data.
Convert to map-style datasets: Make datasets compatible with random access (train_dataset and test_dataset).
Split the training dataset:
95% for training (split_train_).
5% for validation (split_valid_).
Prepare for GPU/CPU: Ensures that the training process utilizes GPU if available, otherwise defaults to CPU.