<a href="https://colab.research.google.com/github/youngsunjang/Class_DSU_OperatingSystem/blob/main/Skip_gram_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Specs of Devices
Colab primarily supports two kinds of GPU options
*   A100 GPU -- Hard to get assgined with this Device (accessed Nov-30, 10:00AM)
    - Allocation of high-performance GPUs is selective. It is known to be allocated according to priority.
*   V100 GPU




#### 1. A100 GPU - Not available
As below, the name of the A100 GPU memory is NVIDIA A100-SXM4-40GB.

The GPU memory is 40960.0 MB (approx. 40 GB).

In [1]:
!pip install GPUtil

Collecting GPUtil
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: GPUtil
  Building wheel for GPUtil (setup.py) ... [?25l[?25hdone
  Created wheel for GPUtil: filename=GPUtil-1.4.0-py3-none-any.whl size=7395 sha256=b53f6d7c05c5b5d65079cfcece0b6d40037ac17e026aec22c4caeb9f11c2a5be
  Stored in directory: /root/.cache/pip/wheels/a9/8a/bd/81082387151853ab8b6b3ef33426e98f5cbfebc3c397a9d4d0
Successfully built GPUtil
Installing collected packages: GPUtil
Successfully installed GPUtil-1.4.0


In [2]:
import GPUtil

In [3]:
# Get the list of available GPUs
gpus = GPUtil.getGPUs()

# Print GPU information
for i, gpu in enumerate(gpus):
    print(f"GPU {i + 1}:")
    print(f"  Name: {gpu.name}")
    print(f"  Driver: {gpu.driver}")
    print(f"  GPU Memory: {gpu.memoryTotal} MB")
    print(f"  GPU Memory Free: {gpu.memoryFree} MB")
    print(f"  GPU Memory Used: {gpu.memoryUsed} MB")
    print(f"  GPU Load: {gpu.load * 100}%")
    print("\n")

GPU 1:
  Name: NVIDIA A100-SXM4-40GB
  Driver: 525.105.17
  GPU Memory: 40960.0 MB
  GPU Memory Free: 40513.0 MB
  GPU Memory Used: 0.0 MB
  GPU Load: 0.0%




In [4]:
# Comprehensive view of GPU specs
!nvidia-smi

Thu Nov 30 16:07:17 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    46W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#### 2. V100 GPU
As below, the name of the V100 GPU memory is Tesla V100-SXM2-16GB.

The GPU memory is 16384.0 MB (approx. 16 GB).

In [None]:
# Get the list of available GPUs
gpus = GPUtil.getGPUs()

# Print GPU information
for i, gpu in enumerate(gpus):
    print(f"GPU {i + 1}:")
    print(f"  Name: {gpu.name}")
    print(f"  Driver: {gpu.driver}")
    print(f"  GPU Memory: {gpu.memoryTotal} MB")
    print(f"  GPU Memory Free: {gpu.memoryFree} MB")
    print(f"  GPU Memory Used: {gpu.memoryUsed} MB")
    print(f"  GPU Load: {gpu.load * 100}%")
    print("\n")

GPU 1:
  Name: Tesla V100-SXM2-16GB
  Driver: 525.105.17
  GPU Memory: 16384.0 MB
  GPU Memory Free: 16150.0 MB
  GPU Memory Used: 0.0 MB
  GPU Load: 0.0%




In [None]:
# Comprehensive view of GPU specs
!nvidia-smi

Mon Nov 27 20:49:47 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0    24W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Program

In [1]:
import torch
torch.manual_seed(10)
from torch.autograd import Variable
from torch.utils.data import DataLoader
import pandas as pd
import numpy as np
from sklearn import decomposition
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (10,8)
import nltk
from nltk.corpus import stopwords
import time

In [2]:
######################################
# Passage split
######################################
import nltk

# Download the punkt tokenizer if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# Specify the path to your text file
file_path = "/content/New_data.txt"

# Read the passage from the text file
with open(file_path, 'r', encoding='utf-8') as file:
    passage = file.read()

# Use nltk to tokenize the text into sentences
corpus = nltk.sent_tokenize(passage)

In [4]:
def create_vocabulary(corpus):
    '''Creates a dictionary with all unique words in corpus with id'''
    vocabulary = {}
    i = 0
    for s in corpus:
        for w in s.split():
            if w not in vocabulary:
                vocabulary[w] = i
                i+=1
    return vocabulary

def prepare_set(corpus, n_gram = 1):
    '''Creates a dataset with Input column and Outputs columns for neighboring words.
       The number of neighbors = n_gram*2'''
    columns = ['Input'] + [f'Output{i+1}' for i in range(n_gram*2)]
    result = pd.DataFrame(columns = columns)
    for sentence in corpus:
        for i,w in enumerate(sentence.split()):
            inp = [w]
            out = []
            for n in range(1,n_gram+1):
                # look back
                if (i-n)>=0:
                    out.append(sentence.split()[i-n])
                else:
                    out.append('<padding>')

                # look forward
                if (i+n)<len(sentence.split()):
                    out.append(sentence.split()[i+n])
                else:
                    out.append('<padding>')
            row = pd.DataFrame([inp+out], columns = columns)
            result = result.append(row, ignore_index = True)
    return result

In [5]:
def prepare_set_ravel(corpus, n_gram = 1):
    '''Creates a dataset with Input column and Output column for neighboring words.
       The number of neighbors = n_gram*2'''
    columns = ['Input', 'Output']
    result = pd.DataFrame(columns = columns)
    for sentence in corpus:
        for i,w in enumerate(sentence.split()):
            inp = w
            for n in range(1,n_gram+1):
                # look back
                if (i-n)>=0:
                    out = sentence.split()[i-n]
                    row = pd.DataFrame([[inp,out]], columns = columns)
                    result = result.append(row, ignore_index = True)

                # look forward
                if (i+n)<len(sentence.split()):
                    out = sentence.split()[i+n]
                    row = pd.DataFrame([[inp,out]], columns = columns)
                    result = result.append(row, ignore_index = True)
    return result

In [6]:
stop_words = set(stopwords.words('english'))

In [7]:
def preprocess(corpus):
    result = []
    for i in corpus:
        out = nltk.word_tokenize(i)
        out = [x.lower() for x in out]
        out = [x for x in out if x not in stop_words]
        result.append(" ". join(out))
    return result

#########################
# In paper, they used 300 dimensions and 5 context (n gram)
#########################

corpus = preprocess(corpus)
vocabulary = create_vocabulary(corpus)
train_emb = prepare_set(corpus, n_gram = 2)
train_emb = prepare_set_ravel(corpus, n_gram = 2)
train_emb.Input = train_emb.Input.map(vocabulary)
train_emb.Output = train_emb.Output.map(vocabulary)

vocab_size = len(vocabulary)

def get_input_tensor(tensor):
    '''Transform 1D tensor of word indexes to one-hot encoded 2D tensor'''
    size = [*tensor.shape][0]
    inp = torch.zeros(size, vocab_size).scatter_(1, tensor.unsqueeze(1), 1.)
    return Variable(inp).float()

embedding_dims = 300
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


initrange = 0.5 / embedding_dims
W1 = Variable(torch.randn(vocab_size, embedding_dims, device=device).uniform_(-initrange, initrange).float(), requires_grad=True) # shape V*H
W2 = Variable(torch.randn(embedding_dims, vocab_size, device=device).uniform_(-initrange, initrange).float(), requires_grad=True) #shape H*V
print(f'W1 shape is: {W1.shape}, W2 shape is: {W2.shape}')

num_epochs = 2000
learning_rate = 10.0
lr_decay = 0.99
loss_hist = []

# Record the start time
start_time = time.time()

W1 shape is: torch.Size([478, 300]), W2 shape is: torch.Size([300, 478])


In [8]:
for epo in range(num_epochs):
    total_correct = 0
    total_samples = 0

    for x,y in zip(DataLoader(train_emb.Input.values, batch_size=train_emb.shape[0]), DataLoader(train_emb.Output.values, batch_size=train_emb.shape[0])):
        # one-hot encode input tensor
        input_tensor = get_input_tensor(x).to(device) #shape N*V

        # simple NN architecture
        h = input_tensor.mm(W1.to(device)) # shape 1*H
        y_pred = h.mm(W2.to(device)) # shape 1*V

        # move target tensor to the same device
        y = y.to(device)

        # define loss func
        loss_f = torch.nn.CrossEntropyLoss() # see details: https://pytorch.org/docs/stable/nn.html

        #compute loss
        loss = loss_f(y_pred, y)

        # bakpropagation step
        loss.backward()

        # Update weights using gradient descent. For this step we just want to mutate
        # the values of w1 and w2 in-place; we don't want to build up a computational
        # graph for the update steps, so we use the torch.no_grad() context manager
        # to prevent PyTorch from building a computational graph for the updates
        with torch.no_grad():
            # SGD optimization is implemented in PyTorch, but it's very easy to implement manually providing better understanding of process
            W1 -= learning_rate*W1.grad.data
            W2 -= learning_rate*W2.grad.data
            # zero gradients for next step
            W1.grad.data.zero_()
            W2.grad.data.zero_()

        # compute loss and accuracy
        _, predicted = torch.max(y_pred.data, 1)
        total_correct += (predicted == y).sum().item()
        total_samples += y.size(0)

    if epo%10 == 0:
        learning_rate *= lr_decay
    loss_hist.append(loss)
    if epo%50 == 0:
        accuracy = total_correct / total_samples
        print(f'Epoch {epo}, loss = {loss}, accuracy = {accuracy}')

# Record the end time
end_time = time.time()

# Calculate the elapsed time
elapsed_time = end_time - start_time

# Print the elapsed time
print(f"Elapsed Time: {elapsed_time} seconds")

Epoch 0, loss = 6.169609069824219, accuracy = 0.003692132916785004
Epoch 50, loss = 5.610062599182129, accuracy = 0.20718545867651236
Epoch 100, loss = 6.899322986602783, accuracy = 0.20746946890088044
Epoch 150, loss = 5.239627361297607, accuracy = 0.22010792388525988
Epoch 200, loss = 4.646688461303711, accuracy = 0.20803748934961658
Epoch 250, loss = 5.488072872161865, accuracy = 0.15705765407554673
Epoch 300, loss = 4.520118236541748, accuracy = 0.20704345356432832
Epoch 350, loss = 4.153924942016602, accuracy = 0.26924169270093723
Epoch 400, loss = 4.154606342315674, accuracy = 0.2837262141437092
Epoch 450, loss = 4.249271869659424, accuracy = 0.2327463788696393
Epoch 500, loss = 4.416109561920166, accuracy = 0.21144561204203352
Epoch 550, loss = 4.308865070343018, accuracy = 0.21329167850042602
Epoch 600, loss = 4.256441116333008, accuracy = 0.21201363249076965
Epoch 650, loss = 4.030119895935059, accuracy = 0.27577392786140303
Epoch 700, loss = 3.984341621398926, accuracy = 0.26