# Natural Language Modeling -- 2-Layer LSTM on the PTB Dataset

## We will create a 2 layer LSTM and train our network on the PTB dataset to predict which word will come next. 

# PART A: TRAINING

# ----------------------------------------------------------------------------------

There are 53,074,704 parameters in my network. I set the initial learning schedule as 3.1. Then starting from the fourth epoch, I divide the learning rate by 1.1 in the beginning of each epoch, and divide again by 2.4 at the end of each epoch. When it reaches 11, I divide the learning rate by 4.5 instead of 2.4. Finally, when it exceeds 15, i divide the learning rate by 8. There are two divisions because I find that when I try to combine them, the preplexity becomes worse. I first experimented with different learning rates, anywhere between 1 to 4, and I found that 3.1 works the best given everything else constant. I also tried several learning rate schedules, and I found that the one i described above works best. Then, I tried changing the embedding size and hidden size. I found that if the difference between the two sizes are small, the preplexity becomes worse, if there is no difference between the two sizes, the preplexity becomes better, but it works best when the two are far apart, with emb_size > hid_size, hidden size has to be small, around 100-300. I think changing the embed size makes the most difference in the preplexity.

# ----------------------------------------------------------------------------------

### Write your code below

In [2]:
import torch
import torch.nn.functional as F
import torch.nn as nn
import math
import time
import utils

In [3]:
device= torch.device("cuda")
print(device)

cuda


In [4]:
train_data  =  torch.load('../../data/ptb/train_data.pt')
test_data   =  torch.load('../../data/ptb/test_data.pt')

print(  train_data.size()  )
print(  test_data.size()   )

torch.Size([46479, 20])
torch.Size([4121, 20])


In [5]:
bs = 20
vocab_size = 10000

In [6]:
class rec_neural_net(nn.Module):
    
    def __init__(self, embedding_size, hidden_size):
        super().__init__()
        
        self.layer1 = nn.Embedding( vocab_size  , embedding_size  )
        self.layer2 = nn.LSTM(      embedding_size , hidden_size, num_layers=2, dropout=0.3 )
        self.layer3 = nn.Linear(    hidden_size , vocab_size )
        
        
    def forward(self, word_seq, h_init):
        
        input_seq_emb = self.layer1( word_seq )
        output_seq , (h_final, c_last) = self.layer2( input_seq_emb, (h,c) )
        scores_seq = self.layer3( output_seq )   
        
        return scores_seq, (h_final, c_last)

In [7]:
emb_size = 5237
hid_size = 292

net=rec_neural_net(emb_size,hid_size)
net = net.to(device)

print(net)
utils.display_num_param(net)

rec_neural_net(
  (layer1): Embedding(10000, 5237)
  (layer2): LSTM(5237, 292, num_layers=2, dropout=0.3)
  (layer3): Linear(in_features=292, out_features=10000, bias=True)
)
There are 62444656 (62.44 million) parameters in this neural network


In [8]:
net.layer1.weight.data.uniform_(-0.1, 0.1)
net.layer3.weight.data.uniform_(-0.1, 0.1)

tensor([[ 0.0069, -0.0692, -0.0362,  ...,  0.0284,  0.0921,  0.0357],
        [-0.0657,  0.0723, -0.0327,  ...,  0.0865,  0.0085,  0.0020],
        [-0.0904,  0.0845, -0.0496,  ...,  0.0302, -0.0925,  0.0876],
        ...,
        [-0.0557,  0.0961,  0.0363,  ...,  0.0889, -0.0958, -0.0670],
        [ 0.0714,  0.0149, -0.0260,  ..., -0.0454,  0.0363, -0.0526],
        [ 0.0509, -0.0077, -0.0801,  ...,  0.0064,  0.0028, -0.0469]],
       device='cuda:0')

In [9]:
criterion = nn.CrossEntropyLoss()
my_lr = 3.1
seq_length = 35

In [10]:
def eval_on_test_set():

    running_loss=0
    num_batches=0    
    
    with torch.no_grad():
       
        h = torch.zeros(2, bs, hid_size)
        c = torch.zeros(2, bs, hid_size)

        h=h.to(device)
        c=c.to(device)


        for count in range( 0 , 4120-seq_length ,  seq_length) :

            minibatch_data =  test_data[ count   : count+seq_length   ]
            minibatch_label = test_data[ count+1 : count+seq_length+1 ]

            minibatch_data=minibatch_data.to(device)
            minibatch_label=minibatch_label.to(device)

            scores, (h1,c1) = net( minibatch_data, h )

            minibatch_label =   minibatch_label.view(  bs*seq_length ) 
            scores          =            scores.view(  bs*seq_length , vocab_size)

            loss = criterion(  scores ,  minibatch_label )    

            h=h.detach()

            running_loss += loss.item()
            num_batches += 1        
    
    total_loss = running_loss/num_batches 
    print('test: exp(loss) = ', math.exp(total_loss)  )

In [11]:
start=time.time()

for epoch in range(25):
    
    # keep the learning rate to 1 during the first 4 epochs, then divide by 1.1 at every epoch
    if epoch >= 4:
        my_lr = my_lr / 1.1
    
    # create a new optimizer and give the current learning rate.   
    optimizer=torch.optim.SGD( net.parameters() , lr=my_lr )
        
    # set the running quantities to zero at the beginning of the epoch
    running_loss=0
    num_batches=0    
       
    # set the initial h to be the zero vector
    h = torch.zeros(2, bs, hid_size)
    c = torch.zeros(2, bs, hid_size)

    # send it to the gpu    
    h=h.to(device)
    c=c.to(device)
    
    for count in range( 0 , 46445-seq_length ,  seq_length):
             
        # Set the gradients to zeros
        optimizer.zero_grad()
        
        # create a minibatch
        minibatch_data =  train_data[ count   : count+seq_length   ]
        minibatch_label = train_data[ count+1 : count+seq_length+1 ] 
        
        # send them to the gpu
        minibatch_data=minibatch_data.to(device)
        minibatch_label=minibatch_label.to(device)
        
        # Detach to prevent from backpropagating all the way to the beginning
        # Then tell Pytorch to start tracking all operations that will be done on h and c
        h=h.detach()
        c=c.detach()
        h=h.requires_grad_()
        c=c.requires_grad_()
        
        # forward the minibatch through the net        
        scores, (h1,c1)  = net( minibatch_data, (h,c) )
    
        # reshape the scores and labels to huge batch of size bs*seq_length
        scores          =            scores.view(  bs*seq_length , vocab_size)  
        minibatch_label =   minibatch_label.view(  bs*seq_length )       
        
        # Compute the average of the losses of the data points in this huge batch
        loss = criterion(  scores ,  minibatch_label )
        
        # backward pass to compute dL/dR, dL/dV and dL/dW
        loss.backward()

        # do one step of stochastic gradient descent: R=R-lr(dL/dR), V=V-lr(dL/dV), ...
        utils.normalize_gradient(net)
        optimizer.step()
            
        # update the running loss  
        running_loss += loss.item()
        num_batches += 1
        
    # compute stats for the full training set
    total_loss = running_loss/num_batches
    elapsed = time.time()-start
    
    if epoch >= 4 and epoch <= 10:
        my_lr = my_lr / 2.4
    
    if epoch > 10 and epoch <= 15:
        my_lr = my_lr/4.5
        
    if epoch > 15:
        my_lr = my_lr/8
        
    print('')
    print('epoch=',epoch, '\t time=', elapsed,'\t lr=', my_lr, '\t train: exp(loss)=',  math.exp(total_loss))
    eval_on_test_set() 
    




epoch= 0 	 time= 21.824772119522095 	 lr= 3.1 	 train: exp(loss)= 315.39196579440596
test: exp(loss) =  199.25337341190493

epoch= 1 	 time= 44.37513518333435 	 lr= 3.1 	 train: exp(loss)= 149.41416013346404
test: exp(loss) =  153.65944548640894

epoch= 2 	 time= 67.02786040306091 	 lr= 3.1 	 train: exp(loss)= 111.71319887900349
test: exp(loss) =  137.05368945267847

epoch= 3 	 time= 89.60040307044983 	 lr= 3.1 	 train: exp(loss)= 91.49975215365893
test: exp(loss) =  129.1294538112854

epoch= 4 	 time= 112.17343044281006 	 lr= 1.174242424242424 	 train: exp(loss)= 77.14474211901053
test: exp(loss) =  125.8397793845628

epoch= 5 	 time= 134.66523122787476 	 lr= 0.44478879706152424 	 train: exp(loss)= 61.894419116472264
test: exp(loss) =  120.09267649739712

epoch= 6 	 time= 157.02283310890198 	 lr= 0.16848060494754705 	 train: exp(loss)= 55.152509247831574
test: exp(loss) =  117.5755404311114

epoch= 7 	 time= 179.5460183620453 	 lr= 0.06381841096497995 	 train: exp(loss)= 52.529627776

# ----------------------------------------------------------------------------------

# ----------------------------------------------------------------------------------

# PART B: INFERENCE

In each of the cells below, we will: 
1. Take a sentence from the ptb test set
2. Convert this sentence into a LongTensor using text2tensor function from utils.py
3. Feed the sentence to the network
4. Price the 30 most likely words that comes after the last word in the sentence according to the network using the show_most_likely_words function from utils.py

In [12]:
sentence1 = "prices averaging roughly $ N a barrel higher in the third" # taken from ptb dataset
h = torch.zeros(2, 1, hid_size)
c = torch.zeros(2, 1, hid_size)
h = h.to(device)
c = c.to(device)

x = utils.text2tensor(sentence1)
x = x.view(-1,1)
x = x.to(device)

scores, (h1,c1)= net(x, (h,c))
p = F.softmax(scores[scores.size()[0]-1], dim=1)
print(sentence1, '... \n')
utils.show_most_likely_words(p)

prices averaging roughly $ N a barrel higher in the third ... 

94.2%	 quarter
0.8%	 half
0.6%	 week
0.5%	 or
0.4%	 consecutive
0.4%	 month
0.3%	 period
0.2%	 day
0.2%	 of
0.2%	 &
0.2%	 and
0.1%	 session
0.1%	 hour
0.1%	 range
0.1%	 term
0.1%	 year
0.1%	 area
0.0%	 quarters
0.0%	 <eos>
0.0%	 part
0.0%	 level
0.0%	 market
0.0%	 season
0.0%	 game
0.0%	 section
0.0%	 deficit
0.0%	 <unk>
0.0%	 standard
0.0%	 sector
0.0%	 third


In [13]:
sentence2 = "i think my line has been very consistent mrs. hills said at a news"
h = torch.zeros(2, 1, hid_size)
c = torch.zeros(2, 1, hid_size)
h = h.to(device)
c = c.to(device)

x = utils.text2tensor(sentence2)
x = x.view(-1,1)
x = x.to(device)

scores, (h1,c1)= net(x, (h,c))
p = F.softmax(scores[scores.size()[0]-1], dim=1)
print(sentence2, '... \n')
utils.show_most_likely_words(p)

i think my line has been very consistent mrs. hills said at a news ... 

67.8%	 conference
5.6%	 that
1.8%	 meeting
1.5%	 agency
1.0%	 and
0.9%	 show
0.6%	 <unk>
0.6%	 moment
0.6%	 firm
0.5%	 here
0.5%	 price
0.3%	 where
0.3%	 note
0.3%	 in
0.3%	 news
0.3%	 point
0.3%	 cost
0.3%	 policy
0.2%	 network
0.2%	 <eos>
0.2%	 group
0.2%	 with
0.2%	 rate
0.2%	 hearing
0.2%	 '
0.2%	 official
0.2%	 panel
0.2%	 of
0.2%	 research
0.2%	 said


In [14]:
sentence3 = "this appears particularly true at gm which had strong sales in the"
h = torch.zeros(2, 1, hid_size)
c = torch.zeros(2, 1, hid_size)
h = h.to(device)
c = c.to(device)

x = utils.text2tensor(sentence3)
x = x.view(-1,1)
x = x.to(device)

scores, (h1,c1)= net(x, (h,c))
p = F.softmax(scores[scores.size()[0]-1], dim=1)
print(sentence3, '... \n')
utils.show_most_likely_words(p)  

this appears particularly true at gm which had strong sales in the ... 

12.3%	 u.s.
5.9%	 quarter
4.8%	 company
3.9%	 <unk>
2.9%	 past
2.6%	 first
1.9%	 third
1.6%	 region
1.5%	 bay
1.4%	 industry
1.3%	 automotive
1.3%	 latest
1.1%	 field
1.0%	 country
0.9%	 markets
0.9%	 next
0.8%	 market
0.8%	 wake
0.7%	 new
0.7%	 fourth
0.7%	 oil
0.6%	 area
0.6%	 1990s
0.6%	 san
0.6%	 world
0.6%	 european
0.6%	 business
0.5%	 midwest
0.5%	 period
0.5%	 year


In [15]:
sentence4 = "some analysts expect oil prices to remain relatively"

x = utils.text2tensor(sentence4)
x = x.view(-1,1)
x = x.to(device)

scores, (h1,c1)= net(x, (h,c))
p = F.softmax(scores[scores.size()[0]-1], dim=1)
print(sentence4, '... \n')
utils.show_most_likely_words(p)  

some analysts expect oil prices to remain relatively ... 

19.0%	 low
10.3%	 higher
10.1%	 strong
10.0%	 high
2.0%	 flat
1.2%	 more
1.2%	 slow
1.2%	 <unk>
1.1%	 small
1.0%	 greater
1.0%	 bullish
1.0%	 weak
1.0%	 lower
0.9%	 modest
0.9%	 thin
0.8%	 positive
0.8%	 weaker
0.7%	 tight
0.7%	 profitable
0.7%	 volatile
0.6%	 minor
0.6%	 favorable
0.5%	 better
0.5%	 growth
0.5%	 difficult
0.4%	 good
0.4%	 significant
0.4%	 well
0.4%	 relatively
0.3%	 hard


### Make a 3 sentences of your own (they should related to economy) and see what the network predict.

In [16]:
sentence5 = "the economy is far from full employment , job creation is"

x = utils.text2tensor(sentence5)
x = x.view(-1,1)
x = x.to(device)

scores, (h1,c1)= net(x, (h,c))
p = F.softmax(scores[scores.size()[0]-1], dim=1)
print(sentence5, '... \n')
utils.show_most_likely_words(p)   

the economy is far from full employment , job creation is ... 

10.1%	 n't
6.2%	 <unk>
5.7%	 a
3.3%	 to
3.2%	 the
2.4%	 not
2.2%	 being
2.0%	 that
1.7%	 likely
1.7%	 expected
1.0%	 slowing
1.0%	 in
1.0%	 very
0.9%	 due
0.9%	 an
0.8%	 so
0.8%	 more
0.7%	 growing
0.7%	 one
0.6%	 still
0.6%	 getting
0.6%	 too
0.5%	 relatively
0.5%	 made
0.5%	 less
0.5%	 often
0.5%	 high
0.5%	 part
0.4%	 rising
0.4%	 far


In [17]:
sentence6 = "increasing share of government spending has been for transfer payments , rather than for purchases of goods and"

x = utils.text2tensor(sentence6)
x = x.view(-1,1)
x = x.to(device)

scores, (h1,c1)= net(x, (h,c))
p = F.softmax(scores[scores.size()[0]-1], dim=1)
print(sentence6, '... \n')
utils.show_most_likely_words(p)

increasing share of government spending has been for transfer payments , rather than for purchases of goods and ... 

18.9%	 services
3.6%	 other
3.3%	 equipment
3.0%	 the
2.3%	 costs
1.4%	 that
1.3%	 to
1.1%	 <unk>
1.1%	 service
1.1%	 expenses
1.0%	 a
1.0%	 income
1.0%	 at
0.9%	 property
0.9%	 for
0.8%	 companies
0.8%	 investment
0.8%	 paper
0.8%	 industry
0.8%	 management
0.7%	 sales
0.7%	 it
0.7%	 loans
0.7%	 money
0.7%	 business
0.7%	 exports
0.6%	 construction
0.6%	 investments
0.6%	 stock
0.5%	 was


In [18]:
sentence7 = "The rich earn higher incomes because they contribute more to society than others do. However, because of diminishing marginal utility, they don't get much value from their last few dollars of"

x = utils.text2tensor(sentence7)
x = x.view(-1,1)
x = x.to(device)

scores, (h1,c1)= net(x, (h,c))
p = F.softmax(scores[scores.size()[0]-1], dim=1)
print(sentence7, '... \n')
utils.show_most_likely_words(p)

The rich earn higher incomes because they contribute more to society than others do. However, because of diminishing marginal utility, they don't get much value from their last few dollars of ... 

11.8%	 <unk>
3.8%	 their
3.4%	 the
1.4%	 N
1.3%	 health
1.2%	 work
1.0%	 people
1.0%	 research
0.9%	 a
0.7%	 damage
0.7%	 revenue
0.6%	 equipment
0.6%	 insurance
0.5%	 u.s.
0.5%	 workers
0.5%	 money
0.5%	 tax
0.5%	 each
0.5%	 land
0.4%	 space
0.4%	 mortgage
0.4%	 medical
0.4%	 home
0.4%	 time
0.4%	 business
0.3%	 american
0.3%	 dollars
0.3%	 advertising
0.3%	 real
0.3%	 sales
