# Next Word Prediction with LSTM

## üìã Overview
This notebook implements a text generation pipeline using an **LSTM Network**. By training on a custom FAQ dataset, the model learns to predict the most probable next word given a sequence of previous words, effectively acting as an automated "auto-complete" for course-related queries.

---

## üõ†Ô∏è Workflow Steps

### 1. Text Preprocessing & Tokenization
* **NLTK Integration**: Uses `word_tokenize` to break the raw document into individual tokens.
* **Vocabulary Creation**: Builds a mapping of unique tokens to numerical indices, including an `<unk>` token for out-of-vocabulary terms.
* **N-Gram Sequence Generation**: Converts sentences into cumulative sequences (e.g., "The course fee" becomes `[The, course]` and `[The, course, fee]`) to teach the model how sentences grow.

### 2. Sequence Handling
* **Padding**: Uses "Pre-Padding" to ensure all input sequences have the same length ($61$ tokens) by adding leading zeros. This is crucial for batch processing in PyTorch.
* **X/Y Split**: For every sequence, the last word is treated as the **target (y)** and all preceding words as the **input (X)**.

[Image of n-gram sequence generation for next word prediction]

### 3. LSTM Model Architecture
LSTMs are chosen for their ability to maintain "long-term memory" via a cell state, which helps in understanding context in long sentences.
* **Embedding Layer**: Learns 100-dimensional dense vectors for each word.
* **LSTM Layer**: Processes the sequence and maintains a hidden state of 150 dimensions.
* **Linear Output**: A fully connected layer that maps the LSTM output to the total vocabulary size for word selection.

[Image of LSTM cell architecture showing forget, input, and output gates]

### 4. Training the Generator
* **Optimization**: Uses the **Adam Optimizer** and **Cross-Entropy Loss**.
* **GPU Acceleration**: Moves the model and tensors to CUDA if available for faster computation over 50 epochs.

### 5. Iterative Inference (Text Generation)
The prediction function doesn't just predict once; it can be used in a loop:
1. Provide a seed phrase (e.g., "What is the fee").
2. The model predicts the next word ("for").
3. The new word is appended to the input, and the process repeats.

---

## üìö Key Tech Stack
| Category | Tools |
| :--- | :--- |
| **NLP** | `NLTK (Tokenization)`, `Collections (Counter)` |
| **Deep Learning** | `PyTorch (nn.LSTM, nn.Embedding)` |
| **Data Processing** | `DataLoader`, `Pre-padding techniques` |

In [110]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
from torch.utils.data import Dataset, DataLoader
from nltk.tokenize import word_tokenize
import nltk

In [111]:
document = """Q: What is the course fee for the Data Science Mentorship Program?
A: The course fee is Rs 799 per month.

Q: What is the total duration of the course?
A: The total duration of the course is 7 months.

Q: What is the total course fee for the full program?
A: The total fee is approximately Rs 5600.

Q: What modules are covered in the program?
A: Python, Data Analysis, SQL, Machine Learning, MLOPs, and case studies are covered.

Q: Is Deep Learning part of the curriculum?
A: No, Deep Learning is not included.

Q: Is NLP included in the program?
A: No, NLP is not part of the curriculum.

Q: Will recordings be available if I miss a session?
A: Yes, all sessions are recorded.

Q: Where can I find the class schedule?
A: The class schedule is available in the Google Sheet provided.

Q: How long does each live session last?
A: Each live session lasts around 2 hours.

Q: What language is used in the sessions?
A: The instructor speaks Hinglish.

Q: How will I be notified about upcoming classes?
A: You will receive an email before every session.

Q: Can non-tech students join the course?
A: Yes, students from non-tech backgrounds can join.

Q: Can I join the program in the middle?
A: Yes, you can join anytime.

Q: Will I get access to past lectures if I join late?
A: Yes, all past content will be available after payment.

Q: Do I need to submit tasks?
A: No, you will self-evaluate using provided solutions.

Q: Are case studies included in the program?
A: Yes, case studies are included.

Q: How can I contact the team?
A: You can email nitish.campusx@gmail.com.

Q: Where should I make payments?
A: Payments must be made on the official website.

Q: Can I pay the full fee at once?
A: No, the program follows a monthly subscription model.

Q: What is the validity of the monthly subscription?
A: The subscription is valid for 30 days from payment.

Q: Is there a refund policy?
A: Yes, you get a 7-day refund period.

Q: What if I cannot pay from outside India?
A: You should contact the team via email.

Q: Till when can I watch paid videos?
A: You can watch videos while your subscription is valid.

Q: Will I get lifetime access after completing the course?
A: No, videos are available till Aug 2024 after full payment.

Q: Why is lifetime access not provided?
A: Because of the low course fee.

Q: How can I ask doubts after a session?
A: Fill the doubt clearance Google form.

Q: Can I ask doubts from past weeks?
A: Yes, select past week doubts in the form.

Q: What is the criteria for certificate?
A: Pay full fee and attempt all assessments.

Q: How can I pay earlier month fees if I join late?
A: A payment link will be provided in your dashboard.

Q: Does placement assistance guarantee a job?
A: No, there is no placement guarantee.

Q: What is included in placement assistance?
A: Portfolio building, soft skills, mentorship, and job strategies.

Q: What topics are taught in Python Fundamentals?
A: Basics of Python programming and syntax.

Q: What Python libraries are taught?
A: Libraries for data science such as pandas and numpy.

Q: What is covered in Data Analysis?
A: Data cleaning, visualization, and exploratory analysis.

Q: What is SQL for Data Science?
A: SQL concepts for querying databases.

Q: What is Maths for Machine Learning?
A: Linear algebra, probability, and statistics basics.

Q: What ML algorithms are taught?
A: Regression, classification, and clustering algorithms.

Q: What is Practical Machine Learning?
A: Hands-on ML projects and implementations.

Q: What is MLOPs?
A: Machine learning deployment and operations.

Q: Will industry mentors interact with students?
A: Yes, sessions with industry mentors are included.

Q: Will soft skill sessions be conducted?
A: Yes, soft skill sessions are included.

Q: Will job hunting strategies be discussed?
A: Yes, job hunting strategies will be discussed.

Q: Can I access the dashboard after payment?
A: Yes, the dashboard becomes available after payment.

Q: Can I watch past content immediately after payment?
A: Yes, past sessions are unlocked after payment.

Q: Is the course suitable for beginners?
A: Yes, beginners can join the program.

Q: Is the course subscription monthly or yearly?
A: The course subscription is monthly.

Q: Will there be assessments?
A: Yes, assessments are part of the course.

Q: Is the course online or offline?
A: The course is conducted online.

Q: Are doubt sessions one-on-one?
A: Yes, 1-on-1 doubt clearance sessions are provided.

Q: Is email the only contact method?
A: Yes, email is the official contact method.

Q: Will the program help build a portfolio?
A: Yes, portfolio building sessions are included.

Q: Are interview calls guaranteed?
A: No, interview calls are not guaranteed.

"""

In [112]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [113]:
tokens = word_tokenize(document.lower())

In [114]:
vocab = {'<unk>':0}

for token in Counter(tokens).keys():
  if token not in vocab:
    vocab[token] = len(vocab)

vocab

{'<unk>': 0,
 'q': 1,
 ':': 2,
 'what': 3,
 'is': 4,
 'the': 5,
 'course': 6,
 'fee': 7,
 'for': 8,
 'data': 9,
 'science': 10,
 'mentorship': 11,
 'program': 12,
 '?': 13,
 'a': 14,
 'rs': 15,
 '799': 16,
 'per': 17,
 'month': 18,
 '.': 19,
 'total': 20,
 'duration': 21,
 'of': 22,
 '7': 23,
 'months': 24,
 'full': 25,
 'approximately': 26,
 '5600.': 27,
 'modules': 28,
 'are': 29,
 'covered': 30,
 'in': 31,
 'python': 32,
 ',': 33,
 'analysis': 34,
 'sql': 35,
 'machine': 36,
 'learning': 37,
 'mlops': 38,
 'and': 39,
 'case': 40,
 'studies': 41,
 'deep': 42,
 'part': 43,
 'curriculum': 44,
 'no': 45,
 'not': 46,
 'included': 47,
 'nlp': 48,
 'will': 49,
 'recordings': 50,
 'be': 51,
 'available': 52,
 'if': 53,
 'i': 54,
 'miss': 55,
 'session': 56,
 'yes': 57,
 'all': 58,
 'sessions': 59,
 'recorded': 60,
 'where': 61,
 'can': 62,
 'find': 63,
 'class': 64,
 'schedule': 65,
 'google': 66,
 'sheet': 67,
 'provided': 68,
 'how': 69,
 'long': 70,
 'does': 71,
 'each': 72,
 'live': 73,

In [115]:
len(vocab)

248

In [116]:
input_sentences = document.split('\n')

In [117]:
def text_to_indices(sentence_tokens, vocab):
    return [vocab.get(token, vocab['<unk>']) for token in sentence_tokens]


In [118]:
input_numerical_sentences = []

for sentence in input_sentences:
  input_numerical_sentences.append(text_to_indices(word_tokenize(sentence.lower()), vocab))


In [119]:
len(input_numerical_sentences)
input_numerical_sentences[:5]

[[1, 2, 3, 4, 5, 6, 7, 8, 5, 9, 10, 11, 12, 13],
 [14, 2, 5, 6, 7, 4, 15, 16, 17, 18, 19],
 [],
 [1, 2, 3, 4, 5, 20, 21, 22, 5, 6, 13],
 [14, 2, 5, 20, 21, 22, 5, 6, 4, 23, 24, 19]]

In [120]:
training_sequence = []
for sentence in input_numerical_sentences:
# we did this because it will help us to create x and y pairs for training the model. For example if we have a sentence like [1,2,3,4] then we will create training sequences like [1,2], [1,2,3], [1,2,3,4] and the corresponding target will be 3,4 and <eos> respectively. This way we can train our model to predict the next word in the sequence.
  for i in range(1, len(sentence)):
    training_sequence.append(sentence[:i+1]) #i+1 because we want to include the word at index i as well in the input sequence.

In [121]:
len(training_sequence)
training_sequence[:5]

[[1, 2], [1, 2, 3], [1, 2, 3, 4], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6]]

In [122]:
len_list = []

for sequence in training_sequence:
  len_list.append(len(sequence))

max(len_list)

20

In [123]:
training_sequence[0]
# for equal length of each sentences in training we will use padding 


[1, 2]

In [124]:
padded_training_sequence = []
for sequence in training_sequence:

  padded_training_sequence.append([0]*(max(len_list) - len(sequence)) + sequence)

In [125]:
len(padded_training_sequence[10])


20

In [126]:
print(padded_training_sequence[12])

[0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 5, 9, 10, 11, 12, 13]


In [127]:
type(padded_training_sequence)

list

In [128]:
#convert this into a tensor 
padded_training_sequence = torch.tensor(padded_training_sequence,dtype=torch.long) 

In [129]:
#seperate x and y
X = padded_training_sequence[:, :-1]
y = padded_training_sequence[:,-1]

In [130]:
X

tensor([[  0,   0,   0,  ...,   0,   0,   1],
        [  0,   0,   0,  ...,   0,   1,   2],
        [  0,   0,   0,  ...,   1,   2,   3],
        ...,
        [  0,   0,   0,  ..., 245, 246,  29],
        [  0,   0,   0,  ..., 246,  29,  46],
        [  0,   0,   0,  ...,  29,  46, 247]])

In [131]:
y

tensor([  2,   3,   4,   5,   6,   7,   8,   5,   9,  10,  11,  12,  13,   2,
          5,   6,   7,   4,  15,  16,  17,  18,  19,   2,   3,   4,   5,  20,
         21,  22,   5,   6,  13,   2,   5,  20,  21,  22,   5,   6,   4,  23,
         24,  19,   2,   3,   4,   5,  20,   6,   7,   8,   5,  25,  12,  13,
          2,   5,  20,   7,   4,  26,  15,   0,  19,   2,   3,  28,  29,  30,
         31,   5,  12,  13,   2,  32,  33,   9,  34,  33,  35,  33,  36,  37,
         33,  38,  33,  39,  40,  41,  29,  30,  19,   2,   4,  42,  37,  43,
         22,   5,  44,  13,   2,  45,  33,  42,  37,   4,  46,  47,  19,   2,
          4,  48,  47,  31,   5,  12,  13,   2,  45,  33,  48,   4,  46,  43,
         22,   5,  44,  19,   2,  49,  50,  51,  52,  53,  54,  55,  14,  56,
         13,   2,  57,  33,  58,  59,  29,  60,  19,   2,  61,  62,  54,  63,
          5,  64,  65,  13,   2,   5,  64,  65,   4,  52,  31,   5,  66,  67,
         68,  19,   2,  69,  70,  71,  72,  73,  56,  74,  13,  

In [132]:
class CustomDataset(Dataset):

  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __len__(self):
    return self.X.shape[0]

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

In [133]:
dataset = CustomDataset(X,y)

In [134]:
len(dataset)

959

In [135]:
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [136]:
class LSTMModel(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    self.embedding = nn.Embedding(vocab_size, 100)
    self.lstm = nn.LSTM(100, 150, batch_first=True)
    self.fc = nn.Linear(150, vocab_size)

  def forward(self, x):
    embedded = self.embedding(x)
    intermediate_hidden_states, (final_hidden_state, final_cell_state) = self.lstm(embedded)
    output = self.fc(final_hidden_state.squeeze(0))
    return output

In [137]:
model = LSTMModel(len(vocab))

In [138]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [139]:
model.to(device)

LSTMModel(
  (embedding): Embedding(248, 100)
  (lstm): LSTM(100, 150, batch_first=True)
  (fc): Linear(in_features=150, out_features=248, bias=True)
)

In [140]:
epochs = 50
learning_rate = 0.001

criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [141]:

for epoch in range(epochs):
  total_loss = 0

  for batch_x, batch_y in dataloader:

    batch_x, batch_y = batch_x.to(device), batch_y.to(device)

    optimizer.zero_grad()

    output = model(batch_x)

    loss = criterion(output, batch_y)

    loss.backward()

    optimizer.step()

    total_loss = total_loss + loss.item()

  print(f"Epoch: {epoch + 1}, Loss: {total_loss:.4f}")

Epoch: 1, Loss: 149.9856
Epoch: 2, Loss: 122.4637
Epoch: 3, Loss: 109.9638
Epoch: 4, Loss: 99.7200
Epoch: 5, Loss: 90.6483
Epoch: 6, Loss: 82.2432
Epoch: 7, Loss: 74.9682
Epoch: 8, Loss: 68.1077
Epoch: 9, Loss: 61.7338
Epoch: 10, Loss: 56.0683
Epoch: 11, Loss: 51.1080
Epoch: 12, Loss: 46.2950
Epoch: 13, Loss: 42.3509
Epoch: 14, Loss: 38.5963
Epoch: 15, Loss: 35.3854
Epoch: 16, Loss: 32.4524
Epoch: 17, Loss: 30.1617
Epoch: 18, Loss: 28.0440
Epoch: 19, Loss: 26.2975
Epoch: 20, Loss: 24.9816
Epoch: 21, Loss: 23.5191
Epoch: 22, Loss: 22.4846
Epoch: 23, Loss: 21.5051
Epoch: 24, Loss: 20.8324
Epoch: 25, Loss: 20.1089
Epoch: 26, Loss: 19.4302
Epoch: 27, Loss: 18.9418
Epoch: 28, Loss: 18.4199
Epoch: 29, Loss: 18.1243
Epoch: 30, Loss: 17.6575
Epoch: 31, Loss: 17.4511
Epoch: 32, Loss: 17.2155
Epoch: 33, Loss: 16.9507
Epoch: 34, Loss: 16.6268
Epoch: 35, Loss: 16.4645
Epoch: 36, Loss: 16.2904
Epoch: 37, Loss: 16.1996
Epoch: 38, Loss: 15.9363
Epoch: 39, Loss: 15.7993
Epoch: 40, Loss: 15.7430
Epoch:

In [142]:
# prediction

def prediction(model, vocab, text):

  # tokenize
  tokenized_text = word_tokenize(text.lower())

  # text -> numerical indices
  numerical_text = text_to_indices(tokenized_text, vocab)

  # padding
  padded_text = torch.tensor([0] * (61 - len(numerical_text)) + numerical_text, dtype=torch.long).unsqueeze(0)

  # send to model
  output = model(padded_text.to(device))

  # predicted index
  value, index = torch.max(output, dim=1)

  # merge with text
  return text + " " + list(vocab.keys())[index]

In [143]:
prediction(model, vocab, "The course follows a monthly")

'The course follows a monthly subscription'

In [149]:
import time

num_tokens = 10
input_text = "What is the fee for course"
#no duplicate answer

for i in range(num_tokens):
  output_text = prediction(model, vocab, input_text)
#   print(output_text)
  input_text = output_text
  time.sleep(0.5)

print(output_text)


What is the fee for course is the course fee for the data science mentorship program
