<a href="https://colab.research.google.com/github/snigdha2808/Gen-AI-/blob/main/Build_own_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Step 1: Set Up
pip install numpy



In [3]:
# A language model predicts the next word in a sentence.
# We’ll keep things simple and build a bigram model. This just means that our
# model will predict the next word using only the current word.
import numpy as np

# Sample dataset: A small text corpus
corpus = """Artificial Intelligence is the new electricity.
Machine learning is the future of AI.
AI is transforming industries and shaping the future."""

# Preparing the Text
# First things first, we need to break this text into individual words and
# create a vocabulary (basically a list of all unique words). This gives us
# something to work with. Tokenize the corpus into words
words = corpus.lower().split()

# Create a vocabulary of unique words
vocab = list(set(words))
vocab_size = len(vocab)

print(f"Vocabulary: {vocab}")
print(f"Vocabulary size: {vocab_size}")

Vocabulary: ['learning', 'and', 'ai', 'artificial', 'ai.', 'transforming', 'is', 'industries', 'electricity.', 'new', 'the', 'future', 'shaping', 'future.', 'intelligence', 'machine', 'of']
Vocabulary size: 17


In [6]:
# Map Words to Numbers
# Computers work with numbers, not words. So, we’ll map each word to an index
# and create a reverse mapping too (this will help when we convert them back to
# words later).

word_to_idx = {word: idx for idx, word in enumerate(vocab)} #Dictionary comprehension- reducing the concept
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

# Convert the words in the corpus to indices
corpus_indices = [word_to_idx[word] for word in words]

# we’re just turning words into numbers that our model can understand. Each word gets its own number,
# like “AI” might become 0, and “learning” might become 1, depending on the order.


In [7]:
# Building the Model
# Now, let’s get to the heart of it: building the bigram model.
# We want to figure out the probability of one word following another. To do
# that, we’ll count how often each word pair (bigram) shows up in our dataset.
# Initialize bigram counts matrix
bigram_counts = np.zeros((vocab_size, vocab_size))

# Count occurrences of each bigram in the corpus
for i in range(len(corpus_indices) - 1):
    current_word = corpus_indices[i]
    next_word = corpus_indices[i + 1]
    bigram_counts[current_word, next_word] += 1

# Apply Laplace smoothing by adding 1 to all bigram counts
bigram_counts += 0.01

# Normalize the counts to get probabilities
bigram_probabilities = bigram_counts / bigram_counts.sum(axis=1, keepdims=True)

print("Bigram probabilities matrix: ", bigram_probabilities)

Bigram probabilities matrix:  [[0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.86324786 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701]
 [0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.86324786 0.00854701 0.00854701 0.00854701 0.00854701]
 [0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.86324786 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701]
 [0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.86324786 0.00854701 0.00854701]
 [0.00854701 0.00854701 0.86324786 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701 0.00854701
  0.00854701 0.00854701 0.00854701 0.00854701 0.00854701]


In [8]:
# Predicting the Next Word

def predict_next_word(current_word, bigram_probabilities):
    word_idx = word_to_idx[current_word]
    next_word_probs = bigram_probabilities[word_idx]
    next_word_idx = np.random.choice(range(vocab_size), p=next_word_probs)
    return idx_to_word[next_word_idx]

# Test the model with a word
current_word = "ai"
next_word = predict_next_word(current_word, bigram_probabilities)
print(f"Given '{current_word}', the model predicts '{next_word}'.")

# This function takes a word, looks up its probabilities, and randomly selects
# the next word based on those probabilities. If you pass in "AI," the model
# might predict something like "is" as the next word.

Given 'ai', the model predicts 'is'.


In [9]:
def generate_sentence(start_word, bigram_probabilities, length=5):
    sentence = [start_word]
    current_word = start_word

    for _ in range(length):
        next_word = predict_next_word(current_word, bigram_probabilities)
        sentence.append(next_word)
        current_word = next_word

    return ' '.join(sentence)

# Generate a sentence starting with "artificial"
generated_sentence = generate_sentence("artificial", bigram_probabilities, length=10)
print(f"Generated sentence: {generated_sentence}")

Generated sentence: artificial intelligence is the future. learning future. is transforming industries and


To link this notebook with your GitHub repository, you can save a copy to GitHub directly from Colab. Go to **File** > **Save a copy to GitHub**.