# **``N-Grams``**

## **Imports**

In [16]:
import re
import math
import random
from collections import Counter, defaultdict

## **Task 0:** Corpus and Tokenization

In [57]:
def preprocess_text(text):
    text = text.lower()
    # Покращена обробка пунктуації
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    # Вилучення зайвих пробілів
    text = re.sub(r'\s+', ' ', text).strip()
    tokens = text.split()
    return tokens

corpus = """
Honey bees are social insects that live in colonies. A colony of honey bees includes a queen, drones, and workers. The queen is the only reproductive female in the colony. She can live up to five years and lays up to 2,000 eggs per day. Drones are male bees whose primary function is to mate with virgin queens. Worker bees are sterile females who perform all the labor for the colony, including collecting nectar and pollen, making honey, building the honeycomb, and defending the hive.
Honey bees communicate through a sophisticated "dance language." When a worker bee discovers a good source of nectar or pollen, she returns to the hive and performs a dance that indicates the direction and distance of the food source. The round dance indicates that food is close to the hive, while the waggle dance provides information about more distant food sources.
The process of making honey begins when worker bees collect nectar from flowers. The nectar is stored in the bee's special stomach, where enzymes begin to break down the complex sugars. Back at the hive, the nectar is passed from bee to bee until its moisture content is reduced from about 70% to 20%. This concentrated nectar is then stored in honeycomb cells, where it becomes honey.
Bee pollination is vital to agriculture and ecosystem health. As bees collect nectar and pollen, they transfer pollen from the male parts of flowers to the female parts, enabling plant reproduction. About one-third of the human food supply depends on bee pollination, including fruits, nuts, and vegetables. The economic value of bee pollination services is estimated to be in the billions of dollars annually.
Beekeepers use various types of hives to house their bees. The Langstroth hive, invented in 1851, is the most common type worldwide. It features removable frames that allow beekeepers to inspect the colony and harvest honey without destroying the honeycomb. Other types include the top-bar hive, which is simpler to build and manage, and the Warré hive, which aims to mimic the natural habitat of bees.
Beekeepers perform regular hive inspections to assess the health of their colonies. They check for signs of disease or pest infestation, ensure the queen is present and laying eggs, and monitor honey stores. Common pests and diseases include Varroa mites, American foulbrood, European foulbrood, and nosema. Integrated pest management strategies help beekeepers control these threats with minimal use of chemicals.
Honey has been used for thousands of years for its nutritional and medicinal properties. It contains antioxidants and has antibacterial properties. Different types of honey vary in color, flavor, and aroma, depending on the floral sources visited by the bees. Popular varieties include clover, acacia, buckwheat, and manuka honey.
Beeswax is another valuable product produced by honey bees. Worker bees secrete wax scales from glands on their abdomen, which they then use to build the honeycomb structure. Beeswax has numerous applications, including candle making, cosmetics, food wrapping, and woodworking.
Propolis, sometimes called "bee glue," is a resinous substance collected by bees from tree buds and sap. Bees use propolis to seal gaps in the hive and to strengthen the honeycomb structure. It has antimicrobial properties and is used in traditional medicine and health supplements.
Royal jelly is a secretion produced by worker bees to feed the queen and young larvae. It is rich in proteins, vitamins, and fatty acids. The queen bee's diet of royal jelly allows her to grow larger and live much longer than worker bees. Royal jelly is harvested for use in dietary supplements and cosmetics.
Swarms are a natural part of the honey bee life cycle. When a colony grows too large for its hive, about half the bees leave with the old queen to establish a new colony. Before swarming, the bees create new queen cells to ensure the original colony can continue. Swarm management is an important aspect of beekeeping, as it helps prevent the loss of valuable bees while controlling population growth.
Urban beekeeping has grown in popularity in recent years, with hives appearing on rooftops, balconies, and gardens in cities worldwide. Urban beekeeping helps support bee populations, promotes pollination of urban plants, and connects city dwellers with nature and sustainable food production.
Winter management is crucial for bee colony survival in colder climates. Beekeepers ensure their hives have sufficient honey stores to last through winter, provide insulation against the cold, and protect against moisture buildup, which can be deadly to bees. They also monitor ventilation to prevent condensation inside the hive.
In the spring, beekeepers focus on building up their colonies after winter. They add frames for the bees to build new comb, monitor for signs of swarming, and may split strong colonies to prevent swarming and increase their number of hives. Spring management also includes disease prevention and ensuring the queen is healthy and productive.
Honey extraction typically occurs after a honey flow, when flowers are in bloom and bees are actively collecting nectar. Beekeepers remove frames heavy with capped honey, use a knife or uncapping fork to remove the wax cappings, and place the frames in an extractor that spins out the honey using centrifugal force. The honey is then filtered and bottled.
Beekeeping equipment includes protective gear such as a veil, gloves, and a bee suit; hive tools for manipulating frames; a smoker for calming bees; an uncapping knife for honey extraction; and various feeders for providing supplemental nutrition when necessary. Proper maintenance of this equipment is essential for successful beekeeping.
Bee-friendly gardening practices help support honey bee populations and other pollinators. These include planting a diverse range of flowering plants that bloom throughout the growing season, avoiding the use of pesticides, providing water sources, and creating suitable nesting habitat for native bees.
The genetic diversity of honey bees is threatened by commercial breeding practices that focus on a limited number of subspecies. Efforts to conserve native and locally adapted honey bee populations help maintain genetic diversity, which is crucial for the species' long-term resilience against diseases, pests, and environmental changes.
Conservation strategies for honey bees include creating and preserving bee habitat, reducing pesticide use, supporting pollinator research, educating the public about the importance of bees, and implementing policies that protect bees and other pollinators. These efforts require collaboration among beekeepers, farmers, scientists, policymakers, and the general public.
Climate change poses significant challenges for beekeeping. Changes in temperature and precipitation patterns affect the timing of plant flowering, potentially creating mismatches between bee activity and nectar availability. Extreme weather events and shifting ranges of pests and diseases also impact bee colonies. Beekeepers are adapting their practices to help bees cope with these changes.
"""

# Токенізація та додавання маркерів початку та кінця речення
tokens = preprocess_text(corpus)
tokens = ['<s>'] + tokens + ['</s>']

## **Task 1:** Build Bi- and Trigram Models

In [58]:
def build_ngram_counts(tokens, n):
    """
    Будує лічильники для n-грам та їх контекстів.
    
    Для n-грами (w1, w2, ..., wn), контекст - це (w1, w2, ..., wn-1).
    Результати використовуються для обчислення P(wn|w1, w2, ..., wn-1).
    """
    ngram_counts = Counter()
    context_counts = Counter()
    
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i+n])
        context = tuple(tokens[i:i+n-1])
        ngram_counts[ngram] += 1
        context_counts[context] += 1
        
    return ngram_counts, context_counts

# Будуємо моделі біграм і триграм
bigram_counts, bigram_contexts = build_ngram_counts(tokens, 2)
trigram_counts, trigram_contexts = build_ngram_counts(tokens, 3)

# Створюємо словник та визначаємо його розмір для згладжування
vocab = set(tokens)
V = len(vocab)

print(f"Розмір словника: {V} унікальних слів")
print(f"Кількість біграм: {len(bigram_counts)}")
print(f"Кількість триграм: {len(trigram_counts)}")

Розмір словника: 494 унікальних слів
Кількість біграм: 1005
Кількість триграм: 1097


## **Task 2:** Interpolation

In [60]:
def mle_prob(ngram, ngram_counts, context_counts):
    """
    Максимальна оцінка правдоподібності для n-грами:
    P(wn|w1,..wn-1) = C(w1,...wn) / C(w1,...wn-1)
    
    Як показано у формулі на слайді:
    P(wn|wn-1) = C(wn-1,wn) / C(wn-1)
    """
    context = ngram[:-1]
    if context not in context_counts or context_counts[context] == 0:
        return 0
    return ngram_counts[ngram] / context_counts[context]

def laplace_smoothed_prob(ngram, ngram_counts, context_counts):
    """
    Згладжування Лапласа для n-грами:
    P_laplace(wn|w1,...wn-1) = [C(w1,...wn) + 1] / [C(w1,...wn-1) + V]
    
    Як показано у формулі на слайді:
    P_laplace(wn|wn-1) = [C(wn-1,wn) + 1] / [C(wn-1) + V]
    """
    context = ngram[:-1]
    return (ngram_counts.get(ngram, 0) + 1) / (context_counts.get(context, 0) + V)

def interpolated_prob(trigram, lambda1=0.1, lambda2=0.3, lambda3=0.6):
    """
    Лінійна інтерполяція для n-грам за формулою:
    P(wn|wn-2,wn-1) = λ1*P(wn|wn-2,wn-1) + λ2*P(wn|wn-1) + λ3*P(wn)
    
    Як показано на слайді "Conditional interpolation".
    """
    w1, w2, w3 = trigram
    
    # Компонент з триграми P(w3|w1,w2)
    p1 = laplace_smoothed_prob(trigram, trigram_counts, trigram_contexts)
    
    # Компонент з біграми P(w3|w2)
    bigram = (w2, w3)
    p2 = laplace_smoothed_prob(bigram, bigram_counts, bigram_contexts)
    
    # Компонент з уніграми P(w3)
    p3 = tokens.count(w3) / len(tokens)
    
    # Лінійна інтерполяція
    return lambda1 * p1 + lambda2 * p2 + lambda3 * p3

## **Task 3:** Sentence Generation

In [61]:
def generate_sentence(max_words=15, temperature=1.0):
    """
    Генерує речення, використовуючи модель n-грам з відступом та контролем температури.
    
    Аргументи:
        max_words: Максимальна кількість слів у реченні
        temperature: Контроль випадковості (нижче = більш детерміновано, вище = більш випадково)
    """
    # Починаємо з маркеру початку речення
    context = ['<s>']
    sentence = []
    
    for _ in range(max_words):
        if len(context) >= 2:
            # Використовуємо триграмну модель
            candidates = []
            probs = []
            
            for gram in trigram_counts.keys():
                if gram[0] == context[-2] and gram[1] == context[-1]:
                    candidates.append(gram[2])
                    probs.append(interpolated_prob((context[-2], context[-1], gram[2])))
            
            if not candidates:
                # Відступ до біграмної моделі
                for gram in bigram_counts.keys():
                    if gram[0] == context[-1]:
                        candidates.append(gram[1])
                        probs.append(laplace_smoothed_prob((context[-1], gram[1]), 
                                                          bigram_counts, 
                                                          bigram_contexts))
        else:
            # Використовуємо біграмну модель для перших слів
            candidates = []
            probs = []
            
            for gram in bigram_counts.keys():
                if gram[0] == context[-1]:
                    candidates.append(gram[1])
                    probs.append(laplace_smoothed_prob(gram, bigram_counts, bigram_contexts))
        
        # Запобігання випадку відсутності кандидатів
        if not candidates:
            # Відступ до випадкового вибору зі словника
            candidates = list(vocab - {'<s>', '</s>'})
            probs = [1/len(candidates)] * len(candidates)
        
        # Застосування температури для контролю випадковості
        if temperature != 1.0:
            probs = [p ** (1/temperature) for p in probs]
            # Нормалізація ймовірностей
            s = sum(probs)
            if s > 0:
                probs = [p/s for p in probs]
        
        # Вибір наступного слова
        chosen = random.choices(candidates, weights=probs, k=1)[0]
        
        # Закінчуємо речення, якщо зустріли маркер кінця або досягли ліміту
        if chosen == '</s>':
            break
            
        sentence.append(chosen)
        context.append(chosen)
        
        # Обмежуємо контекст до останніх двох слів
        if len(context) > 2:
            context = context[-2:]
    
    return ' '.join(sentence)

## **Perplexity**

In [63]:
def calculate_perplexity(test_tokens, n=3):
    """
    Обчислює перплексію за формулою:
    perplexity(W) = [1/P(w1,w2,...,wN)]^(1/N) = [∏(1/P(wi|wi-n+1,...,wi-1))]^(1/N)
    
    Як показано на слайді про перплексію.
    """
    # Приймає як рядок, так і список токенів
    if isinstance(test_tokens, str):
        test_tokens = preprocess_text(test_tokens)
    
    # Додаємо маркери початку та кінця
    test_tokens = ['<s>'] * (n-1) + test_tokens + ['</s>']
    
    N = len(test_tokens) - (n-1)
    log_prob_sum = 0
    
    for i in range(n-1, len(test_tokens)):
        if n == 3:
            # Для триграмної моделі використовуємо інтерполяцію
            ngram = (test_tokens[i-2], test_tokens[i-1], test_tokens[i])
            prob = interpolated_prob(ngram)
        else:
            # Для інших n використовуємо згладжену ймовірність
            ngram = tuple(test_tokens[i-(n-1):i+1])
            if n == 2:
                prob = laplace_smoothed_prob(ngram, bigram_counts, bigram_contexts)
            else:
                # Запасний варіант
                prob = 1 / V
        
        # Запобігання log(0)
        if prob <= 0:
            prob = 1e-10
            
        log_prob_sum += math.log2(prob)
    
    # Обчислюємо перплексію як 2^(-L), де L - середній логарифм ймовірності
    return 2 ** (-log_prob_sum / N)

## **Example**

In [64]:
# Генеруємо кілька речень з різними параметрами температури
print("\nГенерація речень з розширеним корпусом:")
print("1. Стандартна температура (1.0):", generate_sentence(temperature=1.0))
print("2. Низька температура (0.5, більш шаблонно):", generate_sentence(temperature=0.5))
print("3. Висока температура (1.5, більш випадково):", generate_sentence(temperature=1.5))
print("4. Довше речення:", generate_sentence(max_words=25))

# Обчислення перплексії для різних тестових речень
test_sentences = [
    "bees produce honey and pollinate crops",
    "beekeeping is an important agricultural activity",
    "honey bees live in colonies with a queen",
    "beekeepers use protective clothing when working with hives",
    "this sentence has nothing to do with beekeeping or bees",
    "artificial intelligence models process natural language"
]

print("\nПорівняння перплексії з розширеним корпусом:")
for sentence in test_sentences:
    ppl = calculate_perplexity(preprocess_text(sentence))
    # Категоризуємо перплексію
    category = "Високо релевантне" if ppl < 50 else "Відносно релевантне" if ppl < 150 else "Нерелевантне"
    print(f"'{sentence}': {ppl:.2f} - {category}")

# Порівняння біграмної та триграмної моделей
print("\nПорівняння моделей (триграми vs біграми):")
for sentence in test_sentences[:3]:  # Беремо перші 3 речення для порівняння
    bigram_ppl = calculate_perplexity(preprocess_text(sentence), n=2)
    trigram_ppl = calculate_perplexity(preprocess_text(sentence), n=3)
    improvement = ((bigram_ppl - trigram_ppl) / bigram_ppl) * 100
    print(f"'{sentence}'")
    print(f"  Біграмна перплексія: {bigram_ppl:.2f}")
    print(f"  Триграмна перплексія: {trigram_ppl:.2f}")
    print(f"  Покращення: {improvement:.1f}%")


Генерація речень з розширеним корпусом:
1. Стандартна температура (1.0): honey bees worker bees are social insects that live in colonies a colony grows too
2. Низька температура (0.5, більш шаблонно): honey bees includes a queen drones and workers the queen and young larvae it is
3. Висока температура (1.5, більш випадково): honey bees includes a queen drones and workers the queen and young larvae it is
4. Довше речення: honey bees is threatened by commercial breeding practices that focus on a limited number of subspecies efforts to conserve native and locally adapted honey bee

Порівняння перплексії з розширеним корпусом:
'bees produce honey and pollinate crops': 302.96 - Нерелевантне
'beekeeping is an important agricultural activity': 407.13 - Нерелевантне
'honey bees live in colonies with a queen': 162.87 - Нерелевантне
'beekeepers use protective clothing when working with hives': 426.55 - Нерелевантне
'this sentence has nothing to do with beekeeping or bees': 370.31 - Нерелевантне