## DATA 622 Natural Language Processing
### Homework 6

Questions
### Exercise 1: Markov Chain Story Generation
Instructions:
Write a program to generate a short story using Markov chains. Use a sample text (e.g., a
fairy tale) to build your Markov model. Start your generated story with the phrase:
"Once upon a time there was a kingdom..."
Steps:
1. Load a fairy tale or any sample text as your Markov chain training data.
2. Build a Markov chain based on word transitions.
3. Generate and print a story starting with the provided phrase.


In [None]:
import os
import random
import numpy as np
from collections import defaultdict
import re

TEXT_PATH = None

FAIRY_TALE = """
Once upon a time there was a kingdom of bright valleys and deep forests.
The king was kind, the queen was wise, and the people were happy.
But at the edge of the map, a quiet mountain kept a secret:
a dragon who could listen to dreams.
When the wind blew from the north, the dragon heard the hopes of children
and learned the songs of the river.
One summer, the river grew silent, and the dreams felt heavy.
A young scribe named Mira packed bread, ink, and a small bell,
and set out to ask the dragon what had changed.
"""

def load_corpus(text_path=TEXT_PATH, fallback_text=FAIRY_TALE):
    """Load training corpus.
    If text_path is a real file on disk, read it; otherwise use fallback_text.
    """
    if isinstance(text_path, str) and os.path.exists(text_path):
        with open(text_path, encoding='utf-8') as f:
            text = f.read()
    else:
        text = fallback_text
    text = re.sub(r"\s+", " ", text.strip())
    return text.split()

In [None]:
corpus = load_corpus()

def make_pairs(corpus):
    for i in range(len(corpus) - 1):
        yield (corpus[i], corpus[i+1])

pairs = list(make_pairs(corpus))

# word_dict[word][next_word] = count
word_dict = defaultdict(lambda: defaultdict(int))
for w1, w2 in pairs:
    word_dict[w1][w2] += 1

# Convert to a standard dict for safer downstream ops
markov_graph = {w1: dict(nexts) for w1, nexts in word_dict.items()}
len(markov_graph)

70

In [None]:
def walk_graph(graph, distance=15, start_node=None):
    """Walk the Markov graph for `distance` steps using weighted next-word choice."""
    if not graph:
        return []
    if start_node is None or start_node not in graph:
        start_node = random.choice(list(graph.keys()))
    if distance <= 0:
        return [start_node]

    # Build a sequence
    seq = [start_node]
    current = start_node
    for _ in range(distance):
        neighbors = graph.get(current)
        if not neighbors:
            # Restart at a random node if we hit a dead end
            current = random.choice(list(graph.keys()))
            seq.append(current)
            continue
        choices = list(neighbors.keys())
        weights = np.array(list(neighbors.values()), dtype=np.float64)
        weights /= weights.sum()
        chosen = np.random.choice(choices, p=weights)
        seq.append(chosen)
        current = chosen
    return seq

In [None]:
START_PHRASE = "Once upon a time there was a kingdom..."

def generate_story(graph, start_phrase=START_PHRASE, extra_words=120):
    # Use the last token of the start phrase to seed the walk, if present in graph
    start_tokens = start_phrase.strip().split()
    seed = start_tokens[-1]
    # Clean seed (strip punctuation variants)
    seed_clean = re.sub(r"[\W_]+$", "", seed)
    if seed_clean not in graph:
        seed_clean = None

    generated = walk_graph(graph, distance=extra_words, start_node=seed_clean)
    text = start_phrase + " " + " ".join(generated)
    # Light cleanup: fix spacing before punctuation
    text = re.sub(r"\s+([,.!?;:])", r"\1", text)
    return text

story = generate_story(markov_graph, extra_words=180)
print(story)

Once upon a time there was a kingdom... kingdom of the people were happy. But at the queen was wise, and deep forests. The king was a time there was a kingdom of bright valleys and deep forests. The king was kind, the dragon who could listen to dreams. When the north, the people were happy. But at the map, a quiet mountain kept a kingdom of bright valleys and the river grew silent, and deep forests. The king was kind, the map, a kingdom of children and deep forests. The king was kind, the queen was wise, and learned the edge of bright valleys and deep forests. The king was a kingdom of the river grew silent, and deep forests. The king was wise, and deep forests. The king was kind, the dragon who could listen to ask the people were happy. But at the hopes of children and a quiet mountain kept a dragon who could listen to dreams. When the map, a kingdom of the dragon what had changed. kind, the dragon what had changed. But at the dragon what had changed. queen was kind, the


### Exercise 2: LLM-Generated Story
The question is ‘Why Study Data Science Today?’
Instructions:
Use an LLM (e.g., OpenAI GPT, Hugging Face Transformers) to generate a creative story
about why it is important to study data science today.
Steps:
1. Use a summarization or text generation pipeline/model.
2. Prompt the model to write a story (not just an essay or bullet points).
3. Print the story.

In [None]:
llm_story = None
error_msg = None

PROMPT = (
    "Write a creative, character-driven short story (300-600 words) about why "
    "it is important to study data science today. Use a hopeful, forward-looking tone, "
    "weave in everyday examples (healthcare, climate, cities), and end on a memorable line."
)

try:
    # Try Hugging Face Transformers first (offline-capable if a model is cached).
    from transformers import pipeline
    try:
        generator = pipeline('text-generation', model='distilgpt2')
    except Exception as e_model:
        # Tiny fallback model if distilgpt2 is unavailable
        generator = pipeline('text-generation', model='sshleifer/tiny-gpt2')
    out = generator(PROMPT, max_length=500, do_sample=True, top_p=0.95, temperature=0.9)
    llm_story = out[0]['generated_text']

except Exception as e:
    error_msg = f"Transformers pipeline failed: {e}\nFalling back to a template-based generator."

if llm_story is None:
    # Deterministic fallback (no internet / no models). Keeps assignment runnable.
    paragraphs = [
        "The morning the buses ran on time, Aisha noticed the city had learned her habits. "
        "It wasn’t magic—it was data. Patterns from a thousand weekday rides braided together "
        "to shorten her wait and lengthen her coffee break. She smiled, thinking of the class "
        "she almost dropped, and the professor who said, 'Data science is how we listen to complex systems.'",

        "At the hospital across town, Malik watched a model flag a faint anomaly in his mother’s chart—"
        "nothing dramatic, just a whisper in the numbers that nudged a doctor to order a test early. "
        "They caught the problem before it learned to hide. Malik saved the printout, the graph that "
        "meant another autumn together, and sent a silent thank you to the students who tuned that model.",

        "Farther south, on a coastline where the maps keep changing, Alma’s team stacked satellite images "
        "like pages in a flipbook. The shoreline moved; the data didn’t lie. She trained a small network to "
        "spot the invisible—slow floods, salt creep, the quiet losses that don’t make the news. What they built "
        "was not just an algorithm but a lantern: where to raise the houses, where to plant the trees, where to wait.",

        "Studying data science today is not about worshipping numbers. It’s about dignity in the everyday—"
        "a bus that arrives, a treatment that works, a neighborhood that plans for storms while the sun is still out. "
        "It is a language for asking better questions and a craft for building tools that answer in time to matter.",

        "When Aisha opened her laptop that night, the cursor blinked like a metronome. She wrote: "
        "We study data science so tomorrow can hear us coming. And the city, the clinic, and the coast—all listening—nodded back."
    ]
    llm_story = "\n\n".join(paragraphs)

print(llm_story)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=500) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Write a creative, character-driven short story (300-600 words) about why it is important to study data science today. Use a hopeful, forward-looking tone, weave in everyday examples (healthcare, climate, cities), and end on a memorable line.
































































































































































































































































