# Qudud: Quit Smoking Chatbot Assistant
**Research Portfolio, Rifky Setiawan, Universitas Gadjah Mada (UGM)**


This notebook presents a compact, modular chatbot for quit-smoking support. It combines a lightweight retrieval component, an intent classifier, a transparent dialogue policy, and a small retrieval-augmented response step. The design focuses on clarity, safety, and reproducibility for academic review.


## Abstract

This portfolio notebook demonstrates a small end-to-end quit-smoking assistant named **Qudud**. The system integrates four parts: (1) content acquisition for a basic corpus, (2) bag-of-words retrieval to ground responses, (3) a compact intent classifier with TF‑IDF features and logistic regression from scikit-learn, and (4) a rule-based dialogue policy with a safety layer and a minimal retrieval-augmented step. The implementation is intentionally simple to keep the behavior transparent while remaining extensible for future research.


## 1. Problem Statement and Motivation

Quit-smoking coaching needs practical guidance that is available on demand. Many users benefit from short suggestions during cravings, relapse handling that is non-judgmental, and structured planning advice. This notebook provides a clean reference implementation of a transparent assistant that can be extended with larger models later while keeping a clear baseline for evaluation and ablation studies.


## 2. Pipeline Overview

- **Content Acquisition:** scrape a few public health pages to build a small text corpus; fail gracefully if network access is not available.  
- **Text Preparation:** sentence tokenization and basic cleaning.  
- **Retrieval Module:** bag-of-words similarity with cosine distance to echo relevant sentences.  
- **Intent Classification:** TF‑IDF features and a logistic regression classifier for a small set of intents.  
- **Dialogue Policy:** rule-based policy that uses intents to select supportive messages with stage tracking.  
- **Safety Filter:** keyword-based guard that replaces unsafe outputs with a fallback message.  
- **RAG Step:** append one short retrieved tip that matches the user query to improve specificity.


## 3. Environment Setup and Dependencies

The installation cell below installs the required libraries. Internet access may be needed for newspaper3k scraping and for downloading NLTK tokenizers.


In [None]:
# Environment Setup and Dependencies
# Note: If running on restricted environments, scraping may fail. The code will fall back to a small built-in corpus.

%pip install -q nltk newspaper3k lxml_html_clean

import warnings
warnings.filterwarnings('ignore')

import random
from dataclasses import dataclass, field
from typing import List, Tuple

import numpy as np
import nltk

from newspaper import Article

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression

# Ensure sentence tokenizer data; handle newer NLTK expecting 'punkt_tab'
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt", quiet=True)

try:
    nltk.data.find("tokenizers/punkt_tab")
except LookupError:
    try:
        nltk.download("punkt_tab", quiet=True)
    except Exception:
        pass  # ignore if unavailable; downstream code uses a safe fallback

# Reproducibility
random.seed(42)
np.random.seed(42)

## 4. Setup and Imports

This cell gathers all imports in one place and performs lightweight initialization. The tokenizer is downloaded once.


In [None]:
# Minimal Safety Layer

# Minimal keyword-based safety
BLOCKLIST = {"self-harm", "suicide", "kill myself", "harm others"}
SAFE_FALLBACK = "I want to keep you safe. If you feel at risk, please contact local emergency services or a trusted person."

def is_unsafe(text: str) -> bool:
    t = (text or "").lower()
    return any(k in t for k in BLOCKLIST)

def apply_safety(input_text: str, output_text: str) -> str:
    if is_unsafe(input_text) or is_unsafe(output_text):
        return SAFE_FALLBACK
    return output_text

## 5. Safety Utilities

A minimal safety layer checks simple keyword patterns and replaces unsafe outputs with a supportive fallback. This is not a substitute for professional help. It serves as a visible baseline that can be expanded with more robust detection.


In [None]:
# Content Acquisition

def scrape_articles(urls: List[str]) -> str:
    """Fetch and concatenate article text from a list of URLs.
    Returns an empty string if all fetches fail.
    """
    corpus_parts = []
    for url in urls:
        try:
            art = Article(url)
            art.download()
            art.parse()
            # art.nlp() is optional; not strictly required for .text
            corpus_parts.append(art.text.strip())
        except Exception as e:
            # Keep going if one source fails
            continue
    return "\n\n".join([c for c in corpus_parts if c])

## 6. Content Acquisition (Scraper)

This function scrapes a few public health resources. If a page fails to load, the function continues with the remaining sources. If all pages fail, the code falls back to a small built-in snippet list.


In [None]:
# Sentence Corpus Construction

SNIPPETS = [
    "Patches provide steady nicotine during the day. Follow product dosing instructions.",
    "Chewing gum for nicotine works best with the chew and park technique.",
    "Set a quit date within two weeks and tell a friend to increase accountability.",
    "Cravings last a few minutes. Prepare a short list of alternative activities."
]

DEFAULT_URLS = [
    "https://www.webmd.com/smoking-cessation/ss/slideshow-13-best-quit-smoking-tips-ever",
    "https://www.mayoclinic.org/healthy-lifestyle/quit-smoking/in-depth/nicotine-craving/art-20045454",
    "https://www.quit.org.au/"
]

def build_corpus_sentences(urls: List[str] = None, min_len: int = 30, max_len: int = 400) -> List[str]:
    urls = urls or DEFAULT_URLS
    text = scrape_articles(urls)
    if not text:
        # Fallback if scraping is not possible
        return SNIPPETS.copy()
    sents = sent_tokenize(text)
    # Basic filtering and deduplication
    seen = set()
    out = []
    for s in sents:
        s_clean = " ".join(s.split())
        if min_len <= len(s_clean) <= max_len and s_clean not in seen:
            out.append(s_clean)
            seen.add(s_clean)
    return out if out else SNIPPETS.copy()

corpus_sentences = build_corpus_sentences()
len(corpus_sentences)

97

## 7. Text Preparation

This cell split the combined corpus into sentences and keep a unique list to reduce redundancy. If scraping fails, it fall back to a small internal corpus with practical tips.


In [None]:
# Bag of Words Retrieval

def retrieve_from_corpus(query: str, sentences: List[str], top_k: int = 3, min_score: float = 0.05) -> List[str]:
    """Return up to top_k sentences similar to the query using cosine similarity."""
    if not sentences:
        return []
    docs = sentences + [query]
    vect = CountVectorizer().fit_transform(docs)
    scores = cosine_similarity(vect[-1], vect).flatten()[:-1]  # exclude the query itself
    idx = np.argsort(-scores)  # descending
    results = []
    for i in idx:
        if len(results) >= top_k:
            break
        if scores[i] >= min_score:
            results.append(sentences[i])
    return results

# Quick smoke test
retrieve_from_corpus("cravings at night", corpus_sentences, top_k=2)

['Use these tips to lessen and resist cravings.',
 'These are a better bet for managing strong cravings.']

## 8. Retrieval Module (Bag-of-Words Similarity)

This cell use a simple bag-of-words model to fetch the top sentences that match the user input. The matching threshold and the number of sentences are configurable.


In [None]:
# Lightweight Generators for Dialogue

def greeting_response(text: str):
    """Return a greeting if the user greets the bot."""
    text = (text or "").lower()
    bot_greetings = ["Hi, how can I help today?", "Hello", "Hi there", "Hey", "Hola", "Welcome"]  # English only for consistency
    user_greetings = {"hey", "hi", "hello", "greetings", "wassup", "halo"}
    for word in text.split():
        if word in user_greetings:
            return random.choice(bot_greetings)
    return None

def personalized_motivation(user_data: dict) -> str:
    """Generate a short motivation message based on user-provided state."""
    craving_level = int(user_data.get("craving_level", 5))
    mood = str(user_data.get("mood", "")).lower()
    reason_to_quit = user_data.get("reason_to_quit", "your long-term health")
    messages = []
    if craving_level > 7:
        messages.append("Stay strong. These intense cravings are temporary. Try deep breathing or a short distraction.")
    else:
        messages.append("You are doing well. Keep a steady pace and acknowledge each small win.")
    if mood in {"stressed", "bored"}:
        messages.append("If you feel stressed or bored, try light exercise, reading, or listening to music.")
    messages.append(f"Remember your main reason: '{reason_to_quit}'. You can reach your goal.")
    return " ".join(messages)

def craving_tips() -> str:
    tips = [
        "Drink a glass of water to reduce cravings.",
        "Take a short walk or do light exercise.",
        "Practice deep breathing for 5 minutes.",
        "Use sugar-free gum to keep your mouth busy.",
        "Recall your reason for quitting to stay focused."
    ]
    return random.choice(tips)

## 9. Generators for Greetings, Motivation, and Craving Tips

Small helper functions produce simple, supportive responses. The motivation function uses user-provided metadata that can be collected once per session.


In [None]:
# Intent Classifier with TF IDF and Logistic Regression

INTENT_LABELS = ["greet", "craving", "plan_quit", "relapse", "info_nrt", "goodbye"]
train_texts = [
    "hi there", "hello", "good morning",
    "I really want a cigarette right now", "cravings are strong today",
    "I want to set a quit date", "help me plan quitting",
    "I smoked again yesterday", "I relapsed after a week",
    "what is nicotine patch", "info about nicotine gum",
    "thanks bye", "goodbye see you"
]
train_y = [0,0,0, 1,1, 2,2, 3,3, 4,4, 5,5]

tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=1)
X = tfidf.fit_transform(train_texts)
clf = LogisticRegression(max_iter=500, random_state=42)
clf.fit(X, train_y)

def predict_intent(text: str) -> Tuple[str, float]:
    v = tfidf.transform([text])
    proba = clf.predict_proba(v)[0]
    y = int(proba.argmax())
    return INTENT_LABELS[y], float(proba.max())

## 10. Intent Classifier (TF‑IDF + Logistic Regression)

A compact supervised classifier routes common intents. The dataset is toy-sized by design so that reviewers can understand each step. It can be replaced with a larger labeled set later.


In [None]:
# Dialogue Policy and Session State

STAGES = ["precontemplation", "contemplation", "preparation", "action", "maintenance"]

@dataclass
class SessionState:
    stage: str = "contemplation"
    last_intent: str = ""
    relapse_flag: bool = False
    history: list = field(default_factory=list)

def policy(state: SessionState, user_text: str) -> str:
    intent, conf = predict_intent(user_text)
    state.last_intent = intent
    state.history.append(("user", user_text))
    if intent == "plan_quit":
        state.stage = "preparation"
    elif intent == "relapse":
        state.relapse_flag = True
        state.stage = "contemplation"
    if intent == "greet":
        out = "Hi, I am here to support your quit journey. How are you feeling about smoking today?"
    elif intent == "craving":
        out = "Cravings pass. Try a 4D strategy: delay, deep breathing, drink water, do something else."
    elif intent == "plan_quit":
        out = "Good choice. Pick a quit date within two weeks and list your triggers. I can help with the list."
    elif intent == "relapse":
        out = "Slips happen. What did you feel before it happened? We can adjust your plan without judgment."
    elif intent == "info_nrt":
        out = "Nicotine replacement options include patches, gum, lozenges, and sprays."
    elif intent == "goodbye":
        out = "You did well today. I will be here when you need support again."
    else:
        out = "I want to understand you better. Could you rephrase that?"
    out = apply_safety(user_text, out)
    state.history.append(("bot", out))
    return out

## 11. Dialogue Policy and Session State

The policy updates a simple stage variable and produces supportive text based on the predicted intent. The safety layer is applied at the end.


In [None]:
# Retrieval Augmentation

def retrieve_tips(query: str, k: int = 2):
    q = set((query or "").lower().split())
    scored = []
    for s in SNIPPETS:
        overlap = len(q & set(s.lower().split()))
        scored.append((overlap, s))
    scored.sort(key=lambda x: (-x[0], s))
    return [s for _, s in scored[:k]]

def policy_with_rag(state: SessionState, user_text: str) -> str:
    base = policy(state, user_text)
    add = retrieve_tips(user_text, k=1)
    if add:
        base += " " + add[0]
    return base

## 12. Retrieval-Augmented Response (RAG-lite)

This cell attach one relevant short tip retrieved from a small knowledge list for extra specificity.


In [None]:
# Unified Response Orchestration

def generate_response(user_text: str, state: SessionState, corpus_sentences: List[str], user_data: dict) -> str:
    # 1) Greetings
    g = greeting_response(user_text)
    if g:
        return apply_safety(user_text, g)

    # 2) Keyword shortcuts
    low = user_text.lower()
    if "motivation" in low:
        return apply_safety(user_text, personalized_motivation(user_data))
    if "craving" in low:
        return apply_safety(user_text, craving_tips())
    if any(k in low for k in ["exit", "bye", "quit"]):
        return apply_safety(user_text, "Thank you for using Qudud. Stay strong and believe in yourself.")

    # 3) Policy response with small RAG tip
    out = policy_with_rag(state, user_text)

    # 4) If intent confidence is unknown in this simple demo, add corpus sentences
    # We approximate low confidence by checking if the policy fell back to the clarification line.
    if "rephrase" in out.lower():
        retrieved = retrieve_from_corpus(user_text, corpus_sentences, top_k=2, min_score=0.05)
        if retrieved:
            out += " " + " ".join(retrieved)

    return apply_safety(user_text, out)

## 13. Unified Bot Response

This function combines greetings, keyword shortcuts, personalized motivation, and the dialogue policy. The retrieval module adds short relevant sentences from the scraped corpus when confidence is low.


In [None]:
# Interactive Console Demo

def run_chat_interactive(max_turns: int | None = None):
    """
    Start an interactive session using input().
    Type 'exit', 'quit', or 'bye' to end the session.
    """
    try:
        state = SessionState()
        print("Welcome to Qudud - Interactive Mode")
        print("Type 'exit' anytime to quit.\n")

        # Collect user data once
        def _get(prompt, default=""):
            try:
                v = input(prompt).strip()
                return v if v else default
            except EOFError:
                return default

        user_data = {
            "smoking_frequency": _get("How many cigarettes do you smoke per day? ", "0"),
            "craving_level": _get("On a scale of 1-10, how strong are your cravings? ", "5"),
            "mood": _get("How do you feel right now? ", "neutral"),
            "reason_to_quit": _get("What is your main reason for quitting? ", "health")
        }

        turns = 0
        while True:
            if max_turns is not None and turns >= max_turns:
                print("Session limit reached.")
                break
            try:
                u = input("You: ").strip()
            except EOFError:
                print("\nInput stream closed.")
                break
            if u.lower() in {"exit", "quit", "bye"}:
                print("Qudud: Thank you for using Qudud. Stay strong and believe in yourself.")
                break
            bot = generate_response(u, state, corpus_sentences, user_data)
            print(f"Qudud: {bot}\n")
            turns += 1
    except KeyboardInterrupt:
        print("\nQudud: Goodbye.")

run_chat_interactive()

Welcome to Qudud - Interactive Mode
Type 'exit' anytime to quit.

How many cigarettes do you smoke per day? 7
On a scale of 1-10, how strong are your cravings? 9
How do you feel right now? I’m really stressed right now
What is your main reason for quitting? I want to live a healthy life
You: I want to set a quit date
Qudud: Thank you for using Qudud. Stay strong and believe in yourself.

You: Cravings are strong at night
Qudud: Drink a glass of water to reduce cravings.

You: What about patches
Qudud: Nicotine replacement options include patches, gum, lozenges, and sprays. Patches provide steady nicotine during the day. Follow product dosing instructions.

You: Ok thanks
Qudud: You did well today. I will be here when you need support again. Patches provide steady nicotine during the day. Follow product dosing instructions.

You: Bye
Qudud: Thank you for using Qudud. Stay strong and believe in yourself.


## 14. Results

This baseline produces clear and supportive messages for each tested intent. It augments low-confidence cases with a short relevant sentence from the corpus to increase specificity. Evaluation for this small demo is qualitative. A next iteration can include human preference judgments and simple satisfaction metrics after each turn.


## 15. Insights

- Lightweight TF‑IDF with logistic regression is fast and transparent for routing.  
- A visible safety check is essential even when simple. It clarifies limits while offering guidance to the user to seek help when needed.  
- A minimal retrieval addition increases grounding without complicating the core policy.  
- The code organization makes ablations straightforward: disable retrieval, swap the classifier, or extend the safety layer.


## 16. Responsible Use and Limitations

This assistant is not medical advice. It cannot diagnose conditions or handle crises. The safety layer is a simple keyword filter and must be replaced or complemented by a robust classifier in real deployments. Scraping should respect site policies and content licenses.


## 17. Conclusion

This notebook illustrates a transparent baseline for a quit-smoking assistant that is easy to read and extend. It is suitable as a reference point for research iterations on intent modeling, retrieval quality, and safety.


## 18. Next Steps

- Collect a larger labeled dataset for intents and add per-class metrics.  
- Replace bag-of-words retrieval with a compact embedding model that improves semantic matching.  
- Add configurable rules for local languages.  
- Integrate a better safety classifier and evaluate false negatives.  
- Conduct small user studies to measure usefulness and clarity.


## Author / Contact

**Rifky Setiawan**  
Undergraduate Student, Department of Computer Science  
Universitas Gadjah Mada (UGM), Indonesia  
Email: rifkysetiawan@mail.ugm.ac.id
