## Lab Activity: Hosting an App on Streamlit

### Introduction

Streamlit is a powerful framework for creating web applications with Python. 

This lab activity will, you will go through the process of hosting an Information Retrieval (IR) app using document embeddings on Streamlit. 

The app will allow users to enter a query and retrieve the top K most
relevant documents.

**Prerequisites**

Before starting, ensure you have:

- Python installed (Python 3.7+ recommended)
- pip installed for package management
- Precomputed document embeddings stored as a NumPy array
- A text-based dataset with corresponding documents

#### Step 1: Install Required Libraries

To get started, install Streamlit and other necessary dependencies

In [29]:
!pip install nltk gensim numpy scikit-learn streamlit




[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip





In [30]:
import streamlit as st
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, reuters
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import Word2Vec


#### Step 2: Download NLTK Data

In [31]:
import nltk

nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\tehre\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tehre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tehre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Step 3: Build Corpus for Word2Vec

In [32]:
from nltk.corpus import reuters
from nltk.tokenize import word_tokenize

# Prepare stopwords
stop_words = set(stopwords.words("english"))

# Create documents
# Original documents (for display in app)
documents_original = []

# Tokenized and cleaned documents (for embeddings)
documents_processed = []

for fileid in reuters.fileids():
    raw_text = reuters.raw(fileid).strip()
    documents_original.append(raw_text)  # Save original full text
    
    # Preprocess text: lowercase, remove stopwords & punctuation
    words = [w.lower() for w in word_tokenize(raw_text) if w.isalnum() and w.lower() not in stop_words]
    documents_processed.append(words)  # Keep as list of words for Word2Vec
    
print("Total documents:", len(documents_original))

Total documents: 10788


#### Step 4: Train Word2Vec

In [33]:
model = Word2Vec(
    sentences=documents_processed,
    vector_size=100,
    window=5,
    min_count=5,
    workers=4
)

print("Vocabulary size:", len(model.wv.index_to_key))

# Save the Word2Vec model for Streamlit app
model.save("word2vec_reuters.model")

Vocabulary size: 10407


#### Step 5: Create Document Embeddings

In [34]:
doc_embeddings = []

for tokens in documents_processed:
    vectors = [model.wv[w] for w in tokens if w in model.wv]
    if vectors:
        doc_vector = np.mean(vectors, axis=0)
    else:
        doc_vector = np.zeros(model.vector_size)
    doc_embeddings.append(doc_vector)

doc_embeddings = np.array(doc_embeddings)
print("Embeddings shape:", doc_embeddings.shape)

Embeddings shape: (10788, 100)


#### Step 6: Save Everything

In [35]:
np.save("embeddings.npy", doc_embeddings)

with open("documents_original.txt", "w", encoding="utf-8") as f:
    for doc in documents_original:
        f.write(doc + "\n")

print("Embeddings, documents, and Word2Vec model saved successfully!")


Embeddings, documents, and Word2Vec model saved successfully!


**Summary of variables for app**

- documents_original -> list of full text documents for display
- doc_embeddings     -> NumPy array of embeddings aligned with documents_original
- model              -> Word2Vec model used for query embedding