# **CourseraRAG: AI-Powered Study Assistant**

### **Project Overview**

The **CourseraRAG: AI-Powered Study Assistant** is a project designed to transform the study process through state-of-the-art technology. It integrates three key elements:

1. **Retrieval System**  
   A cornerstone of this project is its advanced retrieval system. This system is uniquely configured to incorporate a comprehensive corpus of information, which includes every lecture transcript and each chapter of the course textbook, converted into text files. By integrating these diverse and rich resources, the system is exceptionally equipped to identify and present the most relevant documents in response to specific user queries. This approach ensures that the information retrieved is not only pertinent but also encompasses a broad spectrum of educational materials, facilitating focused and effective learning.

2. **Generative AI (Powered by OpenAI APIs)**  
   Utilizing OpenAI's APIs, this segment of the project generates coherent and contextually accurate answers based on the documents retrieved. The use of OpenAI's advanced AI technology guarantees that the responses are relevant, reliable, and grounded in the substantial database of lecture transcripts and textbook content, thereby elevating the quality and accuracy of the information provided.

3. **Interactive User Interface**  
   The user interface is the heart of **CourseraRAG**. It is crafted to be engaging and user-friendly, enabling users to seamlessly pose questions and receive AI-generated answers. This interactive platform is the gateway to the sophisticated capabilities of both the Retrieval System and Generative AI, making it a dynamic and accessible educational tool.

### **Project Goals**

- To deliver efficient and targeted learning by providing access to a rich database of lecture transcripts and textbook content.
- To enhance understanding and retention through AI-generated answers, leveraging the comprehensive corpus of educational materials and the power of OpenAI's APIs.
- To provide an intuitive and interactive tool that caters to the diverse needs of learners and educators.


## Step 1: Conversion of Textbook PDF into Chapter-wise Text Files

This initial step involves the meticulous conversion of the textbook from its PDF format into individual text files, each corresponding to a separate chapter. This process is designed to ensure that each chapter is distinctly segmented, facilitating ease of access and reference in the subsequent stages of the project.


In [None]:
%pip install pdfplumber
%pip install pytesseract
%pip install PyPDF2
%pip install openai
import pdfplumber
from PIL import Image
import pytesseract
import PyPDF2
import re
import os

Collecting pdfplumber
  Downloading pdfplumber-0.10.3-py3-none-any.whl (48 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/49.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m41.0/49.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.0/49.0 kB[0m [31m932.6 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20221105 (from pdfplumber)
  Downloading pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.26.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdfium2, pdfm

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
def insert_spaces(text):
    # Pattern to identify places where a lowercase letter is followed by an uppercase letter
    pattern = re.compile(r'(?<=[a-z])(?=[A-Z])')

    # Insert a space at each identified position
    return pattern.sub(' ', text)

def extract_text_from_pdf(pdf_path):
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)

            # Check for encryption and try to decrypt
            if reader.is_encrypted:
                try:
                    reader.decrypt('')
                except Exception as e:
                    print(f"Unable to decrypt PDF: {e}")
                    return None

            text = ''

            # Extract text from each page
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    # Add logic here if you need to clean or format the text
                    text += page_text + '\n'
            return text
    except FileNotFoundError:
        print("File not found. Please check the file path.")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

def chunk_text_by_chapter_and_save(text, chapter_titles, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    current_chapter = None
    chapter_contents = []

    for line in text.split('\n'):
        # Check if the line matches or partially matches any chapter title
        for title in chapter_titles:
            if title.startswith(line) or line.startswith(title):
                if current_chapter:
                    save_chapter_to_file(current_chapter, '\n'.join(chapter_contents), output_folder)
                current_chapter = title
                chapter_contents = []
                break
        else:  # This else corresponds to the for-loop
            chapter_contents.append(line)

    if current_chapter:
        save_chapter_to_file(current_chapter, '\n'.join(chapter_contents), output_folder)

def save_chapter_to_file(chapter_title, content, output_folder):
    filename = f"{chapter_title}.txt".replace(' ', '_').replace('/', '_')
    file_path = os.path.join(output_folder, filename)
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(content)


In [None]:
chapter_titles = [
    "1Introduction",
    "2Background",
    "3Text Data Understanding",
    "4META: A Unified Toolkit",
    "5Overview of Text",
    "6Retrieval Models",
    "7Feedback",
    "8Search Engine",
    "9Search Engine Evaluation",
    "10Web Search",
    "11Recommender Systems",
    "12Overview of Text",
    "13Word Association Mining",
    "14Text Clustering",
    "15Text Categorization",
    "16Text Summarization",
    "17Topic Analysis",
    "18Opinion Mining and",
    "19Joint Analysis of Text",
    "20Toward A Unified System for Text Management and Analysis"
]


pdf_path = '/content/drive/MyDrive/CS_410/textbook_410.pdf'
text = extract_text_from_pdf(pdf_path)

if text:
    chunk_text_by_chapter_and_save(text, chapter_titles, "files/")

## Step 2: File Verification and Inventory

This step is crucial in ensuring the completeness and readiness of our resources. It involves a thorough verification process to confirm the presence of all necessary files. The key components to be verified are:

- **Overview Files:** These files should contain guiding questions and key concepts. It's essential to ensure that each overview file is complete and accurately reflects the course material.

- **Lecture Transcripts:** We need to have transcripts for each week's lecture. This step includes checking that each transcript is available, legible, and correctly corresponds to the respective week's content.

- **Textbook Chapters:** Different chapters from our textbook should be available as individual text files. The verification process here involves confirming that each chapter is properly extracted, correctly labeled, and includes all the relevant content.

This comprehensive verification ensures that all the critical educational resources are in place and correctly organized for the subsequent stages of our project.


In [None]:
import re
import os

# Function to extract content between two headings
def extract_content(heading, text, next_heading=None):
    if next_heading:
        pattern = re.compile(rf"{heading}\n(.*?)\n{next_heading}", re.DOTALL)
    else:
        pattern = re.compile(rf"{heading}\n(.*?)(?=\n[A-Z][a-z])", re.DOTALL)
    match = pattern.search(text)
    content = match.group(1).strip() if match else ""
    return [line.strip() for line in content.split('\n') if line.strip()]

# Function to process overview files in a specified directory
def process_overviews(directory):
    files = sorted(os.listdir(directory))
    for file in files:
        file_path = os.path.join(directory, file)
        if os.path.isfile(file_path):
            with open(file_path, 'r') as f:
                text = f.read()

            goals_and_objectives = extract_content("Goals and Objectives", text, "Guiding Questions")
            guiding_questions = extract_content("Guiding Questions", text, "Key Phrases and Concepts")
            key_phrases_and_concepts = extract_content("Key Phrases and Concepts", text)

            print(f"File: {file}")
            print(f"Length: {len(text)} characters")
            print("Goals and Objectives:")
            for item in goals_and_objectives:
                print("-", item)
            print("\nGuiding Questions:")
            for item in guiding_questions:
                print("-", item)
            print("\nKey Phrases and Concepts:")
            for item in key_phrases_and_concepts:
                print("-", item)
            print("\n" + "-"*50 + "\n")

# Function to list textbook chapters or lecture transcripts
def list_files(directory, title):
    files = sorted(os.listdir(directory))
    print(f"{title}:")
    for file in files:
        file_path = os.path.join(directory, file)
        if os.path.isfile(file_path):
            with open(file_path, 'r') as f:
                text = f.read()
            print(f"{file} - Length: {len(text)} characters")
    print("\n" + "-"*50 + "\n")

# Function to process lecture transcripts in weekly folders
def process_lecture_transcripts(directory):
    for week in sorted(os.listdir(directory)):
        week_path = os.path.join(directory, week)
        if os.path.isdir(week_path):
            print(f"Processing {week} Transcripts:")
            list_files(week_path, f"{week} Transcripts")
            print("\n" + "-"*50 + "\n")

# Directory paths
overview_directory = "/content/drive/MyDrive/CS_410/Text Files/Overview"
chapter_directory = "/content/drive/MyDrive/CS_410/Text Files/Textbook"
lecture_directory = "/content/drive/MyDrive/CS_410/Text Files/Lectures"

# Check if directories exist and process files
if os.path.exists(overview_directory):
    process_overviews(overview_directory)
else:
    print(f"The overview directory {overview_directory} does not exist.")

if os.path.exists(chapter_directory):
    list_files(chapter_directory, "Textbook Chapters")
else:
    print(f"The textbook chapters directory {chapter_directory} does not exist.")

if os.path.exists(lecture_directory):
    process_lecture_transcripts(lecture_directory)
else:
    print(f"The lecture transcripts directory {lecture_directory} does not exist.")


File: Week1_Overview.txt
Length: 605 characters
Goals and Objectives:
- • Understand what cloud computing is and why it is important.
- • Get a picture of the economics of cloud computing.
- • Learn about Big Data (much more about Big Data in the second part of this course).
- Key Phrases/Concepts
- • Cloud computing
- • Big Data
- • Cloudonmics
- • Software defined architecture
- • IaaS: Infrastructure as a Service

Guiding Questions:

Key Phrases and Concepts:

--------------------------------------------------

The textbook chapters directory /content/drive/MyDrive/UIUC/CS_498/Week1/Textbook does not exist.


1_1_Cloud_Computing_Introduction.txt
1_2_Cloudonomics_Part_1.txt
1_3_Cloudonomics_Part_2.txt
1_4_Big_Data.txt
1_5_Summary_to_Cloud_Introduction.txt


## Step 3: Advanced Text Analysis and Retrieval System

This section of the project is dedicated to the development of an advanced text analysis and retrieval system. It encompasses several key functionalities designed to streamline the process of querying and extracting valuable insights from a comprehensive set of educational resources. The main components and their functionalities include:

- **Content Extraction from Overview Files:** Utilizing a custom function to parse and extract guiding questions from overview files. This ensures a focused approach in identifying key areas of study and topics of interest.

- **Preprocessing of Textual Data:** Implementation of a preprocessing routine involving lemmatization and removal of stopwords. This step is crucial for standardizing the text data, enhancing the effectiveness of subsequent analysis.

- **Lecture Transcripts and Textbook Chapters Processing:** Systematic processing of lecture transcripts and textbook chapters, converting them into a format suitable for advanced text analysis. This includes organizing lectures by weeks and chapters by their specific content.

- **TF-IDF Vectorization:** Application of the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization to transform the textual data into a numerical format. This transformation is vital for enabling sophisticated similarity comparisons.

- **Cosine Similarity-Based Document Retrieval:** Utilization of cosine similarity measures to identify the most relevant documents in response to a query. This component is adept at retrieving the top matching lecture transcripts and textbook chapters, tailored to the specifics of each query.

- **Results Presentation:** Displaying the top matching documents, including lecture transcripts and the most relevant textbook chapter for each query. This provides users with immediate access to the most pertinent information, fostering an efficient and targeted learning experience.

This comprehensive system ensures that students and educators can swiftly locate the most relevant information, thereby enhancing the overall effectiveness of the learning process.

In [None]:
import os
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('wordnet')
nltk.download('stopwords')

def extract_content(heading, text, next_heading=None):
    if next_heading:
        # The (?s) flag makes '.' match any character including newline
        pattern = re.compile(rf"{re.escape(heading)}\s*\n(?s)(.*?)\n{re.escape(next_heading)}", re.DOTALL)
    else:
        # The [^\n•] pattern matches any character except newline or bullet points
        pattern = re.compile(rf"{re.escape(heading)}\s*\n(?s)(.*?)(?=\n[A-Z][a-z])", re.DOTALL)
    match = pattern.search(text)
    content = match.group(1).strip() if match else ""
    return [line.strip() for line in content.split('\n') if line.strip()]

# Function to process overview files and create a list of all guiding questions
def process_overviews_and_extract_questions(directory):
    all_guiding_questions = []
    files = sorted(os.listdir(directory))
    for file in files:
        file_path = os.path.join(directory, file)
        if os.path.isfile(file_path):
            with open(file_path, 'r') as f:
                text = f.read()

            guiding_questions = extract_content("Guiding Questions", text, "Key Phrases and Concepts")
            all_guiding_questions.extend(guiding_questions)

    return all_guiding_questions

def preprocess(text):
    lemmatizer = WordNetLemmatizer()
    words = text.lower().split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english')]
    return ' '.join(words)

# Directories
overview_directory = "/content/drive/MyDrive/UIUC/CS_498/Overview"
#chapter_directory = "/content/drive/MyDrive/UIUC/CS_498/Week1/Textbook"
lecture_directory = "/content/drive/MyDrive/UIUC/CS_498/Lectures"

# Extract and preprocess overview questions
overview_questions = process_overviews_and_extract_questions(overview_directory)

# Read and preprocess transcripts and textbook chapters
documents = []
document_names = []

# Process lecture transcripts
for week_folder in sorted(os.listdir(lecture_directory)):
    week_path = os.path.join(lecture_directory, week_folder)
    if os.path.isdir(week_path):
        for filename in os.listdir(week_path):
            file_path = os.path.join(week_path, filename)
            if filename.endswith(".txt"):
                with open(file_path, 'r') as file:
                    documents.append(preprocess(file.read()))
                    document_names.append(f"{week_folder}/{filename}")

# Process textbook chapters
for filename in os.listdir(chapter_directory):
    if filename.endswith(".txt"):
        with open(os.path.join(chapter_directory, filename), 'r') as file:
            documents.append(preprocess(file.read()))
            document_names.append(f"Textbook/{filename}")

# TfidfVectorizer with preprocessing
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)

# Function to find top matching documents
def find_top_documents(query, top_n=3, is_textbook=False):
    query_tfidf = vectorizer.transform([query])
    cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()

    if is_textbook:
        # Filter to include only textbook chapters
        textbook_indices = [i for i, doc_name in enumerate(document_names) if "Textbook/" in doc_name]
        textbook_similarities = [cosine_similarities[i] for i in textbook_indices]
        top_indices = sorted(range(len(textbook_similarities)), key=lambda i: textbook_similarities[i], reverse=True)[:top_n]
        top_indices = [textbook_indices[i] for i in top_indices]  # Map back to original indices
    else:
        top_indices = cosine_similarities.argsort()[-top_n:][::-1]

    results = [(document_names[i], cosine_similarities[i]) for i in top_indices]
    return results[0] if is_textbook else results


# Find and print top documents for each query
for query in overview_questions:
    top_lectures = find_top_documents(query)
    top_chapter = find_top_documents(query, top_n=1, is_textbook=True)

    print(f"Query: {query}")
    print("Top Lecture Transcripts:")
    for doc, score in top_lectures:
        if "Textbook" not in doc:
            print(f"Matching document: {doc} with score {float(score):.4f}")

    print("\nTop Textbook Chapter:")
    if top_chapter:
        chapter_doc, chapter_score = top_chapter
        print(f"Matching chapter: {chapter_doc} with score {float(chapter_score):.4f}")
    else:
        print("No matching chapter found.")
    print("\n" + "-"*50 + "\n")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
overview_questions

[]

## Step 4: Parallelized Query Processing with GPT-4 and Contextual Document Retrieval

This section of the project focuses on efficiently answering guiding questions by leveraging the advanced capabilities of GPT-4, in conjunction with a context-based retrieval system. Utilizing parallel processing and the integration of contextual documents, this system aims to provide comprehensive and relevant answers to key questions derived from educational materials. Key features of this section include:

- **GPT-4 Query Function**: Utilizes OpenAI's GPT-4 to generate answers for the provided queries. This function forms the core of the query-answering mechanism, leveraging the advanced language understanding capabilities of GPT-4.

- **Contextual Information Gathering**: Before querying GPT-4, the system gathers relevant contextual information from a set of pre-identified documents. This includes the top lecture transcripts and the most relevant textbook chapters related to each query, ensuring that the responses are well-informed and pertinent.

- **Preprocessing and TF-IDF Vectorization**: Implements preprocessing routines and TF-IDF vectorization to transform textual data into a suitable format for analysis, enhancing the effectiveness of document retrieval based on query relevance.

- **Parallelized Processing**: Employs Python's `concurrent.futures.ThreadPoolExecutor` for parallel processing of multiple queries. This approach significantly improves efficiency, especially when dealing with multiple queries and large volumes of data.

- **Dynamic Response Generation**: For each guiding question, the system dynamically generates a prompt that includes the question and its associated contextual documents. This prompt is then used to query GPT-4, ensuring that the AI's response is informed by the most relevant and recent academic content.

- **Comprehensive Output**: The output for each query includes the guiding question, the documents used for context (both lecture transcripts and textbook chapters), and the answer generated by GPT-4. This structure provides a clear and thorough understanding of how each response was derived.

This system represents a sophisticated approach to automated query answering in educational settings, combining state-of-the-art AI with contextually rich academic resources to deliver insightful and accurate responses.

In [None]:
import concurrent.futures
import openai

OPENAI_KEY = os.getenv("OPENAI_API_KEY")
def query_gpt4(prompt):
    client = openai.OpenAI(api_key=OPENAI_KEY)
    response = client.chat.completions.create(
        model="gpt-4-1106-preview",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=4096
    )
    return response.choices[0].message.content

def ask_gpt_with_context(question, context_documents, document_names):
    prompt = f"Question: {question}\n\nContext:\n"
    for doc, name in zip(context_documents, document_names):
        prompt += f"Document: {name}\n{doc}\n\n"

    max_length = 128000
    if len(prompt) > max_length:
        prompt = prompt[-max_length:]

    return query_gpt4(prompt)

def process_question(query):
    try:
        top_documents = find_top_documents(query)
        top_chapter_result = find_top_documents(query, top_n=1, is_textbook=True)

        context_documents = []
        context_document_names = []

        # Process top lecture transcripts
        for doc_name, _ in top_documents:
            if doc_name in document_names:
                context_documents.append(documents[document_names.index(doc_name)])
                context_document_names.append(doc_name)
            else:
                print(f"Document '{doc_name}' not found in document_names.")

        # Process top textbook chapter
        if top_chapter_result:
            chapter_name, _ = top_chapter_result  # Unpack the result
            if chapter_name in document_names:
                chapter_content = documents[document_names.index(chapter_name)]
                context_documents.append(chapter_content)
                context_document_names.append(chapter_name)
            else:
                print(f"Chapter '{chapter_name}' not found in document_names.")

        answer = ask_gpt_with_context(query, context_documents, context_document_names)

        output = {
            "question": query,
            "documents_used": context_document_names,
            "answer": answer
        }

        return output
    except Exception as exc:
        print(f"An error occurred while processing the question '{query}': {exc}")
        return None


# Using ThreadPoolExecutor for parallel processing
outputs2 = []
with concurrent.futures.ThreadPoolExecutor() as executor:
    # Submit each query (first three questions) to the executor
    future_to_query = {executor.submit(process_question, query): query for query in overview_questions[:1]}

    # As each future completes, process its result
    for future in concurrent.futures.as_completed(future_to_query):
        query = future_to_query[future]
        try:
            output = future.result()
            outputs2.append(output)
            print(f"Question: {query}")
            print(f"Documents used: {', '.join(output['documents_used'])}")
            print("Answer:", output['answer'])
            print("-------------------------------------------------------------------------------------------------------------\n\n")
        except Exception as exc:
            print(f"{query} generated an exception: {exc}")


Question: Develop your answers to the following guiding questions while watching the video lectures throughout the week.
Documents used: Week12/12_5_Contextual_Text_Mining_Contextual_Probabilistic_Latent_Semantic_Analysis.txt, Week4/4_1_Probabilistic_Retrieval_Model_Basic_Idea.txt, Week1/1_3_Text_Retrieval_Problem.txt, Textbook/3Text_Data_Understanding.txt
Answer: These documents provide a detailed insight into several concepts essential to understanding and working with text data and retrieval systems. While there's a substantial amount of information, I will focus on summarizing the key points from each document relevant to the guiding questions.

1. **Week12/12_5_Contextual_Text_Mining_Contextual_Probabilistic_Latent_Semantic_Analysis.txt**:
   - This document discusses Contextual Probabilistic Latent Semantic Analysis (CPLSA), which incorporates context variables (like time periods or locations) into topic modeling.
   - CPLSA aims to discover how topics and their coverage in text 

# Step 5 (Optional): Generating Guiding Questions and Key Concepts
This step addresses the need for creating study aids when guiding questions and key concepts are not readily available. It automates the extraction of essential educational elements from lecture transcripts, ensuring comprehensive support for learning

In [None]:
def process_weekly_lectures(base_directory, weeks_to_process):
    all_responses = []

    # Function to generate the prompt for GPT
    def create_gpt_prompt(week_number, lectures_text):
        return (f"Week {week_number} Lectures for a Master-Level Text Information Systems Course:\n\n" +
                "FORMAT YOUR RESPONSE LIKE" +
                "Guiding Questions:\n" +
                "Q1: [Question 1]\nA1: [Answer to Question 1]\n" +
                "Q2: [Question 2]\nA2: [Answer to Question 2]\n\n" +
                "Key Concepts:\n" +
                "Identify Key Concepts mentioned in the lectures and provide a brief definition for each. Format the response as a numbered list, for example:\n" +
                "1. [Term1] - [Definition1]\n" +
                "2. [Term2] - [Definition2]\n"+
                "______________________________" +
                "Here are my lecture for the week: "+
                lectures_text +
                "\nBased on the above lectures, please generate Guiding Questions and Key Concepts as follows:\n\n" +
                "AGAIN, please format the response in this EXACT same format" +
                "Guiding Questions:\n" +
                "Q1: [Question 1]\nA1: [Answer to Question 1]\n" +
                "Q2: [Question 2]\nA2: [Answer to Question 2]\n\n" +
                "Key Concepts:\n" +
                "Identify Key Concepts mentioned in the lectures and provide a brief definition for each. Format the response as a numbered list, for example:\n" +
                "1. [Term1] - [Definition1]\n" +
                "2. [Term2] - [Definition2]\n"+
                "______________________________"
            )

    # Iterate over the specified weeks
    for week in weeks_to_process:
        week_folder = f"Week{week}"
        week_path = os.path.join(base_directory, week_folder)
        if os.path.isdir(week_path):
            week_lectures = ""
            for filename in sorted(os.listdir(week_path)):
                file_path = os.path.join(week_path, filename)
                if filename.endswith(".txt"):
                    with open(file_path, 'r') as file:
                        week_lectures += file.read() + "\n\n"

            prompt = create_gpt_prompt(week, week_lectures)
            response_text = "Week " + str(week) + " Overview: \n" + query_gpt4(prompt)
            all_responses.append(response_text)

    return all_responses

# Example usage
base_directory = "/content/drive/MyDrive/CS_410/Text Files/Lectures"
selected_weeks = [1, 2]  # Example: process only weeks 1, 2, and 3
responses = process_weekly_lectures(base_directory, selected_weeks)


In [None]:
def parse_guiding_questions(responses):
    all_parsed_questions = []
    question_pattern = re.compile(r"(?:Q\d*|[-•]|\d+\))\:? ([^\n]+)\n(?:A\d*|[-•]|\d+\))\:? ([^\n]+)")

    for response in responses:
        questions = question_pattern.findall(response)
        for question, answer in questions:
            all_parsed_questions.append({'Question': question.strip(), 'Answer': answer.strip()})

    return all_parsed_questions

def parse_key_concepts(responses):
    all_parsed_concepts = []
    concept_pattern = re.compile(r"(\d+\.|[-•]) ([^\-•\n]+) - ([^\n]+)")

    for response in responses:
        concepts = concept_pattern.findall(response)
        for _, term, definition in concepts:
            all_parsed_concepts.append({'Term': term.strip(), 'Definition': definition.strip()})

    return all_parsed_concepts


parsed_questions = parse_guiding_questions(responses)
parsed_concepts = parse_key_concepts(responses)

# Printing the parsed questions and concepts
print("Guiding Questions:")
for q in parsed_questions:
    print(f"Q: {q['Question']}")
    print(f"A: {q['Answer']}\n")

print("Key Concepts:")
for c in parsed_concepts:
    print(f"{c['Term']} - {c['Definition']}")


Guiding Questions:
Q: What is Natural Language Processing (NLP) and why is it important in text retrieval?
A: Natural Language Processing (NLP) is the main technique for processing natural languages to help computers understand the text data they process. It involves lexical analysis, semantic parsing, and inference among other tasks. It is important in text retrieval because understanding the structure and meaning of text is necessary for effectively finding and organizing relevant information.

Q: What are some of the main challenges in natural language processing that affect text retrieval?
A: Some main challenges in natural language processing that impact text retrieval include word-level ambiguity, where a word has multiple meanings or syntactic categories (like "design" functioning as a noun or a verb); syntactical ambiguities, where a sentence could have multiple interpretations; anaphora resolution, deciding what a pronoun or reference word stands for; and presuppositions, wher

In [None]:
import os
import re

def save_week_overviews(responses, output_folder):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)  # Create the folder if it doesn't exist

    for response in responses:
        # Use regular expression to find the week number
        match = re.search(r"Week (\d+)", response)
        if match:
            week_number = match.group(1)
        else:
            # Fallback to find any number if "Week X" format isn't found
            match = re.search(r"\d+", response)
            week_number = match.group() if match else "Unknown"

        file_name = f"Week{week_number}.txt"
        file_path = os.path.join(output_folder, file_name)

        # Write the response to the file
        with open(file_path, 'w') as file:
            file.write(response)


output_folder = "/content/drive/MyDrive/CS_410/Text Files/Generated_Overviews"  # Replace with your desired output folder path
save_week_overviews(responses, output_folder)


## Step 6: User Interaction and Query Processing

In this step, we focus on the user interface and the core functionality of our educational resource retrieval system. The following functions and processes are implemented to facilitate user interaction:

- **Display Menu:** We provide a menu with options for the user to choose from, including getting guiding questions for a specific week, asking a question, or quitting the program.

- **Get Guiding Questions:** Users can input a week number, and the system retrieves and displays guiding questions from the corresponding text file. This feature assists learners in accessing relevant course materials.

- **Ask a Question:** Users can input their questions, and the system processes these queries by concurrently retrieving documents and querying GPT-4 for answers. The results are presented to the user, including the top documents used and the generated answer.

- **Main Program Loop:** The main program loop ensures continuous user interaction, allowing them to navigate through the available options until they choose to exit the program.

This step integrates the user interface with the information retrieval and AI-driven question-answering capabilities, making our educational resource system accessible and efficient.


In [None]:
import os
import concurrent.futures

# Function to display the main menu and get the user's choice
def display_menu():
    print("1. Get Guiding Questions")
    print("2. Ask a Question")
    print("3. Quit")
    choice = input("Enter your choice (1-3): ")
    return choice

# Function to choose the type of guiding questions (provided or generated)
def choose_question_type():
    print("1. Provided Guiding Questions")
    print("2. Generated Guiding Questions")
    choice = input("Choose the type of guiding questions (1-2): ")
    return choice

# Function to get and display guiding questions for a specified week
def get_week_options(base_path, subfolder):
    """Scan the directory to get available week options, sorted numerically."""
    week_options = []
    full_path = os.path.join(base_path, subfolder)
    if os.path.exists(full_path):
        for filename in os.listdir(full_path):
            if filename.startswith("Week") and filename.endswith(".txt"):
                # Extract the week number from filename and convert to integer
                week_number = int(filename[4:-4])
                week_options.append(week_number)

    # Sort the week numbers in ascending numerical order
    week_options = sorted(week_options)

    # Convert back to strings
    week_options = [str(week) for week in week_options]
    return week_options

def get_guiding_questions(question_type):
    base_path = "/content/drive/MyDrive/CS_410/Text Files/"
    subfolder = "Overview" if question_type == '1' else "Generated_Overviews"
    week_options = get_week_options(base_path, subfolder)

    if week_options:
        print(f"Available weeks: {', '.join(week_options)}")
        week_number = input("Enter one of the available week numbers: ")
        file_path = os.path.join(base_path, subfolder, f"Week{week_number}.txt")

        if os.path.exists(file_path):
            with open(file_path, 'r') as file:
                questions = file.read()
                print(questions + "\n\n")
        else:
            print(f"No guiding questions found for week {week_number}.\n\n\n")
    else:
        print(f"No guiding questions available in the {subfolder} folder.\n\n\n")

# Function to ask a question, retrieve documents, and query GPT for an answer
def ask_question():
    user_question = input("Enter your question: ")

    # Process the question using concurrent futures
    with concurrent.futures.ThreadPoolExecutor() as executor:
        future = executor.submit(process_question, user_question)
        try:
            output = future.result()
            print(f"Top Documents: {', '.join(output['documents_used'])}")
            print(f"Answer: {output['answer']}\n\n")
        except Exception as exc:
            print(f"An error occurred: {exc}")

# Main program function
def main():
    while True:
        choice = display_menu()
        if choice == '1':
            question_type = choose_question_type()
            get_guiding_questions(question_type)
        elif choice == '2':
            ask_question()
        elif choice == '3':
            print("Exiting the program.")
            break
        else:
            print("Invalid choice. Please enter 1, 2, or 3.")

if __name__ == "__main__":
    main()


1. Get Guiding Questions
2. Ask a Question
3. Quit
Enter your choice (1-3): 1
1. Provided Guiding Questions
2. Generated Guiding Questions
Choose the type of guiding questions (1-2): 1
Available weeks: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Enter one of the available week numbers: 3
Goals and Objectives
After you actively engage in the learning experiences in this module, you should be able to:

Explain the Cranfield evaluation methodology and how it works for evaluating a text retrieval system.

Explain how to evaluate a set of retrieved documents and how to compute precision, recall, and F1.

Explain how to evaluate a ranked list of documents.

Explain how to compute and plot a precision-recall curve.

Explain how to compute average precision and mean average precision (MAP).

Explain how to evaluate a ranked list with multi-level relevance judgments.

Explain how to compute normalized discounted cumulative gain.

Explain why it is important to perform statistical significance tests.


# Add On
## Create GPT-4 Summary and answer guiding questions

In [None]:
import os
import re
import openai
from concurrent.futures import ThreadPoolExecutor

# Assuming the preprocessing functions and other necessary imports and functions are defined as before.

# Function to read a file
def read_file(file_path):
    with open(file_path, 'r') as file:
        return file.read()

# Function to construct the prompt for GPT-4
def construct_prompt(lecture_transcript, overview_content):
    prompt = (
        "You are a highly knowledgeable assistant tasked with analyzing a master-level lecture on Cloud Computing Applications. "
        "Using only the provided lecture transcript and overview document, you are to perform a detailed analysis. "
        "Remember to reference specific parts of the transcript in your analysis and avoid drawing on outside information. "
        "Your responses should demonstrate a deep understanding appropriate for a master's program level.\n\n"
        "Lecture Transcript:\n"
        f"\"\"\"{lecture_transcript}\"\"\"\n\n"
        "Overview Content (containing Key Terms and Guiding Questions):\n"
        f"\"\"\"{overview_content}\"\"\"\n\n"
        "Please provide the following analysis:\n"
        "1. Summary: In a clear and concise format, summarize the lecture in a passage of about 300 words. Highlight the main points and crucial ideas, ensuring that your summary is grounded in the specifics of the lecture transcript.\n\n"
        "2. Guiding Questions: For each of the questions listed in the overview document, provide a direct and supported answer. Use quotations from the transcript to substantiate your answers. Detail your reasoning process for each response to demonstrate how you arrived at your conclusion.\n\n"
        "3. Key Terms: Identify and explain the significance of each key term listed in the overview. Discuss how these terms are applied in the context of the lecture transcript, providing specific examples or explanations from the transcript.\n\n"
        "Aim to format your analysis with clear, distinct sections for each part, using bullet points or numbered lists where appropriate. Your analysis is crucial for understanding and reviewing the core insights of the lecture. Thank you for your detailed work."
    )
    return prompt

# Function to save the output to a text file
def save_output(file_path, content):
    with open(file_path, 'w') as file:
        file.write(content)

# Function to process all files for a single week and create one final analysis with GPT-4
def process_week_final_prompt(week_num, base_directory):
    week_directory = f"{base_directory}/Week{week_num}"
    overview_directory = f"{week_directory}/Overview"
    transcripts_directory = f"{week_directory}/Transcripts"

    # Read the overview file
    overview_file_path = f"{overview_directory}/Week{week_num}_Overview.txt"
    overview_content = read_file(overview_file_path)

    # Initialize a variable to store all lecture transcripts for the week
    all_lectures_transcript = ""

    # Get all lecture video files for the week
    lecture_files = [f for f in os.listdir(transcripts_directory) if f.endswith('.txt')]

    # Concatenate each lecture transcript
    for lecture_file in lecture_files:
        lecture_file_path = f"{transcripts_directory}/{lecture_file}"
        lecture_transcript = read_file(lecture_file_path)
        all_lectures_transcript += f"{lecture_transcript}\n\n"

    # Construct the final prompt for GPT-4
    final_prompt = construct_prompt(all_lectures_transcript, overview_content)

    # Query GPT-4 with the final prompt
    final_gpt_response = query_gpt4(final_prompt)

    # Save the final GPT-4 response to a text file for the whole week
    final_output_file_path = f"{week_directory}/Week{week_num}_Final_GPT4_Analysis.txt"
    save_output(final_output_file_path, final_gpt_response)

# Example usage:
process_week_final_prompt(1, "/content/drive/MyDrive/UIUC/CS_498/Week1")

