Code to parse Azure cloud service data

This implementation includes several key features:
Web Scraping: Uses requests and BeautifulSoup4 to fetch and parse the Azure VM availability documentation.
Content Processing: Removes unnecessary HTML elements, Organizes content into sections based on headers, Maintains the structure of the documentation
Question Answering: Uses BERT model fine-tuned on SQuAD (Stanford Question Answering Dataset), Finds the most relevant section for each question, Generates answers based on the context
Interactive Interface: Provides a simple command-line chat interface, Allows users to ask questions until they choose to quit
To use the chatbot, simply run the script and start asking questions about Azure VM availability. For example:

# Example usage
chatbot = AzureVMDocChatbot()

# Ask questions
print(chatbot.answer_question("What are availability zones in Azure?"))
print(chatbot.answer_question("How does Azure Site Recovery help with business continuity?"))

The chatbot will provide answers based on the documentation content. The answers are generated using the BERT model, which understands context and can provide relevant responses based on the documentation.
Some example questions you can ask:
What are availability zones?
How does Azure ensure storage redundancy?
What is an availability set?
How does Azure Site Recovery work?
What is a Virtual Machine Scale Set?
Note that the quality of answers depends on:
The relevance of the section matching
The quality of the question
The presence of the information in the documentation
You can enhance this implementation by:
Adding better error handling
Implementing more sophisticated section matching
Adding support for follow-up questions
Implementing a web interface instead of command-line
Adding support for multiple documentation sources
Would you like me to explain any particular part of the implementation in more detail?

Import required libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import re
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
import numpy as np
import pandas as pd
import time
from tqdm import tqdm
import logging

  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/miniconda3/envs/cursor-conda-python-3.11.11/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/opt/miniconda3/envs/cursor-conda-python-3.11.11/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/opt/miniconda3/envs/cursor-conda-pyth

Next steps

In [2]:


class AzureVMDocChatbot:
    def __init__(self, timeout=30):  # Add timeout parameter
        self.url = "https://learn.microsoft.com/en-us/azure/virtual-machines/availability"
        self.doc_content = ""
        self.sections = {}
        self.timeout = timeout
        
        # Set up logging
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)
        
        # Initialize the model and tokenizer with progress feedback
        self.logger.info("Loading BERT model and tokenizer...")
        start_time = time.time()
        try:
            self.tokenizer = AutoTokenizer.from_pretrained(
                "bert-large-uncased-whole-word-masking-finetuned-squad",
                local_files_only=False  # Allow downloading if not cached
            )
            self.model = AutoModelForQuestionAnswering.from_pretrained(
                "bert-large-uncased-whole-word-masking-finetuned-squad",
                local_files_only=False
            )
            self.logger.info(f"Model loaded successfully in {time.time() - start_time:.2f} seconds!")
        except Exception as e:
            self.logger.error(f"Error loading model: {str(e)}")
            raise

        # Load and parse the content
        self.load_and_parse_content()

    def load_and_parse_content(self):
        """Fetch and parse the Azure VM documentation with timeout and progress indicators."""
        try:
            self.logger.info(f"Fetching documentation from {self.url}")
            start_time = time.time()
            
            # Use timeout for the request
            response = requests.get(self.url, timeout=self.timeout)
            response.raise_for_status()
            
            fetch_time = time.time() - start_time
            self.logger.info(f"Fetched document in {fetch_time:.2f} seconds")
            
            self.logger.info("Parsing content...")
            parse_start_time = time.time()
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Extract main content
            main_content = soup.find('main')
            if main_content:
                # Remove unnecessary elements
                for element in main_content.find_all(['script', 'style', 'nav']):
                    element.decompose()
                
                # Extract text content
                self.doc_content = main_content.get_text(separator=' ', strip=True)
                
                # Parse sections with progress bar
                headers = main_content.find_all(['h1', 'h2', 'h3'])
                self.logger.info(f"Found {len(headers)} sections to parse")
                
                current_section = ""
                current_content = []
                
                for header in tqdm(headers, desc="Parsing sections"):
                    if current_section:
                        self.sections[current_section] = ' '.join(current_content)
                    current_section = header.get_text(strip=True)
                    current_content = []
                    
                    next_element = header.find_next_sibling()
                    while next_element and not next_element.name in ['h1', 'h2', 'h3']:
                        if next_element.get_text(strip=True):
                            current_content.append(next_element.get_text(strip=True))
                        next_element = next_element.find_next_sibling()
                
                # Add the last section
                if current_section:
                    self.sections[current_section] = ' '.join(current_content)
                
                parse_time = time.time() - parse_start_time
                self.logger.info(f"Successfully parsed {len(self.sections)} sections in {parse_time:.2f} seconds!")
                
                # Print first few sections as verification
                self.logger.info("\nFirst few sections found:")
                for i, (section, _) in enumerate(list(self.sections.items())[:3]):
                    self.logger.info(f"{i+1}. {section}")
                    
        except requests.Timeout:
            self.logger.error(f"Timeout error: Request took longer than {self.timeout} seconds")
            raise
        except requests.RequestException as e:
            self.logger.error(f"Error fetching documentation: {str(e)}")
            raise
        except Exception as e:
            self.logger.error(f"Error parsing content: {str(e)}")
            raise

    def find_most_relevant_section(self, question):
        """Find the most relevant section for the given question."""
        max_score = 0
        best_section = None
        
        self.logger.info("Searching for relevant section...")
        start_time = time.time()
        
        for section, content in self.sections.items():
            # Simple relevance scoring based on word overlap
            question_words = set(question.lower().split())
            section_words = set(section.lower().split() + content.lower().split())
            score = len(question_words.intersection(section_words))
            
            if score > max_score:
                max_score = score
                best_section = content
        
        search_time = time.time() - start_time
        self.logger.info(f"Found relevant section in {search_time:.2f} seconds")
        return best_section

    def answer_question(self, question):
        """Answer a question about Azure VM availability."""
        self.logger.info(f"\nProcessing question: {question}")
        start_time = time.time()
        
        # Find the most relevant section
        context = self.find_most_relevant_section(question)
        if not context:
            return "I'm sorry, I couldn't find relevant information to answer your question."

        # Prepare the input for the model
        self.logger.info("Generating answer...")
        inputs = self.tokenizer(question, context, return_tensors="pt", max_length=512, truncation=True)
        
        # Get the answer
        with torch.no_grad():
            outputs = self.model(**inputs)
        
        # Process the model output
        answer_start = torch.argmax(outputs.start_logits)
        answer_end = torch.argmax(outputs.end_logits)
        
        tokens = self.tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        answer = tokens[answer_start:answer_end + 1]
        
        # Convert tokens to string
        answer = self.tokenizer.convert_tokens_to_string(answer)
        
        total_time = time.time() - start_time
        self.logger.info(f"Answer generated in {total_time:.2f} seconds")
        
        return answer if answer else "I'm sorry, I couldn't generate a good answer for that question."

# Test the improved implementation
def test_chatbot():
    print("Initializing chatbot...")
    chatbot = AzureVMDocChatbot(timeout=30)  # Set 30-second timeout
    
    # Test a simple question
    test_question = "What are availability zones?"
    print(f"\nTesting question: {test_question}")
    answer = chatbot.answer_question(test_question)
    print(f"Answer: {answer}")

if __name__ == "__main__":
    test_chatbot()

INFO:__main__:Loading BERT model and tokenizer...


Initializing chatbot...


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
INFO:__main__:Model loaded successfully in 3.18 seconds!
INFO:__main__:Fetching documentation from https://learn.microsoft.com/en-us/azure/virtual-machines/availability
INFO:__main__:Fetched document in 0.56 seconds
INFO:__main__:Parsing content...
INFO:__main__:Found 8 sections to parse
Parsing sections: 100


Testing question: What are availability zones?


INFO:__main__:Answer generated in 7.04 seconds


Answer: a physically separate zone, within an azure region


In [3]:
# Cell 3: Initialize the chatbot
chatbot = AzureVMDocChatbot()

INFO:__main__:Loading BERT model and tokenizer...
Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
INFO:__main__:Model loaded successfully in 1.18 seconds!
INFO:__main__:Fetching documentation from https://learn.microsoft.com/en-us/azure/virtual-machines/availability
INFO:__main__:Fetched document in 0.21 seconds
INFO:__main__:Parsing content...
INFO:__main

In [4]:
# Cell 4: Visualize the parsed sections
sections_df = pd.DataFrame(list(chatbot.sections.items()), columns=['Section', 'Content'])
print(f"Total number of sections: {len(sections_df)}")
sections_df.head()

Total number of sections: 8


Unnamed: 0,Section,Content
0,Availability options for Azure Virtual Machines,Article08/22/20249 contributorsFeedback Applie...
1,Availability zones,Availability zonesexpands the level of control...
2,Virtual Machines Scale Sets,Azure virtual machine scale setslet you create...
3,Availability sets,Anavailability setis a logical grouping of VMs...
4,Load balancer,Combine theAzure Load Balancerwith availabilit...


In [5]:
# Cell 5: Test some example questions
test_questions = [
    "What are availability zones?",
    "How does Azure ensure storage redundancy?",
    "What is an availability set?",
    "How does Azure Site Recovery work?",
    "What is a Virtual Machine Scale Set?",
    "What is the difference between availability zones and availability sets?",
    "How many availability zones are there in an Azure region?",
    "How does Azure Site Recovery help with business continuity?",  
    "How does Azure Load Balancer work with availability zones?"
]

for question in test_questions:
    print(f"\nQ: {question}")
    print(f"A: {chatbot.answer_question(question)}")
    print("-" * 80)

INFO:__main__:
Processing question: What are availability zones?
INFO:__main__:Searching for relevant section...
INFO:__main__:Found relevant section in 0.00 seconds
INFO:__main__:Generating answer...



Q: What are availability zones?


INFO:__main__:Answer generated in 7.51 seconds
INFO:__main__:
Processing question: How does Azure ensure storage redundancy?
INFO:__main__:Searching for relevant section...
INFO:__main__:Found relevant section in 0.00 seconds
INFO:__main__:Generating answer...


A: a physically separate zone, within an azure region
--------------------------------------------------------------------------------

Q: How does Azure ensure storage redundancy?


INFO:__main__:Answer generated in 1.56 seconds
INFO:__main__:
Processing question: What is an availability set?
INFO:__main__:Searching for relevant section...
INFO:__main__:Found relevant section in 0.00 seconds
INFO:__main__:Generating answer...


A: redundancy ensures that your storage account meets its availability and durability targets even in the face of failures
--------------------------------------------------------------------------------

Q: What is an availability set?


INFO:__main__:Answer generated in 1.27 seconds
INFO:__main__:
Processing question: How does Azure Site Recovery work?
INFO:__main__:Searching for relevant section...
INFO:__main__:Found relevant section in 0.00 seconds
INFO:__main__:Generating answer...


A: a physically separate zone, within an azure region
--------------------------------------------------------------------------------

Q: How does Azure Site Recovery work?


INFO:__main__:Answer generated in 1.39 seconds
INFO:__main__:
Processing question: What is a Virtual Machine Scale Set?
INFO:__main__:Searching for relevant section...
INFO:__main__:Found relevant section in 0.00 seconds
INFO:__main__:Generating answer...


A: replicates workloads running on physical and virtual machines ( vms ) from a primary site to a secondary location
--------------------------------------------------------------------------------

Q: What is a Virtual Machine Scale Set?


INFO:__main__:Answer generated in 0.93 seconds


A: create and manage a group of load balanced vms
--------------------------------------------------------------------------------


In [None]:
# Cell 6: Interactive question answering
from IPython.display import clear_output

def interactive_qa():
    while True:
        question = input("Ask a question (or type 'quit' to exit): ")
        if question.lower() == 'quit':
            break
            
        clear_output(wait=True)
        print(f"Q: {question}")
        print(f"A: {chatbot.answer_question(question)}")
        print("\n" + "-" * 80 + "\n")

interactive_qa()

INFO:__main__:
Processing question: 
INFO:__main__:Searching for relevant section...
INFO:__main__:Found relevant section in 0.00 seconds


Q: 
A: I'm sorry, I couldn't find relevant information to answer your question.

--------------------------------------------------------------------------------

