To demonstrate your familiarity with OpenAI API, and also Ollama, build a tool that takes a technical question,
and responds with an explanation. This is a tool that you will be able to use yourself during the course!

In [14]:
# imports

import os
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import display, Markdown, update_display
import requests



In [2]:
MODEL_GPT = 'gpt-4.1-mini'
MODEL_OLLAMA = 'llama3.2'

In [3]:
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
    print("No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!")
elif not api_key.startswith("sk-proj-"):
    print("An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook")
else:
    print("API key found and looks good so far!")

API key found and looks good so far!


In [4]:
tutor_system_prompt = """
You are a techinical tutor assistant.
You will help students with their coding problems, and you will provide a step by step explanation.
"""

In [5]:
def get_user_prompt():
    prompt = """
Here is a code snippet that I would like you to explain:

def fetch_website_links(url):
    '''
    Return the links on the webiste at the given url
    I realize this is inefficient as we're parsing twice! This is to keep the code in the lab simple.
    Feel free to use a class and optimize it!
    '''
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    links = [link.get("href") for link in soup.find_all("a")]
    return [link for link in links if link]
"""
    return prompt

In [None]:
def explain_code_openai():
    openai = OpenAI()
    response = openai.chat.completions.create(
        model=MODEL_GPT,
        messages=[
            {"role": "system", "content": tutor_system_prompt},
            {"role": "user", "content": get_user_prompt()}
        ]
    )

    result = response.choices[0].message.content
    display(Markdown(result))

    

In [10]:
explain_code_openai()

Certainly! Let's go through the code step-by-step.

### Function: `fetch_website_links(url)`

#### Purpose:
This function takes a URL as input and returns a list of all the hyperlinks (`href` attributes) found on the web page at that URL.

---

### Step-by-step explanation:

1. **Function Definition and Docstring:**
   ```python
   def fetch_website_links(url):
       '''
       Return the links on the website at the given url
       I realize this is inefficient as we're parsing twice! This is to keep the code in the lab simple.
       Feel free to use a class and optimize it!
       '''
   ```
   - The function `fetch_website_links` expects one parameter called `url`.
   - The docstring explains that it returns all the links on the website at the specified URL.
   - It also mentions there's an inefficiency due to parsing the content twice (though in this snippet it's not explicitly shown), and hints at an optimization with classes if desired.

2. **Making an HTTP GET Request:**
   ```python
   response = requests.get(url, headers=headers)
   ```
   - This line sends an HTTP GET request to the given `url`.
   - The `headers` argument is provided, likely to mimic a browser or include necessary headers (user-agent, etc.), but `headers` must be defined somewhere outside this snippet.
   - `requests.get` returns a `response` object containing the web page data.

3. **Parsing the HTML Content:**
   ```python
   soup = BeautifulSoup(response.content, "html.parser")
   ```
   - `response.content` contains the raw bytes of the webpage fetched.
   - `BeautifulSoup` is a popular Python library for parsing HTML and XML documents.
   - Here, the content is parsed using the `"html.parser"` which is Python's built-in HTML parser.
   - `soup` now contains a parsed tree of the HTML content.

4. **Extracting all `<a>` Tags:**
   ```python
   links = [link.get("href") for link in soup.find_all("a")]
   ```
   - `soup.find_all("a")` finds all anchor tags (`<a>`) in the HTML.
   - The list comprehension iterates over each anchor tag and gets the value of its `href` attribute (the actual link).
   - Some `<a>` tags may not have an `href` attribute, so `.get("href")` returns `None` in those cases.
   - `links` is a list of strings (the link URLs) or `None` values.

5. **Filtering out `None` Values:**
   ```python
   return [link for link in links if link]
   ```
   - This filters the `links` list to exclude any that are `None` or empty. Only valid strings remain.
   - The resulting list is returned.

---

### Summary

- The function fetches the webpage at a given URL.
- Parses the HTML to find all anchor tags `<a>`.
- Extracts the URLs from the `href` attributes.
- Returns a clean list of valid links.

---

### Notes:
- **Efficiency and Optimization:**  
  The comment about inefficiency likely refers to cases where multiple parsings or multiple requests happen, but this snippet parses once.
- **Prerequisites:**  
  You need to import the relevant libraries and define `headers`, for example:

  ```python
  import requests
  from bs4 import BeautifulSoup

  headers = {
      "User-Agent": "Mozilla/5.0 ..."
  }
  ```

If you want, I can also help you rewrite or optimize this code!

In [15]:
requests.get("http://localhost:11434").content

b'Ollama is running'

In [16]:
def explain_code_ollama():
    # Setting up ollama to use it using the OpenAI API
    OLLAMA_BASE_URL = "http://localhost:11434/v1"

    ollama = OpenAI(base_url=OLLAMA_BASE_URL, api_key='ollama')
    response = ollama.chat.completions.create(
        model=MODEL_OLLAMA,
        messages=[
            {"role": "system", "content": tutor_system_prompt},
            {"role": "user", "content": get_user_prompt()}
        ]
    )

    result = response.choices[0].message.content
    display(Markdown(result))


In [17]:
explain_code_ollama()

Let's break down this code snippet and discuss its efficiency.

**Functionality**

The `fetch_website_links` function takes a URL as input, retrieves the webpage content using `requests`, parses the HTML content using BeautifulSoup, extracts all the anchor tags (`<a>`) from the parsed content, and returns a list of link URLs associated with these anchor tags.

**Code Review**

Here's a step-by-step explanation:

1. `response = requests.get(url, headers=headers)`: This line sends an HTTP GET request to the specified URL and stores the response in the `response` variable.
2. `soup = BeautifulSoup(response.content, "html.parser")`: This line parses the HTML content of the response using BeautifulSoup and creates a parse tree (i.e., a representation of the structure of the HTML document).
3. `links = [link.get("href") for link in soup.find_all("a")]`: This list comprehension extracts all anchor tags (`<a>`) from the parsed content and retrieves their href attributes, which contain the link URLs.
4. `return [link for link in links if link]`: This line filters out any empty strings or None values from the extracted link URLs (i.e., removes unnecessary entries) and returns the filtered list.

**Inefficiency**

As you mentioned, this approach is inefficient because it involves parsing the HTML content twice:
	* Once using BeautifulSoup.
	* Again when extracting anchor tags (`<a>`) and their href attributes.

A more efficient approach would involve combining these two parse operations into a single pass over the parsed HTML tree. We can take advantage of BeautifulSoup's `find_all` method to extract all link URLs directly in one go, without needing to repeat the parsing step.

**Optimized Version**

Here's an optimized version of the function that uses BeautifulSoup's `extract` method with a CSS selector:
```python
def fetch_website_links(url):
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    links = [a.get("href") for a in soup.find_all("a", href=True)]  # extract links with href attribute
    return [link for link in links if link]
```
However, this approach still repeats the parsing step. To truly optimize the function, we can create a BeautifulSoup object and use its `text_content` method to parse the HTML content into a string, which can then be processed without repeating the parsing step.
```python
from bs4 import BeautifulSoup

def fetch_website_links(url):
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, "html.parser")
    
    def extract_links(html_tree):
        """Helper function to extract links from an HTML tree"""
        return [a.get("href") for a in html_tree.find_all("a", href=True)]

    # Create a Soup object to avoid repeating parsing
    s = soup.decode_contents()  # decode contents to process the parsed content
    
    # Process the parsed content without re-parsing it
    links = extract_links(s)
    
    return [link for link in links if link]
```
This optimized version of the function should be much more efficient than the original code snippet.

**Additional Advice**

* Always use try-except blocks to handle exceptions that may occur during request or parsing operations.
* Consider using a more modern Python version (e.g., Python 3.8+).
* Familiarize yourself with common Python libraries and their methods for web scraping, such as `requests`, `BeautifulSoup`, `lxml` (for HTML parsing), et al.

Do you have any specific questions or areas of the code that you'd like me to expand upon?