# Part 2: Scrape HTML Content From a Page

- **Static Websites**
- _Hidden Websites_
- _Dynamic Websites_

In this course, you will work with a static website. You will also get a high-level overview about the challenges of scraping dynamically generated information and data behind logins.

## ⚠️ Durabilty Warning ⚠️

Like [mentioned in the course](https://realpython.com/lessons/challenge-of-durability/), websites frequently change. Unfortunately the job board that you'll see in the course, indeed.com, has started to block scraping of their site since the recording of the course.

Just like in the associated written tutorial on [web scraping with beautiful soup](https://realpython.com/beautiful-soup-web-scraper-python/#scrape-the-fake-python-job-site), you can instead use [Real Python's fake jobs site](https://realpython.github.io/fake-jobs/) to practice scraping a static website.

All the concepts discussed in the course lessons are still accurate. Translating what you see onto a different website will be a good learning opportunity where you'll have to synthesize the information and apply it practically.

## Static Websites

In [None]:
import requests

In [None]:
url = "https://realpython.github.io/fake-jobs/"
response = requests.get(url)

In [None]:
response.content[400:500]  # let's take a peek

You have access to the data from this website. Can you already search for the content that interests you?

In [None]:
# let's try a string search
loc = str(response.content).find("python")
loc

In [None]:
response.content[loc - 50 : loc + 50]

In [None]:
# what about regex?
import re

re.findall(r"python", str(response.content))

It works, but it is tedious and inefficient! That's where **parsing** and `BeautifulSoup` comes in and makes your life easier. You will learn more about that in the next chapter.

## Hidden Websites

Some pages require you to log in before they display information. Scraping them _without_ logging in doesn't give you what you want. `requests` includes ways to authenticate with websites.

Train this with [GitHub](https://github.com/) and our tutorial on [Python's Requests Library](https://realpython.com/python-requests/).

In [None]:
res = requests.get("https://api.github.com/user")

In [None]:
res.status_code  # whoops! not authorized!

In [None]:
res.content

## Dynamic Websites

Websites attempt to offload computing power to the client. That means they send back **JavaScript** code that the client's browser executes. These pages are harder to scrape, because that code needs to be executed before you will see the information you are interested in.

Train this with [requests-html](https://requests.readthedocs.io/projects/requests-html/en/latest/) or [Selenium](https://selenium-python.readthedocs.io/) and our tutorial for [Modern Web Automation with Python and Selenium](https://realpython.com/modern-web-automation-with-python-and-selenium/)

In [None]:
res = requests.get("https://twitter.com/search?q=realpython")

In [None]:
res.status_code  # the code says all is fine...

In [None]:
res.content  # ... but the content doesn't contain what you're looking for!