# Part 2: Scrape HTML Content From a Page

- **Static Websites**
- _Hidden Websites_
- _Dynamic Websites_

In this course, you will work with a static website. You will also get a high-level overview about the challenges of scraping dynamically generated information and data behind logins.

## Static Websites

In [1]:
import requests

In [None]:
url = "https://www.indeed.com/jobs?q=python&l=new+york"
response = requests.get(url)

In [None]:
response.content[400:500]  # let's take a peek

You have access to the data from this website. Can you already search for the content that interests you?

In [None]:
# let's try a string search
loc = str(response.content).find('python')
loc

In [None]:
response.content[loc-10:loc+10]

In [None]:
# what about regex?
import re
re.findall(r'python', str(response.content))

It works, but it is tedious and inefficient! That's where **parsing** and `BeautifulSoup` comes in and makes your life easier. You will learn more about that in the next chapter.

## Hidden Websites

Some pages require you to log in before they display information. Scraping them _without_ logging in doesn't give you what you want. `requests` includes ways to authenticate with websites.

Train this with [GitHub](https://github.com/) and our tutorial on [Python's Requests Library](https://realpython.com/python-requests/).

In [None]:
res = requests.get('https://api.github.com/user')

In [None]:
res.status_code  # whoops! not authorized!

In [None]:
res.content

## Dynamic Websites

Websites attempt to offload computing power to the client. That means they send back **JavaScript** code that the client's browser executes. These pages are harder to scrape, because that code needs to be executed before you will see the information you are interested in.

Train this with [requests-html](https://requests.readthedocs.io/projects/requests-html/en/latest/) or [Selenium](https://selenium-python.readthedocs.io/) and our tutorial for [Modern Web Automation with Python and Selenium](https://realpython.com/modern-web-automation-with-python-and-selenium/)

In [2]:
res = requests.get('https://twitter.com/search?q=realpython')

In [3]:
res.status_code  # the code says all is fine...

200

In [4]:
res.content  # ... but the content doesn't contain what you're looking for!

b'<!DOCTYPE html>\n<html dir="ltr" lang="en">\n<meta charset="utf-8" />\n<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=0,viewport-fit=cover" />\n<link rel="preconnect" href="//abs.twimg.com" />\n<link rel="preconnect" href="//api.twitter.com" />\n<link rel="preconnect" href="//pbs.twimg.com" />\n<link rel="preconnect" href="//t.co" />\n<link rel="preconnect" href="//video.twimg.com" />\n<link rel="dns-prefetch" href="//abs.twimg.com" />\n<link rel="dns-prefetch" href="//api.twitter.com" />\n<link rel="dns-prefetch" href="//pbs.twimg.com" />\n<link rel="dns-prefetch" href="//t.co" />\n<link rel="dns-prefetch" href="//video.twimg.com" />\n<link rel="preload" as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/web/polyfills.675e3184.js" nonce="YzVhMTM5MDktOTc0Ny00M2E2LTljZjktZDNkZDE5MmIwMWU0" />\n<link rel="preload" as="script" crossorigin="anonymous" href="https://abs.twimg.com/responsive-web/web/vendors~main.80