This repository contains the codes of Web Scraping tasks of my first Internship.
After downloading this project to your PC, open the project folder, there, open your command-line interpreter (e.g. Command Prompt for Windows), and run the following:
pip install -r requirements.txt
- Properly formatted codes (PEP 8 ✅)
- Proper comments and descriptive variable names 🙌
Use Session
from requests import Session
with Session() as session: # requests session init
session.stream = False # stream off for all the requests of this session
response = session.get(url='https://www.example.com')
print(response.status_code)
Preferred) Use Requests-HTML (Alternative Link)
- How I Scrape JAVASCRIPT websites with Python | Scrape Amazon NEW METHOD with Python 2020
- Render Dynamic Pages - Web Scraping Product Links with Python | Rendering Dynamic Pages 2! - Web Scraping ALL products with Python
- Python Tutorial: Web Scraping with Requests-HTML
Alternative) Use Selenium
- Scrape content from dynamic websites
- How I use SELENIUM to AUTOMATE the Web with PYTHON. Pt1 | How to SCRAPE DYNAMIC websites with Selenium
from selenium.webdriver import Chrome
from selenium.common.exceptions import WebDriverException
from bs4 import BeautifulSoup
from time import sleep
def get_parsed_page_html():
return BeautifulSoup(markup=driver.page_source, features='html.parser')
def wait_to_load():
print('SLOW INTERNET: PLEASE WAIT...')
sleep(1)
try:
driver = Chrome() # webdriver init
except WebDriverException:
raise SystemExit('''\nERROR: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/getting-started#h.p_ID_36 \n
TL;DR:
1) Download or Update the ChromeDriver binary for your platform from https://chromedriver.chromium.org/downloads
2) Include the ChromeDriver location in your PATH environment variable''')
driver.maximize_window() # window must not be minimized, else page will load in greater time
driver.get(url='https://www.example.com') # open the webpage
# wait_to_load() # in case still loading
html = get_parsed_page_html()
# print(html.prettify()) # debugging
print(html.p.text)
driver.quit()
Use Threading
- Python Threading Tutorial: Run Code Concurrently Using the Threading Module
- PARALLEL and CONCURRENCY in Python for FAST Web Scraping
from concurrent.futures import ThreadPoolExecutor
THREADS = 10
def main(i: int) -> None:
print(i)
# Executing {THREADS} no. of threads at once:
with ThreadPoolExecutor() as Exec:
Exec.map(main, range(1, THREADS+1))
Want even more SPEED?
- How to Make 2500 HTTP Requests in 2 Seconds with Async & Await
- multiprocessing vs multithreading vs asyncio in Python 3
Use Proxies
from requests import get as get_request, RequestException
from bs4 import BeautifulSoup
from random import choice
# Using free proxies here, which is very slow, use paid proxies if possible.
html = BeautifulSoup(markup=get_request(url='https://www.sslproxies.org').text, features='html.parser')
proxies_raw = html.find(name='textarea').text.strip()
# print(proxies_raw) # debugging
proxies = proxies_raw.split('\n')[3:]
# print(proxies, len(proxies)) # debugging
while True:
proxy = choice(proxies)
print(proxy)
try:
print(get_request('https://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, stream=False, timeout=5).json())
break
except RequestException:
pass
Use Custom Headers
- User Agent Switching - Python Web Scraping
- Get your user agent from: Google or httpbin.org
from requests import get as get_request
URL = 'https://httpbin.org/user-agent'
print('Without:', get_request(url=URL).json())
# Without: {'user-agent': 'python-requests/2.27.1'}
HEADER = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'}
print('With:', get_request(url=URL, headers=HEADER).json())
# With: {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'}
Please use pandas.read_html
!!
9) Always Check for the Hidden API when Web Scraping ([On Incognito] Inspect -> Network -> XHR -> Name -> some GET request -> Response)
- Saving to large Excel files takes forever. Never use Excel files to save large data, if required, partition it across multiple Excel files. See this. OR Use CSV files instead!
- Excel = zipped(CSV). If the data has to be saved across a large number of (small) files, use Excel files to end up taking very less storage in comparison with CSV files!