Internship Tasks

This repository contains the codes of Web Scraping tasks of my first Internship.

Certificate

Installing the Dependencies

After downloading this project to your PC, open the project folder, there, open your command-line interpreter (e.g. Command Prompt for Windows), and run the following:

pip install -r requirements.txt

Coding Standards

Properly formatted codes (PEP 8 ✅)
Proper comments and descriptive variable names 🙌

Some Helper Notes & Resources:

0) Web Scraping Tips for Beginners

1) Official `Requests` Doc

2) Want to Try Something / Play with Requests & Responses?

3) Sending Multiple Requests to Same Host?

Use Session

from requests import Session

with Session() as session:  # requests session init

    session.stream = False  # stream off for all the requests of this session

    response = session.get(url='https://www.example.com')
    print(response.status_code)

4) Complete HTML is not Loading using `Requests` because it's a Dynamic Website?

Preferred) Use Requests-HTML (Alternative Link)

Alternative) Use Selenium

from selenium.webdriver import Chrome
from selenium.common.exceptions import WebDriverException
from bs4 import BeautifulSoup
from time import sleep

def get_parsed_page_html():
    return BeautifulSoup(markup=driver.page_source, features='html.parser')

def wait_to_load():
    print('SLOW INTERNET: PLEASE WAIT...')
    sleep(1)

try:
    driver = Chrome()  # webdriver init
except WebDriverException:
    raise SystemExit('''\nERROR: 'chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/getting-started#h.p_ID_36 \n
TL;DR:
1) Download or Update the ChromeDriver binary for your platform from https://chromedriver.chromium.org/downloads
2) Include the ChromeDriver location in your PATH environment variable''')

driver.maximize_window()  # window must not be minimized, else page will load in greater time

driver.get(url='https://www.example.com')  # open the webpage

# wait_to_load()  # in case still loading

html = get_parsed_page_html()
# print(html.prettify())  # debugging

print(html.p.text)

driver.quit()

5) HUGE Number of Requests to Send?

Use Threading

from concurrent.futures import ThreadPoolExecutor

THREADS = 10

def main(i: int) -> None:
    print(i)

# Executing {THREADS} no. of threads at once:
with ThreadPoolExecutor() as Exec:
    Exec.map(main, range(1, THREADS+1))

Want even more SPEED?

6) Website have Rate Limit?

Use Proxies

Creating a Reliable, Random Web Proxy Request Application using Python

from requests import get as get_request, RequestException
from bs4 import BeautifulSoup
from random import choice

# Using free proxies here, which is very slow, use paid proxies if possible.
html = BeautifulSoup(markup=get_request(url='https://www.sslproxies.org').text, features='html.parser')

proxies_raw = html.find(name='textarea').text.strip()
# print(proxies_raw)  # debugging

proxies = proxies_raw.split('\n')[3:]
# print(proxies, len(proxies))  # debugging

while True:
    proxy = choice(proxies)
    print(proxy)
    try:
        print(get_request('https://httpbin.org/ip', proxies={'http': proxy, 'https': proxy}, stream=False, timeout=5).json())
        break
    except RequestException:
        pass

7) Always send custom user agent (to tell the website that it's not a bot)

Use Custom Headers

User Agent Switching - Python Web Scraping
Get your user agent from: Google or httpbin.org

from requests import get as get_request

URL = 'https://httpbin.org/user-agent'

print('Without:', get_request(url=URL).json())
# Without: {'user-agent': 'python-requests/2.27.1'}

HEADER = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'}
print('With:', get_request(url=URL, headers=HEADER).json())
# With: {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'}

8) Data to get is table data?

Please use pandas.read_html!!

Scrape HTML tables easily with Pandas and Python

9) Always Check for the Hidden API when Web Scraping ([On Incognito] Inspect -> Network -> XHR -> Name -> some GET request -> Response)

10) Official `Beautiful Soup` Doc

11) Best Web Scraping Tutorials

12) Data Storage Note:

Saving to large Excel files takes forever. Never use Excel files to save large data, if required, partition it across multiple Excel files. See this. OR Use CSV files instead!
Excel = zipped(CSV). If the data has to be saved across a large number of (small) files, use Excel files to end up taking very less storage in comparison with CSV files!

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
0) CoinMarketCap		0) CoinMarketCap
1) ASIC Miner Value		1) ASIC Miner Value
10) GitLab		10) GitLab
11) CryptoDataDownload		11) CryptoDataDownload
2) Cosmos (ATOM)		2) Cosmos (ATOM)
3) Tokenview		3) Tokenview
4) Solana		4) Solana
5) Gridcoin		5) Gridcoin
6) Filecoin		6) Filecoin
7) NXT.Jelurida		7) NXT.Jelurida
8) Bytecoin Explorer		8) Bytecoin Explorer
9) Yahoo Finance		9) Yahoo Finance
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Internship Tasks

Certificate

Installing the Dependencies

Coding Standards

Some Helper Notes & Resources:

0) Web Scraping Tips for Beginners

1) Official `Requests` Doc

2) Want to Try Something / Play with Requests & Responses?

3) Sending Multiple Requests to Same Host?

4) Complete HTML is not Loading using `Requests` because it's a Dynamic Website?

5) HUGE Number of Requests to Send?

6) Website have Rate Limit?

7) Always send custom user agent (to tell the website that it's not a bot)

8) Data to get is table data?

9) Always Check for the Hidden API when Web Scraping ([On Incognito] Inspect -> Network -> XHR -> Name -> some GET request -> Response)

10) Official `Beautiful Soup` Doc

11) Best Web Scraping Tutorials

12) Data Storage Note:

Ciao!👋

About

Releases

Packages

Languages

samyak1409/internship-tasks

Folders and files

Latest commit

History

Repository files navigation

Internship Tasks

Certificate

Installing the Dependencies

Coding Standards

Some Helper Notes & Resources:

0) Web Scraping Tips for Beginners

1) Official Requests Doc

2) Want to Try Something / Play with Requests & Responses?

3) Sending Multiple Requests to Same Host?

4) Complete HTML is not Loading using Requests because it's a Dynamic Website?

5) HUGE Number of Requests to Send?

6) Website have Rate Limit?

7) Always send custom user agent (to tell the website that it's not a bot)

8) Data to get is table data?

9) Always Check for the Hidden API when Web Scraping ([On Incognito] Inspect -> Network -> XHR -> Name -> some GET request -> Response)

10) Official Beautiful Soup Doc

11) Best Web Scraping Tutorials

12) Data Storage Note:

Ciao!👋

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1) Official `Requests` Doc

4) Complete HTML is not Loading using `Requests` because it's a Dynamic Website?

10) Official `Beautiful Soup` Doc

Packages