<a href="https://colab.research.google.com/github/umas-iit/PythonNotes/blob/main/oop_Python_WebscrapingProject_BS_GC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Web Scrapping with Python  and Beautiful Soup:

Amount of data available on the Internet is a rich resource for any type of research or personal data analysis . To effectively harvest the data, a techinque called web scraping is used. The Python libraries requests and Beautiful Soup are powerful web scraping tools for the job.
Use requests and Beautiful Soup for scraping and parsing data from the Web example is shown below 

### What Is Web Scraping?
Web scraping is the process of gathering information from the Internet. The word “web scraping” usually refer to a process that involves automation.Automated web scraping can be a solution to speed up the data collection process. Manual web scraping can take a lot of time and repetition.
Because the internet is dynamic, the scrapers we use will probably require constant maintenance. It is always useful to  set up continuous integration to run scraping tests periodically to ensure that the main script doesn’t break without our  knowledge.




### Scrape Static HTML Content From a Page
* Get the site’s HTML code into your Python script so that you can interact  with it. 
* use Python’s requests library. 

This code performs an HTTP request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.
The server that hosts the site sends back HTML documents that already contain all the data 

The Components of a Web Page:
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. These files will typically include:
 *  HTML — the main content of the page. HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML.
 *  CSS — used to add styling to make the page look nicer.
 *  JS — Javascript files add interactivity to web pages.
 *  Images — image formats, such as JPG and PNG, allow web pages to show pictures.
After our browser receives all the files, it renders the page and displays it to us.

# HTML page example

<html>
<head>
</head>
<body>
<p>
Here's a paragraph of text!
<a href="https://www.dataquest.io">Learn Data Science Online</a>
</p>
<p>
Here's a second paragraph of text!
<a href="https://www.python.org">Python</a> </p>
</body></html>

In [None]:
# Example 1:  The requests library
#  A python code without pprint
import requests
  
def geocode(address):
    url = "https://maps.googleapis.com/maps/api/geocode/json"
    resp = requests.get(url, params = {'address': address})
    return resp.json()
  
# calling geocode function
data = geocode('India gate')

# printing json response
print(data)



import requests
page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page

In [None]:
# Example 2:  A python code with pprint
import requests
from pprint import pprint
  
def geocode(address):
    url = "https://maps.googleapis.com/maps/api/geocode/json"
    resp = requests.get(url, params = {'address': address})
    return resp.json()
  
# calling geocode function
data = geocode('India gate')
  
# printing json response
pprint(data)

### Techniques  :
Get all of the data from inside a table that was displayed on a web page.
steps in sequence:
* Request the content (source code) of a specific URL from the server
* Download the content that is returned
* Identify the elements of the page that are part of the table we want
* Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.

In [None]:
#. https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/

import requests
#page = requests.get("https://github.com/umas-iit/Python-Examples/Pythonsample1.html")
page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page
#print(page.status_code)
#print(page.content)

In [None]:
# Parsing a page with BeautifulSoup

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
list(soup.children)


In [None]:
### All of the items are BeautifulSoup objects:
# The first is a Doctype object, which contains information about the type of the document.
# The second is a NavigableString, which represents text found in the HTML document.
# The final item is a Tag object, which contains other nested tags.he Tag object allows 
# us to navigate through an HTML document, and extract other tags and text.

[type(item) for item in list(soup.children)]

In [None]:
html = list(soup.children)[2]      # select the html tag and its children by taking the third item in the list
list(html.children)
body = list(html.children)[3]
list(body.children)
p = list(body.children)[1]
p.get_text()

In [None]:
## Finding all instances of a tag at once

soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()

soup.find('p')

In [None]:
## Searching for tags by class and id

page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

soup.find_all('p', class_='outer-text')
soup.find_all(class_="outer-text")
soup.find_all(id="first")

In [None]:
## Start Scraping!
## We now know enough to download the page and start parsing it. In the below code, we will:
# Download the web page containing the forecast.
# Create a BeautifulSoup class to parse the page.
# Find the div with id seven-day-forecast, and assign to seven_day
# Inside seven_day, find each individual forecast item.
# Extract and print the first forecast item.

page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())


period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

In [None]:

img = tonight.find("img")
desc = img['title']
print(desc)

In [None]:
# Extracting information from the page
# As we can see, inside the forecast item tonight is all the information we want. There are four pieces of information we can extract:
# The name of the forecast item — in this case, Tonight.
# The description of the conditions — this is stored in the title property of img.
# A short description of the conditions — in this case, Mostly Clear.
# The temperature low — in this case, 49 degrees.

period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)

In [None]:
# Extracting all the information from the page
# Select all items with the class period-name inside an item with the class tombstone-container in seven_day.
# Use a list comprehension to call the get_text method on each BeautifulSoup object.

period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

In [None]:
# apply the same technique to get the other three fields

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)

In [None]:
# Combining our data into a Pandas Dataframe
# call the DataFrame class, and pass in each list of items. 
# pass them in as part of a dictionary.
#  Each dictionary key will become a column in the DataFrame 
# each list will become the values in the column

import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_desc,
    "temp": temp,
    "desc":desc
})
weather