<a href="https://colab.research.google.com/github/sultanfarooq98/sultanfarooq98/blob/main/MSDS_Week4_Web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Python  Framework/ Tools for scraping
### There are three popular web scraping tools for Python web scrapers

#### Three main Steps
* Get the Data
* Parse the data
* Save the data

### 1. Beautifulsoup:

#### Pros:
* User friendly
* Used for both Html and XML
* Easy to learn and master

#### Cons:
* Dependencies so portability issues
* Inefficient for large amount of data

### For
* Less features
* Beginner
* Quick/ Small Projects



### 2. Selenium:
#### Pros:
* Used for web application
* Works well with java scripts
* More versatile
#### Cons:
* Not Ideal and Very Slow
* Not user friendly
* Inefficient for large amount of data

#### For
* To automate a wide range of actions: i.e. click, hover etc

### 3. Scrapy:
#### Pros:
* Efficient
* Powerfull and Customizable
* No dependencies
* Good portability and works with all OS

#### Cons:
* Not userfriendly
* Not easy to master

### For
* Full features
* Advanced Users
* Complete Projects


In [None]:
# Load in the necessary libraries
#!pip install requests
import requests
from bs4 import BeautifulSoup as bs # pip install beautifulsoup4

## Load our first page

In [None]:
# Load the webpage content
#in html dot is a class
# in html # is an id
#request.get() Makes a request to a web page, and return the status code:
r = requests.get("https://keithgalli.github.io/web-scraping/example.html")
r.content

b'<html>\n<head>\n<title>HTML Example</title>\n</head>\n<body>\n\n<div align="middle">\n<h1>HTML Webpage</h1>\n<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>\n</div>\n\n<h2>A Header</h2>\n<p><i>Some italicized text</i></p>\n\n<h2>Another header</h2>\n<p id="paragraph-id"><b>Some bold text</b></p>\n\n</body>\n</html>\n'

In [None]:
# Convert to a beautiful soup object
soup = bs(r.content)

In [None]:
# Print out our html with prettify to see the data propely indented..
print(soup.prettify())

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



# Start using Beautiful Soup to Scrape

### find and find_all

In [None]:
# it returns first match from a source .
first_header = soup.find("h2")
print(first_header)


<h2>A Header</h2>


In [None]:
headers = soup.find_all("h2")
print(headers)

[<h2>A Header</h2>, <h2>Another header</h2>]


In [None]:
# Pass in a list of elements to look for first match.
first_header = soup.find(["h2", "h1"])
print(first_header)
Second_header = soup.find(["h1", "h2"])
print(Second_header)

<h1>HTML Webpage</h1>
<h1>HTML Webpage</h1>


In [None]:
# when we use find.all method it will resturn all matches.
headers = soup.find_all(["h1", "h2"])
print(headers)

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]


In [None]:
# You can pass in attributes to the find/find_all function
paragraph = soup.find_all("p", attrs={"id": "paragraph-id"})
paragraph

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [None]:
# You can use nested calls of find/find_all
body = soup.find('body')

print("_______Body_______")
print(body)

# now inside body we can return div section only.

div = body.find('div')
print("_______Div_______")
print(div)

# now inside div section we can return "h1" section only.
header = div.find('h1')
print("_______Header_______")
header

_______Body_______
<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
_______Div_______
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
_______Header_______


<h1>HTML Webpage</h1>

In [None]:
# we can find by using regular expression.
#We can search specific strings in our find/find_all calls
import re

paragraphs = soup.find_all("p", string=re.compile("Some"))
print(paragraphs)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [None]:
import re
headers = soup.find_all("h2", string=re.compile("(H|h)eader"))
print(headers)

headers2 = soup.find_all("h2", string=re.compile("(A)?r"))
print(headers2)

[<h2>A Header</h2>, <h2>Another header</h2>]
[<h2>A Header</h2>, <h2>Another header</h2>]


### select (CSS selector)
    BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones.
    A CSS selector is the part of a CSS rule set that actually selects the content you want to style.
    Universal Selector (*)
    ID Selector (#)
    Class Selector (.)
    Preceded by (h2 ~ p) (p + p)

In [None]:
# this method  print the content in proper format.
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



In [None]:
content = soup.select("div p")
content

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

In [None]:
# Find all paragraphs preceded by h2.  " find all paraghraph after h2"
paragraphs = soup.select("h2 ~ p")
for paragraph in paragraphs:
    print(paragraph.text)
paragraphs

Some italicized text
Some bold text


[<p><i>Some italicized text</i></p>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

In [None]:
# First find paragraph with id=paragraph-id and then find bold tag (<b></b>)
bold_text = soup.select("p#paragraph-id b")
bold_text

[<b>Some bold text</b>]

In [None]:
#print(soup.body.prettify())

In [None]:
# Find all the paragraphs where the parent is body
# it will not print grand child. e.g  <p> inside <div> will not be accessed.
paragraphs = soup.select("div > h1")
print(paragraphs)

print("_______________________")
for paragraph in paragraphs:
    print(paragraph.select("i"))

[<h1>HTML Webpage</h1>]
_______________________
[]


In [None]:
# Grab all elements with specific property
soup.select("[align=middle]")

[<div align="middle">\n<h1>HTML Webpage</h1>\n<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>\n</div>]

## Get different properties of the HTML

In [None]:
# To get string value of elements, use .string
header = soup.find("h2")
header.string


'A Header'

In [None]:
# If there are multiple elements, then use get_text to get string values of all elements
#div = soup.find("align=middle")
div = soup.find("div")
print(div.prettify())
print(div.get_text())

<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>


HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [None]:
# Get a specific property from an element
link = soup.find("a")
link['href']


'https://keithgalli.github.io/web-scraping/webpage.html'

In [None]:
# select a paragharph which has an label "paragraph-id"
paragraphs = soup.select("p#paragraph-id")
paragraphs[0]['id']

'paragraph-id'

## Code navigation

In [None]:
# Path Syntax
print(soup.body.prettify())

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



In [None]:
# Know the terms:
# Parent: Top element
# Sibling: Sibling elements must have the same parent element
# Child: Element exist inside of another element

# CSS Next Sibling Selector matches all element that are only next sibling of specified element.
# Here we want to findout which elements are sibbling to <div> element, so <div> itself will
# not be included in the list.

soup.body.find("div").find_next_siblings()

[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

## Load the webpage





In [None]:
# Load the webpage content
r = requests.get("https://keithgalli.github.io/web-scraping/webpage.html")

# Convert to a beautiful soup object
webpage = bs(r.content)

# Print out our html
print(webpage.prettify())

<html>
 <head>
  <title>
   Keith Galli's Page
  </title>
  <style>
   table {
    border-collapse: collapse;
  }
  th {
    padding:5px;
  }
  td {
    border: 1px solid #ddd;
    padding: 5px;
  }
  tr:nth-child(even) {
    background-color: #f2f2f2;
  }
  th {
    padding-top: 12px;
    padding-bottom: 12px;
    text-align: left;
    background-color: #add8e6;
    color: black;
  }
  .block {
  width: 100px;
  /*float: left;*/
    display: inline-block;
    zoom: 1;
  }
  .column {
  float: left;
  height: 200px;
  /*width: 33.33%;*/
  padding: 5px;
  }

  .row::after {
    content: "";
    clear: both;
    display: table;
  }
  </style>
 </head>
 <body>
  <h1>
   Welcome to my page!
  </h1>
  <img src="./images/selfie1.jpg" width="300px"/>
  <h2>
   About me
  </h2>
  <p>
   Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
  </p>
  <p>
   Here is a link to my channel:
   <a href="https://www.youtube.com/kgmi

## Grab all of the social links from the webpage

Do this in at least 3 different ways



In [None]:
# Select ul with a class (.) name as socials and then select all the hyperlinks in it.
links = webpage.select("ul.socials a")
actual_links = [link['href'] for link in links]
actual_links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [None]:
# You can also use find method to get ul with a class (.) name as socials.
ulist = webpage.find("ul", attrs={"class": "socials"})
# Here we want to find all hyperlinks from the ul (ordered list).
links = ulist.find_all("a")
actual_links = [link['href'] for link in links]
actual_links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [None]:
# select all list contain class value "social"
links = webpage.select("li.social a")
actual_links = [link['href'] for link in links]
actual_links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [None]:
# select child elements of body, then child elements of ul and li contain class socails and all hlinks
links = webpage.select("body ul li.social a")
links

[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>,
 <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>,
 <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>,
 <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]

## Exercise \#2: Grab all text on the webpage

Just get stuff above the Photos tag



In [None]:
header = webpage.body.find("h2", string="Photos")
#print(header)
previous_elements = header.find_previous_siblings()
#print(previous_elements)
previous_elements_sorted = previous_elements[::-1]
elements = [x.get_text() for x in previous_elements_sorted]
text = "\n".join(elements)
print(text)


Welcome to my page!

About me
Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
Here is a link to my channel: youtube.com/kgmit
I grew up in the great state of New Hampshire here in the USA. From an early age I always loved math. Around my senior year of high school, my brother first introduced me to programming. I found it a creative way to apply the same type of logical thinking skills that I enjoyed with math. This influenced me to study computer science in college and ultimately create a YouTube channel to share some things that I have learned along the way.
Hobbies
Believe it or not, I don't code 24/7. I love doing all sorts of active things. I like to play ice hockey & table tennis as well as run, hike, skateboard, and snowboard. In addition to sports, I am a board game enthusiast. The two that I've been playing the most recently are Settlers of Catan and Othello.
Fun Facts

Owned my dream car in high schoo

## Scrape the Table




In [None]:
#The Strip() method in Python removes or truncates the given characters from the beginning and the end of the original string.

import pandas as pd

table = webpage.select("table.hockey-stats")[0]
print(table)
columns = table.find("thead").find_all("th")
#print(columns)
column_names = [c.string for c in columns]
print(column_names)

table_rows = table.find("tbody").find_all("tr")
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [str(tr.get_text()).strip() for tr in td]
    l.append(row)

df = pd.DataFrame(l, columns=column_names)
df.head()

<table class="hockey-stats">
<thead>
<tr>
<th class="season" data-sort="">S</th>
<th class="team" data-sort="team">Team</th>
<th class="league" data-sort="league">League</th>
<th class="regular gp" data-sort="gp">GP</th>
<th class="regular g" data-sort="g">G</th>
<th class="regular a" data-sort="a">A</th>
<th class="regular tp" data-sort="tp">TP</th>
<th class="regular pim" data-sort="pim">PIM</th>
<th class="regular pm" data-sort="pm">+/-</th>
<th class="separator"> </th>
<th class="postseason">POST</th>
<th class="postseason gp" data-sort="playoffs-gp">GP</th>
<th class="postseason g" data-sort="playoffs-g">G</th>
<th class="postseason a" data-sort="playoffs-a">A</th>
<th class="postseason tp" data-sort="playoffs-tp">TP</th>
<th class="postseason pim" data-sort="playoffs-pim">PIM</th>
<th class="postseason pm" data-sort="playoffs-pm">+/-</th>
</tr>
</thead>
<tbody>
<tr class="team-continent-NA">
<td class="season sorted">
                  2014-15
              </td>
<td class="team"

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17.0,3.0,9.0,12.0,20.0,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9.0,1.0,1.0,2.0,2.0,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12.0,5.0,5.0,10.0,8.0,0.0,|,,,,,,,
3,2017-18,Did not play,,,,,,,,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8.0,5.0,10.0,15.0,8.0,,|,,,,,,,


## Grab all fun facts that use word "is"




In [None]:
# ("\d")= digit
#("\D") = no digits
#

import re

facts = webpage.select("ul.fun-facts li")
facts_with_is = [fact.find(string=re.compile("is")) for fact in facts]

print(facts_with_is)

facts_with_is = [fact.find_parent().get_text() for fact in facts_with_is if fact]
facts_with_is


[None, 'Middle name is Ronald', None, 'Dunkin Donuts coffee is better than Starbucks', 'A favorite book series of mine is ', 'Current video game of choice is ', "The band that I've seen the most times live is the "]


['Middle name is Ronald',
 'Dunkin Donuts coffee is better than Starbucks',
 "A favorite book series of mine is Ender's Game",
 'Current video game of choice is Rocket League',
 "The band that I've seen the most times live is the Zac Brown Band"]

## Download an Image




In [None]:
# Done locally, but here is the code
import requests # pip install requests
from bs4 import BeautifulSoup as bs # pip install beautifulsoup4

# Load the webpage content
url = "https://keithgalli.github.io/web-scraping/"
r = requests.get(url+"webpage.html")

# Convert to a beautiful soup object
webpage = bs(r.content)
#print(webpage.prettify())
images = webpage.select("div.row div.column img")
image_url = images[0]['src']
full_url = url + image_url
img_data = requests.get(full_url).content

with open('lake_como.jpg', 'wb') as handler:
    handler.write(img_data)
    print("Image downloaded Link: ", full_url)

Image downloaded Link:  https://keithgalli.github.io/web-scraping/images/italy/lake_como.jpg
