# [Web Scraping with BeautifulSoup and Requests](https://www.youtube.com/watch?v=ng2o98k983k)

## Requirements
* beautifulsoup4
* requests

In [1]:
from bs4 import BeautifulSoup
import requests

## Loading a local file

In [2]:
with open('simple.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

In [3]:
print(soup)

<!DOCTYPE html>
<html class="no-js" lang="">
<head>
<title>Test - A Sample Website</title>
<meta charset="utf-8"/>
<link href="css/normalize.css" rel="stylesheet"/>
<link href="css/main.css" rel="stylesheet"/>
</head>
<body>
<h1 id="site_title">Test Website</h1>
<hr/>
<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>
<hr/>
<div class="article">
<h2><a href="article_2.html">Article 2 Headline</a></h2>
<p>This is a summary of article 2</p>
</div>
<hr/>
<div class="footer">
<p>Footer Information</p>
</div>
<script src="js/vendor/modernizr-3.5.0.min.js"></script>
<script src="js/plugins.js"></script>
<script src="js/main.js"></script>
</body>
</html>


In [4]:
# To clean it up a little.
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="">
 <head>
  <title>
   Test - A Sample Website
  </title>
  <meta charset="utf-8"/>
  <link href="css/normalize.css" rel="stylesheet"/>
  <link href="css/main.css" rel="stylesheet"/>
 </head>
 <body>
  <h1 id="site_title">
   Test Website
  </h1>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_1.html">
     Article 1 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 1
   </p>
  </div>
  <hr/>
  <div class="article">
   <h2>
    <a href="article_2.html">
     Article 2 Headline
    </a>
   </h2>
   <p>
    This is a summary of article 2
   </p>
  </div>
  <hr/>
  <div class="footer">
   <p>
    Footer Information
   </p>
  </div>
  <script src="js/vendor/modernizr-3.5.0.min.js">
  </script>
  <script src="js/plugins.js">
  </script>
  <script src="js/main.js">
  </script>
 </body>
</html>


### Grabbing information from this html

In [5]:
match = soup.title
print(match)

<title>Test - A Sample Website</title>


In [6]:
# Grabbing only the text of the title tag.
match = soup.title.text
print(match)

Test - A Sample Website


In [8]:
match = soup.div
print(match)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


### Using find

In [9]:
match = soup.find('div')
print(match)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


In [10]:
match = soup.find('div', class_='footer')
print(match)

<div class="footer">
<p>Footer Information</p>
</div>


### Use inspect to get a required tag

- Go to the site.
- Right click and click inspect.
- Find the tag you are interested in.

In [11]:
article = soup.find('div', class_='article')
print(article)

<div class="article">
<h2><a href="article_1.html">Article 1 Headline</a></h2>
<p>This is a summary of article 1</p>
</div>


### Child methods

In [13]:
headline = article.h2.a.text
print(headline)

Article 1 Headline


In [14]:
summary = article.p.text
print(summary)

This is a summary of article 1


### Use find_all()

In [29]:
articles = soup.find_all('div', class_='article')

In [31]:
for article in articles:
    headline = article.h2.a.text
    print(headline)
    
    summary = article.p.text
    print(summary)
    
    print()

Article 1 Headline
This is a summary of article 1

Article 2 Headline
This is a summary of article 2



## From an actual website
[Website link]()

In [32]:
source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')

In [34]:
article = soup.find('article')

In [35]:
print(article.prettify())

<article class="post-1670 post type-post status-publish format-standard has-post-thumbnail category-development category-python tag-gzip tag-shutil tag-zip tag-zipfile entry" itemscope="" itemtype="https://schema.org/CreativeWork">
 <header class="entry-header">
  <h2 class="entry-title" itemprop="headline">
   <a class="entry-title-link" href="https://coreyms.com/development/python/python-tutorial-zip-files-creating-and-extracting-zip-archives" rel="bookmark">
    Python Tutorial: Zip Files – Creating and Extracting Zip Archives
   </a>
  </h2>
  <p class="entry-meta">
   <time class="entry-time" datetime="2019-11-19T13:02:37-05:00" itemprop="datePublished">
    November 19, 2019
   </time>
   by
   <span class="entry-author" itemprop="author" itemscope="" itemtype="https://schema.org/Person">
    <a class="entry-author-link" href="https://coreyms.com/author/coreymschafer" itemprop="url" rel="author">
     <span class="entry-author-name" itemprop="name">
      Corey Schafer
     </spa

In [37]:
headline = article.h2.a.text
print(headline)

Python Tutorial: Zip Files – Creating and Extracting Zip Archives


In [38]:
summary = article.find('div', class_='entry-content').p.text
print(summary)

In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…


In [39]:
vid_src = article.find('iframe', class_='youtube-player')['src']
print(vid_src)

https://www.youtube.com/embed/z0gguhEmWiY?version=3&rel=1&showsearch=0&showinfo=1&iv_load_policy=1&fs=1&hl=en-US&autohide=2&wmode=transparent
