## Day 46 - Beautiful Soup

* Corey Schafer [video](https://www.youtube.com/watch?v=N5vscPTWKOk&vl=en) on virtual environments.

* Virtualenv [docs](https://virtualenv.pypa.io/en/stable/userguide/#usage).

* Beautiful Soup [docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). 


In [1]:
import requests
import bs4

In [2]:
url = 'https://pybit.es/pages/projects.html'

---
Using requests, we store the html in a variable called raw_site.  raise_for_status() will give us a warning if there is an issue with the url.  If we left that line out, whatever gets returned is stored and it could create issues further down the pipeline. 

In [3]:
raw_site = requests.get(url)
raw_site.raise_for_status()

---
Select will select the CSS class that we pass it.  In this case, class="projectHeader" is what we're looking for.  The syntax is just a "." in front of our class name.

In [4]:
header_list = []
soup = bs4.BeautifulSoup(raw_site.text, 'html.parser')
html_header = soup.select('.projectHeader')

---
Select will grab everything in the H3 tag.  Since we only want the titles, we'll use getText()

![title](img/projectHeader.png)

In [5]:
for header in html_header:
    print(header)

<h3 class="projectHeader">0. PyBites Apps (first 100 days)</h3>
<h3 class="projectHeader">1. #100DaysOfCode (Mar 30, 2017 - Jul 07, 2017)</h3>
<h3 class="projectHeader">2. #100DaysOfDjango (Jul 08, 2017 - Oct 15, 2017)</h3>
<h3 class="projectHeader">3. PyBites Code Challenges Platform (Oct 16, 2017 - Jan 23, 2018)</h3>
<h3 class="projectHeader">4. PyBites Community #100DaysOfCode (Jan 24, 2017 - May 03, 2018)</h3>


In [6]:
for header in html_header:
    header_list.append(header.getText())

---
Now that looks much better!

In [7]:
for header in header_list:
    print(header)

0. PyBites Apps (first 100 days)
1. #100DaysOfCode (Mar 30, 2017 - Jul 07, 2017)
2. #100DaysOfDjango (Jul 08, 2017 - Oct 15, 2017)
3. PyBites Code Challenges Platform (Oct 16, 2017 - Jan 23, 2018)
4. PyBites Community #100DaysOfCode (Jan 24, 2017 - May 03, 2018)


---
Let's try another URL.

In [8]:
url = 'https://pybit.es/pages/articles.html'

In [9]:
raw_site = requests.get(url)
raw_site.raise_for_status()

In [10]:
articleList = []
soup = bs4.BeautifulSoup(raw_site.text, 'html.parser')
html_header = soup.select('.articleList')

In [11]:
for header in html_header:
    articleList.append(header.getText())

---

That's an impressive list! 

In [12]:
for article in articleList:
    print(article)


A Short Primer on Assembers, Compilers and Interpreters
Persistent Virtualenv Environment Variables with python-dotenv
You don't need to be a Pro @ Python to crack the code of Pycon
Career Development for Programmers
A Python Orientation - How to Get Started
How Promotions work in Large Corporations
Why Python is Great for Test Automation
My Anaconda Workflow: Python environment and package management made easy
Watch Me Code - Solving Bite 21. Query a Nested Data Structure
Why Python is so popular in Devops?
How Encoding Works in Python
Enough pytest to be Dangerous, 10 Things I Learned Writing Tests for 100 Python (Bites of Py) Exercises
Pushing the Packt "free book of the day" to the world with Scrapy and Alexa
PyCon 2018 - My First PyCon
CodeChalleng.es Platform Update 26-Mar-2018
All You Need to Know to Start Using Fixtures in Your pytest Code
Using Feedparser, Difflib and Plotly to Analyze PyBites Blog Tags
PyBites 1 Year Special - Taking Python Code Challenges to the Next Level 

---
Let's look at some cool features of the soup object!

The searching by tag with the syntax soup.(tagName) returns the first instance of that tag that it finds. 

In [13]:
soup.ul

<ul class="list">
<li><a href="/pages/about.html">About</a></li>
<li><a href="/pages/articles.html">Articles</a></li>
<li><a href="/pages/challenges.html">Code Challenges</a></li>
<li><a href="/pages/courses.html">#100DaysOfCode</a></li>
<li><a href="/pages/news.html">Python News</a></li>
<li><a href="/pages/search.html">Search</a></li>
</ul>

---
Find_all is a little better, but we get all kinds of stuff that we don't want.  

In [14]:
soup.find_all('ul')

[<ul class="list">
 <li><a href="/pages/about.html">About</a></li>
 <li><a href="/pages/articles.html">Articles</a></li>
 <li><a href="/pages/challenges.html">Code Challenges</a></li>
 <li><a href="/pages/courses.html">#100DaysOfCode</a></li>
 <li><a href="/pages/news.html">Python News</a></li>
 <li><a href="/pages/search.html">Search</a></li>
 </ul>, <ul class="social">
 <li><a href="https://twitter.com/pybites" target="_blank"><img alt="Follow us on Twitter" src="https://pybit.es/theme/img/socialmedia/twitter.png"/></a></li>
 <li><a href="https://github.com/pybites/" target="_blank"><img alt="Follow us on Github" src="https://pybit.es/theme/img/socialmedia/github.png"/></a></li>
 <li><a href="https://instagram.com/pybites" target="_blank"><img alt="Follow us on Instagram" src="https://pybit.es/theme/img/socialmedia/instagram.png"/></a></li>
 <li><a href="https://www.youtube.com/channel/UCBn-uKDGsRBfcB0lQeOB_gA" target="_blank"><img alt="Follow us on Youtube" src="https://pybit.es/the

---
BeautifulSoup let's us drill down a bit further though to refine that list.  On further inspection, the unordered list that we are looking for is the only list that lives within the main part of the page.  

In [15]:
soup.main.ul

<ul class="articleList" id="articleList">
<li><a href="https://pybit.es/python-interpreters.html">A Short Primer on Assembers, Compilers and Interpreters</a></li>
<li><a href="https://pybit.es/persistent-environment-variables.html">Persistent Virtualenv Environment Variables with python-dotenv</a></li>
<li><a href="https://pybit.es/howto-crack-pycon.html">You don't need to be a Pro @ Python to crack the code of Pycon</a></li>
<li><a href="https://pybit.es/career-development-programmers.html">Career Development for Programmers</a></li>
<li><a href="https://pybit.es/guest-python-orientation.html">A Python Orientation - How to Get Started</a></li>
<li><a href="https://pybit.es/guest-promotions-large-corporations.html">How Promotions work in Large Corporations</a></li>
<li><a href="https://pybit.es/guest-python-test-automation.html">Why Python is Great for Test Automation</a></li>
<li><a href="https://pybit.es/guest-anaconda-workflow.html">My Anaconda Workflow: Python environment and packa

---
Another method we could use is to store all of the list elements and then iterate over that and print out only the string portion.

In [16]:
all_li = soup.main.find_all('li')

In [17]:
for item in all_li:
    print(item.string)

A Short Primer on Assembers, Compilers and Interpreters
Persistent Virtualenv Environment Variables with python-dotenv
You don't need to be a Pro @ Python to crack the code of Pycon
Career Development for Programmers
A Python Orientation - How to Get Started
How Promotions work in Large Corporations
Why Python is Great for Test Automation
My Anaconda Workflow: Python environment and package management made easy
Watch Me Code - Solving Bite 21. Query a Nested Data Structure
Why Python is so popular in Devops?
How Encoding Works in Python
Enough pytest to be Dangerous, 10 Things I Learned Writing Tests for 100 Python (Bites of Py) Exercises
Pushing the Packt "free book of the day" to the world with Scrapy and Alexa
PyCon 2018 - My First PyCon
CodeChalleng.es Platform Update 26-Mar-2018
All You Need to Know to Start Using Fixtures in Your pytest Code
Using Feedparser, Difflib and Plotly to Analyze PyBites Blog Tags
PyBites 1 Year Special - Taking Python Code Challenges to the Next Level …

In [18]:
len(all_li)

118