# WebScraping with BeautifulSoup and Pandas

Here's the [talk](https://www.youtube.com/watch?v=XQgXKtPSzUI) for further information

And related [article](https://pythonprogramminglanguage.com/web-scraping-with-pandas-and-beautifulsoup/) that is meant for Python 2.x but removing tabulate() will reach the same result

In [15]:
import bs4
from urllib.request import urlopen as uReq
from os.path import basename
from urllib.parse import urljoin
from lxml import html
from lxml.cssselect import CSSSelector
import requests

page = requests.get("http://www.example.com").text
doc = html.fromstring(page)
link = doc.cssselect("a")[0]

print(link.text_content())
print(link.attrib['href'])

More information...
http://www.iana.org/domains/example


Using to scrap data from a products page at NewEgg

In [4]:
from bs4 import BeautifulSoup as BS
#from urllib.request import urlopen as uReq

# Grab graphics cards
# NewEgg Worksatation Graphics Cards
source_url = 'https://www.newegg.com/Desktop-Graphics-Cards/SubCategory/ID-48?Tid=7709'

client = uReq(source_url) # open conn
page_html = client.read() # grab page
client.close() # close conn to be nice

# open file to create .csv of data
#   and write headers line
filename = "desktop-graphics-cards.csv"
f = open(filename, "w")
headers = "brand, product_name, shipping\n"
f.write(headers)

# Parse the html
soup = BS(page_html, "html.parser")

# traverse the graphics card page elements
containers = soup.findAll("div", {"class": "item-container"})

for container in containers:
    brand = container.div.div.a.img["title"]

    title_container = container.findAll("a", {"class": "item-title"})
    title = title_container[0].text.replace(",", "|")

    ship_container = container.findAll("li", {"class": "price-ship"})
    ship = ship_container[0].text.strip() # remove newlines, etc.

    print(brand)
    print(title)
    print(ship)

    f.write(brand + "," + title + "," + ship + "\n")

f.close()

print('\nNumber of graphics cards:')
print(len(containers))

Sapphire Tech
Sapphire Radeon NITRO+ RX 580 4GB GDDR5 PCI-E Dual HDMI / DVI-D / Dual DP w/ Backplate (UEFI)| 100411NT+4GL
$4.99 Shipping
GIGABYTE
GIGABYTE GeForce GTX 1060 DirectX 12 GV-N1060WF2OC-3GD 3GB 192-Bit GDDR5 PCI Express 3.0 x16 ATX Video Card
$4.99 Shipping
MSI
MSI Radeon RX 570 DirectX 12 RX 570 ARMOR 4G OC 4GB 256-Bit GDDR5 PCI Express 3.0 x16 HDCP Ready CrossFireX Support ATX Video Card
$4.99 Shipping
ASUS
ASUS GeForce GTX 1070 DUAL-GTX1070-O8G 8GB 256-Bit GDDR5 PCI Express 3.0 HDCP Ready SLI Support Video Card
Free Shipping
AMD
AMD Radeon Vega Frontier Edition DirectX 12 100-506061 16GB 2048-Bit HBM2 Video Card (Air Cooled Model)
Free Shipping
GIGABYTE
GIGABYTE GeForce GTX 1060 DirectX 12 GV-N1060G1 GAMING-6GD 6GB 192-Bit GDDR5 PCI Express 3.0 x16 ATX Video Card
Free Shipping
MSI
MSI Radeon RX 570 DirectX 12 RX 570 GAMING X 4G 4GB 256-Bit GDDR5 PCI Express 3.0 HDCP Ready CrossFireX Support ATX Video Card
$4.99 Shipping
Sapphire Tech
SAPPHIRE Radeon RX Vega 56 DirectX 12 

### Adding Pandas for DataFrames


In [14]:
from bs4 import BeautifulSoup as BS
import pandas as pd
import requests

res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")
soup = BS(res.content, 'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0])

        #                                        COUNTRY         AMOUNT  DATE  \
0       1                                          China    389 million  2009   
1       2                                  United States    245 million  2009   
2       3                                          Japan  99.18 million  2009   
3     NaN    Group of 7 countries (G7) average (profile)  80.32 million  2009   
4       4                                         Brazil  75.98 million  2009   
5       5                                        Germany  65.12 million  2010   
6       6                                          India  61.34 million  2009   
7       7                                         Russia   59.7 million  2010   
8     NaN      Non-religious countries average (profile)  51.56 million  2009   
9       8                                 United Kingdom  51.44 million  2009   
10      9                                         France  44.63 million  2010   
11     10                   