# Lab Question 7

Scrape the countries of the world and the related metrics from the following site: https://scrapethissite.com/pages/simple/

Store the result in a DataFrame that looks like the following:

| name | capital | population | area |
| ---- | ------- | ---------- | ---- |
| Andorra | Andorra la Vella | 84000 | 468.0 |
| ....

Then save your DataFrame as "countries.csv".

### Solution

This website is very scraping-friendly, but we still have to string together a lot of concept we've been practicing in more contained problems:
- Fetching HTML with `requests`
- Parsing it with the BeautifulSoup class
- Locating elements of interest
- Looping over multiple elements
- Creating a DataFrame from scraped elements

As for finding the elements, the simplest "container" to loop over is the "col-md-4 country" `div` element -- there is one of these for each country, so we can `find_all()` and then extract the information within each.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
URL = 'https://scrapethissite.com/pages/simple/'

# Get the site HTML and parse it.
response = requests.get(URL)
bs = BeautifulSoup(response.content, 'html.parser')
# Find the divs that contain countries.
divs = bs.find_all(name='div', class_='col-md-4 country')

# For each one, extract name, capital, population, and area --
# store that info in a dictionary and add it to our list of rows.
rows = []
for div in divs:
    #######
    # Name
    #######
    # We can't just use div.h3.string because there is also an image within
    #$ the h3 (not just text.)
    name = ''.join(div.h3.strings)
    # (Optional) Get rid of whitespace around the country name
    name = name.strip()
    
    # Everything else is simpler; use the span classes and .string.
    
    # Capital
    capital = div.find(name='span', class_='country-capital').string
    # Population
    population = div.find(name='span', class_='country-population').string
    # Area
    area = div.find(name='span', class_='country-area').string
    
    # Create a dictionary of this info
    country_dict = {'name': name, 'capital': capital, 'population': population, 'area': area}
    # Add it to our list of rows
    rows.append(country_dict)

In [None]:
# Now just transform our rows into a DataFrame
country_df = pd.DataFrame(rows)
country_df

In [None]:
# Save it
# index=False is a nice option when saving DataFrames -- it omits the row index.
country_df.to_csv('countries.csv', index=False)