## A Comparative Study of Regional Air Quality

This notebook uses webscraping to collect pollen and allergen data for a comparative study of air quality between five cities with three significantly different terrains and two levels of population density:
1. San Diego, California (semi-arid, coastal, 62 ft elevation, xxx population);
2. Los Angeles, California 
3. Denver, Colorado (semi-arid, mountainous, 5414 ft elevation);
4. Atlanta, Georgia
5. Nashville, Tennessee (humid subtropical, forested, 597 ft elevation).

The data gathered in this notebook is available at: https://www.pollen.com/research/. The resulting data is saved as a DataFrame and exported as a CSV file titled _'pollen_data.csv'_.

### Import the Required Libraries

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import time

### Set up the Selenium Web Driver to Iterate through the Separate Charts for Each Season and Pollen Type

In [2]:
driver = webdriver.Chrome()

### Declare the Static Variables

In [3]:
# XPATHS for each season
seasons = {'Spring':'//*[@id="seasonlist"]/li[1]',
           'Summer':'//*[@id="seasonlist"]/li[2]',
           'Fall':'//*[@id="seasonlist"]/li[3]',
           'Winter':'//*[@id="seasonlist"]/li[4]'}

# XPATHS for each pollen type
pollens = {'Tree':'//*[@id="pollenlist"]/li[1]',
           'Grass':'//*[@id="pollenlist"]/li[2]',
           'Ragweed':'//*[@id="pollenlist"]/li[3]'}

In [4]:
# A zip codes dictionary with zip codes as keys and city names as the values
zip_dict = {'92101':'San Diego, CA',
            '90001':'Los Angeles, CA',
            '80201':'Denver, CO',
            '30301':'Atlanta, GA',
            '37201':'Nashville, TN'}

In [5]:
# The common url
url = 'https://www.pollen.com/research/'

### Loop through the Charts

In [6]:
# Initiate an empty DataFrame
pollen_df = pd.DataFrame(columns=['city', 'season', 'category', 'species', 'allergenicity'])

# Loop through each city's page
for code in zip_dict:
    
    city = zip_dict[code]
    
    driver.get(url + code)
    assert city in driver.title
    time.sleep(5)
    
    # Loop through each season
    for season in seasons:
        
        # Select the season
        driver.find_element_by_xpath(seasons[season]).click()
        time.sleep(2)

        # Loop through each pollen category
        for category in pollens:
            
            # Select the category
            driver.find_element_by_xpath(pollens[category]).click()
            time.sleep(2)
            
            # Refresh the page content and locate the relevant element
            page_content = bs(driver.page_source)
            species_divs = page_content.findAll(name='div', attrs={'class':'col-sm-6 no-padding'})

            # Loop through each species in the chart
            for div in species_divs:
                species = div.find('a').text
                allergenicity = div.find('div').get('class')[1].title()

                # Add the entry to the collective DataFrame
                pollen_df.loc[len(pollen_df)] = [city, season, category, species, allergenicity]
                
pollen_df

Unnamed: 0,city,season,category,species,allergenicity
0,"San Diego, CA",Spring,Tree,Arizona Cypress (Cupressus arizonica),Severe
1,"San Diego, CA",Spring,Tree,Arroyo Willow (Salix lasiolepis),Severe
2,"San Diego, CA",Spring,Tree,"Box Elder, Ash-Leaf Maple (Acer negundo)",Severe
3,"San Diego, CA",Spring,Tree,California Black Oak (Quercus kelloggii),Severe
4,"San Diego, CA",Spring,Tree,Canyon Live Oak (Quercus chrysolepis),Severe
...,...,...,...,...,...
567,"Nashville, TN",Winter,Grass,Winter Bent (Agrostis hyemalis),Severe
568,"Nashville, TN",Winter,Ragweed,Annual Ragweed (Ambrosia artemisiifolia),Severe
569,"Nashville, TN",Winter,Ragweed,Pennsylvania Pellitory (Parietaria pensy...,Severe
570,"Nashville, TN",Winter,Ragweed,Rape (Brassica rapa),Severe


### Close out the Driver

In [8]:
driver.close()

### Export the DataFrame

In [7]:
pollen_df.to_csv('../data/pollen_data.csv', index=False)