# Web Scraping

[Coffee Review](https://www.coffeereview.com/) is a review aggregate founded by Kenneth Davids. He and his team have expertly reviewed over 5000 coffee roasts since 1997. The code below scrapes through their website, collecting data on each coffee blend. (on Jan 23rd, 2019)

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import re

**Warning:** This notebook will take ~4 hours to run completely. The final web scraped data can be found in the `code` folder of this repo. 

## Scrape through 257 pages of coffees for their name and slug

In [2]:
#initialize empty list of coffee review names and slugs
name_slug_list = []

#iterate through 257 pages of coffee reviews
print('begin scraping')
for i in range(1,257+1):
    url = 'https://www.coffeereview.com/advanced-search/?pg={}'.format(i)
    res = requests.get(url,headers = {'User-agent': 'Lisa Vanderpump'})
    soup = BeautifulSoup(res.content, "lxml")
    names = soup.find_all('h2',{'class':"review-title"})
    
    #collect name and slugs
    for name in names:
        ns_dict ={}
        ns_dict['name'] = name.text
        ns_dict['slug'] = name.find('a').attrs['href']
        
        #add dictionary to list
        name_slug_list.append(ns_dict)
    
    #counter
    if (i%25) == 0:
        print(f'scraped {i} pages')

    #pause between requests
    time.sleep(1)

print('done scraping')

begin scraping
scraped 25 pages
scraped 50 pages
scraped 75 pages
scraped 100 pages
scraped 125 pages
scraped 150 pages
scraped 175 pages
scraped 200 pages
scraped 225 pages
scraped 250 pages
done scraping


In [3]:
name_slug_df = pd.DataFrame(name_slug_list)
name_slug_df.head()

Unnamed: 0,name,slug
0,Ethiopia Deri Kochoha,/review/ethiopia-deri-kochoha-2
1,Espresso,/review/espresso-14
2,Kenya Ruthaka Peaberry,/review/kenya-ruthaka-peaberry
3,Ethiopia Gora Kone Sidamo,/review/ethiopia-gora-kone-sidamo
4,Specialty Coffee Blend Espresso,/review/specialty-coffee-blend-espresso


In [4]:
name_slug_df.shape[0]

5124

- Successfully scraped 5124 names of coffee and their unique corresponding slug for further web scraping

## Scrape for content of 5124 different coffees

We can use the slugs we retrieved in our first web scrape to visit each page and collect the content that contains each coffee's ratings and descriptions.

In [8]:
#initialize empty lists and counter
coffee_list = []
i = 0
print('begin scraping')

for slug in name_slug_df.slug:
    #counter
    i += 1
    url = 'https://www.coffeereview.com{}'.format(slug)
    res = requests.get(url,headers = {'User-agent': 'Kyle Richards'})
    soup = BeautifulSoup(res.content, "lxml")
    #grab only the main content
    soup = soup.find('div',{'class':"maincontent"})

    #initialize empty dictionary for current page coffee data
    coffee_dict={}

    #add data to dictionary
    coffee_dict['slug'] = slug
    coffee_dict['all_text'] = soup.get_text()
    coffee_dict['rating'] = soup.find('div',{'class':"review-rating"}).text
    coffee_dict['roaster'] = soup.find('h3').text
    coffee_dict['name'] = soup.find('h2',{'class':"review-title"}).text

    coffee_list.append(coffee_dict)
    
    #counter
    if (i%500) == 0:
        print(f'scraped {i} pages')
        
    #pause between requests
    time.sleep(1)
    
print('done scraping')

begin scraping
scraped 500 pages
scraped 1000 pages
scraped 1500 pages
scraped 2000 pages
scraped 2500 pages
scraped 3000 pages
scraped 3500 pages
scraped 4000 pages
scraped 4500 pages
scraped 5000 pages
done scraping


In [9]:
coffee_df = pd.DataFrame(coffee_list)
coffee_df.head()

Unnamed: 0,all_text,name,rating,roaster,slug
0,\n\n\n\n \n93\nFlight Coffee Co.\nEthiopia Der...,Ethiopia Deri Kochoha,93,Flight Coffee Co.,/review/ethiopia-deri-kochoha-2
1,\n\n\n\n\n91\nDoi Chaang Coffee\nEspresso\nLoc...,Espresso,91,Doi Chaang Coffee,/review/espresso-14
2,\n\n\n\n \n95\nTemple Coffee and Tea\nKenya Ru...,Kenya Ruthaka Peaberry,95,Temple Coffee and Tea,/review/kenya-ruthaka-peaberry
3,\n\n\n\n \n93\nTemple Coffee and Tea\nEthiopia...,Ethiopia Gora Kone Sidamo,93,Temple Coffee and Tea,/review/ethiopia-gora-kone-sidamo
4,\n\n\n\n\n93\nChoosy Gourmet\nSpecialty Coffee...,Specialty Coffee Blend Espresso,93,Choosy Gourmet,/review/specialty-coffee-blend-espresso


In [10]:
coffee_df.shape[0]

5124

 - We successfully collected all 5124 coffee descriptions.
 - We will clean the text/data after the next scrape.

## Additional Scraping

- The `region` and `type` categories are not listed on individual coffee reviews, but the _advanced search_ feature on the website filters for our desired results. 
- Below, we apply a similar webscraping technique to add columns to our dataframe with the desired boolean values.


In [11]:
#advanced search query filter with page number associations
search_dict = {'region_africa_arabia':56, 'region_caribbean':3, 'region_central_america':41, 
               'region_hawaii':6, 'region_asia_pacific':20, 'region_south_america':22,
               'type_espresso':35, 'type_organic':23, 'type_fair_trade':15, 
               'type_decaffeinated':4, 'type_pod_capsule':9, 'type_blend':22, 'type_estate':34}

#iterate through pages of coffee
for search_term in search_dict:
    print(f"scraping {search_term}")

    #initialize an empty set
    results = set()
    
    #scrape each page of search results
    for i in range(1,search_dict[search_term]+1):
        url = 'https://www.coffeereview.com/advanced-search/?{}=on&keyword=&search=Search+Now&pg={}'.format(search_term, i)
        res = requests.get(url,headers = {'User-agent': 'Erika Girardi'})
        soup = BeautifulSoup(res.content, "lxml")
        names = soup.find_all('h2',{'class':"review-title"})

        #collect slugs and add them to the set
        for name in names:
            results.add(name.find('a').attrs['href'])
        
        #counter
        if (i%10) == 0:
            print(f'scraped {i} pages')

        #pause between requests
        time.sleep(1)
    
    #create binary column, indicating whether slug is in the search results
    coffee_df[search_term] = coffee_df['slug'].map(lambda x: 1 if x in results else 0)

    print(f'done scraping {search_term}')
    print('')

scraping region_africa_arabia
scraped 10 pages
scraped 20 pages
scraped 30 pages
scraped 40 pages
scraped 50 pages
done scraping region_africa_arabia

scraping region_caribbean
done scraping region_caribbean

scraping region_central_america
scraped 10 pages
scraped 20 pages
scraped 30 pages
scraped 40 pages
done scraping region_central_america

scraping region_hawaii
done scraping region_hawaii

scraping region_asia_pacific
scraped 10 pages
scraped 20 pages
done scraping region_asia_pacific

scraping region_south_america
scraped 10 pages
scraped 20 pages
done scraping region_south_america

scraping type_espresso
scraped 10 pages
scraped 20 pages
scraped 30 pages
done scraping type_espresso

scraping type_organic
scraped 10 pages
scraped 20 pages
done scraping type_organic

scraping type_fair_trade
scraped 10 pages
done scraping type_fair_trade

scraping type_decaffeinated
done scraping type_decaffeinated

scraping type_pod_capsule
done scraping type_pod_capsule

scraping type_blend
scr

In [12]:
coffee_df.head()

Unnamed: 0,all_text,name,rating,roaster,slug,region_africa_arabia,region_caribbean,region_central_america,region_hawaii,region_asia_pacific,region_south_america,type_espresso,type_organic,type_fair_trade,type_decaffeinated,type_pod_capsule,type_blend,type_estate
0,\n\n\n\n \n93\nFlight Coffee Co.\nEthiopia Der...,Ethiopia Deri Kochoha,93,Flight Coffee Co.,/review/ethiopia-deri-kochoha-2,1,0,0,0,0,0,0,0,0,0,0,0,0
1,\n\n\n\n\n91\nDoi Chaang Coffee\nEspresso\nLoc...,Espresso,91,Doi Chaang Coffee,/review/espresso-14,0,0,0,0,1,0,1,0,0,0,0,0,1
2,\n\n\n\n \n95\nTemple Coffee and Tea\nKenya Ru...,Kenya Ruthaka Peaberry,95,Temple Coffee and Tea,/review/kenya-ruthaka-peaberry,1,0,0,0,0,0,0,0,0,0,0,0,0
3,\n\n\n\n \n93\nTemple Coffee and Tea\nEthiopia...,Ethiopia Gora Kone Sidamo,93,Temple Coffee and Tea,/review/ethiopia-gora-kone-sidamo,1,0,0,0,0,0,0,0,0,0,0,0,0
4,\n\n\n\n\n93\nChoosy Gourmet\nSpecialty Coffee...,Specialty Coffee Blend Espresso,93,Choosy Gourmet,/review/specialty-coffee-blend-espresso,0,0,0,0,0,0,1,0,0,0,0,0,0


 - DataFrame with search 13 term filters added

### Extract info from `all_text`:

- Using Regex, we capture desired data within the `all_text` column for each coffee.

In [13]:
info_list = []

for slug in coffee_df.index:
    text = coffee_df.loc[slug].all_text
# remove part of description in `Notes` sections that contains roaster's website and phone number
# commented out as this section is not used for consideration in the next notebook
#     text = re.sub('[Vv]isit.+','',text)
#     text = re.sub('[Ff]or more information.+','',text)    
    data_info = [
            re.findall('Location: (.+)',text),
            re.findall('Origin: (.+)',text),
            re.findall('Roast: (.+)',text),
            re.findall('Est. Price: (.+)',text),
            re.findall('Review Date: (.+)',text),
            re.findall('Agtron: (.+)',text),
            re.findall('Aroma: (\d*\.*\d*)',text),
            re.findall('Acid.+: (\d*\.*\d*)',text),
            re.findall('Body: (\d*\.*\d*)',text),
            re.findall('Flavor: (\d*\.*\d*)',text),
            re.findall('Aftertaste: (\d*\.*\d*)',text),
            re.findall('With Milk: (\d*\.*\d*)',text),
            re.findall('Blind Assessment: ?\n?(.+)',text),
            re.findall('Notes: ?\n?(.+)',text),
            re.findall('Bottom Line: ?\n?(.+)\n',text),
            re.findall('Who Should Drink It: ?\n?(.+)\n',text)
    ]
    
    info_list.append(data_info)
    
info_df = pd.DataFrame(info_list,columns = ['location', 'origin', 'roast', 'est_price', 
                                            'review_date', 'agtron', 'aroma', 'acid', 
                                            'body', 'flavor', 'aftertaste', 'with_milk', 
                                            'desc_1', 'desc_2', 'desc_3','desc_4'])
#fill in empty entries with NaN values
info_df = info_df.applymap(lambda x: x[0] if x != [] else None)
info_df.head()


Unnamed: 0,location,origin,roast,est_price,review_date,agtron,aroma,acid,body,flavor,aftertaste,with_milk,desc_1,desc_2,desc_3,desc_4
0,"Bedford, New Hampshire","West Guji Zone, Oromia Region, southeastern Et...",Medium-Light,$17.00/12 ounces,January 2019,56/80,9,8.0,9,9,8,,"Bright, crisp, sweetly tart. Citrus medley, ca...",From the Deri Kochoha mill in the Hagere Marya...,A poised and melodic wet-processed Ethiopia co...,
1,"Richmond, British Columbia, Canada",Northern Thailand,Medium,CAD $29.99/32 ounces,January 2019,46/68,8,,8,8,8,9.0,"Evaluated as espresso. Deeply rich, sweetly ro...",Doi Chaang is a single-estate coffee produced ...,"A rich, resonant espresso from Thailand, espec...",
2,"Sacramento, California","Nyeri growing region, south-central Kenya",Medium,$19.00/12 ounces,January 2019,48/72,9,8.0,9,10,8,,"Deeply sweet, richly savory. Dark chocolate, p...",Despite challenges ranging from contested gove...,"A high-toned, nuanced Kenya cup, classic in it...",
3,"Sacramento, California","Sidamo (also Sidama) growing region, south-cen...",Medium-Light,$20.00/12 ounces,January 2019,55/77,9,8.0,9,9,8,,"Fruit-forward, richly chocolaty. Raspberry cou...",Southern Ethiopia coffees like this one are la...,"A playful, unrestrained fruit bomb of a coffee...",
4,"Kaohsiung, Taiwan","Ethiopia, Colombia, Kenya",Medium-Light,NT $250/16 ounces,January 2019,51/75,9,,8,9,8,9.0,"Evaluated as espresso. Rich, chocolaty, sweetl...",A blend of coffees from Ethiopia (natural-proc...,An espresso blend in which spice notes — in pa...,


### Concatenate dataframes into one:

In [14]:
coffee_df = pd.concat([coffee_df, info_df], axis =1)
coffee_df.head()

Unnamed: 0,all_text,name,rating,roaster,slug,region_africa_arabia,region_caribbean,region_central_america,region_hawaii,region_asia_pacific,...,aroma,acid,body,flavor,aftertaste,with_milk,desc_1,desc_2,desc_3,desc_4
0,\n\n\n\n \n93\nFlight Coffee Co.\nEthiopia Der...,Ethiopia Deri Kochoha,93,Flight Coffee Co.,/review/ethiopia-deri-kochoha-2,1,0,0,0,0,...,9,8.0,9,9,8,,"Bright, crisp, sweetly tart. Citrus medley, ca...",From the Deri Kochoha mill in the Hagere Marya...,A poised and melodic wet-processed Ethiopia co...,
1,\n\n\n\n\n91\nDoi Chaang Coffee\nEspresso\nLoc...,Espresso,91,Doi Chaang Coffee,/review/espresso-14,0,0,0,0,1,...,8,,8,8,8,9.0,"Evaluated as espresso. Deeply rich, sweetly ro...",Doi Chaang is a single-estate coffee produced ...,"A rich, resonant espresso from Thailand, espec...",
2,\n\n\n\n \n95\nTemple Coffee and Tea\nKenya Ru...,Kenya Ruthaka Peaberry,95,Temple Coffee and Tea,/review/kenya-ruthaka-peaberry,1,0,0,0,0,...,9,8.0,9,10,8,,"Deeply sweet, richly savory. Dark chocolate, p...",Despite challenges ranging from contested gove...,"A high-toned, nuanced Kenya cup, classic in it...",
3,\n\n\n\n \n93\nTemple Coffee and Tea\nEthiopia...,Ethiopia Gora Kone Sidamo,93,Temple Coffee and Tea,/review/ethiopia-gora-kone-sidamo,1,0,0,0,0,...,9,8.0,9,9,8,,"Fruit-forward, richly chocolaty. Raspberry cou...",Southern Ethiopia coffees like this one are la...,"A playful, unrestrained fruit bomb of a coffee...",
4,\n\n\n\n\n93\nChoosy Gourmet\nSpecialty Coffee...,Specialty Coffee Blend Espresso,93,Choosy Gourmet,/review/specialty-coffee-blend-espresso,0,0,0,0,0,...,9,,8,9,8,9.0,"Evaluated as espresso. Rich, chocolaty, sweetl...",A blend of coffees from Ethiopia (natural-proc...,An espresso blend in which spice notes — in pa...,


In [15]:
coffee_df.shape

(5124, 34)

Our DataFrame now includes data for every coffee including name, roaster, webpage slug, review date, text descriptions, ratings, and binary columns.

#### Export full dataframe:

In [16]:
coffee_df.to_csv('../data/coffee.csv', index=False)