# Webscraping with Beautiful Soup
In this notebook we'll be web scraping different HTML pages through beautiful soup. We'll be performing these exercises without utilizing regex. <br/><br/>
The websites scraped are,<br/>
**ChubbyGrub**- http://chubbygrub.com<br/>
Sample HTML Page provided by GA's instructor- https://rldaggie.github.io/sample-html/<br/>
**Basketball Reference**- https://www.basketball-reference.com/


In [1]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

### ChubbyGrub
ChubbyGrub is a website that holds nutritional information from multiple restaurants. Let's build a pandas dataframe of all the food listed by restraurant.

In [2]:
# Grab data from chubbygrub.com
res = requests.get('http://chubbygrub.com')
soup = bs(res.content, 'lxml')

In [3]:
# Isolate the div that has all the restaurant links
restaurants_section = soup.find('div', {'class': 'restaurant-buttons'})

# Create list of dictionaries of names and slugs
restaurants = []

for r in restaurants_section.find_all('a', {'class': \
                                            'btn btn-lg btn-primary'}):
    
    restaurant = {}
    restaurant['name'] = r.text
    restaurant['slug'] = r['href'].split('/')[-1]
    
    restaurants.append(restaurant)

# Top 10 dictionaries
restaurants[0:10]

[{'name': 'A&W Restaurants', 'slug': 'aw-restaurants'},
 {'name': "Applebee's", 'slug': 'applebees'},
 {'name': "Arby's", 'slug': 'arbys'},
 {'name': 'Atlanta Bread Company', 'slug': 'atlanta-bread-company'},
 {'name': "Bojangle's Famous Chicken 'n Biscuits",
  'slug': 'bojangles-famous-chicken-n-biscuits'},
 {'name': 'Buffalo Wild Wings', 'slug': 'buffalo-wild-wings'},
 {'name': 'Burger King', 'slug': 'burger-king'},
 {'name': "Captain D's", 'slug': 'captain-ds'},
 {'name': "Carl's Jr.", 'slug': 'carls-jr'},
 {'name': "Charley's Grilled Subs", 'slug': 'charleys-grilled-subs'}]

We can use the slugs to scrape each restaurant's page and create a single list of food dictionaries.

In [4]:
foods = []

# 
for r in restaurants:
    restaurant_res = requests.get('http://chubbygrub.com/restaurants/{}'.format(restaurant['slug']))
    restaurant_soup = bs(restaurant_res.content, 'lxml')
    tble = restaurant_soup.find('table', {'id': 'items'})
    
    for row in tble.find('tbody').find_all('tr'):
        
        cells = row.find_all('td')
        
        food = {}
        
        food['restaurant'] = restaurant['name']
        food['name'] = cells[0].text
        food['category'] = cells[1].text.strip()
        food['calories'] = cells[2].text
        food['fat'] = cells[3].text
        food['carbs'] = cells[4].text
        
        foods.append(food)

# Top 5 dictionaries
foods[0:5]

[{'calories': '450',
  'carbs': '24',
  'category': 'Chicken',
  'fat': '30',
  'name': '10-piece Chicken Nuggets',
  'restaurant': "Wendy's"},
 {'calories': '430',
  'carbs': '23',
  'category': 'Chicken',
  'fat': '28',
  'name': '10-piece Spicy Chicken Nuggets',
  'restaurant': "Wendy's"},
 {'calories': '180',
  'carbs': '10',
  'category': 'Chicken',
  'fat': '12',
  'name': '4-piece Chicken Nuggets',
  'restaurant': "Wendy's"},
 {'calories': '170',
  'carbs': '9',
  'category': 'Chicken',
  'fat': '11',
  'name': '4-piece Spicy Chicken Nuggets',
  'restaurant': "Wendy's"},
 {'calories': '270',
  'carbs': '14',
  'category': 'Chicken',
  'fat': '18',
  'name': '6-piece Chicken Nuggets',
  'restaurant': "Wendy's"}]

In [5]:
df_food = pd.DataFrame(foods)
print(df_food.shape)
df_food.head()

(2948, 6)


Unnamed: 0,calories,carbs,category,fat,name,restaurant
0,450,24,Chicken,30,10-piece Chicken Nuggets,Wendy's
1,430,23,Chicken,28,10-piece Spicy Chicken Nuggets,Wendy's
2,180,10,Chicken,12,4-piece Chicken Nuggets,Wendy's
3,170,9,Chicken,11,4-piece Spicy Chicken Nuggets,Wendy's
4,270,14,Chicken,18,6-piece Chicken Nuggets,Wendy's


In [6]:
# Export if you want the csv
df_food.to_csv('foods.csv', index=False)

### Sample HTML
General Assembly instructor provided sample HTML file. This is just a HTML file that has different tags to practice web scraping.

In [7]:
res = requests.get('https://rldaggie.github.io/sample-html/')
res.status_code

200

In [8]:
soup = bs(res.content, 'lxml')

In [9]:
# Place items in todo and completed lists into a dataframe
todo_list = []

for ol in soup.find_all('ol'):
    for li in ol.find_all('li'):
        todo = {}
        todo['task'] = li.text
        todo_list.append(todo)
pd.DataFrame(todo_list)

Unnamed: 0,task
0,Take out trash
1,Pay billz
2,Feed dog
3,Mow lawn
4,Take out compost
5,Create scraping lecture


In [10]:
# Place items in todo list into a dataframe
todo_list = []

for li in soup.find('ol').find_all('li'):
    todo_list.append(li.text)
pd.DataFrame(todo_list, columns=['To-Do'])

Unnamed: 0,To-Do
0,Take out trash
1,Pay billz
2,Feed dog


In [11]:
# Place items in completed list into a dataframe
todo_list = []

for li in soup.find_all('ol')[-1].find_all('li'):
    todo_list.append(li.text)
pd.DataFrame(todo_list, columns=['Completed'])

Unnamed: 0,Completed
0,Mow lawn
1,Take out compost
2,Create scraping lecture


In [12]:
# Create dataframe of students name, email, and role
table = soup.find('table', {'id': 'directory'})

# Create list of dictionaries
entries = []

for row in table.find('tbody').find_all('tr'):
    dct = {}
    dct['Name'] = row.text.split()[0]
    dct['Email'] = row.find('a').attrs['href'][7:]
    dct['Role'] = row.text.split()[1]
    
    entries.append(dct)

pd.DataFrame(entries)

Unnamed: 0,Email,Name,Role
0,praveen@ga.co,Praveen,Student
1,fred@ga.co,Fred,Student
2,homer@ga.co,Homer,Student
3,kyle@ga.co,Kyle,Student
4,sam@ga.co,Sam,Student
5,javier@ga.co,Javier,Student
6,nengkuan@ga.co,Nengkuan,Student
7,kieth@ga.co,Kieth,Student
8,bola@ga.co,Bola,Student
9,steve@ga.co,Steve,Student


### Basketball Reference
For this website, we'll create a dataframe out of NBA team names, ranks, wins, conference, and 3 letter acronym.

In [13]:
res = requests.get('https://www.basketball-reference.com/')
res.status_code

200

In [14]:
soup = bs(res.content, 'lxml')

In [15]:
conferences = ['East', 'West']
teams = []

for conference in conferences:
    table = soup.find('table', {'id': 'confs_standings_{}'.format(conference[0])})
    for team in table.find('tbody').find_all('tr'):
        dct = {}
        dct['Slug'] = team.find('a').text
        dct['Name'] = team.find('a').attrs['title']
        dct['Wins'] = team.find_all('td')[-2].text
        dct['Losses'] = team.find_all('td')[-1].text
        dct['Conference'] = conference
        
        teams.append(dct)

pd.DataFrame(teams)

Unnamed: 0,Conference,Losses,Name,Slug,Wins
0,East,10,Boston Celtics,BOS,28
1,East,10,Toronto Raptors,TOR,23
2,East,11,Cleveland Cavaliers,CLE,24
3,East,14,Detroit Pistons,DET,19
4,East,16,Washington Wizards,WAS,19
5,East,16,Indiana Pacers,IND,19
6,East,15,Milwaukee Bucks,MIL,17
7,East,16,Miami Heat,MIA,18
8,East,17,New York Knicks,NYK,17
9,East,18,Philadelphia 76ers,PHI,15
