## Food for Thought: Extract - Transform - Load Project

### Team: Tolani, Joy, Andrea, Manuel


### Objective and Methodology:

The internet is a great resource for aspiring chefs, casual cooks, and for people wanting to make healthier food and lifestyle choices in order to lose weight or feel better. One of the largest collections of recipes can be found on allrecipes.com. There are several categories of recipes, including Diet and Health, which includes a curated list of healthy recipes. In order to verify the nutritional value of each recipe, we used data from the United States Department of Agriculture’s FoodData Central download site. By merging the two datasets, we provide a vehicle to find not only tasty recipes, but nutritional ones as well. 

This notebook outlines the first step in this study, web scraping the Allrecipes.com website for healthy recipes. As part of the scraping process, we extract the recipe title, recipe link, and ingredients using Selenium, Beautiful Soup, PyMongo, and MongoDB. The data is then transformed into a DataFrame using Pandas.

### STEP ONE: Web Scraping the Allrecipes.com website and creating MongoDB database

In [2]:
# Dependencies
from bs4 import BeautifulSoup
import requests
import pymongo
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import scrapy
from sys import argv
import time
import csv
import lxml.html
import unicodedata
import time
import numpy as np
import re

### Web Scraping allrecipes.com 

In [3]:
# Make a connection to MongoDB and create two collections

conn=pymongo.MongoClient('mongodb://localhost:27017')
recipesdb = conn['allrecipes_etl1']
collection = recipesdb['recipes']
collection2 = recipesdb['ingredients']


In [3]:
# URL of page to be scraped
url = 'https://www.allrecipes.com/recipes/84/healthy-recipes/'

# Retrieve page with the requests module
response = requests.get(url)

# Create BeautifulSoup object; parse with 'lxml'
soup = BeautifulSoup(response.text, 'lxml')

In [4]:
# Examine the results, then determine element that contains needed info
# Results are returned as an iterable list

results = soup.find_all(class_="fixed-recipe-card")

# Loop through returned results to find all href links from 'a' tag

list_url=[]
for (result) in results:
    # Error handling
    try:
        
        link = result.a['href']
       
        # If link exists, append to list
        if (link):
            # Print results
            list_url.append(link)
            print(list_url)
               
    except Exception as e:
        print(e)
        


['https://www.allrecipes.com/recipe/217013/chop-chop-salad/']
['https://www.allrecipes.com/recipe/217013/chop-chop-salad/', 'https://www.allrecipes.com/recipe/245946/healthy-turmeric-chicken-stew/']
['https://www.allrecipes.com/recipe/217013/chop-chop-salad/', 'https://www.allrecipes.com/recipe/245946/healthy-turmeric-chicken-stew/', 'https://www.allrecipes.com/recipe/260392/turkish-orange-salad-with-mediterranean-dressing/']
['https://www.allrecipes.com/recipe/217013/chop-chop-salad/', 'https://www.allrecipes.com/recipe/245946/healthy-turmeric-chicken-stew/', 'https://www.allrecipes.com/recipe/260392/turkish-orange-salad-with-mediterranean-dressing/', 'https://www.allrecipes.com/recipe/259262/acai-smoothie-bowl/']
['https://www.allrecipes.com/recipe/217013/chop-chop-salad/', 'https://www.allrecipes.com/recipe/245946/healthy-turmeric-chicken-stew/', 'https://www.allrecipes.com/recipe/260392/turkish-orange-salad-with-mediterranean-dressing/', 'https://www.allrecipes.com/recipe/259262/ac

In [5]:
#return list of recipe links
list_url

['https://www.allrecipes.com/recipe/217013/chop-chop-salad/',
 'https://www.allrecipes.com/recipe/245946/healthy-turmeric-chicken-stew/',
 'https://www.allrecipes.com/recipe/260392/turkish-orange-salad-with-mediterranean-dressing/',
 'https://www.allrecipes.com/recipe/259262/acai-smoothie-bowl/',
 'https://www.allrecipes.com/recipe/72381/orange-roasted-salmon/',
 'https://www.allrecipes.com/recipe/199598/quinoa-with-asian-flavors/',
 'https://www.allrecipes.com/recipe/256590/chorizo-spiced-party-sized-chopped-veggie-salad/',
 'https://www.allrecipes.com/recipe/256584/khitchari/',
 'https://www.allrecipes.com/recipe/256558/hummingbird-carrot-cake-oatmeal/',
 'https://www.allrecipes.com/recipe/49552/quinoa-and-black-beans/',
 'https://www.allrecipes.com/recipe/51283/maple-salmon/',
 'https://www.allrecipes.com/recipe/8665/braised-balsamic-chicken/',
 'https://www.allrecipes.com/recipe/26692/annies-fruit-salsa-and-cinnamon-chips/',
 'https://www.allrecipes.com/recipe/85452/homemade-black-

In [12]:
# open browser using Google Chrome driver 
browser = webdriver.Chrome() 

# Loop through the list of urls to extract the recipe title
for i in (list_url):
    
    getUrl = browser.get(i)
    time.sleep(1)
    browser.implicitly_wait(1)
    
    try:
        rtitle = browser.find_element_by_tag_name('h1').text
       
        print(rtitle)
    
    except TimeoutException:
        rtitle = 'NA'
        print(e)

# Loop through checklist items to extract individual ingredients
    ingred = browser.find_elements_by_class_name("checkList__item")

    ingredients = []
    for x in np.arange(len(ingred)-1):
        ingredients.append(str(ingred[x].text.encode('ascii', 'ignore')))
        print("testoutput")
        recoutput =('\t'+rtitle)
        print(recoutput)

# Capture ingredients and insert into MongoDB database collection  
    listingr = []
    for ingr in ingredients:
        temp = {'ingredient': ingr.encode('ascii', 'ignore')}
        collection2.insert_one(temp)
        
        ingroutput = '\t'.join(listingr)
        ingroutput = '\t'+ingroutput
        print(ingroutput)
        
        #closes automatically, append data to csv, open as panda data frame 
        temp = {'recipe_title': rtitle.encode('ascii', 'ignore')}
        collection.insert_one(temp)

#opens browser and initalizes MongoDB
if __name__ == '__main__':
	try:
		conn=pymongo.MongoClient()
		print("Connected successfully!!!")
	except pymongo.errors.ConnectionFailure as e:
		print("Could not connect to MongoDB: %s % e" )
 

Chop Chop Salad
testoutput
	Chop Chop Salad
testoutput
	Chop Chop Salad
testoutput
	Chop Chop Salad
testoutput
	Chop Chop Salad
testoutput
	Chop Chop Salad
testoutput
	Chop Chop Salad
testoutput
	Chop Chop Salad
	
	
	
	
	
	
	
Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
testoutput
	Healthy Turmeric Chicken Stew
	
	
	
	
	
	
	
	
	
Turkish Orange Salad with Mediterranean Dressing
testoutput
	Turkish Orange Salad with Mediterranean Dressing
testoutput
	Turkish Orange Salad with Mediterranean Dressing
testoutput
	Turkish Orange Salad with Mediterranean Dressing
testoutput
	Turkish Orange Salad with Mediterranean Dressing
testoutput
	Turkish Orange Salad with Mediterranean Dre

Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
testoutput
	Fish Tacos
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
Refried Beans Without the Refry
testoutput
	Refried Beans Without the Refry
testoutput
	Refried Beans Without the Refry
testoutput
	Refried Beans Without the Refry
testoutput
	Refried Beans Without the Refry
testoutput
	Refried Beans Without the Refry
testoutput
	Refried Beans Without the Refry
testoutput
	Refried Beans Without the Refry
testoutput
	Refried Beans Without the Refry
	
	
	
	
	
	
	
	
Baked Kale Chips
testoutput
	Baked Kale Chips
testoutput
	Baked Kale Chips
testoutput
	Baked Kale Chips
	
	
	
Roas

In [14]:
# Dependencies
import pandas as pd
from pymongo import MongoClient

#Open MongoDB instance, place data into DataFrames
client = MongoClient('mongodb://localhost:27017')
db = client.allrecipes_etl1
collection = db.recipes
collection2= db.ingredients
data = pd.DataFrame(list(collection.find()))

In [15]:
# Create dataframe from collections using Pandas
df = pd.DataFrame(list(collection.find()))
df2 = pd.DataFrame(list(collection2.find()))


In [16]:
# Check recipe title results; id from MongoDB. Recipe title requires cleanup.
df

Unnamed: 0,_id,recipe_title
0,5ccfa84818cadbc63e296833,b'Chop Chop Salad'
1,5ccfa84818cadbc63e296835,b'Chop Chop Salad'
2,5ccfa84818cadbc63e296837,b'Chop Chop Salad'
3,5ccfa84818cadbc63e296839,b'Chop Chop Salad'
4,5ccfa84818cadbc63e29683b,b'Chop Chop Salad'
5,5ccfa84818cadbc63e29683d,b'Chop Chop Salad'
6,5ccfa84818cadbc63e29683f,b'Chop Chop Salad'
7,5ccfa85718cadbc63e296841,b'Healthy Turmeric Chicken Stew'
8,5ccfa85718cadbc63e296843,b'Healthy Turmeric Chicken Stew'
9,5ccfa85718cadbc63e296845,b'Healthy Turmeric Chicken Stew'


In [17]:
# Check ingredient results; id from MongoDB. Ingredients require cleanup.
df2

Unnamed: 0,_id,ingredient
0,5ccfa84718cadbc63e296832,"b""b'1 red grapefruit'"""
1,5ccfa84818cadbc63e296834,"b""b'1 cup peeled, chopped jicama'"""
2,5ccfa84818cadbc63e296836,"b""b'1 cup chopped orange bell pepper'"""
3,5ccfa84818cadbc63e296838,"b""b'1 cup chopped cucumber'"""
4,5ccfa84818cadbc63e29683a,"b""b'1 tomato, chopped'"""
5,5ccfa84818cadbc63e29683c,"b""b'2 green onions, chopped'"""
6,5ccfa84818cadbc63e29683e,"b""b'1/4 cup chopped fresh cilantro'"""
7,5ccfa85718cadbc63e296840,"b""b'2 tablespoons olive oil'"""
8,5ccfa85718cadbc63e296842,"b""b'2 skinless, boneless chicken breasts, cubed'"""
9,5ccfa85718cadbc63e296844,"b""b'2 sweet potatoes, cubed'"""


### MongoDB collections of recipe title and ingredients complete. Further cleanup is necessary. Please open file 'Recipes.csv' in Jupyter Notebook for the next steps.