# CS699 Web Crawler - Report

This is a python project to crawl and scrape restaurant data from [Yelp.com](https://www.yelp.com/madison).

## Running the Spider

This project uses the [Scrapy framework](https://scrapy.org/) for running the crawler. The crawler is invoked using the scrapy command as follows:

```bash
# Command Line Flags
# -o output file name
# -t format of output file
scrapy crawl yelp -o restaurant.csv -t csv
```

## Implementation Notes

We chose to crawl Yelp.com due to it's relaxed rules around web crawling. While domains like Zomato.com and GrubHub.com aggressively block crawlers and instead encourage the use of their APIs, Yelp.com isn't as effective in blocking crawlers. To avoid being blocked by Yelp.com, we configured the crawler (see [settings.py](https://github.com/shantanusinghal/mad-repo/blob/master/cs699-python/Breakfast/settings.py)) to request pages at a slow (human like) pace and not make any concurrent requests.

We've written the python code in an Object Oriented manner with domain entities such as [`YelpSpider`](https://github.com/shantanusinghal/mad-repo/blob/master/cs699-python/Breakfast/spiders/yelp.py#L6) and [`YelpListing`](https://github.com/shantanusinghal/mad-repo/blob/master/cs699-python/Breakfast/items.py#L18). The `YelpSpider` starts at the search results page and collects the links for individual restaurant pages that are asynchronously crawled by [`parse_listing_page`](https://github.com/shantanusinghal/mad-repo/blob/master/cs699-python/Breakfast/spiders/yelp.py#L25) function. For each listing page the we build a `YelpListing` instance, that encapsulates all the attributes associated with that web-page. We've used XPath and CSS selectors to extract this data from the HTML body. 

## Extracted data

The extracted data is stored in a CSV file and looks like this sample:

```bash
reviews,rating,price_to,title,price_from,address,phone_number,categories
282,4,30,Short Stack Eatery,11,"301 W Johnson St Madison, WI 53703",(608) 709-5569,["Breakfast & Brunch"]
```

## Visualizing data

In the following cells, we try to visualize the data to gain some insight into the underlying data distribution. We'll used pandas library to read the data from disk and numpy for various data manipulation and transformation tasks.

* Fig 1 - Each bar denotes the average price of restaurants in that category and the error bars highlight the standard deviation in prices within the category. Looking at this plot, it is easy to identify 'Lounges' and 'Juice Bars' as two categories with generally poorly rated establishments.
* Fig 2 - Shows the relationship between ratings, number of reviews and prices at these restaurants. Looking at the plot it's fair to say that neither the number of reviews nor the price point is a good indicator of the rating. This would imply that we need to collect more/different data about each restaurant before we can start predicting it's ratings.




In [115]:
# import all required libraries
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import numpy as np
import pandas as pd
import math
from plotly.graph_objs import *
from plotly.graph_objs import ColorBar

# set the credentials to access the Plotly API
#plotly.tools.set_credentials_file(username='shantanusinghal', api_key='*********')

# read the data collected from web scrapping
df = pd.read_csv('breakfast_yelp.csv', encoding='latin1')
df

Unnamed: 0,reviews,rating,price_to,title,price_from,address,phone_number,categories
0,88,4.5,10,Pat OÛªMalleyÛªs Jet Room,0,"3606 Corben Ct Madison, WI 53704",(608) 268-5010,"Diners,Breakfast & Brunch,American (Traditional)"
1,135,4.5,10,4&20 Bakery & Cafe,0,"305 N 4th St Madison, WI 53704",(608) 819-8893,"Bakeries,Breakfast & Brunch,Coffee & Tea"
2,51,4.5,10,Cottage Cafe,0,"915 Atlas Ave Madison, WI 53714",(608) 221-4815,"Breakfast & Brunch,Cafes"
3,351,4.0,30,Bassett Street Brunch Club,11,"444 W Johnson St Madison, WI 53703",(608) 467-5051,"Breakfast & Brunch,American (Traditional),Donuts"
4,246,4.0,30,Eldorado Grill,11,"744 Williamson St Madison, WI 53703",(608) 280-9378,"Tex-Mex,Breakfast & Brunch"
5,253,4.0,10,Mickies Dairy Bar,0,"1511 Monroe St Madison, WI 53711",(608) 256-9476,Breakfast & Brunch
6,146,4.0,10,Crema Cafe,0,"4124 Monona Dr Madison, WI 53716",(608) 224-1150,"Breakfast & Brunch,Sandwiches,Cafes"
7,282,4.0,30,Short Stack Eatery,11,"301 W Johnson St Madison, WI 53703",(608) 709-5569,Breakfast & Brunch
8,226,4.0,10,Lazy JaneÛªs,0,"1358 Williamson St Madison, WI 53703",(608) 257-5263,"Cafes,Bakeries,Breakfast & Brunch"
9,141,3.5,30,Gates & Brovi,11,"3502 Monroe St Madison, WI 53711",(608) 819-8988,"American (New),Sports Bars,Breakfast & Brunch"


In [116]:
# extract list of ratings across categories
cat_dict = {}
for val in df.values:
    for cat in val[7].split(","):
        v = np.float32(val[1])
        if cat in cat_dict:
            cat_dict[cat].append(v)
        else:
            cat_dict[cat] = [v]

x_labels = [str(cat) for cat in cat_dict.keys()]

# calculate mean and std deviation for each category
y_avg = [np.mean(cat_dict[cat]) for cat in cat_dict.keys()]
y_std = [np.std(cat_dict[cat]) for cat in cat_dict.keys()]

# define traces and layout for the figure
trace = Bar(
    x = x_labels,
    y = y_avg,
    marker = Marker(color='#E3BA22'),
    error_y = ErrorY(
        type = 'data',
        array = y_std,
        color='#E6842A'
    )
)

layout = Layout(
    title = 'Fig 1. Category vs Ratings Plot',       # set plot title
    yaxis = go.YAxis(
        title = 'Average Rating', # y-axis title
        gridcolor = 'white'
    ),
    xaxis= go.XAxis(
        title = 'Category'  # x-axis title
    ),
    paper_bgcolor = 'rgb(233,233,233)',  # set background color
    plot_bgcolor = 'rgb(233,233,233)',   # set bar color
)

figure=go.Figure(data=[trace], layout=layout)

# display plot on browser
py.iplot(figure, filename='categories-scatter')

In [117]:
rating = df.as_matrix(columns=['rating'])
reviews = df.as_matrix(columns=['reviews'])
price_to = df.as_matrix(columns=['price_to'])
title = df.as_matrix(columns=['title'])

trace = go.Scatter(
    x = reviews,
    y = rating,
    mode = 'markers',
    marker=dict(
        size='10',
        opacity=0.8,
        color = price_to, #set color equal to a variable
        colorscale='Viridis',
        showscale=True,
        colorbar=ColorBar(
            title='Price'
        )
    ),
    text = title
)

axis_style = dict(
    gridcolor='#FFFFFF',  # white grid lines
    zeroline=False        # remove thick zero line
)

layout = go.Layout(
    title='Fig 2. Price Heat Map',
    plot_bgcolor='#f2f2f2',
    xaxis=XAxis(axis_style, title='Number of Reviews'),
    yaxis=YAxis(axis_style, title='Rating')
)

figure=go.Figure(data=[trace], layout=layout)

# Plot and embed in ipython notebook!
py.iplot(figure, filename='basic-scatter')