# Web Scrapping Scrpit
------------------------------------

## Datasets

Uber API - https://developer.uber.com/dashboard/

Uber Rides Python SDK (beta) - https://github.com/uber/rides-python-sdk 

Lyft API - https://developer.lyft.com/v1/reference 

Weather API - https://openweathermap.org/api

Yelp API - https://www.yelp.com/developers/documentation/v3/business_search 

As indicated in the Datasets section of this , we queried the Uber API. In order to do that, we must first register an application on Uber’s developer dashboard and install the uber-rides SDK, which is also available on Github.

After successfully creating an Uber session with a server token, we found out that the Uber Ride API accepts a pair of coordinates (latitudes and longitudes) as parameters to return data on its various types of services and their price estimates. 

In [1]:
from __future__ import print_function
import json
import pprint
import requests
import urllib
import os
from datetime import timedelta
import datetime
import pandas as pd
import numpy as np
import sqlite3
from pandas.io import sql
from pandas.io.json import json_normalize

Getting the proper imports to run our script.

## Yelp API

In this study, we focus on the Uber ride in Boston and Cambridge areas. 

We have restricted our study to these two small geographic regions due to practical issues such as analysis on a smaller region will help us to predict better with the help of other factors such as weather or popularity of places. This would provide us to provide efficient analysis with available resources.


We are using the Yelp Api to get the location cordinates, which will be feeded to to both the Uber and Lyft Api for price prediction.

In [1]:
#Boston zipcodes: http://www.city-data.com/zipmaps/Boston-Massachusetts.html

#Cambridge zipcodes: http://www2.cambridgema.gov/CityOfCambridge_Content/documents/ZipCodeMap.pdf

zipcodes = ["02108", "02109", "02110", "02111", "02113", "02114", "02115", "02116", "02118", "02119", "02120", "02121", "02122", 
            "02124", "02125", "02126", "02127", "02128", "02129", "02130", "02131", "02132", "02134", "02135", "02136", "02151", 
            "02152", "02163", "02199", "02203", "02210", "02215", "02467", "02138", "02139", "02140", "02141", "02142"]

Instead of inputting random coordinates in Boston and Cambridge, which are the areas we are parameterizing, we opted to query Yelp’s API as well because it returns actual businesses’ location and thankfully the returned data contains longitudes and latitudes.

In [0]:
# data = {'grant_type': 'client_credentials',
#         'client_id': app_id,
#         'client_secret': app_secret}
# token = requests.post('https://api.yelp.com/oauth2/token', data=data)
# access_token = token.json()['access_token']
api_key = 'nIBG6z_BWFoH2CUvBagSh-7LNxr0UIXp0TIgnKrxDvCkBiCYu2InnKrqFvm-_KYm2a8_EBoFGiAqv5OMsdg6h38Vn-BiQNLx6hhiTgKD1F7gBKNn5SQRAJgh3EGcWnYx'
url = 'https://api.yelp.com/v3/businesses/search'
headers = {'Authorization': 'bearer %s' % api_key}

# Yelp v3 API: https://nz.yelp.com/developers/documentation/v3
# https://www.yelp.com/developers/documentation/v3/business_search

for z in zipcodes:   
    params = {'location': '%s' % z, # for loop of zipcodes in Boston and Cambridge
              'categories': 'active', # active stands for all businesses https://www.yelp.com/developers/documentation/v3/all_category_list
              'limit': 50} # maximum 50
    
response = requests.get(url=url, params=params, headers=headers)

results = response.json()['businesses']
for business in results:
     print(business['name'], business['location'])

print ('\nTotal Businesses retrieved:', len(results))


conn = sqlite3.connect('yelp.db')
cur = conn.cursor()


# normalizing json to pandas dataframe
df = json_normalize(results)

df = df.drop(['categories', 'location.display_address', 'transactions'], axis=1)

#Renaming the columns
df.columns = df.columns.str.replace(r'[.]', '_')

# converting to sqlite
df.to_sql("yelp_businesses", conn, if_exists="replace")

#Getting the data into our dataframe
pd.read_sql_query("select * from yelp_businesses;", conn)

We are using api to scrap the data within the parameters i.e. zipcodes, and then we convert the json file to our dataframe.

## Uber API

Using the proper import methods to call the api.

In [0]:
# pip install uber-rides
from uber_rides.session import Session as uber_Session
from uber_rides.client import UberRidesClient
# conda install -c conda-forge geopy
from geopy.distance import vincenty
import csv

session = uber_Session(server_token='Uvu3eEPnLtPKCbTU7KrCko5jo1ua4CVgYAqd0JfO')
client = UberRidesClient(session)

Getting the source and destination coordinates for the API using our yelp database.

In [0]:
df1 = pd.read_sql_query("SELECT name, coordinates_latitude, coordinates_longitude, location_address1, location_address2, location_address3, location_city, location_state, location_zip_code, location_country FROM yelp_businesses ORDER BY RANDOM() LIMIT 1;", conn)
df1.head()

df2 = pd.read_sql_query('SELECT name, coordinates_latitude, coordinates_longitude, location_address1, location_address2, location_address3, location_city, location_state, location_zip_code, location_country FROM yelp_businesses where name IN (SELECT name FROM yelp_businesses ORDER BY RANDOM() LIMIT 1)', conn)
df2.head()

start_loc = (df1['coordinates_latitude'][0], df1['coordinates_longitude'][0])
start_loc

end_loc = (df2['coordinates_latitude'][0], df2['coordinates_longitude'][0])
end_loc

Limiting the distance between our source and destinations to 1 mile, as no one would go for a ride within 1 mile.

In [0]:
distance = vincenty(start_loc, end_loc).miles
    if(distance > 1):
        print(distance, '\n')
        response = client.get_price_estimates(
        start_latitude= df1['coordinates_latitude'][0],
        start_longitude= df1['coordinates_longitude'][0],
        end_latitude=  df2['coordinates_latitude'][0],
        end_longitude= df2['coordinates_longitude'][0]
        )
        uber_rides = response.json.get("prices")
        print(uber_rides)
       

Setting time and cooridnates to our json file.

In [0]:
dt = datetime.datetime.now()    
    for rides in uber_rides:
        rides["time"] = dt.strftime('%H:%M:%S')
        rides['day'] = dt.strftime('%A')
        rides['date'] = dt.strftime('%B %d, %Y')
        rides["start_latitude"] = df1['coordinates_latitude'][0]
        rides["start_longitude"] = df1['coordinates_longitude'][0]
        rides["end_latitude"] = df2['coordinates_latitude'][0]
        rides["end_longitude"] = df2['coordinates_longitude'][0]
        rides['start_location'] = df1['name'][0]
        rides['end_location'] = df2['name'][0]

Converting our json file to the data frame and getting the output as a csv file. This will allow us to gather our data for further analysis.

In [0]:
df_uber = pd.DataFrame(uber_rides)

# to append when sending to server
with open('uber_test.csv', 'a') as f:
    df_uber.to_csv(f, sep=',', encoding='utf-8', index=False, header=False)
    
    

## Lyft API

In [1]:
from lyft_rides.auth import ClientCredentialGrant
from lyft_rides.session import Session as lyft_Session
from lyft_rides.auth import AuthorizationCodeGrant

ImportError: No module named 'lyft_rides'

Using the same source and destiantion coordinates to get price estimations using the Lyft api for getting comparisons between Uber and Lyft.

In [0]:
auth_flow = ClientCredentialGrant(
    'gRUenY4LPYg_',
    'dFyiT-f23Jwmo_A7n2xGfzp_WcWvBIi8',
    'public',
    )
lyft_session = auth_flow.get_session()

#Use the same location
df1.head()

df2.head()

#Get the ride type, Introduction to different type of lyft:https://developer.lyft.com/docs/glossary
from lyft_rides.client import LyftRidesClient

lyft_client=LyftRidesClient(lyft_session)
lyft_type_response = lyft_client.get_ride_types(df1['coordinates_latitude'][0], df1['coordinates_longitude'][0])
ride_types = lyft_type_response.json.get('ride_types')
print(ride_types)

Verification of minimum distance

In [0]:
#Get the ride estimated cost
distance = vincenty(start_loc, end_loc).miles
if(distance > 1):
    print(distance)
    lyft_price_response = lyft_client.get_cost_estimates(
    start_latitude= df1['coordinates_latitude'][0],
    start_longitude= df1['coordinates_longitude'][0],
    end_latitude=  df2['coordinates_latitude'][0],
    end_longitude= df2['coordinates_longitude'][0]
    )
    lyft_rides = lyft_price_response.json.get('cost_estimates')
    print(lyft_rides)

Adding date-time and inserting start and end locations to the json file and then converting them to a dataframe and apending them to a csv file.

In [0]:
for rides in lyft_rides:
    rides["time"] = dt.strftime('%H:%M:%S')
    rides['day'] = dt.strftime('%A')
    rides['date'] = dt.strftime('%B %d, %Y')
    rides["start_latitude"] = df1['coordinates_latitude'][0]
    rides["start_longitude"] = df1['coordinates_longitude'][0]
    rides["end_latitude"] = df2['coordinates_latitude'][0]
    rides["end_longitude"] = df2['coordinates_longitude'][0]
    rides['start_location'] = df1['name'][0]
    rides['end_location'] = df2['name'][0]
    
df_lyft = pd.DataFrame(lyft_rides)
df_lyft

file_name_lyft = os.path.join(os.getcwd(), 'lyft_test.csv')
df_lyft.to_csv(file_name_lyft, sep=',', encoding='utf-8', index=False
               
# to append when sending to server
with open('lyft_test.csv', 'a') as f:
    df_lyft.to_csv(f, sep=',', encoding='utf-8', index=False, header=False)

## Weather API

Getting the weather data as a feature to add to our dataset for further analysis.

In [4]:
#Get the current weather information from lat and long
#http://api.openweathermap.org/data/2.5/weather?lat=42.37046&lon=-71.10352&appid=119fe664452f079528a64467c793dd7d
lat=str(df1['coordinates_latitude'][0])
long=str(df1['coordinates_longitude'][0])

api_address='http://api.openweathermap.org/data/2.5/weather?lat='+lat+'&lon='+long+'&appid=119fe664452f079528a64467c793dd7d&q='

# real_time=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

weather_data = requests.get(api_address).json()

print(weather_data)
# print(real_time)

NameError: name 'df1' is not defined

In [0]:
df_weather.columns = df_weather.columns.str.replace(r'[.]', '_')
df_weather.head()

df_weather = df_weather[['weather', 'main_temp', 'main_temp_max', 'main_temp_min']]
df_weather.head()