# NEXT Recommendation System
  ** 2017-06-12 **          
  _Jon Petersen, petersen@uber.com_       
                  
                     
                     
### Objective 
  * find the ten most recommended venues for a given user in AM, PM, and late night    
  * each recommended venue is to be a venue not visited by the user for that time bucket 
  
### Current Approach 
  * given a client uuid fetch all trips to a specified venue from a specified date range (I use _2016-10-01_ to _2017-05-31_)
  * fetch all trips from all clients who have taken at least one trip to a venue in common to the test user    
  * run collaborative filtering algo to score non-visited venues     
    _Note: we assume that a visit is analogous to a ranking so more visits mean higher satisfaction but this is clearly at risk if a visit was a poor quality trip_
  * return the 10 highest-ranked venues    
  * if fewer than 10 venues get ranked (e.g. intersection for morning venues too small to generate recommendations), then suggest venues with the highest overall visits

### Plan
  * test this out with Advanced Programs team & friends (Jeff, Nikhil, Janie, JBad, Rob D, Bryant, JP)   
  * iterate to make sure recommendations make sense   
  * once this is stable, roll out to a much broader audience (company wide with a UI?); iterate & improve until algo is sufficiently tested and hardened  
  
  **things to improve upon this version:**    
      1. consider upraking lower-ranked venues that are closer to the rider's destination      
        (_open question: when in the app would these recs be surfaced? if they are pre-request this may be difficult since we may not know with a high degree of confidence where they are going_)      
      2. we need to pay attention to sparsity; most people included here are more or less power users; how well does this perform for those who take far fewer trips? if relatively poor, we should pivot from a user-based collaborative filter ot a probabilistic matrix factorization (PMF) model

### Algo details
   This algo exhibits four steps. Let $t$ index the time block, $t \in \left\{ AM, PM, late\right\}$        
   $\hspace{0.3cm}$ **step 1**: create user-location matrix $A$ for which $a_{ij}$ represents the number of trips for user $i \in U$ to location $j \in L$ within time block $t$    
   $\hspace{0.3cm}$ **step 2**: create item-similarity index - let $s(i,j,t)$ denote the cosine similarity for items $i$ and $j$ during time block $t$:      
$
s(i,j,t) = cos(i,j,t) = \frac{i \cdot j}{||i||_{2} ||j||_{2}} = \frac{\sum_{u \in U(i,j)}{x_{ui}x_{uj}}}{\sqrt{\sum_{u \in U(i,j)}x_{ui}^{2}}\sqrt{\sum_{u \in U(i,j)}x_{uj}^{2}}}
$
           where $x_{ui}$ is the number of visits by user $u \in U$ to venue $i \in I$     
   $\hspace{0.3cm}$**step 3**: generate scores for new venues for active user $u \in U$ the predicted ranking for new item $j \in I$ is:    
$r(u,j,t) = \frac{\sum_{k \in L(u,t)} r_{u,k}s(j,k)}{\sum_{k \in L(u,t)}|s(j,k)|}$ where $L(u,t)$ is the set of locations visited by user $u$ during time block $t$    
   $\hspace{0.3cm}$**step 4**: return the highest $q$ (e.g. 10) venues for which $q \not\in L(u,t)$; if there fewere than $q$ venues are returned, fill in the rest with the most common venues not visited by $u$ during $t$


In [1]:
user_dict = {'JP': tuple(['75dbe0db-10cf-47d2-998a-ade9027824d4',1,'JP']),
             'JP_MSP': tuple(['75dbe0db-10cf-47d2-998a-ade9027824d4',28,'JP_MSP']),
             'NG': tuple(['d91dc3e1-6eec-480c-a29d-58816b4229bc',1,'Nikhil']),
             'JG': tuple(['0fc5c0be-ff04-41e6-bbb3-0d8a527d796b',1,'Janie']),
             'JB': tuple(['a8f576fb-e470-4d91-91c2-1417b13a06eb',1,'JBad']),
             'RD': tuple(['eba76912-6c4e-4fbb-aa9d-59e00da3e371',5,'Rob']),
             'BJ': tuple(['34948544-648e-40b3-85cc-e015d1bc4550',1,'Bryant']),
             'JH': tuple(['c597668c-e279-4fc3-aa06-fc03836794d3',1,'Jeff']),
             'SC': tuple(['ecdd6690-93e9-4ead-8871-dc69d17bac31',1,'Sean'])}

### set configs

In [2]:
rider       = 'NG'
begin_date  = '2017-04-01'
end_date    = '2017-09-01'
excluded_categories = list(['Airport','Metro Station','Train Station','Light Rail Station',
                            'Hotel','Food & Beverage','Education','Bus Station','Apartment/Condo',
                            'Auto Service','Office','Department Store','Shops & Services',
                            'Clothing & Accessories','Professional Place','Gym','Hospital',
                            'Hair Care','Electronics','Pharmacy','Church','Post Office', 
                            'Veterinarian', 'Urgent Care', 'Florist', 'Convenience Store', 
                            'Courthouse', 'Embassy', 'Gas Station', 'Spiritual Center', 
                            'Car Rental', 'Bank', 'Bus Station', 'Bike Shop', 'Ferry Station',
                            'Dentist', 'Shoe Store', 'Government', 'Jewelry', 'Adult Entertainment',
                            'Medical Center', 'College & University', 'City Hall', 'Physician',
                            'Laundry', 'Car Dealer', 'Hardware Store', 'Library', 'Home (private)', 
                            'Bus Stop', 'Toll Gate', 'Emergency Room', 'Police Station', 'Real Estate',
                            'Health & Fitness', 'Car Wash', 'Banking & Finance', 'Parking', 
                            'Home & Garden'])
output_recs_path = './saved_recs/'

client_uuid = user_dict[rider][0]
city_id = user_dict[rider][1]
client_name = user_dict[rider][2]

n_top_venues = 500   # if this is not none we consider the top venues by total trips 
                     # if this is none then we consider all venues 

print 'running algo for {} (city ID: {})'.format(client_name, city_id)
print 'user trips from  {} to {}'.format(begin_date, end_date)
print 'using top {} venues'.format(n_top_venues if n_top_venues else 'unlimited')

running algo for Nikhil (city ID: 1)
user trips from  2017-04-01 to 2017-09-01
using top 500 venues


In [3]:
UBER_BLUE = '#1FBAD6'
UBER_RED = '#F25754'
UBER_YELLOW = '#EDCA3A'
UBER_GREEN = '#8BC53F'

In [4]:
# ------
import time
import numpy as np 
from scipy.spatial.distance import cosine
import pandas as pd
import collections
import haversine
import datetime
import operator
from datetime import timedelta
import folium
from folium import plugins
from queryrunner_client import Client, QueryRunnerException 
# ------

In [5]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [6]:
# 
# fetch user trips
#

In [7]:
client_query = """
    with vvid_table as (
        select distinct vehicle_view_id as vvid
        from dwh.fact_trip 
        where 1=1 
            and datestr = '{{start_date}}' 
            and status = 'completed' 
            and city_id = {{city_id}}
            and product_type_name in ('uberX', 'uberPOOL', 'Black', 'SELECT', 'uberXL', 'UberSUV')
    ),
    
    places_trips as (
        select djc.datestr 
            , djc.msg.clientUUID as client_uuid
            , djc.msg.jobUUID as venue_trip_uuid 
            , element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).location.referenceType as ref_type
            , element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).location.uuid as location_uuid 
            , (case when element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).location.reference is null 
                then element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).anchorGeolocation.location.id 
                else element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).location.reference end) as placeId 
        from hdrone.demand_job_completed djc
        where 1=1 
            and FROM_ISO8601_DATE(djc.datestr) between FROM_ISO8601_DATE('{{start_date}}') - interval '2' day and FROM_ISO8601_DATE('{{end_date}}') + interval '2' day
            and djc.msg.clientUUID = '{{client_uuid}}'
            and djc.msg.region.countryId = 1 
            and djc.msg.vehicleViewId IN (select * from vvid_table)
            and ((element_at(djc.msg.waypoints,CARDINALITY(djc.msg.waypoints)).location.reference is not null) 
                or (element_at(djc.msg.waypoints,CARDINALITY(djc.msg.waypoints)).anchorGeolocation.location.id is not null))
    ),

    venue_trips as (
        select pt.datestr 
            , pt.client_uuid 
            , pt.venue_trip_uuid 
            , pt.placeId 
            , msp.uuid as venue_uuid
            , msp.name as venue_name 
            , msp.uber_category_names 
            , msp.lat as venue_lat 
            , msp.lng as venue_lng
        from places_trips pt 
        join map_services.places msp 
            on pt.placeId = msp.uuid 
        where 1=1 
            and msp.uuid is not null 
            and CARDINALITY(ARRAY_INTERSECT(msp.uber_category_names, SPLIT('{{excluded_categories}}',','))) = 0
    ),

    all_trips as (
        select t.request_date_id_local as datestr_local 
            , t.city_id 
            , t.uuid as trip_uuid 
            , t.client_uuid 
            , t.request_timestamp_local as req_at_local 
            , extract(HOUR FROM t.request_timestamp_local) as local_req_hour
            , (case 
                when extract(HOUR FROM t.request_timestamp_local) between 6 and 10 then 'AM' 
                when extract(HOUR FROM t.request_timestamp_local) between 16 and 20 then 'PM' 
                when extract(HOUR FROM t.request_timestamp_local) IN (0,21,22,23) then 'late_night'
                else NULL 
                end ) as time_block            
            , t.request_lat 
            , t.request_lng 
            , t.begintrip_timestamp_local as begin_at_local
            , t.begintrip_lat 
            , t.begintrip_lng 
            , t.dropoff_timestamp_local as end_at_local 
            , t.dropoff_lat 
            , t.dropoff_lng 
            , t.trip_distance_miles
            , t.product_type_name as product_name
            , t.surge_multiplier
            , ST_DISTANCE(ST_POINT(t.begintrip_lng, t.begintrip_lat),
                ST_POINT(t.dropoff_lng, t.dropoff_lat)) * 111195 * 0.000621371
                as haversine_miles
            , t.vehicle_view_id as vvid 
            , t.product_type_name
            , (case when vt.venue_trip_uuid is not null then 1 else 0 end) as is_venue_trip
            , vt.venue_name as place_name 
            , vt.venue_uuid
            , vt.uber_category_names
            , vt.venue_lat
            , vt.venue_lng
        from dwh.fact_trip t 
        left outer join venue_trips vt 
            on t.uuid = vt.venue_trip_uuid 
        where 1=1 
            and t.request_date_id_local between '{{start_date}}' and '{{end_date}}' 
            and t.status = 'completed' 
            and t.city_id = {{city_id}}
            and t.product_type_name IN ('uberX', 'uberPOOL', 'SELECT', 'uberXL', 'UberBLACK', 'UberSUV')
            and t.client_uuid = '{{client_uuid}}'
    )

    select datestr_local 
        , client_uuid 
        , city_id
        , trip_uuid 
        , extract(HOUR FROM req_at_local) as local_hr_request
        , time_block
        , vvid 
        , product_type_name
        , req_at_local 
        , is_venue_trip
        , place_name 
        , venue_uuid 
        , uber_category_names[1] as primary_category 
        , (case when cardinality(uber_category_names) > 1 then uber_category_names[2] else NULL end) as secondary_category
        , venue_lat
        , venue_lng 
        , request_lat 
        , request_lng 
        , begin_at_local 
        , begintrip_lat 
        , begintrip_lng 
        , end_at_local 
        , dropoff_lat 
        , dropoff_lng 
        , trip_distance_miles
        , haversine_miles
        , product_name
        , surge_multiplier
    from all_trips 
    where 1=1 
        and venue_uuid is not null 
    order by 1,2,3  
"""

In [8]:
def add_query_params(query, params):
    no_parentheses = query.replace('({{', '{').replace('}})', '}')
    new_query = no_parentheses.replace('{{', '{').replace('}}', '}')
    new_query.replace('(\'[\'','(\'').replace('\']\')','\')')
    old_str = '[\'['
    new_query = new_query.replace(old_str,'[')
    with_query_params = new_query.format(**params)
    return with_query_params

In [9]:
def run_query(start_date, end_date, client_uuid, city_id, excluded_categories):
    
    qr =  Client(user_email='petersen@uber.com', interactive=False)
    try:
        start = time.time()
        params = {'start_date':start_date, 
                  'end_date': end_date,
                  'client_uuid': client_uuid,
                  'city_id':city_id,
                  'excluded_categories':','.join(excluded_categories)
                 }
        result_dict = qr.execute('presto', add_query_params(client_query,params))
        query_time = time.time() - start
        query_df = pd.DataFrame(result_dict.fetchall())
        if len(query_df) == 0:
            raise Exception("no rows in query")
        return query_df
    except QueryRunnerException as e:
        print 'Query failed:  {}'.format(e)
        return None

In [10]:
client_df = run_query(begin_date, end_date, client_uuid, city_id, excluded_categories)

In [11]:
all_venue_uuids = client_df.venue_uuid.unique()

In [12]:
print '  there are {} venue trips'.format(len(all_venue_uuids))
client_df[['datestr_local', 'time_block', 'place_name', 'primary_category', 'secondary_category', 'venue_uuid']]

  there are 27 venue trips


Unnamed: 0,datestr_local,time_block,place_name,primary_category,secondary_category,venue_uuid
0,2017-04-15,PM,Gracias Madre,Vegan & Vegetarian,Mexican,12148489-7af8-dae7-bff5-9d27271a73f4
1,2017-04-28,PM,Shizen,Sushi,Vegan & Vegetarian,db66bb1b-4eeb-9ebb-ee35-3149a9cc431b
2,2017-05-01,PM,Yank Sing,Chinese,Asian,fdceec41-96db-75d1-5bc2-a2369e5fd553
3,2017-05-08,PM,Foreign Cinema,New American,Theater,f64e383d-272b-46f1-54f4-82efd6ec0fb8
4,2017-05-25,PM,Bluxome Street Winery,Entertainment,,b9bfa6d9-200e-2a93-76ee-7869f3c1c87d
5,2017-05-26,late_night,The Tipsy Pig,Gastropub,Bar,b0290671-9143-a943-2643-a14a61b281d5
6,2017-05-27,late_night,1760,New American,Italian,76b55222-2fbe-c021-a5b4-f20a10338e32
7,2017-05-27,,Biergarten,Bar,German,8449547b-2cac-921e-8356-e0f222db8ef9
8,2017-05-28,AM,The Cavalier,English,,01c35c12-f316-bdd4-5e67-21a182c4fe6f
9,2017-06-10,PM,Burma Love,Asian,,53b9bb47-9288-378c-dd78-0b58eccc23a5


In [13]:
#
# query used to find other riders 
#

In [14]:
other_riders_query = """
    with vvid_table as (
        select distinct vehicle_view_id as vvid
        from dwh.fact_trip 
        where 1=1 
            and datestr = '{{start_date}}' 
            and status = 'completed' 
            and city_id = {{city_id}}
            and product_type_name in ('uberX', 'uberPOOL', 'Black', 'SELECT', 'uberXL', 'UberSUV')
    ),
    
    places_trips as (
        select djc.datestr 
            , djc.msg.clientUUID as client_uuid
            , djc.msg.jobUUID as venue_trip_uuid 
            , element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).location.referenceType as ref_type
            , element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).location.uuid as location_uuid 
            , (case when element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).location.reference is null 
                then element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).anchorGeolocation.location.id 
                else element_at(djc.msg.waypoints, CARDINALITY(djc.msg.waypoints)).location.reference end) as placeId 
        from hdrone.demand_job_completed djc
        where 1=1 
            and djc.datestr between '{{start_date}}' and '{{end_date}}' 
            and djc.msg.region.countryId = 1 
            and djc.msg.vehicleViewId IN (select * from vvid_table)
            and ((element_at(djc.msg.waypoints,CARDINALITY(djc.msg.waypoints)).location.reference is not null) 
                or (element_at(djc.msg.waypoints,CARDINALITY(djc.msg.waypoints)).anchorGeolocation.location.id is not null))
    ),

    venue_trips as (
        select pt.datestr 
            , pt.client_uuid 
            , pt.venue_trip_uuid 
            , pt.placeId 
            , msp.uuid as venue_uuid
            , msp.name as venue_name 
            , msp.uber_category_names 
            , msp.lat as venue_lat 
            , msp.lng as venue_lng
        from places_trips pt 
        join map_services.places msp 
            on pt.placeId = msp.uuid 
        where 1=1 
            and msp.uuid is not null 
            and CARDINALITY(ARRAY_INTERSECT(msp.uber_category_names, SPLIT('{{excluded_categories}}',','))) = 0
    ),

    venue_riders as (
        select distinct client_uuid as rider_uuid
        from venue_trips vt 
        where 1=1 
            and vt.venue_uuid IN ('{{venue_uuid_list}}')
    ),

    all_trips as (
        select t.datestr 
            , t.uuid as trip_uuid 
            , t.client_uuid 
            , t.request_timestamp_local as req_at_local
            , extract(HOUR FROM t.request_timestamp_local) as local_req_hour
            , (case 
                when extract(HOUR FROM t.request_timestamp_local) between 6 and 10 then 'AM' 
                when extract(HOUR FROM t.request_timestamp_local) between 16 and 20 then 'PM' 
                when extract(HOUR FROM t.request_timestamp_local) IN (0,21,22,23) then 'late_night'
                else NULL 
                end ) as time_block            
            , t.request_lat 
            , t.request_lng 
            , t.begintrip_timestamp_local as begin_at_local
            , t.begintrip_lat 
            , t.begintrip_lng 
            , t.dropoff_timestamp_local as end_at_local 
            , t.dropoff_lat 
            , t.dropoff_lng 
            , t.trip_distance_miles
            , t.product_type_name as product_name
            , t.surge_multiplier
            , ST_DISTANCE(ST_POINT(t.begintrip_lng, t.begintrip_lat),
                ST_POINT(t.dropoff_lng, t.dropoff_lat)) * 111195 * 0.000621371
                as haversine_miles
            , t.vehicle_view_id as vvid 
            , (case when vt.venue_trip_uuid is not null then 1 else 0 end) as is_venue_trip
            , t.city_id  
            , vt.venue_name as place_name 
            , vt.venue_uuid
            , vt.uber_category_names
            , vt.venue_lat
            , vt.venue_lng
        from dwh.fact_trip t 
        join venue_riders vr 
            on t.client_uuid = vr.rider_uuid 
        left outer join venue_trips vt 
            on t.uuid = vt.venue_trip_uuid 
        where 1=1 
            and t.datestr between '{{start_date}}' and '{{end_date}}' 
            and t.status = 'completed' 
            and t.city_id = 1 
    )

    select client_uuid 
        , trip_uuid
        , venue_uuid 
        , time_block 
        , CAST(venue_lat as double) as venue_lat
        , CAST(venue_lng as double) as venue_lng
        , place_name 
        , uber_category_names
        , element_at(uber_category_names,1) as primary_category        
        , (case when cardinality(uber_category_names) > 1 then element_at(uber_category_names,2) else NULL end) as secondary_category
        , count(*) as n_trips
    from all_trips 
    where 1=1 
        and is_venue_trip = 1
        and time_block is not NULL
    group by 1,2,3,4,5,6,7,8,9
    order by 1,2,3,4,5,6,7,8,9
"""

In [15]:
def run_other_trips_query(start_date, end_date, city_id, venue_uuids, client_uuid, excluded_categories):
    
    qr =  Client(user_email='petersen@uber.com', interactive=False)
    try:
        start = time.time()
        params = {'start_date':start_date, 
                  'end_date': end_date,
                  'city_id':city_id, 
                  'venue_uuid_list':venue_uuids,
                  'client_uuid':client_uuid,
                  'excluded_categories':','.join(excluded_categories).encode('utf-8').strip()
                 }
            
        jp = add_query_params(other_riders_query,params)
        jp = jp.replace('\'[\'','\'').replace('\']\'','\'')
        result_dict = qr.execute('presto', jp)
        query_time = time.time() - start
        query_df = pd.DataFrame(result_dict.fetchall())
        if len(query_df) == 0:
            raise Exception("no rows in query")
        return query_df
    except QueryRunnerException as e:
        print 'Query failed:  {}'.format(e)
        return None

In [16]:
# get a string of all venues taken by the given user
all_uuids_str = ",".join(all_venue_uuids).replace(',','\',\'').replace('\'\'','\'')

In [17]:
# -----
#  the following call will return a dataframe with ALL trips taken by a rider with a common venue 
# -----

In [18]:
other_riders_df = run_other_trips_query(begin_date, 
                                        end_date, 
                                        city_id, 
                                        all_uuids_str, 
                                        client_uuid, 
                                        excluded_categories)

In [19]:
venue_trip_count = other_riders_df.venue_uuid.value_counts()

In [20]:
# extract only the top venues (TODO: exclude chains)

In [21]:
N = n_top_venues if n_top_venues and n_top_venues <= len(venue_trip_count) else len(venue_trip_count)

In [22]:
venue_trip_count = venue_trip_count[0:N]

In [23]:
other_riders_df.loc[:,'is_venue_valid'] = map(lambda x: x in venue_trip_count.keys(), other_riders_df.venue_uuid.tolist())

In [24]:
other_riders_df = other_riders_df[other_riders_df.is_venue_valid==True]

In [25]:
rider_counts = other_riders_df.client_uuid.value_counts()

In [26]:
other_riders_df.loc[:,'is_qualified'] = map(lambda x: rider_counts[x] >= min(len(client_df),5), other_riders_df.client_uuid.tolist())

In [27]:
other_riders_df = other_riders_df[other_riders_df.is_qualified==True]

In [28]:
other_riders_df

Unnamed: 0,client_uuid,n_trips,place_name,primary_category,secondary_category,time_block,trip_uuid,uber_category_names,venue_lat,venue_lng,venue_uuid,is_venue_valid,is_qualified
0,00021d3c-e1e6-4a54-bf1e-77f9333b1436,1,Le Colonial,Vietnamese,Entertainment,PM,206136ef-56aa-4761-b58f-9f9693bbcf37,VietnameseEntertainmentAsian,37.788203,-122.412404,7bcade73-5da6-7455-ba2d-e3394e47622b,True,True
1,00021d3c-e1e6-4a54-bf1e-77f9333b1436,1,Club Deluxe,Entertainment,Bar,late_night,2cfa2817-2bac-4d34-8e12-516c28d5db9d,EntertainmentBar,37.769790,-122.447129,b3786fe3-6927-506d-1aca-2b72a151a96d,True,True
2,00021d3c-e1e6-4a54-bf1e-77f9333b1436,1,ABV,Bar,Gastropub,late_night,348402d2-b180-4d46-b89a-593936348a00,BarGastropub,37.765035,-122.423566,e502be6d-9e7b-18b3-9955-a7ba6d92a7fb,True,True
3,00021d3c-e1e6-4a54-bf1e-77f9333b1436,1,Precita Park Caf,Cafe,American,AM,3df9c99c-2c2c-48df-a42a-b12c631e2303,CafeAmerican,37.747018,-122.410416,c1fe5435-237a-298c-22a2-d171f2de85f3,True,True
6,00021d3c-e1e6-4a54-bf1e-77f9333b1436,1,La Taqueria,Burrito,Taco,PM,a7c95cc0-6dd6-4c2b-9b90-544f481e95df,BurritoTacoMexican,37.750882,-122.418096,d4c9a494-44b6-0ab2-58db-508219114e30,True,True
7,00021d3c-e1e6-4a54-bf1e-77f9333b1436,1,Mission Bowling Club,Bowling Alley,Restaurant,late_night,bed3412a-013a-4d5d-a6c9-0302037e808e,Bowling AlleyRestaurantBar,37.763830,-122.416697,3181a9e5-7faa-aa17-7e30-fd1fc54d8ded,True,True
8,00021d3c-e1e6-4a54-bf1e-77f9333b1436,1,Limon Rotisserie,South American,Restaurant,PM,cf65ce20-22bb-429a-95e2-175c78200137,South AmericanRestaurantCuban,37.757047,-122.416510,1a7ba67e-e8b2-3986-7d9b-3c8bc0b86c9f,True,True
10,0002d143-3b32-45f2-b2be-2f41bd72cde7,1,Smuggler's Cove,Bar,,PM,034efcb7-bb9c-49ff-a9ac-b41cfa53ad84,Bar,37.779407,-122.423292,ae081100-2f84-bdfa-9547-3263e882d750,True,True
12,0002d143-3b32-45f2-b2be-2f41bd72cde7,1,Coin-Op Game Room,Restaurant,Bar,PM,0dbd1bba-81e3-46b5-9982-3b5ad7464bf1,RestaurantBarEntertainment,37.779193,-122.398095,9df98944-c3a8-56f7-2e57-97667e6927cf,True,True
21,0002d143-3b32-45f2-b2be-2f41bd72cde7,1,Northstar Cafe,Bar,,late_night,62c909eb-4a00-43e0-9acb-2402943c5b7b,Bar,37.799226,-122.410429,b2f714ca-496c-8e7a-f5f3-6c0771a0aeaa,True,True


In [29]:
# CREATE USER-LOCATION MATRIX
def create_user_location_matrix(venue_df, unique_users, unique_venues):

    client_dict = {}
    for i in np.arange(len(unique_users)):
        client_dict[unique_users[i]] = i
        
    venue_dict = {}
    for i in np.arange(len(unique_venues)):
        venue_dict[unique_venues[i]] = i  
        
    G = venue_df.groupby(['client_uuid','venue_uuid'])['n_trips']    
    R = pd.DataFrame(0, index=unique_users, columns=unique_venues)
    for g in G:
        i = client_dict[g[0][0]]
        j = venue_dict[g[0][1]]
        R.iloc[i,j] = g[1].tolist()[0]  
        
    return R

In [30]:
# CREATE ITEM SIMILARITY MATRIX
def create_item_similairity_matrix(unique_venues, R):
    
    S = pd.DataFrame(index=unique_venues, columns=unique_venues)
    for i in np.arange(len(S.columns)):
        for j in np.arange(i, len(S.columns)):
            S.iloc[i,j] = 1-cosine(R.iloc[:,i].tolist(), R.iloc[:,j].tolist())
            
    return S

In [31]:
# GENERATE SCORE
def get_curr_score(S, client_uuid, active_user_venues, n_trips, curr_venue):
    
    num   = 0.0
    denom = 0.0

    # get similarity score between the current venue and all that the user visited
    for ix in xrange(len(active_user_venues)):
        visited_venue = active_user_venues[ix]
        curr_similiarity = S.loc[visited_venue][curr_venue] 
        if np.isnan(curr_similiarity):
            curr_similiarity = S.loc[curr_venue][visited_venue]
        assert np.isnan(curr_similiarity) == False
        #denom += np.abs(curr_similiarity)
        num += curr_similiarity * n_trips[ix]
        
    # get denom
    valid_row = [x for x in S.loc[curr_venue,:] if np.isnan(x)==False]
    valid_col = [x for x in S.loc[:,curr_venue] if np.isnan(x)==False]
    denom = np.sum(valid_row) if len(valid_row) >= len(valid_col) else np.sum(valid_col)
    
    return np.divide(num,denom) if np.abs(denom) > 0.0001 else 0.0 

In [32]:
# CREATE RANKINGS OF NEW VENUES
def create_new_venue_rankings(R, S, active_user_df):
    venue_rank_dict = {}
    active_user_row = R.loc[client_uuid,:]
    active_user_venues = active_user_df.venue_uuid.tolist()
    n_visits = active_user_df.n_trips.tolist()
    for curr_venue in active_user_row.index:
        if active_user_row[curr_venue] == 0:
            curr_sim_col = S.loc[curr_venue,:]
            curr_score = get_curr_score(S, client_uuid, active_user_venues, n_visits, curr_venue)
            venue_rank_dict[curr_venue] = curr_score 
    
    return venue_rank_dict

In [33]:
def get_recommended_df(ranked_venues, orig_df, active_user_df, venue_counts):
    
    # ranked_venues 
    top_venue_dict = {} # key: rank, value: venue_uuid
    curr_rank = 1
    for v in ranked_venues:
        if np.abs(v[-1]) > 0.0001:
            top_venue_dict[v[0]] = int(curr_rank)
            curr_rank += 1
        else:
            break
    
    if len(top_venue_dict) < 10:
        user_venues = active_user_df.venue_uuid.tolist()
        for k in venue_counts.index:
            if k not in user_venues:
                top_venue_dict[k] = int(curr_rank)
                curr_rank += 1
            if len(top_venue_dict) == 10:
                break
                
    df = orig_df.copy()
    df.loc[:,'venue_rank'] = map(lambda x: top_venue_dict.get(x,None), df.venue_uuid.tolist())
    df = df[df.venue_rank <= 10.0]
    df = df[['place_name','primary_category','venue_lat','venue_lng','venue_rank']]
    df.drop_duplicates(inplace=True)
    df.sort_values('venue_rank', inplace=True)
    df.set_index(np.arange(1,len(df)+1), drop=True, inplace=True)
    df.drop('venue_rank', axis=1, inplace=True)
    return df
    

In [34]:
# MAIN METHOD TO GENERATE RECOMMENDATIONS
def generate_recommendations(orig_df, time_block=None, write_csv=False):
    assert time_block in list([None, 'AM', 'PM', 'late_night'])
    
    venue_df = orig_df.copy()
    venue_df.drop_duplicates('place_name', keep='first', inplace=True)
    active_user_df = venue_df[venue_df.client_uuid==client_uuid]    
    
    if time_block:
        venue_df = venue_df[venue_df.time_block==time_block]
        active_user_df = active_user_df[active_user_df.time_block==time_block]
        
    total_venue_count = venue_df.venue_uuid.value_counts()
    
    unique_users  = venue_df.client_uuid.unique()
    unique_venues = venue_df.venue_uuid.unique() 

    R = create_user_location_matrix(venue_df, unique_users, unique_venues)
    S = create_item_similairity_matrix(unique_venues, R)
    if client_uuid in R.index:
        rankings = create_new_venue_rankings(R, S, active_user_df)    
        rankings = sorted(rankings.items(), key=operator.itemgetter(1), reverse=True)
    else:
        rankings = []
        
    recommended_df = get_recommended_df(rankings, orig_df, active_user_df, total_venue_count)
    
    if write_csv:
        tm = time_block if time_block else 'all'
        out_path = output_recs_path + 'recs_{}_{}.csv'.format(rider,tm)
        recommended_df.to_csv(out_path)
    
    return recommended_df
        

In [35]:
# PLOT
def plot_venues(recs_df):
    mean_loc = tuple([np.mean(recs_df.venue_lat.tolist()),
                      np.mean(recs_df.venue_lng.tolist())])
    df_map  = folium.Map(location=mean_loc,
                          tiles='Mapbox',
                          zoom_start=12,
                          control_scale=True,
                          API_key='uber-mapper.lce5708h')
    
    color_dict = {'Restaurant': UBER_RED, 
                  'Bar':UBER_BLUE, 
                  'Cafe': UBER_GREEN, 
                  'Entertainment': UBER_YELLOW,
                  'Music Venue' : UBER_YELLOW,
                  'Night Club' : UBER_YELLOW,
                 }
    
    locs = map(lambda x,y: tuple([x,y]), recs_df.venue_lat.tolist(), recs_df.venue_lng.tolist())
    places = recs_df.place_name.tolist()
    categs = recs_df.primary_category.tolist()
    for k in xrange(len(recs_df)):
        folium.CircleMarker(locs[k], 
                            radius=175, 
                            color=None, 
                            fill_color=color_dict.get(categs[k], 'black'), 
                            fill_opacity=0.7, 
                            popup=places[k].decode('utf_8')).add_to(df_map)
    
    display(df_map)

## morning recommendations

In [36]:
rankings_AM   = generate_recommendations(other_riders_df, time_block='AM', write_csv=True)

In [37]:
rankings_AM

Unnamed: 0,place_name,primary_category,venue_lat,venue_lng
1,Crissy Field,Park,37.803932,-122.464902
2,The Mill,Cafe,37.776458,-122.437876
3,Pier 39,Travel & Transportation,37.809754,-122.410323
4,Baker Beach,Attraction,37.792775,-122.483896
5,Causwells,American,37.800314,-122.442041
6,Absinthe Brasserie & Bar,French,37.777095,-122.422751
7,Flywheel Sports - Market Street,Athletics & Sports,37.793065,-122.394311
8,Fort Mason Center for Arts & Culture,Museum,37.806455,-122.432128
9,The Interval at Long Now,Bar,37.80655,-122.432157
10,Yank Sing,Chinese,37.789806,-122.399391


In [38]:
plot_venues(rankings_AM)

## evening recommendations

In [39]:
rankings_PM   = generate_recommendations(other_riders_df, time_block='PM', write_csv=True)

In [40]:
rankings_PM

Unnamed: 0,place_name,primary_category,venue_lat,venue_lng
1,Taverna Aventine,Italian,37.795727,-122.402871
2,Bar Crudo,Seafood,37.775668,-122.438245
3,Li Po Cocktail Lounge,Bar,37.79541,-122.406418
4,Tacko,Taco,37.798253,-122.435953
5,Foreign Cinema,New American,37.756419,-122.419213
6,Elephant Sushi,Sushi,37.798662,-122.418731
7,Union Square,Attraction,37.787942,-122.40752
8,Velvet Cantina,Mexican,37.753665,-122.41957
9,The Social Study,Entertainment,37.784089,-122.432501
10,Four Barrel Coffee,Cafe,37.767025,-122.421779


In [41]:
plot_venues(rankings_PM)

## late night recommendations

In [42]:
rankings_late = generate_recommendations(other_riders_df, time_block='late_night',write_csv=True)

In [43]:
rankings_late

Unnamed: 0,place_name,primary_category,venue_lat,venue_lng
1,Tupelo,Southern Or Soul,37.79926,-122.4075
2,The Tipsy Pig,Gastropub,37.800207,-122.440077
3,Trick Dog,Bar,37.759353,-122.411176
4,MatrixFillmore,Night Club,37.798696,-122.435622
5,John Colins,Bar,37.786918,-122.4004
6,Tosca Cafe,Italian,37.797587,-122.405895
7,Infusion Lounge,Night Club,37.785585,-122.4084
8,Bob's Donuts,Donuts,37.791853,-122.421151
9,Lucky Strike,Bowling Alley,37.778152,-122.392325
10,Bruno's,Bar,37.758962,-122.418801


In [None]:
plot_venues(rankings_late)

## all recommendations

In [None]:
rankings_all = generate_recommendations(other_riders_df, write_csv=True)

In [None]:
rankings_all

In [None]:
plot_venues(rankings_all)