# Question

Setting

Beeradvocate is many things:

a website built by beer enthusiasts for beer enthusiasts;
a forum;
a network;
a marketplace;
a cognitive device that assists users in navigating the beer market;
a social valuation device;
a unique source of data to conduct market research


Problem

Beeradvocate people want to put the platform into a better economic/financial position

in order to do that, it is key to increase the traffic (advertisers will pay more!!)
the company is evaluating two, non-mutually exclusive, alternatives: a) pushing occasional users to engage with the platform on a more regular basis; b) pushing heavy users to explore portions of the platforms they are less familiar with

Instructions

help the Beeradvocate people to reach their goal by building a recommender that provides users with consumption/review suggestion

In [8]:
import numpy as np
from datetime import datetime
import pandas as pd
import networkx as nx
import csv
import datetime as dt
from igraph import Graph as IGraph
from igraph import *
import random

# Import datasets

In [9]:
usr_usr = pd.read_csv("Data/user_user_graph.csv", header = None)
pop = pd.read_csv("Data/popular_beer.csv")
usr_int = pd.read_csv("Data/all_user_interactions.csv")
beer_att = pd.read_csv("Data/gb_beer_attributes.csv")
beers = pd.read_csv("Data/gb_beers.csv")

In [10]:
# store beerID to beerName mapping
beer_list = beers[['beer','name']]

In [11]:
beer_list = beer_list.drop_duplicates(subset='beer')

# Community Detection

In [12]:
#%% community detection as per the Community detection algorithm of Latapy & Pons
# make edgelist
edges = []

with open('Data/user_user_weighted_graph.csv', 'r') as ifile:
    for row in csv.reader(ifile.read().splitlines()):
        element = str(row[0]).split()
        u, v, weight = [i for i in element]
        edges.append((u, v, float(weight)))
        
g = IGraph.TupleList(edges, directed=False, vertex_name_attr='name', edge_attrs=None, weights=True)

In [13]:
# funtion that shortlist beers by user's community
def beer_by_com (g, usr_id, edges, beer_rev):
    def detect_community (edge_list):
        # use IGraph to covert edgelist into a network graph  
        names = g.vs["name"]
        weights = g.es["weight"]

        # find clusters with "walkstrap" method
        clusters = IGraph.community_walktrap(g, weights=weights).as_clustering()
        nodes = [{"name": node["name"]} for node in g.vs]

        # store communities to user_id mapping
        community = {}
        for node in nodes:
            idx = g.vs.find(name=node["name"]).index
            node["community"] = clusters.membership[idx]
            if node["community"] not in community:
                community[node["community"]] = [node["name"]]
            else:
                community[node["community"]].append(node["name"])

        return community
    
    # funtion to return key in a dictionary if a value is detected in the key-value pair
    def find_key(input_dict, value):
        return next((k for k, v in input_dict.items() if value in v), None)
    
    community = detect_community(edges)
    
    #%% find top beers in each community

    # extract username to userid mapping
    beer_rev.loc[:, 'usr_name'] = beer_rev['usr'].str.split(
            '.').str.get(0)
    beer_rev.loc[:, 'usr_id'] = beer_rev['usr'].str.split(
            '.').str.get(1).str.strip('.')
    
    # find user's community membership 
    com_list = []
    for index, row in beer_rev.iterrows():
        com_list.append(find_key(community, row['usr_id']))
    beer_rev.loc[:, 'com'] = com_list

    # keep reviews from users that exist in the usr-interaction network
    beer_rev.dropna(subset=['com'], inplace=True)

    # find average score for each beer by community
    beer_rev.loc[:, 'rev_ave'] = 0
    gr2 = beer_rev.groupby(['beer','com'], as_index=False)
    beer_com = pd.DataFrame(gr2['bascore_norm'].aggregate(np.mean))

    # extract top beers from each community
    community_top = {}
    for k in community.items():
        subset = beer_com.loc[beer_com['com'] == k[0]]
        list = []
        for index, row in subset.iterrows():
            # select beers in the first quantile of score distribution in a community
            if row['bascore_norm']>= max(subset['bascore_norm']*0.75):
                list.append(row['beer'])
        community_top.update({k[0]:list})
        
    # find the community user belongs to
    membership = find_key(community,usr_id)
    return community_top.get(membership)

# Data Cleaning Process for User Profile

In [14]:
# import and reshape dataset to extract user id
df = usr_int
df.loc[:, 'usr'] = df['usr'].str.split('.').str.get(1).str.strip('/')

# drop housekeeping column
df.drop('call', axis=1, inplace=True) 
df.drop('Unnamed: 0', axis=1, inplace=True)

# calculate the total activity of each user
# total activity proxied by number of posts/threads
df.loc[:,'count']=1
activity = df.groupby(['usr'])['count'].aggregate(np.sum)

# construct a new dataframe containing unique user and corresponding activity 
usr = df['usr'].drop_duplicates()
usr_activity = pd.DataFrame(columns = ['usr','activity'])
usr_activity.loc[:,'usr'] = list(usr)
usr_activity.loc[:,'activity'] = list(activity)

# import another dataset and extract user id
df2 = pd.read_csv('Data/gb_beer_reviewers_attributes.csv')
df2.loc[:, 'usr'] = df2['usr'].str.split('.').str.get(1)

# extract unique users and him/her join time
usr2 = df2['usr'].drop_duplicates()
jointime = df2.query("var == 'joined'")

# clean the jointime into universal string format
jointime = jointime.reset_index(drop = True)
time = jointime.loc[jointime['value'].isin(['Yesterday','Wednesday','Saturday','Monday','Tuesday','Friday','Thursday']) == False]
time['value'][71]='Sep 2, 2005'

# transform the string format into timestamp 
time.loc[:, 'date_time'] = ''
time.loc[:,'date_time'] = pd.to_datetime(time['value'], format='%b %d, %Y')

# calculating the tenure based on jointime
time.loc[:,'present'] = pd.Timestamp('2018-12-10')
time.loc[:,'tenure'] = (time.present-time.date_time).astype('timedelta64[h]')
time = time.drop_duplicates()

# merge the tenure and activity of users
usr_level = pd.merge(time,usr_activity,on='usr')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stab

In [15]:
def profile(usr_id,usr_level):
    # set the formula of activity level which is total activity divided by tenure
    usr_level.loc[:,'level'] = usr_level.activity/usr_level.tenure
    # define users whose activity level is within the top 1/3 as 'heavy users', otherwise 'occational users'
    usr_level.loc[:,'category'] = ''
    threshold  = np.percentile(usr_level['level'], 66.67)
    a = usr_level['level']

    l = []
    
    for i in range(len(a)):
        if a[i] >= threshold: 
            l.append(1)
        else:
            l.append(0)
            
    usr_level['category'] = l
    category = usr_level[usr_level.usr == usr_id].category.item()
    
    return category

# Neighbour's Neighbour

In [16]:
def NN(tar_usr,user_user_graph,popular,gb_beer_reviews):

    #Import data
    df = user_user_graph
    popular = popular
    beer_usr_rtdate = gb_beer_reviews
    
    #Data cleaning extract 'src' and 'tgt'
    df.loc[:, 'src'] = df[0].str.split(' ').str.get(0)
    df.loc[:, 'tgt'] = df[0].str.split(' ').str.get(1)
    df.drop(0, axis=1, inplace=True)

    #Creat a network 
    G = nx.from_pandas_edgelist(df, source='src', target='tgt')

    #Find the neighbors of the input user
    usr_N = G.neighbors(tar_usr)

    #Find the neighbor's neighbor using loops
    for i in usr_N:
        usr_NN = list(G.neighbors(i))

    #Put neighbor's neighbor into a dataframe
    usr_NsN = pd.DataFrame(usr_NN, columns=['usr'])

    #Extract the user ID
    beer_usr_rtdate.loc[:, 'usr'] = beer_usr_rtdate['usr'].str.split('.').str.get(1)

    #merge user and popular on beer ID
    usr_NN_beer = usr_NsN.merge(beer_usr_rtdate,on='usr')

    usr_NN_beer = usr_NN_beer.merge(popular,on='beer')

    usr_NN_beer = usr_NN_beer.drop_duplicates('beer')

    #Find top_3 beer according popularity
    usr_NN_beer = usr_NN_beer.sort_values(by=['popular'], ascending= False)

    #select the top 3 beers for neighbor's neighbor
    top3 = usr_NN_beer.iloc[0:3, :] 

    #crop the data without the top 3 beers to avoid duplication
    df_no_top3 = usr_NN_beer.iloc[3:,:]

    #select 3 beers at random
    df_no_top3 = df_no_top3.sample(n = 3)

    #the top 6 beers with 3 top popular beers and 3 randomly selected beers
    top6 = list(top3['beer'])+ list(df_no_top3['beer'])

    return top6

# Hubs

In [17]:
def hub(beer_usr_rtdate, g):
    
    # Prepare the gb_beer_reviews data
    beer_usr_rtdate.loc[:, 'usr'] = beer_usr_rtdate['usr'].str.split('.').str.get(1)
    
    #%% Find hubs
    
    # Calculate top 100 betweenness centrality of user network
    btvs = []
    for p in zip(g.vs, g.betweenness()):
        btvs.append({"name": p[0]["name"], "bt": p[1]})
    user_user_weighted_bc = sorted(btvs, key=lambda k: k['bt'], reverse=True)[:100]
    
    # Get the hub list
    hub_nodes_list = []
    for i in range(len(user_user_weighted_bc)):
        hub_nodes_list.append(user_user_weighted_bc[i].get('name'))
    
    #%% The beers reviewed and scored by the hubs
    
    beer_rtdate = beer_usr_rtdate
    
    # slice beers reviewed in last 3 months
    beer_rtdate['date'] = pd.to_datetime(beer_rtdate['date'], errors='coerce')
    beer_rtdate['year'] = beer_rtdate['date'].dt.year
    beer_rtdate['month'] = beer_rtdate['date'].dt.month
    beer_rtdate = beer_rtdate[beer_rtdate.year == 2018]
    beer_rtdate = beer_rtdate[beer_rtdate.month >= 9]
    
    # slice beers scored >= 4
    beer_rtdate = beer_rtdate[beer_rtdate.bascore_norm >= 4]
    
    # Prepare beer data for hubs
    hub_nodes_df = pd.DataFrame(hub_nodes_list, columns=['hub'])
    hub_nodes_df.rename(columns={"hub": "usr"}, inplace=True)
    
    hub_beers_df = hub_nodes_df.merge(beer_rtdate,on='usr')
    hub_beers_df = hub_beers_df.drop_duplicates('beer')
    
    # Randomly selected from hub beers
    hub_beers_sample = hub_beers_df.sample(n = 20)
    
    # Generate the hub beers list
    hub_beers_list = hub_beers_sample['beer'].tolist()
    
    return hub_beers_list

In [18]:
def recommend (usr_id, usr_level, g, edges, usr_usr, pop, beer_list):
    
    com_rev = pd.read_csv("Data/gb_beer_reviews.csv")
    hub_rev = pd.read_csv("Data/gb_beer_reviews.csv")
    nn_rev = pd.read_csv("Data/gb_beer_reviews.csv")
    
    usr_type = profile(usr_id,usr_level)
    beers = []
    hub_rec = random.sample(hub(hub_rev, g),5)
    for x in hub_rec: 
            beers.append(beer_list.loc[beer_list['beer'] == x, 'name'].iloc[0])
    
    # if user selected is a 'heavy' user, use community algorithm
    if usr_type == 1:
        com = random.sample(beer_by_com(g, usr_id, edges, com_rev),5)
        for x in com: 
            beers.append(beer_list.loc[beer_list['beer'] == x, 'name'].iloc[0])
        
    # if user selected is an 'occasional' user, use alter's alter algorithm
    else:
        alter = random.sample(NN(usr_id,usr_usr,pop,nn_rev),5)
        for x in alter: 
            beers.append(beer_list.loc[beer_list['beer'] == x, 'name'].iloc[0])
    return beers

In [19]:
usr_id = '988388'

In [21]:
recommender = recommend(usr_id, usr_level, g, edges, usr_usr, pop, beer_list)

In [22]:
recommender

['Fourpure / Devils Peak - Coastline',
 'Pintle',
 'What the Water Wanted',
 'Hedonic Escalation',
 'Hazy Jane',
 'Railway Porter',
 'Edinburgh Tattoo Strong Ale',
 'DDH IPA Citra BBC',
 'Badger Best Bitter',
 'Stranded Bunny Porter']

# Lecturer's comments

Dear Group 3 members,

let met first recall the assessment criteria that apply to the SMM635 final course project: i) appropriate use of notions and frameworks discussed in class; ii) effectiveness of the proposed answer or solution; iii) originality/creativity of the proposed answer or solution; iv) organization an clarity of submitted materials. All criteria carry-out equal weight in terms of mark.

I am very positively impressed by the quality of your submission. Great job!

What works:

- the supporting documentation is clear and well-organized. Also, I like your choice to keep the code and the documentation for your project in the same document â€• very useful when it comes to share and cooperate on Python projects

- your recommender properly leverages a number of network aspects â€• some of these are 'micro' elements (individual nodes and their positions), some others concern the 'ego-network' (an ego's neighbors or neighbors' neighbors), or the internal organization of the network (i.e., the community structure). Even more important, you argument the choice of each individual network aspect with sound theoretical arguments.  Well done.

- I like the idea of providing the recommender in different flavors, according to the status of the target user
  
What can be improved:

- you do not show how and to what extent the set of recommendations changes as you expand on (different) network elements. A systematic comparison would have made this project outstanding.
  
Taking all these points into considerations, the mark of your project is 74 (bonus points from the presentation session are included).

Best,

Simone