# Cold Start Problem

One of the Engineering concers that we faced during the project, was the Cold Start problem. The cold start problem is faced while trying to recommend beer to users about whom we do not have sufficient information to make an accurate predection. This problem is also faced when the recommendation have to be made for a new user who is using the application for the first time. So, we designed an algorithm which would take over recommending beers to users with no/less reviews, and the switch from cold start algorithm to the main recommendation algorithm would be made once the user has a certain number of review 

In [1]:
# read .csv into python
import pandas as pd
import numpy as np
import operator
import os
import gzip
import re

In [2]:
with gzip.open('../Beeradvocate.txt.gz', 'r') as f:
  rb_file = f.readlines()


data = []
row_out = []

for i in rb_file:
    row = i.decode('utf-8', errors = 'replace')
    #print(row)
    if row == '\n':
      data.append(row_out)
      row_out = []
      continue
    cat, field = row.split(":", 1)
    #remove leading white spaces
    field = field.rstrip()
    row_out.append(field)

In [3]:
data = pd.DataFrame(data)

data.columns = ['beer_name', 'beer_beerId', 'beer_brewer', 'beer_ABV', 'beer_style', 
                'review_appearance', 'review_aroma', 'review_palate', 'review_taste', 
                'review_overall', 'review_time', 'review_profileName', 'review_text']

# keep 3 columns: user name, beer name, overall score
data2 = data[['beer_name', 'review_profileName', 'review_overall', 'review_time', 'beer_style']]

m = 33382 # number of users
n = 56855 # Number of items
# remove NA
data2 = data2[pd.notnull(data2.beer_name)]
data2 = data2[pd.notnull(data2.review_profileName)]
data2 = data2[pd.notnull(data2.review_overall)]
print(data2.shape)


(1586614, 5)


In [4]:
# keep top 1000/33382 frequent users
user = data2.review_profileName.value_counts()
user_list = user.keys()[:m].tolist()

# keep top 100/56855 most reviewed beer
beer = data2.beer_name.value_counts()[:n]
beer_list = beer.keys()[:n].tolist()

# keep (beer&user) pair in (user_list) and (beer_list)
subdata = data2[data2.beer_name.isin(beer_list)]
subdata = subdata[subdata.review_profileName.isin(user_list)]

# sort by user names
subdata = subdata.sort_values(by=['review_profileName','beer_name','review_time','beer_style'])

print(subdata.shape)

(1586606, 5)


## Pre-calculated values during downtime of the recommendation system or once a day (Batch) 

The below algorithm calculates the top 10 beers for each beer type based on its popularity. If a particular beer in a beer type has been reviewed for more than 50 times, the beer is considered for the popularity score calcuation. The popularity score is calucated by taking the average of all the reviews that the particular beer has been given. 

Apart from this, a separate overall top 10 beer based on the same popularity algorithm is also calculated.


In [5]:

Beer_styles = list(set(subdata['beer_style']))
Popular = {}
popdict = {}
for style in Beer_styles:
    new_data = subdata[subdata['beer_style'] == style]
    beers = list(set(new_data['beer_name']))
    tempdict = {}
    for beer in beers:
        beerdata = new_data[new_data['beer_name'] == beer]
        if len(beerdata) > 50:
            val = (pd.to_numeric(beerdata['review_overall']).sum())/len(beerdata)
            tempdict[beer] = val
            popdict[beer] = val
    newA = list(sorted(tempdict.items(), key=operator.itemgetter(1), reverse=True)[:10])
    Popular[style] = newA

Most_popular = list(sorted(popdict.items(), key=operator.itemgetter(1), reverse=True)[:10])

   


In [6]:
Most_popular

[(" Armand'4 Oude Geuze Lente (Spring)", 4.730769230769231),
 (' Hoppy Birthday', 4.684615384615385),
 (' Geuze Cuvée J&J (Joost En Jessie) Blauw (Blue)', 4.633802816901408),
 (' Citra DIPA', 4.630952380952381),
 (' Cantillon Blåbær Lambik', 4.628205128205129),
 (' Veritas 004', 4.626506024096385),
 (' Heady Topper', 4.6257995735607675),
 (' Deviation - Bottleworks 9th Anniversary', 4.620535714285714),
 (' Trappist Westvleteren 12', 4.617924528301887),
 (' King Henry', 4.61734693877551)]

### List of all beer types 

In [7]:
for i in range(len(Beer_styles)):
    print(str(i) + " - " +Beer_styles[i])

0 -  English Bitter
1 -  Smoked Beer
2 -  Bière de Garde
3 -  English Barleywine
4 -  American Double / Imperial Pilsner
5 -  Belgian Dark Ale
6 -  Herbed / Spiced Beer
7 -  Oatmeal Stout
8 -  English Pale Ale
9 -  Märzen / Oktoberfest
10 -  American Amber / Red Lager
11 -  Lambic - Fruit
12 -  Winter Warmer
13 -  Scotch Ale / Wee Heavy
14 -  Baltic Porter
15 -  Munich Helles Lager
16 -  Quadrupel (Quad)
17 -  Cream Ale
18 -  Rauchbier
19 -  Belgian Strong Dark Ale
20 -  Berliner Weissbier
21 -  Flanders Red Ale
22 -  Weizenbock
23 -  Light Lager
24 -  American IPA
25 -  English Dark Mild Ale
26 -  American Barleywine
27 -  Irish Red Ale
28 -  Munich Dunkel Lager
29 -  Belgian Strong Pale Ale
30 -  Faro
31 -  American Double / Imperial IPA
32 -  Kristalweizen
33 -  Japanese Rice Lager
34 -  Euro Dark Lager
35 -  American Porter
36 -  Foreign / Export Stout
37 -  American Double / Imperial Stout
38 -  Bock
39 -  Saison / Farmhouse Ale
40 -  American Strong Ale
41 -  Lambic - Unblended
4

### Cold Start Algorithm 

The cold start algorithm has been designed in such a way that the recommendations are made based on the most popular beers available, as well as the users beer type preference. Let us look at the algorithm with a case by case basis.

** 1 - New user **

When a new user requests for a recommendation, the user is given a list of all the beer types. The user is expected to give 0, 1 or 2 beer types as input to the algorithm

- If the user does not give any beer types, he is recommended the 10 most popular beers that are available.
- If the user gives 1 beer type, then he is recommended the top 5 most popular beers and the top 5 most popular beer from the beer type that is opted for by the user. 
- If the user gives 2 beer types, then he is recommended the top 5 most popular beers, top 3 beers from the first selected beer type and 2 beers from the second selected beer type


** 2 - Returning user with less than 30 reviews. **

When a returning user with less than 30 review requests for a recomendation, the algorithm looks at the reviews that the user has already given to check the users preferred beer type. Based on the Beer type that the user has had the most, he is recommended the top 5 most popular beers and the top 5 most popular beer from that beer type.

** 3 - Returning user with more than 30 reviews. **

When a returning user with less than 30 review requests for a recomendation, the recomendation algorithm switches from the cold-start algorithm to other algorithms.


- Note 1: The recommendation are not repetitive. If a beer is present in top 5 most popular beer, and the user selects it beer type, the next beer in order for the particular beer type will be recommended. 

- Note 2: If a beer type has less than number of beers to be suggested in its most popular beers, the rest of the slots in the recommendation ar filled by the most popular beer list. 


In [8]:
def cold_start(name, flag, demand = ""):
    recommendation = [item[0] for item in Most_popular[:5]]
    if flag == 0:
        types = demand.split(",")
        if demand == "": 
            recommendation = [item[0] for item in Most_popular[:10]]
            
        elif len(types) == 1:
            i = 0
            while((len(recommendation) < 10) and (len(Popular[Beer_styles[int(types[0])]]) > i)):
                if Popular[Beer_styles[int(types[0])]][i][0] not in recommendation:
                        recommendation.append(Popular[Beer_styles[int(types[0])]][i][0])
                i = i + 1

        elif len(types) == 2:
            i = 0
            while((len(recommendation) < 8) and (len(Popular[Beer_styles[int(types[0])]]) > i)):
                if Popular[Beer_styles[int(types[0])]][i][0] not in recommendation:
                    recommendation.append(Popular[Beer_styles[int(types[0])]][i][0])
                i = i + 1
                
            while((len(recommendation) < 10) and (len(Popular[Beer_styles[int(types[1])]]) > i)):
                if Popular[Beer_styles[int(types[1])]][i][0] not in recommendation:
                    recommendation.append(Popular[Beer_styles[int(types[1])]][i][0])
                i = i + 1
        
    elif flag == 1: 
        new_data = subdata[subdata['review_profileName'] == name]
        main_beer_style = new_data.groupby("beer_style")['beer_name'].count().reset_index(name='count').sort_values(['count'], ascending=False).head(1)
        i = 0
        while(len(recommendation) < 10):
            if Popular[main_beer_style.iloc[0]['beer_style']][i][0] not in recommendation and Popular[main_beer_style.iloc[0]['beer_style']][i][0] not in new_data['beer_name'].values:
                    recommendation.append(Popular[main_beer_style.iloc[0]['beer_style']][i][0])
            i = i + 1
    
    i = 5
    while(len(recommendation) < 10):
        if Most_popular[i][0] not in recommendation:
            recommendation.append(Most_popular[i][0])
        i = i+ 1
    return recommendation


Now let us look at each of the cases mentioned above

**Case 1 : New user with no beer type selected  **

In [9]:
cold_start("User 1", 0 , "")

[" Armand'4 Oude Geuze Lente (Spring)",
 ' Hoppy Birthday',
 ' Geuze Cuvée J&J (Joost En Jessie) Blauw (Blue)',
 ' Citra DIPA',
 ' Cantillon Blåbær Lambik',
 ' Veritas 004',
 ' Heady Topper',
 ' Deviation - Bottleworks 9th Anniversary',
 ' Trappist Westvleteren 12',
 ' King Henry']

As mentioned, the user here is recommended the most popular beer types

**Case 2 : New user with 1 beer type selected. Let us assume the users optes for English India Pale Ale (IPA) - code 10**

In [10]:
cold_start("User 2", 0 , "10")

[" Armand'4 Oude Geuze Lente (Spring)",
 ' Hoppy Birthday',
 ' Geuze Cuvée J&J (Joost En Jessie) Blauw (Blue)',
 ' Citra DIPA',
 ' Cantillon Blåbær Lambik',
 ' Brooklyn Lager',
 ' Hopfenmalz',
 ' Creemore Springs Premium Lager',
 ' Riverwest Stein Beer',
 ' Winter Skål']

The first 5 recommendations are the from the most popular list and the next 5 recommendations are the top 5 from the English India Pale Ale (IPA) beer type

**Case 3 : New user with 2 beer types selected. Let us assume the users optes for English India Pale Ale (IPA) - code 10 and American Dark Wheat Ale - code 25**

In [11]:
# for a new user
cold_start("Assdas", 0 , "10,25")


[" Armand'4 Oude Geuze Lente (Spring)",
 ' Hoppy Birthday',
 ' Geuze Cuvée J&J (Joost En Jessie) Blauw (Blue)',
 ' Citra DIPA',
 ' Cantillon Blåbær Lambik',
 ' Brooklyn Lager',
 ' Hopfenmalz',
 ' Creemore Springs Premium Lager',
 ' Harpoon Brown Session Ale',
 ' Brawler Pugilist Style Ale']

The first 5 recommendations are the from the most popular list, the next 3 recommendations slots are filled by top 3 from the English India Pale Ale (IPA) beer type and the last 2 are filled from American Dark Wheat Ale beer type

**For returning users with less than 30 reviews**

Let us look at the case of Hayward who has 12 reivews. It really difficult to recommend beer with such a small set of information about the user. 

In [12]:
print(len(subdata[subdata["review_profileName"] == " Hayward"]))


12


In [13]:
subdata[subdata["review_profileName"] == " Hayward"][["beer_style", "beer_name"]].groupby("beer_style").count()

Unnamed: 0_level_0,beer_name
beer_style,Unnamed: 1_level_1
American Amber / Red Ale,1
American Brown Ale,1
American IPA,4
American Strong Ale,1
Dortmunder / Export Lager,1
German Pilsener,3
Maibock / Helles Bock,1


Hayward has had American IPA 4 times. Hence the algorithm selects that beer type as American IPA and recommends the most popular American IPA beers along with the most popular beers

In [14]:
# for a old user ( Assuming he has less reviews)
cold_start(" Hayward", 1)


[" Armand'4 Oude Geuze Lente (Spring)",
 ' Hoppy Birthday',
 ' Geuze Cuvée J&J (Joost En Jessie) Blauw (Blue)',
 ' Citra DIPA',
 ' Cantillon Blåbær Lambik',
 ' Masala Mama India Pale Ale',
 " O'Brien's IPA",
 ' Sculpin India Pale Ale',
 ' White Rajah',
 ' Nelson']