This notebook narrates the <a href="https://www.kaggle.com/c/expedia-hotel-recommendations">Kaggle Expedia hotel recommendations competition </a>. I would like to acknowledge Vik Paruchuri for providing the wonderful guide for this competition.

The competition Details can be found at the above link. But I would like to summarize it for quick insight.

<b> The Expedia Kaggle competition</b>

The Expedia competition challenges you with predicting what hotel a user will book based on some attributes about the search the user is conducting on Expedia. Before we dive into any coding, we’ll need to put in time to understand both the problem and the data.

<b> Data Fields </b>

The details of the data fields is available at Kaggle site <a href="https://www.kaggle.com/c/expedia-hotel-recommendations/data"> data details </a>.


<b> Data Exploration </b>


In [3]:
#library imports
import pandas as pd
import numpy as np

In [4]:
destinations = pd.read_csv("destinations.csv")
test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")

In [5]:
#lets see the size of the different data files
print("train data size:",train.shape)
print("test data size:",test.shape)

train data size: (37670293, 24)
test data size: (2528243, 22)


We have about 37 million training set rows, and 2 million testing set rows, which will make this problem a bit challenging to work with.

We can explore the first few rows of the data:

In [6]:
train.head() #displays first 5 rows by default

Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,srch_children_cnt,srch_rm_cnt,srch_destination_id,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster
0,2014-08-11 07:46:59,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,0,3,2,50,628,1
1,2014-08-11 08:22:12,2,3,66,348,48862,2234.2641,12,0,1,...,0,1,8250,1,1,1,2,50,628,1
2,2014-08-11 08:24:33,2,3,66,348,48862,2234.2641,12,0,0,...,0,1,8250,1,0,1,2,50,628,1
3,2014-08-09 18:05:16,2,3,66,442,35390,913.1932,93,0,0,...,0,1,14984,1,0,1,2,50,1457,80
4,2014-08-09 18:08:18,2,3,66,442,35390,913.6259,93,0,0,...,0,1,14984,1,0,1,2,50,1457,21


There are a few things that immediately stick out:

<ul>
<li> date_time could be useful in our predictions, so we’ll need to convert it.
<li> Most of the columns are integers or floats, so we can’t do a lot of feature engineering. For example, user_location_country isn’t the name of a country, it’s an integer. This makes it harder to create new features, because we don’t know exactly which each value means.
</ul>

In [7]:
test.head()

Unnamed: 0,id,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,...,srch_ci,srch_co,srch_adults_cnt,srch_children_cnt,srch_rm_cnt,srch_destination_id,srch_destination_type_id,hotel_continent,hotel_country,hotel_market
0,0,2015-09-03 17:09:54,2,3,66,174,37449,5539.0567,1,1,...,2016-05-19,2016-05-23,2,0,1,12243,6,6,204,27
1,1,2015-09-24 17:38:35,2,3,66,174,37449,5873.2923,1,1,...,2016-05-12,2016-05-15,2,0,1,14474,7,6,204,1540
2,2,2015-06-07 15:53:02,2,3,66,142,17440,3975.9776,20,0,...,2015-07-26,2015-07-27,4,0,1,11353,1,2,50,699
3,3,2015-09-14 14:49:10,2,3,66,258,34156,1508.5975,28,0,...,2015-09-14,2015-09-16,2,0,1,8250,1,2,50,628
4,4,2015-07-17 09:32:04,2,3,66,467,36345,66.7913,50,0,...,2015-07-22,2015-07-23,2,0,1,11812,1,2,50,538


There are a few things we can take away from looking at test.csv:

<ul>
<li> It looks like all the dates in test.csv are later than the dates in train.csv, and the <a href="https://www.kaggle.com/c/expedia-hotel-recommendations/data">data page</a> confirms this. The testing set contains dates from 2015, and the training set contains dates from 2013 and 2014.
<li> It looks like the user ids in test.csv are a subset of the user ids in train.csv, given the overlapping integer ranges. We can confirm this later on.
<li> The is_booking column always looks to be 1 in test.csv. The <a href="https://www.kaggle.com/c/expedia-hotel-recommendations/data">data page</a> confirms this.
</ul>


<b> What to predict </b>


We’ll be predicting which hotel_cluster a user will book after a given search. According to the description, there are 100 clusters in total.


<b> How we’ll be scored </b>

The evaluation page says that we’ll be scored using Mean Average Precision @ 5, which means that we’ll need to make 5 cluster predictions for each row, and will be scored on whether or not the correct prediction appears in our list. If the correct prediction comes earlier in the list, we get more points.

For example, if the “correct” cluster is 3, and we predict 4, 43, 60, 3, 20, our score will be lower than if we predict 3, 4, 43, 60, 20. We should put predictions we’re more certain about earlier in our list of predictions.


<b> Exploring Hotel Clusters </b>


In [8]:
train["hotel_cluster"].value_counts()

91    1043720
41     772743
48     754033
64     704734
65     670960
5      620194
98     589178
59     570291
42     551605
21     550092
70     545572
18     545284
83     534132
46     534038
25     530591
62     518809
95     509266
28     507016
68     503797
82     503755
37     496061
50     489892
30     489287
9      488328
58     483253
97     479446
16     477868
72     457463
1      452694
99     444887
       ...   
19     282893
84     278264
66     273505
38     269246
87     260398
23     259233
12     259022
31     257587
67     255946
43     253578
7      252447
54     250745
92     244343
89     243560
45     241408
49     240124
3      225250
80     220218
60     217919
71     216054
93     214293
86     209054
14     192299
75     165226
24     164127
35     139122
53     134812
88     107784
27     105040
74      48355
Name: hotel_cluster, Length: 100, dtype: int64

The output above is truncated, but it shows that the number of hotels in each cluster is fairly evenly distributed. There doesn’t appear to be any relationship between cluster number and the number of items.

<b>Exploring train and test user ids</b>


Finally, we’ll confirm our hypothesis that all the test user ids are found in the train DataFrame. We can do this by finding the unique values for user_id in test, and seeing if they all exist in train.

In [9]:
#find the intersection of train and test user ids
test_ids = set(test["user_id"].unique())
train_ids = set(train.user_id.unique())
intersection_count = len(test_ids & train_ids)
intersection_count == len(test_ids)

True

In [10]:
#another way to do this using numpy array
a = np.array(list(test_ids)) # donot know why in jupyter I need to cast the set to list before converting to array, in python console a set can be directly converted to array
b = np.array(list(train_ids))
common_users = np.intersect1d(a, b)
print(len(common_users), len(test_ids))
len(common_users) == len(test_ids)

1181577 1181577


True

Looks like our hypothesis is correct, which will make working with this data much easier!

<b> Downsampling our Kaggle data </b>


The entire train.csv dataset contains 37 million rows, which makes it hard to experiment with different techniques. Ideally, we want a small enough dataset that lets us quickly iterate through different approaches but is still representative of the whole training data.

We can do this by first randomly sampling rows from our data, then selecting new training and testing datasets from train.csv. By selecting both sets from train.csv, we’ll have the true hotel_cluster label for every row, and we’ll be able to calculate our accuracy as we test techniques

<b> Add in times and dates</b>

The first step is to add month and year fields to train. Because the train and test data is differentiated by date, we’ll need to add date fields to allow us to segment our data into two sets the same way. If we add year and month fields, we can split our data into training and testing sets using them.

The code below will:
<ul>
<li> Convert the date_time column in train from an object to a datetime value. This makes it easier to work with as a date.
<li> Extract the year and month from from date_time, and assign them to their own columns.
</ul>

In [11]:
train["date_time"] = pd.to_datetime(train["date_time"])
train["year"] = train["date_time"].dt.year
train["month"] = train["date_time"].dt.month

<b> Pick 10000 users</b>

Because the user ids in test are a subset of the user ids in train, we’ll need to do our random sampling in a way that preserves the full data of each user. We can accomplish this by selecting a certain number of users randomly, then only picking rows from train where user_id is in our random sample of user ids.


In [12]:
import random

unique_user_ids = train["user_id"].unique()
sel_user_ids = random.sample(set(unique_user_ids),10000)
sel_train = train[train.user_id.isin(sel_user_ids)]

<b> Pick new training and testing sets </b>

We’ll now need to pick new training and testing sets from sel_train. We’ll call these sets t1 and t2.


In [13]:
t1 = sel_train[((sel_train.year == 2013) | ((sel_train.year == 2014) & (sel_train.month < 8)))]
t2 = sel_train[((sel_train.year == 2014) & (sel_train.month >= 8))]


In the original train and test DataFrames, test contained data from 2015, and train contained data from 2013 and 2014. We split this data so that anything after July 2014 is in t2, and anything before is in t1. This gives us smaller training and testing sets with similar characteristics to train and test.

<b> Remove click events</b>

If is_booking is 0, it represents a click, and a 1 represents a booking.  test contains only booking events, so we’ll need to sample t2 to only contain bookings as well.

In [14]:
t2 = t2[t2.is_booking == 1]

<b> A simple algorithm</b>


The most simple technique we could try on this data is to find the most common clusters across the data, then use them as predictions. (Whichever cluster has many user_ids will be assumed to be the WINNER!!!)

We can again use the value_counts method to help us here:

In [15]:
#lets take few of the top clusters
most_common_clusters = list(train.hotel_cluster.value_counts().head().index)



The above code will give us a list of the 5 most common clusters in train. This is because the head method returns the first 5 rows by default, and the index property will return the index of the DataFrame, which is the hotel cluster after running the value_counts method.

<b>Generating predictions</b>


We can turn most_common_clusters into a list of predictions by making the same prediction for each row.



In [16]:
predictions = [most_common_clusters for i in range(t2.shape[0])]
#print(predictions[0])
#print(len(predictions), t2.shape)

This will create a list with as many elements as there are rows in t2. Each element will be equal to most_common_clusters.

<b> Evaluating error</b>


In order to evaluate error, we’ll first need to figure out how to compute Mean Average Precision. Luckily, <a href="https://github.com/benhamner">Ben Hamner</a> has written an implementation that can be found <a href="https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py">here</a>. It can be installed as part of the ml_metrics package, and you can find installation instructions for how to install it here.

We can compute our error metric with the mapk method in ml_metrics. Rather than installing, I am copying the actual code from the above link, to avoid installation and restart of the notebook in the middle of work.

In [17]:
import numpy as np

def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

In [18]:
target = [[l] for l in t2["hotel_cluster"]]
#metrics.mapk(target, predictions, k=5)
#the mapk takes list of lists as the target and predictions
print(mapk(target, predictions, k=5))

0.0616247700797


Our target needs to be in list of lists format for mapk to work, so we convert the hotel_cluster column of t2 into a list of lists. Then, we call the mapk method with our target, our predictions, and the number of predictions we want to evaluate (5).

Our result here isn’t great, but we’ve just generated our first set of predictions, and evaluated our error! The framework we’ve built will allow us to quickly test out a variety of techniques and see how they score. We’re well on our way to building a good-performing solution for the leaderboard.

<b> Finding correlations </b>


Before we move on to creating a better algorithm, let’s see if anything correlates well with hotel_cluster. This will tell us if we should dive more into any particular columns.

We can find linear correlations in the training set using the corr method:

In [19]:
train.corr()["hotel_cluster"]

site_name                   -0.022408
posa_continent               0.014938
user_location_country       -0.010477
user_location_region         0.007453
user_location_city           0.000831
orig_destination_distance    0.007260
user_id                      0.001052
is_mobile                    0.008412
is_package                   0.038733
channel                      0.000707
srch_adults_cnt              0.012309
srch_children_cnt            0.016261
srch_rm_cnt                 -0.005954
srch_destination_id         -0.011712
srch_destination_type_id    -0.032850
is_booking                  -0.021548
cnt                          0.002944
hotel_continent             -0.013963
hotel_country               -0.024289
hotel_market                 0.034205
hotel_cluster                1.000000
year                        -0.001050
month                       -0.000560
Name: hotel_cluster, dtype: float64

This tells us that no columns correlate linearly with hotel_cluster. This makes sense, because there is no linear ordering to hotel_cluster. For example, having a higher cluster number isn’t tied to having a higher srch_destination_id.

Unfortunately, this means that techniques like linear regression and logistic regression won’t work well on our data, because they rely on linear correlations between predictors and targets.



<b> Creating better predictions for our Kaggle entry</b>

This data for this competition is quite difficult to make predictions on using machine learning for a few reasons:

<ul>

<li> There are millions of rows, which increases runtime and memory usage for algorithms.

<li> There are 100 different clusters, and according to the competition admins, the boundaries are fairly fuzzy, so it will likely be hard to make predictions. As the number of clusters increases, classifiers generally decrease in accuracy.

<li> Nothing is linearly correlated with the target (hotel_clusters), meaning we can’t use fast machine learning techniques like linear regression.


For these reasons, machine learning probably won’t work well on our data, but we can try an algorithm and find out.

<b> Generating features</b>

The first step in applying machine learning is to generate features. We can generate features using both what’s available in the training data, and what’s available in destinations. We haven’t looked at destinations yet, so let’s take a quick peek.

<b> Generating features from destinations</b>

Destinations contains an id that corresponds to srch_destination_id, along with 149 columns of latent information about that destination. Here’s a sample:

In [20]:
destinations.head()

Unnamed: 0,srch_destination_id,d1,d2,d3,d4,d5,d6,d7,d8,d9,...,d140,d141,d142,d143,d144,d145,d146,d147,d148,d149
0,0,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-1.897627,-2.198657,-2.198657,-1.897627,...,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657,-2.198657
1,1,-2.18169,-2.18169,-2.18169,-2.082564,-2.18169,-2.165028,-2.18169,-2.18169,-2.031597,...,-2.165028,-2.18169,-2.165028,-2.18169,-2.18169,-2.165028,-2.18169,-2.18169,-2.18169,-2.18169
2,2,-2.18349,-2.224164,-2.224164,-2.189562,-2.105819,-2.075407,-2.224164,-2.118483,-2.140393,...,-2.224164,-2.224164,-2.196379,-2.224164,-2.192009,-2.224164,-2.224164,-2.224164,-2.224164,-2.057548
3,3,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.115485,-2.177409,-2.177409,-2.177409,...,-2.161081,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409,-2.177409
4,4,-2.189562,-2.187783,-2.194008,-2.171153,-2.152303,-2.056618,-2.194008,-2.194008,-2.145911,...,-2.187356,-2.194008,-2.191779,-2.194008,-2.194008,-2.185161,-2.194008,-2.194008,-2.194008,-2.188037


The competition doesn’t tell us exactly what each latent feature is, but it’s safe to assume that it’s some combination of destination characteristics, like name, description, and more. These latent features were converted to numbers, so they could be anonymized.

We can use the destination information as features in a machine learning algorithm, but we’ll need to compress the number of columns down first, to minimize runtime. We can use PCA to do this. PCA will reduce the number of columns in a matrix while trying to preserve the same amount of variance per row. Ideally, PCA will compress all the information contained in all the columns into less, but in practice, some information is lost.

In the code below, we:

<ul>
<li> Initialize a PCA model using scikit-learn.
<li> Specify that we want to only have 3 columns in our data.
<li> Transform the columns d1-d149 into 3 columns.

</ul>

In [21]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
# we skip the srch_destination_id from dimensionality reduction because we might need to extract it later
dest_small = pca.fit_transform(destinations[["d{0}".format(i + 1) for i in range(149)]])
dest_small = pd.DataFrame(dest_small)
#now we apply the srch_destination_id to our dimension reduced data
dest_small["srch_destination_id"] = destinations["srch_destination_id"]
#lets see the column names
print(dest_small.columns.values)

[0 1 2 'srch_destination_id']


The above code compresses the 149 columns in destinations down to 3 columns, and creates a new DataFrame called dest_small. We preserve most of the variance in destinations while doing this, so we don’t lose a lot of information, but save a lot of runtime for a machine learning algorithm.

<b>Generating features</b>

Now that the preliminaries are done with, we can generate our features. We’ll do the following:

<ul>

<li> Generate new date features based on date_time, srch_ci, and srch_co.
<li> Remove non-numeric columns like date_time.
<li> Add in features from dest_small.
<li> Replace any missing values with -1.
</ul

In [22]:
#lets review the columns in train and test set
print(train.columns.values)
print(test.columns.values)

['date_time' 'site_name' 'posa_continent' 'user_location_country'
 'user_location_region' 'user_location_city' 'orig_destination_distance'
 'user_id' 'is_mobile' 'is_package' 'channel' 'srch_ci' 'srch_co'
 'srch_adults_cnt' 'srch_children_cnt' 'srch_rm_cnt' 'srch_destination_id'
 'srch_destination_type_id' 'is_booking' 'cnt' 'hotel_continent'
 'hotel_country' 'hotel_market' 'hotel_cluster' 'year' 'month']
['id' 'date_time' 'site_name' 'posa_continent' 'user_location_country'
 'user_location_region' 'user_location_city' 'orig_destination_distance'
 'user_id' 'is_mobile' 'is_package' 'channel' 'srch_ci' 'srch_co'
 'srch_adults_cnt' 'srch_children_cnt' 'srch_rm_cnt' 'srch_destination_id'
 'srch_destination_type_id' 'hotel_continent' 'hotel_country'
 'hotel_market']


In [23]:
def calc_fast_features(df):
    df["date_time"] = pd.to_datetime(df["date_time"])
    df["srch_ci"] = pd.to_datetime(df["srch_ci"], format='%Y-%m-%d', errors="coerce")
    df["srch_co"] = pd.to_datetime(df["srch_co"], format='%Y-%m-%d', errors="coerce")
    
    props = {}
    for prop in ["month", "day", "hour", "minute", "dayofweek", "quarter"]:
        props[prop] = getattr(df["date_time"].dt, prop)
    
    carryover = [p for p in df.columns if p not in ["date_time", "srch_ci", "srch_co"]]
    for prop in carryover:
        props[prop] = df[prop]
    
    date_props = ["month", "day", "dayofweek", "quarter"]
    for prop in date_props:
        props["ci_{0}".format(prop)] = getattr(df["srch_ci"].dt, prop)
        props["co_{0}".format(prop)] = getattr(df["srch_co"].dt, prop)
    props["stay_span"] = (df["srch_co"] - df["srch_ci"]).astype('timedelta64[h]')
        
    ret = pd.DataFrame(props)
    
    ret = ret.join(dest_small, on="srch_destination_id", how='left', rsuffix="dest")
    ret = ret.drop("srch_destination_iddest", axis=1)
    return ret

df = calc_fast_features(t1)
df.fillna(-1, inplace=True)
print(df.columns.values)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


['channel' 'ci_day' 'ci_dayofweek' 'ci_month' 'ci_quarter' 'cnt' 'co_day'
 'co_dayofweek' 'co_month' 'co_quarter' 'day' 'dayofweek' 'hotel_cluster'
 'hotel_continent' 'hotel_country' 'hotel_market' 'hour' 'is_booking'
 'is_mobile' 'is_package' 'minute' 'month' 'orig_destination_distance'
 'posa_continent' 'quarter' 'site_name' 'srch_adults_cnt'
 'srch_children_cnt' 'srch_destination_id' 'srch_destination_type_id'
 'srch_rm_cnt' 'stay_span' 'user_id' 'user_location_city'
 'user_location_country' 'user_location_region' 'year' 0 1 2]


The above will calculate features such as length of stay, check in day, and check out month. These features will help us train a machine learning algorithm later on.

Replacing missing values with -1 isn’t the best choice, but it will work fine for now, and we can always optimize the behavior later on.


<b> Machine learning</b>

Now that we have features for our training data, we can try machine learning. We’ll use 3-fold cross validation across the training set to generate a reliable error estimate. Cross validation splits the training set up into 3 parts, then predicts hotel_cluster for each part using the other parts to train with.

We’ll generate predictions using the Random Forest algorithm. Random forests build trees, which can fit to nonlinear tendencies in data. This will enable us to make predictions, even though none of our columns are linearly related.

We’ll first initialize the model and compute cross validation scores:


In [24]:
predictors = [c for c in df.columns if c not in ["hotel_cluster"]]
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, min_weight_fraction_leaf=0.1)
scores = cross_validation.cross_val_score(clf, df[predictors], df['hotel_cluster'], cv=3)
scores



array([ 0.06260304,  0.06737546,  0.06583448])

The above code doesn’t give us very good accuracy, and confirms our original suspicion that machine learning isn’t a great approach to this problem. However, classifiers tend to have lower accuracy when there is a high cluster count. We can instead try training 100 binary classifiers. Each classifier will just determine if a row is in it’s cluster, or not. This will entail training one classifier per label in hotel_cluster.

<b> Binary classifiers</b>

We’ll again train Random Forests, but each forest will predict only a single hotel cluster. We’ll use 2 fold cross validation for speed, and only train 10 trees per label.

In the code below, we:

<ul>

<li> Loop across each unique hotel_cluster.
<ul>
    <li> Train a Random Forest classifier using 2-fold cross validation.
    <li> Extract the probabilities from the classifier that the row is in the unique hotel_cluster
</ul>
<li> Combine all the probabilities.
<li> For each row, find the 5 largest probabilities, and assign those hotel_cluster values as predictions.
<li> Compute accuracy using mapk.
</ul>

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import KFold
from itertools import chain

all_probs = []
unique_clusters = df["hotel_cluster"].unique()
for cluster in unique_clusters:
    df["target"] = 1
    df["target"][df["hotel_cluster"] != cluster] = 0
    predictors = [col for col in df if col not in ['hotel_cluster', "target"]]
    probs = []
    cv = KFold(len(df["target"]), n_folds=2)
    clf = RandomForestClassifier(n_estimators=10, min_weight_fraction_leaf=0.1)
    for i, (tr, te) in enumerate(cv):
        print(tr)
        print(te)
        #print(df[predictors].iloc[tr])
        clf.fit(df[predictors].iloc[tr], df["target"].iloc[tr])
        preds = clf.predict_proba(df[predictors].iloc[te])
        probs.append([p[1] for p in preds])
    full_probs = chain.from_iterable(probs)
    all_probs.append(list(full_probs))

prediction_frame = pd.DataFrame(all_probs).T
prediction_frame.columns = unique_clusters
def find_top_5(row):
    return list(row.nlargest(5).index)

preds = []
for index, row in prediction_frame.iterrows():
    preds.append(find_top_5(row))

mapk([[l] for l in t2.iloc["hotel_cluster"]], preds, k=5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]
[     0      1      2 ..., 100027 100028 100029]
[100030 100031 100032 ..., 200057 200058 200059]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


[100030 100031 100032 ..., 200057 200058 200059]
[     0      1      2 ..., 100027 100028 100029]


<b>References </b>

A post from Vik Paruchuri  at https://www.dataquest.io/blog/kaggle-tutorial/