## EDA- Summary

After familiarising with the dataset through the EDA, next step is to get the data ready for modelling. 

- We are working with a sample of the original data, which contains 191385 cases, and 24 features. 
- Our target feature, 'hotel_cluster' ranges from 0-100, and we have all values represented in our sample.
- Several of our potentially predicting features are categorical (user_location, is_mobile, is_package, etc...), so we need to decide how to treat them: if use as they are, or dummify. 
- Further, work is needed to incorporate the variables srch_ci & srch_co, the check in and out dates. 
- We have seen no strong correlations among the features. No likely predictor has arised, and no potential multicollinearity has been detected.

## Preparing the data for modelling:

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

In [2]:
#Let's begin by importing our clean sample
EXPEDIA_FOLDER = "/Users/veronicabianchini/Documents/GA-Data Science/Expedia Project/"
sample_clean = pd.read_csv(EXPEDIA_FOLDER+"sampled_df_2017-05-13-13-58-59.csv")

In [3]:
sample_clean.shape

(191385, 26)

Description of my variables (capture from Kaggle)

![alt text](Expedia_data_features.png "Data Features")

Second, let's create the dummy variables neede, and drop the feature we won't use: Which ones in my set would need transforming?
- 'date_time': will drop for now, buy might use the month & days values 
- 'site_name' : will not use in my model
- *'posa_continent'*: will make into dummy
- *'user_location_country'* : will make into dummy
- *'user_location_region'*: will make into dummy
- 'user_location_city': will drop, since it has too many categories to use as dummy
- 'orig_destination_distance': is scalar variable already 
- 'user_id': won't use in my model
- 'is_mobile' : is already a dummy
- 'is_package' : is already a dummy
- 'channel': won't use in my model
- 'srch_ci': will create 'stay_lenght'
- 'srch_co': will create 'stay_lenght'
- 'srch_adults_cnt': is scalar variable already  
- 'srch_children_cnt': is scalar variable already  
- 'srch_rm_cnt': is scalar variable already 
- 'srch_destination_id': will not use in my model
- *'srch_destination_type_id'*: will make into dummy
- 'is_booking': is already a dummy
- 'cnt': won't use in my model
- *'hotel_continent'*: will make into dummy 
- *'hotel_country'*: will make into dummy 
- *'hotel_market'*: will make into dummy
- **'hotel_cluster'**: Is my target.


### Strategy for creating feature matrix:

- Get columns to convert to dummy
- Get columns to drop from initial modelling.
- Add dummy columns.
- Separate feature matrix from the target variable into X and Y dataframe.

In [4]:
#Get columns to convert to dummy
columns_to_dummies = [
    'posa_continent',
    'user_location_country',
    'user_location_region',
    'srch_destination_type_id',
    'hotel_continent', 
    'hotel_country',
    'hotel_market'
]

columns_to_drop = [
    "Unnamed: 0",
    "Unnamed: 0.1",
    "date_time",
    "site_name",
    "user_location_city",
    "srch_ci",
    "srch_co",
    "srch_destination_id",
    "user_id",
    "cnt",
    "channel"
]

In [5]:
sample_clean.columns

Index([u'Unnamed: 0', u'Unnamed: 0.1', u'date_time', u'site_name',
       u'posa_continent', u'user_location_country', u'user_location_region',
       u'user_location_city', u'orig_destination_distance', u'user_id',
       u'is_mobile', u'is_package', u'channel', u'srch_ci', u'srch_co',
       u'srch_adults_cnt', u'srch_children_cnt', u'srch_rm_cnt',
       u'srch_destination_id', u'srch_destination_type_id', u'is_booking',
       u'cnt', u'hotel_continent', u'hotel_country', u'hotel_market',
       u'hotel_cluster'],
      dtype='object')

In [6]:
# Drop columns that will not be used in the initial modelling (might come back to them afterwards to apply more 
# sophisticated feature engineering)
df_dropped = sample_clean.drop(columns_to_drop, axis=1)
df_dropped.head(2)

Unnamed: 0,posa_continent,user_location_country,user_location_region,orig_destination_distance,is_mobile,is_package,srch_adults_cnt,srch_children_cnt,srch_rm_cnt,srch_destination_type_id,is_booking,hotel_continent,hotel_country,hotel_market,hotel_cluster
0,3,66,226,749.7598,0,1,1,1,1,1,0,2,50,675,69
1,3,66,196,152.9141,0,0,2,1,1,1,0,2,50,694,25


In [7]:
# Create dummy columns and drop original feature columns.
df_dropped_and_dummies = pd.get_dummies(df_dropped, columns=columns_to_dummies)
df_dropped_and_dummies.head(2)

Unnamed: 0,orig_destination_distance,is_mobile,is_package,srch_adults_cnt,srch_children_cnt,srch_rm_cnt,is_booking,hotel_cluster,posa_continent_0,posa_continent_1,...,hotel_market_2107,hotel_market_2108,hotel_market_2109,hotel_market_2110,hotel_market_2111,hotel_market_2112,hotel_market_2113,hotel_market_2115,hotel_market_2116,hotel_market_2117
0,749.7598,0,1,1,1,1,0,69,0,0,...,0,0,0,0,0,0,0,0,0,0
1,152.9141,0,0,2,1,1,0,25,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
#sample_clean['srch_co']=pd.to_datetime(sample_clean['srch_co'],format='%Y-%m-%d')
#sample_clean['srch_ci']=pd.to_datetime(sample_clean['srch_ci'],format='%Y-%m-%d')
#sample_clean['stay_lenght']= sample_clean['srch_co']-sample_clean['srch_ci']
#sample_clean['stay_lenght']=sample_clean['stay_lenght'].dt.days

# MODELS

In [8]:
#from sklearn import datasets, neighbors, metrics, grid_search, cross_validation, linear_model

from sklearn.grid_search import GridSearchCV
from sklearn import neighbors
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import grid_search
from sklearn.metrics import accuracy_score, confusion_matrix



### Strategy for modelling:

Core:

- Create train-test split on our cleaned dataset.
- Calculate baseline model.
- Fit simple logistic regression (with all its defaults) as a naive benchmark.
- Asses different classification models.


In [33]:
X = df_dropped_and_dummies.drop(["hotel_cluster"], axis=1)
y = df_dropped_and_dummies["hotel_cluster"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [14]:
# For quickly debugging models we will take a smaller sample from our cleaned dataset.
quick_df = df_dropped_and_dummies.sample(frac=0.1)
print quick_df.shape
print df_dropped_and_dummies.shape

(19139, 2346)
(191385, 2346)


In [15]:
# Create train test split.
#X = quick_df.drop(["hotel_cluster"], axis=1)
#y = quick_df["hotel_cluster"]
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Baseline model

Our first benchmark will be given by the accuracy of a model "predicting" the mayority class from our target variable.

In [16]:
 len(y)

19139

In [17]:
y.value_counts().max()

688

In [18]:
y.value_counts()[:10]

91    688
48    481
41    461
65    439
18    321
42    314
98    311
95    304
16    303
50    298
Name: hotel_cluster, dtype: int64

In [19]:
baseline_accuracy = float(y.value_counts()[91]) / len(y)
print "Our baseline accuracy is: " + str(baseline_accuracy)

Our baseline accuracy is: 0.0359475416688


### Logistic Regression

In [20]:
# We specify our model (we will later fit another one using gridsearch to optimize the hyperparameters)
logit = LogisticRegression(multi_class="multinomial", solver="lbfgs")

In [23]:
# We train the model on our data.
logit_fit = logit.fit(X=X_train, y=y_train)

In [24]:
y_predicted = logit_fit.predict(X=X_test)

In [25]:
# We calculate our out of sample or test accuracy
print accuracy_score(y_pred=y_predicted, y_true=y_test)

0.0617479417353


## - Desition Tree & Random Forest:

In [37]:
from sklearn.tree import DecisionTreeClassifier

DT_model = DecisionTreeClassifier (max_depth=10)

X = X_train
y = y_train    
    
# Fits the model
DT_model.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [38]:
y_predicted_DT = DT_model.predict(X=X_test)

In [41]:
# We calculate our out of sample or test accuracy
print accuracy_score(y_pred=y_predicted_DT, y_true=y_test)

0.107618987302


In [65]:
DT= DecisionTreeClassifier()
param_grid = {'max_depth': [10, 20,30,40,50]}

grid_DT = grid_search.GridSearchCV(DT, param_grid, cv=10)
grid_DT.fit(X_train, y_train)

print grid_RF.grid_scores_
print grid_RF.best_score_ # mean squared error here comes in negative, so let's make it positive.
print grid_RF.best_estimator_

[mean: 0.12453, std: 0.00178, params: {'n_estimators': 100, 'max_depth': 30}, mean: 0.12552, std: 0.00168, params: {'n_estimators': 200, 'max_depth': 30}, mean: 0.11171, std: 0.00124, params: {'n_estimators': 100, 'max_depth': 50}, mean: 0.11262, std: 0.00211, params: {'n_estimators': 200, 'max_depth': 50}]
0.125519586359
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=30, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)


What about overfitting? Can I improve my model even further by using Random Forests instead of Decision Trees?

### Random Forest

In [42]:
from sklearn.ensemble import RandomForestClassifier

RF_model = RandomForestClassifier(n_estimators = 20, oob_score= True)
    
RF_model.fit(X, y)

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=True, random_state=None,
            verbose=0, warm_start=False)

In [43]:
RF_model.oob_score_ 

0.094176733449273548

My first try is not much better than the Decision Tree. I'll try different parameters through Cross validation.

In [46]:
from sklearn.cross_validation import cross_val_score

In [53]:
RF= RandomForestClassifier()
param_grid = {
                 'n_estimators': [100, 200],
                 'max_depth': [30, 50]
             }

grid_RF = grid_search.GridSearchCV(RF, param_grid, cv=10)
grid_RF.fit(X_train, y_train)

print grid_RF.grid_scores_
print grid_RF.best_score_ # mean squared error here comes in negative, so let's make it positive.
print grid_RF.best_estimator_

[mean: 0.11895, std: 0.00198, params: {'n_estimators': 60, 'max_depth': 20}, mean: 0.11955, std: 0.00182, params: {'n_estimators': 70, 'max_depth': 20}, mean: 0.12291, std: 0.00139, params: {'n_estimators': 60, 'max_depth': 30}, mean: 0.12387, std: 0.00177, params: {'n_estimators': 70, 'max_depth': 30}]
0.123874067084
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=30, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=70, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)


In [54]:
RF= RandomForestClassifier()
param_grid = {
                 'n_estimators': [100, 200],
                 'max_depth': [30, 50]
             }

grid_RF = grid_search.GridSearchCV(RF, param_grid, cv=10)
grid_RF.fit(X_train, y_train)

print grid_RF.grid_scores_
print grid_RF.best_score_ 
print grid_RF.best_estimator_

[mean: 0.12453, std: 0.00178, params: {'n_estimators': 100, 'max_depth': 30}, mean: 0.12552, std: 0.00168, params: {'n_estimators': 200, 'max_depth': 30}, mean: 0.11171, std: 0.00124, params: {'n_estimators': 100, 'max_depth': 50}, mean: 0.11262, std: 0.00211, params: {'n_estimators': 200, 'max_depth': 50}]
0.125519586359
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=30, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)


**I've reached the same "best result" as with the Decision Tree: 0.125519586359. This is achieved by using 200 estimators with a 30 nodes depth. 
There is probably a better result out there, but each try takes hrs to run. **

In [64]:
#Which are my most important features?
features = X.columns
feature_importances = grid_RF.best_estimator_.feature_importances_
features_df = pd.DataFrame({'Features': features, 'Importance Score': feature_importances})
features_df.sort_values('Importance Score', inplace=True, ascending=False)
print features_df

                        Features  Importance Score
0      orig_destination_distance          0.229070
3                srch_adults_cnt          0.060068
4              srch_children_cnt          0.047505
1031            hotel_market_628          0.029863
2                     is_package          0.029744
1                      is_mobile          0.027005
241   srch_destination_type_id_1          0.021183
5                    srch_rm_cnt          0.020164
245   srch_destination_type_id_6          0.018114
6                     is_booking          0.017440
1078            hotel_market_675          0.014203
251            hotel_continent_2          0.013086
253            hotel_continent_4          0.012617
263              hotel_country_8          0.012054
299             hotel_country_50          0.011672
242   srch_destination_type_id_3          0.009120
244   srch_destination_type_id_5          0.008655
56      user_location_region_174          0.008199
1846           hotel_market_150

Since my most important feature is 'orig_destination_distance' (the one with a lot of missing values), doing something better than dropping those rows could improve my model.

# Knn

- For Knn we need to first standarize our feature matrix, otherwise the different scales of our features will affect the learning of the mode.

In [87]:
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier

In [72]:
X = df_dropped_and_dummies.drop(["hotel_cluster"], axis=1)
y = df_dropped_and_dummies["hotel_cluster"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [73]:
# Define preprocessor
scaler = preprocessing.StandardScaler()

In [74]:
# Fit preprocessor
scaled_x = scaler.fit_transform(X)

In [75]:
scaled_X= pd.DataFrame(scaled_x, index=X.index, columns=X.columns)

In [76]:
scaled_X.head()

Unnamed: 0,orig_destination_distance,is_mobile,is_package,srch_adults_cnt,srch_children_cnt,srch_rm_cnt,is_booking,posa_continent_0,posa_continent_1,posa_continent_2,...,hotel_market_2107,hotel_market_2108,hotel_market_2109,hotel_market_2110,hotel_market_2111,hotel_market_2112,hotel_market_2113,hotel_market_2115,hotel_market_2116,hotel_market_2117
0,-0.547293,-0.39541,1.76024,-1.153843,0.863504,-0.240598,-0.298897,-0.090184,-0.228907,-0.071962,...,-0.007581,-0.004572,-0.002286,-0.003233,-0.014816,-0.003959,-0.005111,-0.002286,-0.004572,-0.012728
1,-0.813717,-0.39541,-0.568105,-0.042304,0.863504,-0.240598,-0.298897,-0.090184,-0.228907,-0.071962,...,-0.007581,-0.004572,-0.002286,-0.003233,-0.014816,-0.003959,-0.005111,-0.002286,-0.004572,-0.012728
2,-0.726792,-0.39541,-0.568105,-1.153843,-0.46209,-0.240598,-0.298897,-0.090184,-0.228907,-0.071962,...,-0.007581,-0.004572,-0.002286,-0.003233,-0.014816,-0.003959,-0.005111,-0.002286,-0.004572,-0.012728
3,0.214003,-0.39541,-0.568105,-1.153843,-0.46209,-0.240598,-0.298897,-0.090184,-0.228907,-0.071962,...,-0.007581,-0.004572,-0.002286,-0.003233,-0.014816,-0.003959,-0.005111,-0.002286,-0.004572,-0.012728
4,-0.393925,2.529022,-0.568105,-1.153843,-0.46209,-0.240598,-0.298897,-0.090184,-0.228907,-0.071962,...,-0.007581,-0.004572,-0.002286,-0.003233,-0.014816,-0.003959,-0.005111,-0.002286,-0.004572,-0.012728


In [79]:
scaled_y= scaler.fit_transform(y)
scaled_y= pd.DataFrame(scaled_y, index=y.index)



In [80]:
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.33, random_state=42)

In [88]:
knn= KNeighborsClassifier(n_neighbors=50, weights='uniform')

In [None]:
knn_fit= knn.fit(X=X_train, y=y_train)

In [None]:
y_predicted_knn = knn_fit.predict(X=X_test)

In [None]:
# We calculate our out of sample or test accuracy
print accuracy_score(y_pred=y_predicted_knn, y_true=y_test)

In [None]:
kf = cross_validation.KFold(len(X), n_folds = 3 ,shuffle= True)
params= np.arange(30,60,10)
gs = grid_search.GridSearchCV(
    estimator=neighbors.KNeighborsClassifier(),
    param_grid={
        'n_neighbors':params,
        #'weights' : ['distance','uniform']},
    cv=kf
)
gs.fit(X, y)

print gs.grid_scores_
print gs.best_score_ # mean squared error here comes in negative, so let's make it positive.
print gs.best_estimator_ # explains which grid_search setup worked best


Problem: Knn takes too long to run due to number of columns of my dataset

## Conclusions:

The Expedia dataset had the following challenges:

- Expedia's dataset was very large to analyse as it was
- There was also a very large number of missing values, most of them present in a single feature ('orig_destination_distance' ).
- Most of the features were categorical, with very large number of categories. To manage them correctly, we would have needed to create more than 2000 dummies!
- Dates were not recorded using a consistent method, mixing 'm-d-y' with 'd-m-y' formats. This produced inconsistent results when trying to create a feature called 'stay_length' (not shown in the final output)

How did I tackle this issues?

- I created a sample of the dataset, to be able to handle it.
- I dropped the missing values, as a starting point
- I created dummies, but dropped the one with the most values (user_location_city)
- I did not have time to create the stay_lengh variable, but this could be a following step

Results:

- My best predicting model reached a 12.5% accuracy. This is a large increase over the baseline (3.5%) and the logistic regression (6.2%). 
- There is potential for growth using the same features and models, by allowing the Random Forest to create more trees, yet I lack the computer power to calculate that. 

# Next Steps:

The path is open to do several things that might improve my model:

- Using tools that allow me to devote more computing power to run models I was not able to run (Random Forest with +200 trees, KNN).
- Same computing power might be used to work with a larger sample or the whole dataset. Yet we believe that the improvement should not be dramatic (considering we have a robust sample).
- Several things can be improved in the handling the features themselves: 
   - Since the top preicting feature in the Random Forest was 'orig_destination_distance', the feature with the most missing cases, finding a better way to handle these NaN migh greatly improve our results (eg, imputing mean values).
   - Another feature that could enhance the explanatory powe of our model is the constructed 'stay_lenght'. Resolving the issues with date format and including the new feature in our analysis could result in improvement. 
   - Finding a way to incorporate 'user_location_city' in our model may also improve results
   - Finally, there could be other original ways to classify searches: where they booking for the summer time or winter? To unlock this feature we would need further information on how each region is classified (number to region), something we do not have.
- Additional time and resources is reccomended to improve the model.

In [3]:
EXPEDIA_FOLDER = "/Users/veronicabianchini/Documents/GA-Data Science/Expedia Project/"
sample= pd.read_csv(EXPEDIA_FOLDER+"sampled_df_2017-05-06-13-59-14.csv")

In [4]:
sample.shape

(300100, 25)

In [5]:
sample['orig_destination_distance'].fillna(sample.mean()['orig_destination_distance'], inplace=True)


In [6]:
sample.apply(lambda x: sum(x.isnull().values), axis = 0) 


Unnamed: 0                     0
date_time                      0
site_name                      0
posa_continent                 0
user_location_country          0
user_location_region           0
user_location_city             0
orig_destination_distance      0
user_id                        0
is_mobile                      0
is_package                     0
channel                        0
srch_ci                      397
srch_co                      397
srch_adults_cnt                0
srch_children_cnt              0
srch_rm_cnt                    0
srch_destination_id            0
srch_destination_type_id       0
is_booking                     0
cnt                            0
hotel_continent                0
hotel_country                  0
hotel_market                   0
hotel_cluster                  0
dtype: int64

In [7]:
clean_sample2=sample.dropna()

In [8]:
from pandas import to_datetime

In [9]:
clean_sample2['srch_co2']=pd.to_datetime(clean_sample2['srch_co'],format='%Y-%m-%d')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [10]:
clean_sample2['srch_ci2']=pd.to_datetime(clean_sample2['srch_ci'],format='%Y-%m-%d')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [11]:
clean_sample2['stay_lenght']= clean_sample2['srch_co2']-clean_sample2['srch_ci2']

clean_sample2['stay_lenght']=clean_sample2['stay_lenght'].dt.days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [12]:
error_dates=clean_sample2[clean_sample2['stay_lenght']<=0]

In [72]:
error_dates.head()

Unnamed: 0.1,Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,...,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster,srch_co2,srch_ci2,stay_lenght
54,4563,2013-02-10 13:18:32,2,3,66,184,26748,1659.1186,28447,0,...,1,0,1,4,8,110,65,2013-03-19,2013-03-19,0
204,23465,2013-12-03 20:43:30,37,1,46,171,24727,5847.8806,101809,1,...,1,0,1,3,106,107,64,2013-12-23,2013-12-23,0
270,20995,2013-12-14 11:57:40,2,3,66,258,6945,1725.9137,93370,0,...,1,0,1,4,47,1508,52,2014-03-29,2014-03-29,0
301,38012,2013-09-18 22:37:39,2,3,66,174,47371,640.784,145233,0,...,4,0,1,2,50,659,17,2013-10-18,2013-10-18,0
320,34202,2014-02-27 12:33:43,2,3,66,356,22202,1103.0742,134013,0,...,1,0,1,2,50,662,92,2014-03-20,2014-03-20,0


In [13]:
clean_sample2.shape

(299703, 28)

In [14]:
clean_sample3=clean_sample2[clean_sample2.stay_lenght>0]

In [15]:
clean_sample3.shape

(298529, 28)

In [16]:
clean_sample3[clean_sample3['stay_lenght']<0]

Unnamed: 0.1,Unnamed: 0,date_time,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,...,srch_destination_type_id,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster,srch_co2,srch_ci2,stay_lenght


### Now that I have improved my dataset, let's go back to modelling

In [17]:
#from sklearn import datasets, neighbors, metrics, grid_search, cross_validation, linear_model

from sklearn.grid_search import GridSearchCV
#from sklearn import neighbors
from sklearn import cross_validation
#from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#from sklearn import grid_search
from sklearn.metrics import accuracy_score, confusion_matrix



In [18]:
#Get columns to convert to dummy
columns_to_dummies = [
    'posa_continent',
    'user_location_country',
    'user_location_region',
    'srch_destination_type_id',
    'hotel_continent', 
    'hotel_country',
    'hotel_market'
]

columns_to_drop = [
    "Unnamed: 0",
    "date_time",
    "site_name",
    "user_location_city",
    "srch_ci2",
    "srch_co2",
    "srch_ci",
    "srch_co",
    "srch_destination_id",
    "user_id",
    "cnt",
    "channel"
]

In [19]:
df2_dropped = clean_sample3.drop(columns_to_drop, axis=1)
df2_dropped.head(2)

Unnamed: 0,posa_continent,user_location_country,user_location_region,orig_destination_distance,is_mobile,is_package,srch_adults_cnt,srch_children_cnt,srch_rm_cnt,srch_destination_type_id,is_booking,hotel_continent,hotel_country,hotel_market,hotel_cluster,stay_lenght
0,2,23,48,1975.741688,0,0,8,0,4,1,0,3,151,69,81,1
1,3,66,352,1975.741688,1,1,2,0,1,1,0,4,8,126,2,5


In [20]:
# Create dummy columns and drop original feature columns.
df2_dropped_and_dummies = pd.get_dummies(df2_dropped, columns=columns_to_dummies)
df2_dropped_and_dummies.head(2)

Unnamed: 0,orig_destination_distance,is_mobile,is_package,srch_adults_cnt,srch_children_cnt,srch_rm_cnt,is_booking,hotel_cluster,stay_lenght,posa_continent_0,...,hotel_market_2107,hotel_market_2108,hotel_market_2109,hotel_market_2110,hotel_market_2111,hotel_market_2112,hotel_market_2113,hotel_market_2115,hotel_market_2116,hotel_market_2117
0,1975.741688,0,0,8,0,4,0,81,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1975.741688,1,1,2,0,1,0,2,5,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
X2 = df2_dropped_and_dummies.drop(["hotel_cluster"], axis=1)
y2 = df2_dropped_and_dummies["hotel_cluster"]
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.33, random_state=42)

In [31]:
from sklearn.tree import DecisionTreeClassifier

DT_model = DecisionTreeClassifier (max_depth=100)

X = X_train2
y = y_train2    
    
# Fits the model
DT_model.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=100,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [32]:
y_predicted_DT = DT_model.predict(X=X_test2)

In [33]:

# We calculate our out of sample or test accuracy
print accuracy_score(y_pred=y_predicted_DT, y_true=y_test2)

0.098563670507


In [36]:
from sklearn.ensemble import RandomForestClassifier

RF_model_2 = RandomForestClassifier(n_estimators = 200, max_depth= 60, oob_score= True)
    
RF_model_2.fit(X_train2, y_train2)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=60, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=1, oob_score=True, random_state=None,
            verbose=0, warm_start=False)

In [35]:
RF_model_2.oob_score_ 

0.11100222984391092