### About the Dataset

We will be using the file 'train_users_2.csv'.<br><br>
The dataset has 213451 rows and consists of following columns:<br>
1. id: user id
2. date_account_created: the date of account creation
3. timestamp_first_active: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
4. date_first_booking: date of first booking
5. gender
6. age
7. signup_method
8. signup_flow: the page a user came to signup up from
9. language: international language preference
10. affiliate_channel: what kind of paid marketing
11. affiliate_provider: where the marketing is e.g. google, craigslist, other
12. first_affiliate_tracked: whats the first marketing the user interacted with before the signing up
13. signup_app
14. first_device_type
15. first_browser
16. country_destination: this is the target variable you are to predict

Note: There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'. Please note that 'NDF' is different from 'other' because 'other' means there was a booking, but is to a country not included in the list, while 'NDF' means there wasn't a booking.

### Imported the required libraries

In [None]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

### Loaded the dataset and checked the number of rows in the dataset

In [None]:
df_train=pd.read_csv("Datasets/train_users_2.csv")
df_test=pd.read_csv("Datasets/test_users.csv")
print(str('There are total '+ str(df_train.shape[0])+' rows in training dataset.'))
print(str('There are total '+ str(df_test.shape[0])+' rows in test dataset.'))

### Now we'll start the cleaning process:
#### Step 1: Check for Nulls

In [None]:
df_train.isnull().sum()

In [None]:
df_train.isnull().sum()

In [None]:
df_test.isnull().sum()

In [None]:
df_test.isnull().sum()

We can see that age, date_first_booking and first_affliate tracked has NULL values.

#### Step 2: Check for NULL values which are in some other format

In some cases, instead of NaN, the missing values are in the form of some other format like '-unknown-'. This can be identified by looking for unique values in each of the column as shown below:

a. Affiliate Channel:

In [None]:
df_train.affiliate_channel.unique()

b. Affiliate Provider:

In [None]:
df_train.affiliate_provider.unique()

c. Age:

In [None]:
df_train.age.unique()

d. Country destination:

In [None]:
df_train.country_destination.unique()

e. Date Account Created:

In [None]:
df_train.date_account_created.unique()

f. Date First Booking:

In [None]:
df_train.date_first_booking.unique()

g. First Affiliate Tracked 

In [None]:
df_train.first_affiliate_tracked.unique()

h. First Browser

In [None]:
df_train.first_browser.unique()

###### Here, in browsers list, we can see a value as '-unknown-'.

In [None]:
from IPython.display import Image
Image(filename="img/Browser.jpg", width=550, height=350)

i. First Device Type

In [None]:
df_train.first_device_type.unique()

j. Gender

In [None]:
df_train.gender.unique()

###### Similarly for Gender.

In [None]:
Image(filename="img/Gender.jpg", width=450, height=300)

k. Id

In [None]:
df_train.id.unique()

l. Language

In [None]:
df_train.language.unique()

#### Similarly for Language

m. Signup App

In [None]:
df_train.signup_app.unique()

n. Signup Flow

In [None]:
df_train.signup_flow.unique()

o. Signup Method

In [None]:
df_train.signup_method.unique()

p. Timestamp First Active

In [None]:
df_train.timestamp_first_active.unique()

Now we will replace the '-unknown-' values with NaN.

In [None]:
df_train.first_browser.replace('-unknown-', np.nan, inplace=True)
df_train.gender.replace('-unknown-', np.nan, inplace=True)

Overview of the data:

In [None]:
df_train.head()

#### Step:3 Check for anomalies in numerical variables

Here, we only have Age as a continuous numerical variable.

Obtain the summary statistics of Age:

In [None]:
df_train.age.describe()

Here, we can see some values i.e. 1 as minimum and 2014 as maximum age. So, we will analyze it futher by looking for users with age less than 15 and more than 150.

In [None]:
print('Summary statistics for age<15:')
print(' ')
print(df_train[df_train.age<15].describe())
print('-------------------------------------------------------')
print('Summary statistics for age>150:')
print(' ')
print(df_train[df_train.age>150].describe())

So, here we can see that for the ages above 150, the users have inserted their year of birth instead of age. We can fix this by subtracting the given year from the current year (for this dataset it was 2015) to get the age of the user.<br>
For ages less than 15, they can be considered as incorrect inputs and can be filtered out.

In [None]:
df_test.loc[df_test.age>18, 'age']

In [None]:
df_abnormal_age=df_train['age']>150
df_train.loc[df_abnormal_age,'age']=2015 - df_train.loc[df_abnormal_age,'age']
df_train.age.describe()

Now we will only select the age between 18 and 100 as they are relevant and replace others with NaN.

In [None]:
df_train.loc[df_train.age<18,'age']=np.nan
df_train.loc[df_train.age>100,'age']=np.nan
df_train.age.describe()

#### Step 4: Removing unwanted values

In the 'About the dataset' section, it was mentioned that one of the values in the country_destination column was 'NDF' i.e. 'No Destination found' which means the user has not booked any destination yet. It would be better to filter out those rows as having them won't add any value to our analysis.

In [None]:
df_without_ndf=df_train[df_train.country_destination !='NDF']
df_without_ndf.head()

#### Step 5: Convert the columns into desired format

Convert the date_account_created, date_first_booking, timestamp_first_active columns into date time format:

Before conversion, the values in the two columns had datatype as object and float64 as shown below:

In [None]:
print('Datatype of date_account_type: '+ str(df_without_ndf.date_account_created.dtype))
print('')
print('Datatype of date_first_booking: '+ str(df_without_ndf.date_first_booking.dtype))
print('')
print('Datatype of timestamp_first_active: '+ str(df_without_ndf.timestamp_first_active.dtype))

Now, convert the dates into datetime format.

In [None]:
#To convert the dates into datetime format
df_without_ndf.date_account_created=pd.to_datetime(df_without_ndf.date_account_created)
df_without_ndf.date_first_booking=pd.to_datetime(df_without_ndf.date_first_booking)
df_without_ndf['timestamp_first_active'] = pd.to_datetime((df_without_ndf.timestamp_first_active)//1000000, format='%Y%m%d')

Check the datatypes of the columns:

In [None]:
print('Datatype of date_account_type: '+ str(df_without_ndf.date_account_created.dtype))
print('')
print('Datatype of date_first_booking: '+ str(df_without_ndf.date_first_booking.dtype))
print('')
print('Datatype of timestamp_first_active: '+ str(df_without_ndf.timestamp_first_active.dtype))

So, finally we are done with cleaning and now, we would now dive into visualizing and analyzing the data:

### Visualizing and Analyzing the Airbnb user data

##### 1. How are the Destination countries distributed among the users?

In [None]:
#visualizing the distribution of user's selection of country
plt.figure(figsize=(12,6))
destination_percentage=df_without_ndf.country_destination.value_counts()/df_without_ndf.shape[0]*100
destination_percentage.plot(kind='bar')
# sns.countplot(x='country_destination', data=df_without_ndf,order=df_without_ndf.country_destination.value_counts().index)
plt.ylabel("% of users")
plt.title("Distribution of destination countries among users")
plt.xticks(rotation='horizontal')
plt.show()

68% of the users here have mostly booked their Airbnb's in the US. This might be because all the users are from the US and prefer to go for vacation in the US only.

##### 2. What is the age distribution of users?

In [None]:
#Let's visualize the ages of users
plt.figure(figsize=(12,6))
sns.distplot(df_without_ndf.age.dropna())
plt.title("Age Distribution of users")
plt.ylabel('% of users')
plt.show()

Most of our users are in age range of 25-35 years.

##### 3. How does age varies with Destination countries?

In [None]:
#Let's check how age is distributed across the destination countries
plt.figure(figsize=(12,6))
sns.boxplot(y='age' , x='country_destination',data=df_without_ndf)
plt.title("Age Distribution across the destinations")
plt.xlabel("")
plt.show()

Almost all the countries have a similar median age. Only users tavelling to Spain and Portugal are slightly younger. <br> Users of age 80 and above mostly choose US as their destination. The reason might be the US user data i.e. as all the users are from US, older people in US prefer not to travel outside their home country.

##### 4. What is the gender distribution of users?

In [None]:
#gender distribution
plt.figure(figsize=(10,6))
gender_percentage=df_without_ndf.gender.value_counts()/df_without_ndf.shape[0]*100
gender_percentage.plot(kind='bar')
# sns.countplot(x='gender',data=df_without_ndf)
plt.ylabel("% of users")
plt.title("Gender Distribution of users")
plt.xticks(rotation='horizontal')
plt.show()

35% of the users are female and 30% of the users are male. It means that the difference between the gender of the users is not significant. Also, 34% of the gender information is missing from the dataset.

##### 5. How does Gender Distribution varies across the destination bookings?

In [None]:
# fig,axes=plt.subplots(nrows=1,ncols=2, figsize=(12,4))

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination',data=df_without_ndf, hue='gender', 
              order=df_without_ndf.country_destination.value_counts().index)
plt.xlabel('')
plt.ylabel('No. of users')

# plt.figure(figsize=(12,6))
ctab=pd.crosstab(df_without_ndf.country_destination,df_without_ndf['gender']).apply(lambda x: x/x.sum()*100, axis=1)
ctab.plot(kind='bar',stacked=True,legend=True)
plt.ylabel('% of users')
plt.xlabel('')
plt.xticks(rotation='horizontal')

plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)

plt.show()

The bookings made by females are slightly higher for most of the destination countries except for Canada, Denmark, Netherlands and other(not mentioned) countries where booking by males are slightly more than the females.

##### 6. Which is the most popular signup Application among the users?

In [None]:
#User signup app distribution
plt.figure(figsize=(12,6))
signup_app_percentage=df_without_ndf.signup_app.value_counts()/df_without_ndf.shape[0]*100
signup_app_percentage.plot(kind='bar')
# sns.countplot(x='signup_app', data=df_without_ndf, order=df.signup_app.value_counts().index)
plt.title("Signup app distribution of users")
plt.ylabel('% of users')
plt.xticks(rotation='horizontal')
plt.show()

More than 80% of the users signup using Web, followed by iOS, Mobile Web and Android.

##### 7. Which signup Application is used by users to book their country destinations?

Note: For clear visualization of the data for countries other than US, I displayed both the charts: one with US and one excluding US. I'll follow this similarly for subsequent visualizations wherever required.

In [None]:
df_without_ndf_and_US=df_without_ndf[df_without_ndf.country_destination!='US']

In [None]:
# fig, axes=plt.subplots(nrows=1,ncols=2,figsize=(15,5))
plt.figure(figsize=(12,6))
sns.countplot(x='country_destination',data=df_without_ndf,hue='signup_app',
              order=df_without_ndf.country_destination.value_counts().index,
              hue_order=['Web', 'iOS', 'Moweb', 'Android'])
plt.title("Distribution of Signup app across destination countries")
plt.ylabel('No. of users')
plt.xlabel('')

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination',data=df_without_ndf_and_US,hue='signup_app',
              order=df_without_ndf_and_US.country_destination.value_counts().index,
              hue_order=['Web', 'iOS', 'Moweb', 'Android'])
plt.title("Distribution of Signup app across destination countries excluding US")
plt.ylabel('No. of users')
plt.xlabel('')

plt.show()

We can see that users mostly use web irrespective of booking any of the destination countries.

##### 8. Which channel attracts more users to AirBnB?

In [None]:
plt.figure(figsize=(12,6))
affiliate_channel_percentage=df_without_ndf.affiliate_channel.value_counts()/df_without_ndf.shape[0]*100
affiliate_channel_percentage.plot(kind='bar')
plt.title('Distribution of Affiliate channels used to attract the users')
plt.ylabel('% of users')
plt.xticks(rotation='horizontal')
# sns.countplot(x='affiliate_channel',data=df_without_ndf,order=df_without_ndf.affiliate_channel.value_counts().index)
plt.show()

Direct paid marketing is responsible for attracting 60% of the users to book a place using AirbnB.

##### 9. Which Affiliate channel attracts users to book places in their destination countries using Airbnb?

In [None]:
df_without_ndf.affiliate_channel.unique()

In [None]:
#Channel Distribution based on Destination countries
# fig, axes=plt.subplots(nrows=1, ncols=2,figsize=(15,5))
plt.figure(figsize=(12,6))
sns.countplot(x='country_destination',data=df_without_ndf,hue='affiliate_channel',
              order=df_without_ndf.country_destination.value_counts().index,
              hue_order=['direct', 'sem-brand', 'sem-non-brand', 'seo', 'other', 'api', 'content', 'remarketing'])
plt.title('Distribution of Affiliate channels among the destination countries')
plt.ylabel('No. of users')

plt.figure(figsize=(12,6))
sns.countplot(x='country_destination',data=df_without_ndf_and_US,hue='affiliate_channel',
              order=df_without_ndf_and_US.country_destination.value_counts().index,
              hue_order=['direct', 'sem-brand', 'sem-non-brand', 'seo', 'other', 'api', 'content', 'remarketing'])
plt.title('Distribution of Affiliate channels among the destination countries excluding US')
plt.xlabel('country_destination without US')
plt.ylabel('No. of users')

plt.show()

We can see that direct marketing is most popular for attracting users for booking places in their destination countries.

##### 10. Which signup method is used by users to register on Airbnb?

In [None]:
#Now lets check which apps are being used to signup
plt.figure(figsize=(12,6))
signup_method_percentage=df_without_ndf.signup_method.value_counts()/df_without_ndf.shape[0]*100
signup_method_percentage.plot(kind='bar')
plt.title("Signup Method distribution among users")
plt.ylabel('% of users')
plt.xticks(rotation='horizontal')
# sns.countplot(x='signup_method',data=df_without_ndf, order=df_without_ndf.signup_method.value_counts().index)
plt.show()

More than 70% of the users use the basic signup method to register themselves on Airbnb, followed by Facebook. Users rarely use their Google account to register on Airbnb.

##### 11. Which signup method is popular among users to register on Airbnb before booking their stay in the destination countries?

In [None]:
#Destination country based on signup app
plt.figure(figsize=(12,6))
sns.countplot(x='country_destination',data=df_without_ndf, order=df_without_ndf.country_destination.value_counts().index,
             hue='signup_method', hue_order=['basic','facebook','google'])
plt.ylabel("No. of users")
plt.title("Distribution of Signup Methods among the destination countries")
plt.legend(loc='upper right')

plt.show()

Basic signup method is most common among users to signup into Airbnb to book any of the destination countries.

##### 12. Which is the first device used by users to access Airbnb?

In [None]:
#First Device type distribution
plt.figure(figsize=(18,6))
first_device_type_percentage=df_without_ndf.first_device_type.value_counts()/df_without_ndf.shape[0]*100
first_device_type_percentage.plot(kind='bar')
# sns.countplot(x='first_device_type',data=df_without_ndf,order=df_without_ndf.first_device_type.value_counts().index)
plt.ylabel("% of users")
plt.title("First Device type distribution among users")
plt.xticks(rotation='horizontal')
plt.show()

More than 40% of the users use Mac Desktop to access Airbnb. Also, Mac Desktop and Windows Desktop together constitute appoximately 80% of all the users who use Desktop as the first device to access Airbnb. This supports our earlier result that stated "80% of users use Web as a signup app to register on Airbnb".

##### 13. Which device is used by the users first to book their destination countries?

In [None]:
#First Device type distribition across destinations
plt.figure(figsize=(18,6))
sns.countplot(x='country_destination',data=df_without_ndf, order=df_without_ndf.country_destination.value_counts().index,
             hue='first_device_type',hue_order=['Mac Desktop', 'Windows Desktop', 'iPhone', 'iPad', 'Other/Unknown', 
                                                'Android Phone','Desktop (Other)', 'Android Tablet', 'SmartPhone (Other)'])
plt.ylabel("No. of users")
plt.title('Distribution of First Device type across destination countries')
plt.legend(loc='upper right')

plt.figure(figsize=(18,6))
sns.countplot(x='country_destination',data=df_without_ndf_and_US, 
              order=df_without_ndf_and_US.country_destination.value_counts().index,
              hue='first_device_type', hue_order=['Mac Desktop', 'Windows Desktop', 'iPhone', 'iPad', 'Other/Unknown', 
                                                'Android Phone','Desktop (Other)', 'Android Tablet', 'SmartPhone (Other)'])
plt.ylabel("No. of users")
plt.title('Distribution of First Device type across destination countries excluding US')
plt.legend(loc='upper right')
plt.show()

Mac Desktop and Windows Desktop have been the most popular first devices used by users to access Airbnb.<br>

iPad is used more than iPhone as a first device by the users who book their places in countries apart from US and other (not mentioned) countries.

##### 14. Which is the most popular browser among users to access Airbnb?

In [None]:
#First Browser distribution
plt.figure(figsize=(18,6))
first_browser_percentage=df_without_ndf.first_browser.value_counts()/df_without_ndf.shape[0]*100
first_browser_percentage.plot(kind='bar')
# sns.countplot(x='first_browser', data=df_without_ndf, order=df_without_ndf.first_browser.value_counts().index)
plt.ylabel("% of users")
plt.show()

35% of users use Chrome to access Airbnb, followed by Safari and Firefox. 

Earlier, we observed that Mac Desktop was used by most of our users, followed by Windows Desktop, iPhone and iPad. This means that Chrome is preferred over all other browsers on any device type, be it Apple devices or Windows Desktop.

##### 15. How many pages do users access before landing on Airbnb page?

In [None]:
plt.figure(figsize=(12,6))
signup_flow_percentage=df_without_ndf.signup_flow.value_counts()/df_without_ndf.shape[0]*100
signup_flow_percentage.plot(kind='bar')
plt.title('Pages accessed before landing on Airbnb page')
plt.ylabel('% of users')
plt.xlabel('Page no.')
# sns.countplot(x='signup_flow',data=df_without_ndf)
plt.show()

We can see that more than 75% of the users land on Airbnb page directly.One interesting this to note is that there are around 5-6% of users who land on Airbnb page after accessing 25 pages. The reason for this might be that they start looking for options on other competitor websites first and then while searching, they might be popped up by an advertisement of airbnb somewhere with some attractive deals which makes them visit the Airbnb page.

##### 16. How has the customer base been expanding for Airbnb over time?

In [None]:
#New account created over time
plt.figure(figsize=(12,6))
(df_without_ndf.date_account_created.value_counts().plot(kind='line',linewidth=1))
plt.ylabel("No. of Customers")
plt.xlabel("Date")
plt.show()

There was a huge rise in user registration after 2014. This was the time when Airbnb's business started to boom and since then it has expanded at a very high rate.

### Feature Engineering

1. Clean the test dataset in the similar way we did for the train dataset.

In [None]:
#Assign the df_without_ndf as df_train_without_NDF
df_train1=df_train[df_train['country_destination']!='NDF']
df_train1.head()

In [None]:
#Replacing the unknowns
df_test.first_browser.replace('-unknown-', np.nan, inplace=True)
df_test.gender.replace('-unknown-', np.nan, inplace=True)
df_test.language.replace('-unknown-', np.nan, inplace=True)

In [None]:
#Fixing the age issues
df_abnormal_age_test=df_test['age']>150
df_test.loc[df_abnormal_age_test,'age']=2015 - df_test.loc[df_abnormal_age_test,'age']
df_test.loc[df_test.age<18,'age']=np.nan
df_test.loc[df_test.age>100,'age']=np.nan
print(df_test.age.describe())

Segregate columns date_account_created and timestamp_first_active into year, month and day

In [None]:
#For date_account_created
df_train1['date_account_created']=pd.to_datetime(df_train1['date_account_created'])
df_train1['date_account_created_year']=df_train1.date_account_created.dt.year
df_train1['date_account_created_month']=df_train1.date_account_created.dt.month
df_train1['date_account_created_day']=df_train1.date_account_created.dt.day

df_test['date_account_created']=pd.to_datetime(df_test['date_account_created'])
df_test['date_account_created_year']=df_test.date_account_created.dt.year
df_test['date_account_created_month']=df_test.date_account_created.dt.month
df_test['date_account_created_day']=df_test.date_account_created.dt.day

In [None]:
#For timestamp_first_active
df_train1['timestamp_first_active']=pd.to_datetime((df_train1.timestamp_first_active//1000000),format='%Y%m%d')
df_train1['timestamp_first_active_year']=df_train1.timestamp_first_active.dt.year
df_train1['timestamp_first_active_month']=df_train1.timestamp_first_active.dt.month
df_train1['timestamp_first_active_day']=df_train1.timestamp_first_active.dt.day

df_test['timestamp_first_active']=pd.to_datetime((df_test.timestamp_first_active//1000000),format='%Y%m%d')
df_test['timestamp_first_active_year']=df_test.timestamp_first_active.dt.year
df_test['timestamp_first_active_month']=df_test.timestamp_first_active.dt.month
df_test['timestamp_first_active_day']=df_test.timestamp_first_active.dt.day

In [None]:
#Drop the main columns
df_train1=df_train1.drop(['date_account_created','timestamp_first_active'], axis=1)
df_test=df_test.drop(['date_account_created','timestamp_first_active'], axis=1)

In [None]:
df_train1.head()

In [None]:
#Replace the NULL values in age 
df_train1['age'].fillna(-1, inplace=True)
df_test['age'].fillna(-1,inplace=True)

In [None]:
#Create the target variable and drop it from train dataset
y_train=df_train1['country_destination']
x_train=df_train1.drop(['country_destination'], axis=1)
x_test=df_test

In [None]:
#Drop the unwanted columns from both the datasets
id_test=x_test['id']
x_train=x_train.drop(['date_first_booking'], axis=1)
x_test=df_test.drop(['date_first_booking'], axis=1)

In [None]:
#Check the total rows and columns in both rain and test datasets
print("Train Dataset: "+str(x_train.shape))
print("Test Dataset: "+str(x_test.shape))

In [None]:
#Overview of train dataset
x_train.head()

In [None]:
#Overview of test dataset
x_test.head()

Encoding the categorical features using one hot encoding

In [None]:
#Merge x_train and y_train dataset
merge_train_test=pd.concat([x_train,x_test],axis=0)

In [None]:
#Use get_dummies function to convert the categorical variables into one hot encoding
categorical_columns=['gender', 'signup_method', 'signup_flow', 'language',
       'affiliate_channel', 'affiliate_provider', 'first_affiliate_tracked',
       'signup_app', 'first_device_type', 'first_browser',
       'date_account_created_year', 'date_account_created_month',
       'date_account_created_day', 'timestamp_first_active_year',
       'timestamp_first_active_month', 'timestamp_first_active_day']
merge_train_test1=pd.get_dummies(merge_train_test,columns=categorical_columns)

In [None]:
merge_train_test2=merge_train_test1.set_index('id')

In [None]:
x_train2=merge_train_test2.loc[x_train['id']]
x_train2.shape

In [None]:
x_test2=merge_train_test2.loc[x_test['id']]
x_test2.shape

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
encoded_y_train=label_encoder.fit_transform(y_train)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
x_train2.dtypes

In [None]:
import xgboost as xgb
xg_train = xgb.DMatrix(x_train2, label=encoded_y_train)
#Specifying the hyperparameters
params = {'max_depth': 10,
    'learning_rate': 1,
    'n_estimators': 5,
    'objective': 'multi:softprob',
    'num_class': 12,
    'gamma': 0,
    'min_child_weight': 1,
    'max_delta_step': 0,
    'subsample': 1,
    'colsample_bytree': 1,
    'colsample_bylevel': 1,
    'reg_alpha': 0,
    'reg_lambda': 1,
    'scale_pos_weight': 1,
    'base_score': 0.5,
    'missing': None,
    'nthread': 4,
    'seed': 42
          }
num_boost_round = 5
print("Train a XGBoost model")
gbm = xgb.train(params, xg_train, num_boost_round)

In [None]:
y_pred=gbm.predict(xgb.DMatrix(x_test2))

In [None]:
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += label_encoder.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

In [None]:
y_pred[0]

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
forest_class = RandomForestClassifier(random_state = 42)

n_estimators = [100, 500]
min_samples_split = [10, 20]

param_grid_forest = {'n_estimators' : n_estimators, 'min_samples_split' : min_samples_split}


rand_search_forest = GridSearchCV(forest_class, param_grid_forest, cv = 4, refit = True,
                                 n_jobs = -1, verbose=2)

rand_search_forest.fit(x_train2, encoded_y_train)

In [None]:
random_estimator = rand_search_forest.best_estimator_

y_pred_random_estimator = random_estimator.predict_proba(final_train_X)

In [None]:
y_pred = random_estimator.predict_proba(final_test_X) 

# We take the 5 highest probabilities for each person
ids = []  #list of ids
cts = []  #list of countries
for i in range(len(id_test)):
    idx = id_test[i]
    ids += [idx] * 5
    cts += le.inverse_transform(np.argsort(y_pred[i])[::-1])[:5].tolist()

# Generating a csv file with the predictions 
sub = pd.DataFrame(np.column_stack((ids, cts)), columns=['id', 'country'])
sub.to_csv('output_randomForest.csv',index=False)