# Description
This challenge is open to users from English speaking African countries.

The Tanzanian tourism sector plays a significant role in the Tanzanian economy, contributing about 17% to the country’s GDP and 25% of all foreign exchange revenues. The sector, which provides direct employment for more than 600,000 people and up to 2 million people indirectly, generated approximately $2.4 billion in 2018 according to government statistics. Tanzania received a record 1.1 million international visitor arrivals in 2014, mostly from Europe, the US and Africa.

Tanzania is the only country in the world which has allocated more than 25% of its total area for wildlife, national parks, and protected areas.There are 16 national parks in Tanzania, 28 game reserves, 44 game-controlled areas, two marine parks and one conservation area.

Tanzania’s tourist attractions include the Serengeti plains, which hosts the largest terrestrial mammal migration in the world; the Ngorongoro Crater, the world’s largest intact volcanic caldera and home to the highest density of big game in Africa; Kilimanjaro, Africa’s highest mountain; and the Mafia Island marine park; among many others. The scenery, topography, rich culture and very friendly people provide for excellent cultural tourism, beach holidays, honeymooning, game hunting, historical and archaeological ventures – and certainly the best wildlife photography safaris in the world.

The objective of this hackathon is to develop a machine learning model that can classify the range of expenditures a tourist spends in Tanzania. The model can be used by different tour operators and the Tanzania Tourism Board to automatically help tourists across the world estimate their expenditure before visiting Tanzania.

AI4D AFRICA’S ANGLOPHONE MULTIDISCIPLINARY RESEARCH LAB (http://ai4dlab.or.tz/)

In [60]:
import pandas as pd
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier

In [3]:

sample = pd.read_csv('SampleSubmission.csv')
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
definitions = pd.read_csv('VariableDefinitions.csv')

In [5]:
print(sample.info())
sample.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6169 entries, 0 to 6168
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Tour_ID       6169 non-null   object 
 1   High Cost     1 non-null      float64
 2   Higher Cost   1 non-null      float64
 3   Highest Cost  1 non-null      float64
 4   Low Cost      1 non-null      float64
 5   Lower Cost    1 non-null      float64
 6   Normal Cost   1 non-null      float64
dtypes: float64(6), object(1)
memory usage: 337.5+ KB
None


Unnamed: 0,Tour_ID,High Cost,Higher Cost,Highest Cost,Low Cost,Lower Cost,Normal Cost
0,tour_idynufedne,0.23,0.56,0.04,0.12,0.005,0.12
1,tour_id9r3y5moe,,,,,,
2,tour_idf6itml6g,,,,,,
3,tour_id99u4znru,,,,,,
4,tour_idj4i9urbx,,,,,,


In [7]:
print(train.info())
train.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18506 entries, 0 to 18505
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Tour_ID                18506 non-null  object 
 1   country                18506 non-null  object 
 2   age_group              18506 non-null  object 
 3   travel_with            17431 non-null  object 
 4   total_female           18504 non-null  float64
 5   total_male             18500 non-null  float64
 6   purpose                18506 non-null  object 
 7   main_activity          18506 non-null  object 
 8   info_source            18506 non-null  object 
 9   tour_arrangement       18506 non-null  object 
 10  package_transport_int  18506 non-null  object 
 11  package_accomodation   18506 non-null  object 
 12  package_food           18506 non-null  object 
 13  package_transport_tz   18506 non-null  object 
 14  package_sightseeing    18506 non-null  object 
 15  pa

Unnamed: 0,Tour_ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_accomodation,package_food,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,first_trip_tz,cost_category
0,tour_id1hffseyw,ITALY,45-64,With Children,0.0,2.0,Visiting Friends and Relatives,Beach Tourism,"Friends, relatives",Package Tour,...,Yes,Yes,Yes,No,No,No,0,7,Yes,High Cost
1,tour_idnacd7zag,UNITED KINGDOM,25-44,With Spouse,1.0,1.0,Leisure and Holidays,Wildlife Tourism,"Travel agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,No,No,0,7,Yes,High Cost
2,tour_id62vz7e71,UNITED STATES OF AMERICA,65+,With Spouse,1.0,1.0,Leisure and Holidays,Widlife Tourism,"Travel agent, tour operator",Package Tour,...,Yes,Yes,Yes,Yes,Yes,No,6,6,Yes,Higher Cost
3,tour_idrc76tzix,RWANDA,25-44,With Spouse and Children,3.0,1.0,Leisure and Holidays,Beach Tourism,"Radio, TV, Web",Independent,...,No,No,No,No,No,No,3,0,No,Lower Cost
4,tour_idn723m0n9,UNITED STATES OF AMERICA,45-64,Alone,0.0,1.0,Leisure and Holidays,Widlife Tourism,"Travel agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,Yes,Yes,7,0,Yes,Higher Cost


In [10]:
def seed_everything(seed):
    # random.seed(seed)
    # np.random.seed(seed)
    pass
SEED = 42    
seed_everything(SEED) 

Index(['Tour_ID', 'country', 'age_group', 'travel_with', 'total_female',
       'total_male', 'purpose', 'main_activity', 'info_source',
       'tour_arrangement', 'package_transport_int', 'package_accomodation',
       'package_food', 'package_transport_tz', 'package_sightseeing',
       'package_guided_tour', 'package_insurance', 'night_mainland',
       'night_zanzibar', 'first_trip_tz', 'cost_category'],
      dtype='object')

In [21]:
# cols = ['country','total_female','package_food']
# target = train['cost_category']
# X=pd.DataFrame()
# testy=pd.DataFrame()

In [28]:
# # label_encoder object knows how to understand word labels.
# label_encoder = preprocessing.OneHotEncoder(handle_unknown='ignore')
  
# # Encode labels in column 'country'.
# X= label_encoder.fit_transform(train[['country']])

# testy= label_encoder.transform(test[['country']])

In [35]:
# x = pd.DataFrame(X.toarray())
# t = pd.DataFrame(testy.toarray())
# print(t.shape)
# x.head()

(6169, 131)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,121,122,123,124,125,126,127,128,129,130
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
model = CatBoostClassifier()
cat_cols = ['country', 'age_group','purpose', 'main_activity', 'info_source',
       'tour_arrangement']

model.fit(train[['country', 'age_group','purpose', 'main_activity', 'info_source',
       'tour_arrangement']],target, cat_features=cat_cols)

Learning rate set to 0.091824
0:	learn: 1.7004832	total: 884ms	remaining: 14m 43s
1:	learn: 1.6290997	total: 1.99s	remaining: 16m 34s
2:	learn: 1.5727322	total: 3.05s	remaining: 16m 54s
3:	learn: 1.5266029	total: 4.38s	remaining: 18m 10s
4:	learn: 1.4889037	total: 5.82s	remaining: 19m 19s
5:	learn: 1.4462241	total: 7.21s	remaining: 19m 53s
6:	learn: 1.4123929	total: 8.53s	remaining: 20m 10s
7:	learn: 1.3832497	total: 9.77s	remaining: 20m 11s
8:	learn: 1.3598595	total: 10.6s	remaining: 19m 24s
9:	learn: 1.3393052	total: 12.1s	remaining: 19m 54s
10:	learn: 1.3203619	total: 13.2s	remaining: 19m 47s
11:	learn: 1.3050459	total: 14.3s	remaining: 19m 38s
12:	learn: 1.2905294	total: 15.6s	remaining: 19m 45s
13:	learn: 1.2772276	total: 16.9s	remaining: 19m 49s
14:	learn: 1.2674969	total: 18s	remaining: 19m 39s
15:	learn: 1.2585675	total: 18.8s	remaining: 19m 14s
16:	learn: 1.2487389	total: 19.6s	remaining: 18m 50s
17:	learn: 1.2404193	total: 20.5s	remaining: 18m 37s
18:	learn: 1.2337041	total: 

<catboost.core.CatBoostClassifier at 0x23db11a7a30>

In [68]:
pred = model.predict_proba(test[['country', 'age_group','purpose', 'main_activity', 'info_source',
       'tour_arrangement']])

In [69]:
pred
sol = pd.DataFrame(pred)

In [70]:
sol.head()

Unnamed: 0,0,1,2,3,4,5
0,0.178836,0.143372,0.002784,0.080965,0.041743,0.5523
1,0.318976,0.353521,0.042089,0.010282,0.011223,0.263909
2,0.568434,0.28282,0.002859,0.015787,0.009666,0.120435
3,0.041706,0.012401,0.000416,0.146356,0.569185,0.229936
4,0.0382,0.008152,7.4e-05,0.214956,0.361308,0.37731


In [50]:
sample.head()

Unnamed: 0,Tour_ID,High Cost,Higher Cost,Highest Cost,Low Cost,Lower Cost,Normal Cost
0,tour_idynufedne,0.23,0.56,0.04,0.12,0.005,0.12
1,tour_id9r3y5moe,,,,,,
2,tour_idf6itml6g,,,,,,
3,tour_id99u4znru,,,,,,
4,tour_idj4i9urbx,,,,,,


In [71]:
sub = pd.concat([sample,sol],axis=1).drop(['High Cost','Higher Cost','Highest Cost','Low Cost','Lower Cost','Normal Cost'],axis=1)
sub.columns = ['Tour_ID','High Cost','Higher Cost','Highest Cost','Low Cost','Lower Cost','Normal Cost']

In [72]:
sub.to_csv('sub.csv',index = False)
sub.head()

Unnamed: 0,Tour_ID,High Cost,Higher Cost,Highest Cost,Low Cost,Lower Cost,Normal Cost
0,tour_idynufedne,0.178836,0.143372,0.002784,0.080965,0.041743,0.5523
1,tour_id9r3y5moe,0.318976,0.353521,0.042089,0.010282,0.011223,0.263909
2,tour_idf6itml6g,0.568434,0.28282,0.002859,0.015787,0.009666,0.120435
3,tour_id99u4znru,0.041706,0.012401,0.000416,0.146356,0.569185,0.229936
4,tour_idj4i9urbx,0.0382,0.008152,7.4e-05,0.214956,0.361308,0.37731
