# Method - 2

Simple Linear Regression model performed better in the private LB better than the other models. This notebook the feature engineering and model training

In [None]:
# importing needed libraries

import os

import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import LabelEncoder

In [None]:
# path for the datasets.

TRAIN_PATH = 'train.csv'
TEST_PATH = 'test.csv'

SAMPLE_PATH = 'sample_submission.csv'

In [None]:
# loading train and test set

train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

In [None]:
# let's see whether there are any duplicate columns in the dataset

cols = train.columns
for idx, col in enumerate(cols):
    for c in cols[idx+1:]:
        if train[col].equals(train[c]):
            print(col, c)

It can be seen that there is no duplicate column, let's move on

# Preprocessing and Feature Engineering

This section fixes the issues found in the EDA and generate new feature from the insights we got from the EDA

In [None]:
# hotel_stars feature have problem in it as seen in the EDA.
# let's fix it by replacing ',' by '.'

train.hotel_stars = train.hotel_stars.map(lambda x : x.replace(',', '.'))
test.hotel_stars = test.hotel_stars.map(lambda x : x.replace(',', '.'))

train.hotel_stars = train.hotel_stars.astype('float32')
test.hotel_stars = test.hotel_stars.astype('float32')

There was a significant dependency on whether the hotel had swimming_pool and the score it got. The period of stay also have significant relationship.

It will be good to create a feature that indicate whether the earthling stayed in 'Mar-May' (assuming it is the hottest time).

In [None]:
# we saw in EDA, there is some relationship with the period of stay and the score
# let's make a new feature 'stayed_in_summer' to indicate whether the person stayed in period 'Mar-May'

train['stayed_in_summer'] = train.period_of_stay.map(lambda x : 1 if x == 'Mar-May' else 0)
test['stayed_in_summer'] = test.period_of_stay.map(lambda x : 1 if x == 'Mar-May' else 0)

In [None]:
# Let's encode the categorical features one by one 

continent_encoder = LabelEncoder().fit(train['earthling_continent'])
train['earthling_continet'] = continent_encoder.transform(train['earthling_continent'])
test['earthling_continet'] = continent_encoder.transform(test['earthling_continent'])

In [None]:
# We saw that the feature 'earthling_country' can't be used in our model as it have different value in train and test
# Let's drop that feature

del train['earthling_country']
del test['earthling_country']

In [None]:
# Let's run a loop to encode rest of the categorical columns

for col in train.columns:
    if train[col].dtype == object:
        enc = LabelEncoder().fit(train[col])
        
        train[col] = enc.transform(train[col])
        test[col] = enc.transform(test[col])

In [None]:
# We saw, 'free_wifi' had a significant impact on the score from the EDA
# Let's target encode that feature

score_map = train.groupby(by=['free_wifi'])['score'].mean().to_dict()
train['wifi_score_encoded'] = train['free_wifi'].map(score_map)
test['wifi_score_encoded'] = test['free_wifi'].map(score_map)

In [None]:
# Let's also target encode the new feature 'stayed_in_summer'

score_map = train.groupby(by=['stayed_in_summer'])['score'].mean().to_dict()
train['stayed_in_summer_score_encoded'] = train['stayed_in_summer'].map(score_map)
test['stayed_in_summer_score_encoded'] = test['stayed_in_summer'].map(score_map)

In [None]:
# feature 'number_of_rooms' was not that prominent feature from our EDA
# Let's drop that feature 

del train['number_of_rooms']
del test['number_of_rooms']

# Model Training

With the features pre-processed and a few features generated, we can now move on with the model training and testing.

Since the score ranges from 1 to 5 and the distribution of smaller scores are very less in the train set, the problem is very imabalanced. This makes it difficult for the Classifier to learn. 

Alternate way to model is to try to fit a regressor on the data and predict. Later the prediction can be mapped to class labels based on some threshold.

In [None]:
# Let's split the data in to train and validation

train_feats, valid_feats, train_labels, valid_labels = train_test_split(train.drop(['score'], axis=1), train['score'],
                                                                       test_size=0.25, random_state=2019, stratify=train['score'])

In [None]:
# We are trying to fit a Regressor on the data
# The output of the regressor can be further mapped in to different classes accoringly with the 
# function map_prediction

def map_prediction(score):
    if score < 1.10:
        return 1
    elif score < 2.10:
        return 2
    elif score < 3.10:
        return 3
    elif score < 4.10:
        return 4
    else:
        return 5
    
lr = LinearRegression(normalize=False)
lr.fit(train_feats, train_labels)

valid_preds = lr.predict(valid_feats)
valid_preds = list(map(map_prediction, valid_preds))

accuracy_score(valid_labels, valid_preds)

Result seems not that great actually.

But for this data this is one of the best results to get. 

Let's now fit the model on full data and make prediction on test set.

In [None]:
test = test[train.drop(['score'], axis=1).columns]

In [None]:
# fixing mars_membership_years feature of test set
# Let's replace it with 0

test['mars_membership_years'] = test['mars_membership_years'].map(lambda y: 0 if y < 0 else y)

In [None]:
# predicting on test set

test_preds = lr.predict(test)
test_preds = list(map(map_prediction, test_preds))

In [None]:
# making submission dataframe

sub = pd.read_csv(SAMPLE_PATH)
sub['score'] = test_preds

In [None]:
# saving as csv

sub.to_csv('lr_model.csv', index=False)