## Two Sigma Connect: Rental Listing Inquiries

I got this exercise as a hiring test.<br>
The limitations I have:
* images are not availble anymore (no peers)
* for some resons I had only two days

Considering the above I decided to use Catboost as multipurpose tool with powerful built-in modules for categorical and text data preprocessing.

### Content
1. Data loading
1. Defining pipeline
1. Playing with features
1. Catboost baseline
1. Catbost grid search
1. Augmentation (not used)

In [None]:
import numpy as np
import pandas as pd
import os
from catboost import Pool, CatBoostClassifier
from sklearn.model_selection import train_test_split
from numpy.random import seed
seed(17)

In [None]:
train_raw = pd.read_json('../input/two-sigma-connect-rental-listing-inquiries/train.json.zip').reset_index(drop = True)
test_raw = pd.read_json('../input/two-sigma-connect-rental-listing-inquiries/test.json.zip').reset_index(drop = True)
full_raw = pd.concat([train_raw, test_raw])

print(train_raw.shape, test_raw.shape, full_raw.shape)
train_raw.head(1)

After a quick vivew at the data I decided to make the following features transformations:


```bathrooms```       convert to int<br>
```bedrooms```        no changes<br>
```building_id```     drop<br>
```created```         extract day and month<br>
```description```     clean text<br>
```display_address``` clean text<br>
```features```        extract from list and clean text<br>
```latitude```        no changes<br>
```listing_id```      drop<br>
```longitude```       no changes<br>
```manager_id```      no changes<br>
```photos```          count links (as images itself are not availble)<br>
```price```           round<br>
```street_address```  drop<br>
```interest_level```  convert to labels

I do not care about outliers and frequency encoding as Catboost has perfect built-in [algorithms](https://catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic) for this, especially for GPU.<br>And the best thing is that it process [text](https://catboost.ai/en/docs/concepts/algorithm-main-stages_text-to-numeric) data automatically as well, it makes life much more easier. The only thing I have to do is to set feature types.<br>
*Honestly I tried to remove outliers manually, but result was worse...*

### Process data

In [None]:
#Drop unused
full = full_raw.drop(['building_id', 'listing_id'], axis=1)

#Extract from list
full['features'] = [','.join(map(str, i)) for i in full['features']]

#Convert dtypes
full[['bathrooms', 'bedrooms']] = full[['bathrooms', 'bedrooms']].astype(int)

#Extract day/month and drop original date
full['day'] = pd.to_datetime(full.created).dt.day.astype('object')
full['month'] = pd.to_datetime(full.created).dt.day.astype('object')
full = full.drop('created', axis=1)

#Count web links
full['photos'] = full.photos.apply(len)

#Round price
full.price = full.price // 50 * 50
full.loc[full.price > 10000, 'price'] = full.loc[full.price > 10000, 'price'] // 500 * 500

#Replace classes by labels
full = full.replace({'interest_level' : { 'high' : 0, 'medium' : 1, 'low' : 2 }})

### Text cleaning

Common things for all text features

In [None]:
#Columns to clean text
cols_text = ['description', 'display_address', 'street_address', 'features']
#To string
full[cols_text] = full[cols_text].astype(str)
#To lower case
full[cols_text] = full[cols_text].apply(lambda x: x.str.lower())

#Remove punctuation
with_whitespace = ['&', '(', ')', "-", "_", ':', '=', '"', ',']
with_empty = ['.', "'", '`', '!', '*', '#', '/', '<', '>', 'br',
              ';', '$', '%', '|', '+', '?']


def replace_symbol(df, to_replace, replace_by):
    for symbol in to_replace:
        df = df.apply(lambda x: x.str.replace(symbol, replace_by, regex = True)) 
    return df

full[cols_text] = replace_symbol(full[cols_text], with_whitespace, ' ')
full[cols_text] = replace_symbol(full[cols_text], with_empty, '')

Text cleaning in ```display_address``` and ```street_address```

In [None]:
adr_feat = ['display_address', 'street_address']

#Correct 'street' and 'avenue' cuts
full[adr_feat] = full[adr_feat].replace(['\sst\s', '\sst$'], ' street', regex = True)
full[adr_feat] = full[adr_feat].replace(['\save\s', '\save$'], ' avenue', regex = True)

#Correct 'east' and 'west' cuts
full[adr_feat] = full[adr_feat].replace(['\se\s', '^e\s'], ' east ', regex = True)
full[adr_feat] = full[adr_feat].replace(['\sw\s', '^w\s'], ' west ', regex = True)
full[adr_feat].sample()

Text cleaning in ```description``` and ```features```

In [None]:
desc_feat = ['description', 'features']
#web links
full[desc_feat] = full[desc_feat].replace('\swww.\S*', ' weblink ', regex = True)
#emails
full[desc_feat] = full[desc_feat].replace('\s\S*@\S*', ' emailaddress', regex = True)
#time
full[desc_feat] = full[desc_feat].replace('\s\d{1,2}\s\d\d[ap]m', ' ampmtime', regex = True)
#phone numbers
full[desc_feat] = full[desc_feat].replace('\s\d{2,4}\s\d{2,4}\s\d{2,4}', ' phonenumber', regex = True)

Trim leading, tailing and multiple whitespaces

In [None]:
#Reduce multiple whitespaces
full[cols_text] = full[cols_text].replace('\s+', ' ', regex = True)
#Trim leading and tailing whitespaces
full[cols_text] = full[cols_text].replace(['^\s', '\s$'], '', regex = True)

### Catboost

Split back concatenated dataframe to train / test and additionally split train for ```X``` and ```y```

In [None]:
X = full.iloc[:-1 * len(test_raw)].drop('interest_level', axis=1)
y = full.iloc[:-1 * len(test_raw)].interest_level
test = full.iloc[-1 * len(test_raw):].drop('interest_level', axis=1)
print(full.shape, '->', X.shape, y.shape, test.shape)

Train / validation split and create data pools.<br>
*[Pool](https://catboost.ai/en/docs/concepts/python-reference_pool) is a special Catboost data constructor that increases training performance*<br>
As you can see, I set ```display_address``` as a category and the ```street_address``` as a text. Experiments have demonstrated that this was the right approach.

In [None]:
X_train, X_valid, y_train, y_valid =  train_test_split(
                                      X, y, test_size=0.2, stratify=y, random_state=17)

cat_features = ['day', 'month', 'manager_id', 'display_address']
text_features = ['description','street_address' , 'features']

Train = Pool(data=X_train,
             label=y_train,
             cat_features=cat_features,
             text_features = text_features)
            
Valid = Pool(data=X_valid,
             label=y_valid,
             cat_features=cat_features,
             text_features = text_features)

#### Baseline
Simple out-of-the-box Catboost model.<br>
There is not log_loss **metric** in Catboost, but as it has the same nature as MultiClass **loss_function** I didn't set any metric and left it by default.

In [None]:
model = CatBoostClassifier( random_seed = 17,     
                            thread_count = -1, 
                            verbose = 100,  
                            loss_function='MultiClass',
                            task_type = "GPU" )
# Fit model
model.fit(Train, eval_set=Valid)
preds_class = model.predict(Valid)
preds_proba = model.predict_proba(Valid)

#### Feature importances
You can see how it was imprtant to manage text features

In [None]:
FE = model.get_feature_importance(data=Valid,
                       thread_count=-1,
                       verbose=False)
FEG = pd.DataFrame(FE, index = X_valid.columns ).sort_values(0, ascending = False)
FEG.plot.bar(figsize = (20,5), rot = 0)

#### Validation scores

In [None]:
result = pd.DataFrame()
result['Real'] = y_valid.values
result['Pred'] = preds_class[:,0]

for i in range(3):
    print('Accuracy for Class', str(i), ':', "%.4f" %
          (result.loc[result.Real == i, 'Real'] ==
           result.loc[result.Real == i, 'Pred']).mean())
    print('Class', str(i), 'in observations:', "%.0f" %
          (result.loc[result.Real == i].shape[0] / len(result) * 100),
          '%', "\n")
print ('Mean accuracy:',  "%.4f" % (result.Real == result.Pred).mean(), "\n")
for i in range(3):
    print('Class', str(i), 'observations count:', "%.0f" % y[y==i].shape[0])

Here is the **main problem**: the data set is very imbalanced, this is why scores for minority classes is too low.<br>
One of possible solution is to set weights for classes, I will try it in grid search.

In [None]:
model.get_all_params()

#### Baseline model predictions

In [None]:
#Generate new pool from all train data
Test = Pool(data=test,
            cat_features=cat_features,
            text_features = text_features)

preds_proba = model.predict_proba(Test)
predictions = pd.DataFrame(preds_proba)

sub = pd.read_csv('../input/two-sigma-connect-rental-listing-inquiries/sample_submission.csv.zip')

sub['high'] = predictions[0]
sub['medium'] = predictions[1]
sub['low'] = predictions[2]

sub.to_csv('submission_baseline.csv', index = False)

Baseline model score on validation was 0.5593, when I made submissions I got 0.56646 in private LB.<br>
It is not good score compare to others...

#### Grid search

Here are my trying with the grid search. Although it is possible to use sklearn GridSearchCV with Catboost models I kindly recommend you to use built-in ```grid_search``` as it works much faster. The main reason that built-in module support GPU.
To save notebook commiting time I kept only the best parameters, but you can see all other values I played with commented.<br>
My major disappointment is that ```auto_class_weights``` did not work as I hoped. Class manual setting (Catboost allows to do this) dicreased scores as well.<br>
Actually the main thing I could play with is number of trees, all other parameters Catboost picks up well automatically, depending on what I set manually.

In [None]:
Train_gs = Pool(data=X,
             label=y,
             cat_features=cat_features,
             text_features = text_features)

model_gs = CatBoostClassifier(random_seed = 17,     
                            thread_count = -1, 
                            verbose = 1000,  
                            loss_function='MultiClass',
                            task_type = "GPU",
                            )
params_gs = {
            'iterations': [10000],    #  [1000,2500,5000, 10000]
            #'learning_rate': [0.01, 0.1, 0.15, 0.3, 0.5], 
            #'auto_class_weights': ['None', 'Balanced', 'SqrtBalanced']
            #'depth': [4, 6, 8, 10]
            #'l2_leaf_reg': [2,3,4]
            #'min_data_in_leaf': [1, 2, 3,4]
            }
gs_result = model_gs.grid_search(params_gs, 
                              Train_gs, 
                              partition_random_seed = 17,
                              stratified = True,
                              verbose = 1000,
                              plot=False)

In [None]:
model_gs.get_all_params()

#### Final model predictions and submissions

In [None]:
Test = Pool(data=test,
            cat_features=cat_features,
            text_features = text_features)

preds_proba_gs = model_gs.predict_proba(Test)
predictions_gs = pd.DataFrame(preds_proba_gs)

sub_gs = pd.read_csv('../input/two-sigma-connect-rental-listing-inquiries/sample_submission.csv.zip')

sub_gs['high'] = predictions_gs[0]
sub_gs['medium'] = predictions_gs[1]
sub_gs['low'] = predictions_gs[2]

sub_gs.head()
sub_gs.to_csv('submission_final.csv', index = False)

Final submission score is 0.55789.<br>
**Thank you for your attention !**

#### Augmentation (optional reading)
One of the ways to manage imbalanced data is data augmentation - adding synthetic data for minor classes. The issue is that in this competition you need to augment text data as well, which takes a lot of time, which I did not have. *(in my algorithm, the augmentation of 10 text cells took about a minute due to google translate lag and slow pandas.apply() method)*<br>
By the way  in the hidden cell you can find my algorithm, maybe it will be useful for your experiments. The algorithm uses translation to random language with further retranslation to English. I tried augmenting the numeric data only, but it expectedly worsened the situation.

In [None]:
# !pip uninstall googletrans -y
# !pip install googletrans==4.0.0rc1
# import googletrans
# from googletrans import Translator

# data = X_train.join(y_train)

# data_0 = data[data.interest_level == 0]
# data_1 = data[data.interest_level == 1]

# data_0_aug = data_1_aug = pd.DataFrame()

# data_0_aug = pd.concat([data_0]*(Class_0_add-1))
# data_1_aug = pd.concat([data_1]*(Class_1_add-1))

# data_aug = pd.concat([data_0_aug, data_1_aug]).reset_index(drop = True)
# for i in range(3):
#     print('Class', str(i), 'count:', "%.0f" % data_aug[data_aug.interest_level == i].shape[0])


# length = data_aug.shape[0]

# rand_price = np.random.uniform(0.98, 1.02, length)
# rand_long = np.random.uniform(0.999, 1.001, length)
# rand_lat = np.random.uniform(0.9995, 1.0005, length)

# data_aug['price'] = (data_aug['price'] * rand_price).astype(int) // 10 * 10
# data_aug['longitude'] = round(data_aug['longitude'] * rand_long,4)
# data_aug['latitude'] = round(data_aug['latitude'] * rand_lat,4)

# train_new = pd.concat([data, data_aug])
# print('New train set:')
# for i in range(3):
#     print('Class', str(i), 'count:', "%.0f" % train_new[train_new.interest_level == i].shape[0])


# #*****Text Data distortion*****

# languages = list(googletrans.LANGUAGES.keys())
# def text_aug (text):
#     rand = np.random.randint(0, len(languages))
#     translated = translator.translate(text, dest = languages[rand])
#     return str.lower(translator.translate
#                     (translated.text, dest = 'en').text)

# # Can be used only column by column, like this:
# for col in df.columns:
#     df[col] = df[col].apply(text_aug)