# Twinkle
Flat ranking model. Based on learning to rank (LTR) technique. [Lambdarank](https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf) is used as the main logical engine. The final ranker is supposed to relevantly order housing publications to a final Telegram user. Real-time ranking power can be checked via a local dev service **browsing** .

## I/O
Required data consists of two parts:
- Training tabular data lies in `/home/jovyan/data/training.csv` ;
- Testing (validation) tabular data lies in `/home/jovyan/data/testing.csv` .

Output model must be stored into `/home/jovyan/models/twinkle.txt` .

## Features
- `actual_price` - true flat's price (in USD);
- `utmost_price` - query's price limit (in USD), search results shouldn't exceed much this shape;
- `total_area` - overall apartment's area (in square meters);
- `living_area` - flat's living room area (in square meters);
- `kitchen_area` - flat's kitchen area (in square meters);
- `actual_room_number` - true flat's living room amount;
- `desired_room_number` - target room count;
- `actual_floor` - apartment's floor (the ground floor is the floor #1);
- `total_floor` - house's floor count;
- `desired_floor` - target flat's floor;
- `housing` - either a newbuild or a used apartments;
- `ssf` - Subway Station Factor, the score indicating about subway stations nearby;
- `izf` - Industrial Zone Factor, the score indicating about factories & plants nearby;
- `gzf` - Green Zone Factor, the score indicating about parks nearby;
- `relevance` - sample's search quality (just for training);
- `query` - sample's group ID (just for training).

## Categorical data
All feature categories are sorted according to their ranks.
- `desired_room_number`
    * `whatever` - not matter how many rooms;
    * `1` - 1 room;
    * `2` - 2 rooms;
    * `3` - 3 rooms;
    * `4+` - huge luxurious apartments with many rooms;
- `desired_floor`
    * `whatever` - not matter what floor;
    * `low` - low floors are preferred;
    * `high` - top floors are preferred;
- `housing`
    * `primary` - newbuilds & houses under construction;
    * `secondary` - old & already used apartments;
- `relevance`
    * `terrible` - don't show this thing again!
    * `bad` - poor quality;
    * `so-so` - average result;
    * `good` - quite smart search;
    * `excellent` - the best matches.

In [2]:
from pandas import read_csv
from lightgbm import train, Dataset

In [3]:
def load(path, params, reference=None):
    frame = read_csv(path)
    return Dataset(
        frame.drop(columns=['relevance', 'query']),
        frame['relevance'],
        group=frame.groupby(['query']).size(),
        categorical_feature=['desired_room_number', 'desired_floor', 'housing'],
        reference=reference,
        params=params
    )

In [5]:
params = {'objective': 'lambdarank', 'metric': 'ndcg'}
training = load('/home/jovyan/data/training.csv', params)
testing = load('/home/jovyan/data/testing.csv', params, training)
train(params, training, 30, [testing], early_stopping_rounds=15).save_model('/home/jovyan/models/twinkle.txt')
pass

You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1538
[LightGBM] [Info] Number of data points in the train set: 1350, number of used features: 14
[1]	valid_0's ndcg@1: 0.681481	valid_0's ndcg@2: 0.561127	valid_0's ndcg@3: 0.610224	valid_0's ndcg@4: 0.616368	valid_0's ndcg@5: 0.636542
Training until validation scores don't improve for 15 rounds
[2]	valid_0's ndcg@1: 0.807407	valid_0's ndcg@2: 0.707112	valid_0's ndcg@3: 0.673786	valid_0's ndcg@4: 0.673082	valid_0's ndcg@5: 0.6551
[3]	valid_0's ndcg@1: 0.807407	valid_0's ndcg@2: 0.738634	valid_0's ndcg@3: 0.753274	valid_0's ndcg@4: 0.750052	valid_0's ndcg@5: 0.720191
[4]	valid_0's ndcg@1: 0.585185	valid_0's ndcg@2: 0.653959	valid_0's ndcg@3: 0.730182	valid_0's ndcg@4: 0.715897	valid_0's ndcg@5: 0.725505
[5]	valid_0's ndcg@1: 0.644444	valid_0's ndcg@2: 0.667369	valid_0's ndcg@3: 0.740445	valid_0's ndcg@4: 0.754325	valid_0's ndcg@5: 0.739453
[6]	valid_0's ndcg@1: 0.525926	valid_0's ndcg@2: 0.640549	vali