![Rampart Logo](../images/logo.png)

Twinkle is a flat ranking model based on the learning to rank (LTR) technique. [Lambdarank](https://www.microsoft.com/en-us/research/uploads/prod/2016/02/MSR-TR-2010-82.pdf) is used as the main logical engine. The final ranker is supposed to relevantly order housing publications to a final Telegram user.

## I/O
All datasets lie in `../scientific/binaries` . Final model must be stored into `../scientific/models/twinkle.latest.txt` .

## Metadata
Dataset files have self-explained names:
```
<tag>.<group>.bin
```
For instance:
```
70e36b1e47754064ad14b57a5c79d5bc.validation.bin
```
Placeholders in angle brackets:
- `tag` , UUID4 generated hash.
- `group` , dataset's destination - one of `training` or `validation` .

## Features
- `actual_price` - true flat's price (in USD).
- `utmost_price` - query's price limit (in USD), search results shouldn't exceed much this shape.
- `total_area` - overall apartment's area (in square meters).
- `living_area` - flat's living room area (in square meters).
- `kitchen_area` - flat's kitchen area (in square meters).
- `actual_room_number` - true flat's living room amount.
- `desired_room_number` - target room count.
- `actual_floor` - apartment's floor (the ground floor is the floor #1).
- `total_floor` - house's floor count.
- `desired_floor` - target flat's floor.
- `housing` - either a newbuild or a used apartments.
- `ssf` - Subway Station Factor, the score indicating about subway stations nearby.
- `izf` - Industrial Zone Factor, the score indicating about factories & plants nearby.
- `gzf` - Green Zone Factor, the score indicating about parks nearby.
- `unknown_count` - unclassified photo amount.
- `abandoned_count` - unavailable/not found photo quantity.
- `luxury_count` - elite housing photo number.
- `comfort_count` - ordinary flat pictures.
- `junk_count` - obsolete apartment interior photo amount.
- `construction_count` - raw building images.
- `excess_count` - trash photo amount.
- `panorama_count` - panorama (360 deg) image number.

## Categorical data
All feature categories are sorted according to their ranks from the lowest to the highest:
- `desired_room_number`
    * `whatever` - not matter how many rooms;
    * `1` - 1 room;
    * `2` - 2 rooms;
    * `3` - 3 rooms;
    * `4+` - huge luxurious apartments with many rooms;
- `desired_floor`
    * `whatever` - not matter what floor;
    * `low` - low floors are preferred;
    * `high` - top floors are preferred;
- `housing`
    * `primary` - newbuilds & houses under construction;
    * `secondary` - old & already used apartments;
- `relevance`
    * `terrible` - don't show this thing again!
    * `bad` - poor quality;
    * `so-so` - average result;
    * `good` - quite smart search;
    * `excellent` - the best matches.

In [None]:
%matplotlib inline

In [None]:
from lightgbm import train, Dataset
from uuid import uuid4

In [None]:
training_dataset = Dataset('../scientific/binaries/latest.training.bin')
validation_dataset = training_dataset.create_valid('../scientific/binaries/latest.validation.bin')

In [None]:
booster = train(
    {'objective': 'lambdarank', 'metric': 'ndcg', 'force_row_wise': True},
    training_dataset,
    15,
    [validation_dataset],
    early_stopping_rounds=10
)
booster.save_model('../scientific/models/twinkle.latest.txt')
booster.save_model(f'../scientific/models/twinkle.{uuid4().hex}.txt')
pass