# Задача

В файлах `airlines.train.tsv` и `airlines.test.tsv` находятся данные о пользовательских оценках различных авиакомпаний. Полноценный набор данных доступен <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> по ссылке </a>.

В данных есть информация про название авиакомпании, страну того, кто оставляет отзыв, класс, которым он летел, текстовое сообщение и итоговая оценка от 0 до 10.

Задача - по первым 4 параметрам (авиакомпания, страна, класс, текстовое сообщение) предсказать оценку, которую поставил пользователь. Для этого необходимо дополнительно превратить данные в формат vw. Про формат, в котором нужно предоставить решения будет написано ниже.

В качестве ответа необходимо сдать обученные веса модели vowpal wabbit. Для оценки решения на тестовых данных будет запущен vw с этими весами и будет подсчитана метрика R2. Решения, которые получили качество больше `0.35` будут оцениваться в 100%. Решения с меньшим качеством будут оценены ниже в соответствии с полученных качеством. Саму модель (веса) необходимо сохранить в файл `result.vw`.

Формат vw:
* Целевая переменная - пользовательская оценка
* 4 неймспейса - name, country, cabin, review
* Значения в name, country, cabin приведены в монолитный формат - все символы, не являющиеся буквой или цифрой (то есть подходящие под регулярное выражение `\W`) заменены на `_`, а также вся строка приведена к нижнему регистру.
* В review оставлены только корректные элементы (то есть подходящие под регулярное выражение `[a-zA-Z0-9_]+`).

Для демонстрации того, как выглядит этот формат, в файле `airlines.test.sample.vw` лежат 10 первых элементов из тестовой выборки, которые закодированны соответствующим образом.


In [1]:
import pandas as pd
import numpy as np

In [2]:
df_train = pd.read_csv('airlines.train.tsv', sep='\t')
df_test = pd.read_csv('airlines.test.tsv', sep='\t')

In [3]:
df_train.head()

Unnamed: 0,airline_name,author_country,cabin_flown,content,rating
0,sunwing-airlines,Canada,Economy,March 5th 2014 from Ottawa Canada to Cuba WG 6...,9.0
1,lufthansa,United Kingdom,Economy,SIN-FRA-BHX in Economy. First leg from Singapo...,7.0
2,spirit-airlines,United States,Economy,"Spirit does what they state on their web site,...",7.0
3,sunwing-airlines,Canada,Premium Economy,My fiancé and I were booked to fly to Cayo San...,1.0
4,british-airways,United States,First Class,DXB-LHR B777-200ER BA0108 August 18 First Clas...,9.0


In [4]:
df_test.head()

Unnamed: 0,airline_name,author_country,cabin_flown,content,rating
0,south-african-airways,United Kingdom,Economy,JNB-LHR on the new airbus. Seats were roomy an...,8.0
1,jet-airways,Qatar,Business Class,Flew Business Class DOH-BOM-DOH. Outbound: Use...,6.0
2,american-airlines,United States,First Class,This is a rough review because we flew first b...,5.0
3,flybe,United Kingdom,Economy,Am thoroughly fed up with Flybe customer servi...,1.0
4,american-airlines,United Arab Emirates,Economy,I have flown MIA-JFK on an old B767-300. Fligh...,5.0


In [5]:
! head -n 2 airlines.test.sample.vw

8.0 |name south_african_airways |country united_kingdom |cabin economy |review jnb lhr on the new airbus seats were roomy and comfy staff polite and friendly and inflight entertainment system outstanding we had terrible turbulence throughout the flight but the captain was informative and reassuring and everyone remained calm food not great but otherwise excellent
6.0 |name jet_airways |country qatar |cabin business_class |review flew business class doh bom doh outbound used the oryx lounge at doha airport which was nice cabin was nearly empty seats are similar to those on jet s domestic business class found it difficult to sleep with the recline provided at 6 3 legrests did not help as my legs overshot it the light sandwich was passable service was attentive and cheerful inbound evening flight so looked forward to meal and wine same cheap french table wine indian non veg meal was not great cabin crew were attentive and friendly ife was limited one negative was that my bag was one of t

Ваши полученные коэффициенты будут проверятся примерно следующим образом

In [135]:
# Здесь вы преобразуете train и test в формат vw и обучаете vw на данных из airlines.train.tsv
# Выбор параметров остается целиком в вашей власти - попробуйте получить наилучшее качество
! vw --final_regressor result.vw airlines.test.sample.vw 

final_regressor = result.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airlines.test.sample.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
64.000000 64.000000            1            1.0   8.0000   0.0000       48
40.509827 17.019653            2            2.0   6.0000   1.8745      116
22.732204 4.954582            4            4.0   1.0000   2.1652       92
16.739873 10.747542            8            8.0   8.0000   4.8364      152

finished run
number of examples = 10
weighted example sum = 10.000000
weighted label sum = 57.000000
average loss = 15.418942
best constant = 5.700000
total feature number = 1277


In [136]:
# При проверки будет запущена примерно следующая команда. 
# Вместо airlines.test.sample.vw будет использоваться целиком airlines.test.tsv переведенный в указанный выше формат
# Так как файл airlines.test.tsv присутствует у вас целиком, после преобразования в vw формат можно использовать его
# для самопроверки

! vw --testonly --initial_regressor result.vw --predictions predictions.txt airlines.test.sample.vw

only testing
predictions = predictions.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airlines.test.sample.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
11.737091 11.737091            1            1.0   8.0000   4.5741       48
7.378532 3.019973            2            2.0   6.0000   7.7378      116
7.366215 7.353898            4            4.0   1.0000   3.3891       92
6.362258 5.358301            8            8.0   8.0000   8.0000      152

finished run
number of examples = 10
weighted example sum = 10.000000
weighted label sum = 57.000000
average loss = 5.482505
best constant = 5.700000
total feature number = 1277


## Преобразуем данные

In [11]:
df_train.iloc[0].values

array(['sunwing-airlines', 'Canada', 'Economy',
       "March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It was comparable to meals we've had to purchase on other airlines.",
       9.0], dtype=object)

In [14]:
import re

def convert_to_vw(row):
    
    name, country, cabin, review, target = row 
    
    name = re.sub('\W', '_', name.lower())
    country = re.sub('\W', '_', country.lower())
    cabin = re.sub('\W', '_', cabin.lower())
    
    word_pattern = re.compile(r"[a-zA-Z0-9_]+")
    words = []
    for match in re.finditer(word_pattern, review.lower()):
        words.append(match.group(0))
    
    vw_str = "{target} |name {name} |country {country} |cabin {cabin} |review {review}"
    
    return vw_str.format(
        target=target,
        name=name,
        country=country,
        cabin=cabin,
        review=' '.join(words)
    )

In [16]:
df_train.dropna(inplace=True)

In [17]:
def save_to_vw(df: pd.DataFrame, converter: callable, filename: str):
    with open(filename, 'w') as f:
        for line in df.values:
            vw_line = converter(line)
            f.write(f'{vw_line} \n')

In [18]:
save_to_vw(df_train, convert_to_vw, 'airlines.train.vw')

In [22]:
save_to_vw(df_test, convert_to_vw, 'airlines.test.vw')

In [19]:
! head -n 10 airlines.train.vw 

9.0 |name sunwing_airlines |country canada |cabin economy |review march 5th 2014 from ottawa canada to cuba wg 630 they announced that the flight was going to be delayed 1 hour no explanation why they started boarding and we took off only 1 2 hour late there were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side on the way back from cuba on march 12th 2014 wg 631 we were slow going through immigration no fault of sunwing finally arrived to our plane at 10 35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected the 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us overall the staff were great very friendly and approachable the food served was pretty good considering most airlines don t offer meal service for free it was comparable to meals we ve had to purchase on other airlines 
7.0 |name lufthansa |country united_kingdom |cabin economy |review sin fra bhx in e

## Train

In [29]:
! vw --final_regressor result.vw airlines.train.vw --ngram 2 --learning_rate 40.0 --bit_precision 22 --passes 50 --cache_file vw.cache

Generating 2-grams for all namespaces.
final_regressor = result.vw
Num weight bits = 22
learning rate = 40
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = vw.cache
ignoring text input in favor of cache input
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
81.000000 81.000000            1            1.0   9.0000   0.0000      330
53.915319 26.830639            2            2.0   7.0000   1.8202      236
46.320791 38.726262            4            4.0   1.0000   8.6986      502
47.301578 48.282366            8            8.0   1.0000  10.0000      598
31.030626 14.759673           16           16.0   8.0000   2.9523      130
25.978610 20.926595           32           32.0   1.0000   7.0240      316
19.727663 13.476716           64           64.0  10.0000  10.0000      430
15.382648 11.037633          128          128.0   8

## Evaluate

In [30]:
! vw --testonly --initial_regressor result.vw --predictions airlines-predictions.txt airlines.test.vw

Generating 2-grams for all namespaces.
only testing
predictions = airlines-predictions.txt
Num weight bits = 22
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airlines.test.vw
num sources = 1
Enabled reductions: gd, scorer
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
4.000000 4.000000            1            1.0   8.0000  10.0000       92
4.759127 5.518253            2            2.0   6.0000   8.3491      228
3.957396 3.155665            4            4.0   1.0000   3.1507      180
2.493897 1.030397            8            8.0   8.0000   6.7246      300
3.365292 4.236687           16           16.0   8.0000   5.5176      102
2.543051 1.720809           32           32.0   8.0000   6.4655      108
3.084309 3.625567           64           64.0   8.0000   8.2260       70
3.073055 3.061800          128          128.0   8.0000   8.9499      396
3.601490 4.1

In [31]:
! head airlines-predictions.txt

10
8.349096
3.701584
3.150685
5.030509
5.645532
4.812071
6.724612
8.824572
5.668486


In [32]:
from sklearn.metrics import r2_score

def read_target_from_vw(vw_object):
    return float(vw_object.split(' ')[0])


def calc_r2(predictions_path, answers_path):
    with open(predictions_path, 'r') as f:
        y_pred = np.array([float(value) for value in f.readlines()])
        
    with open(answers_path, 'r') as f:
        y_expected = np.array([read_target_from_vw(value) for value in f.readlines()])
        
    return r2_score(y_expected, y_pred)

In [33]:
calc_r2('airlines-predictions.txt', 'airlines.test.vw')

0.6505202865000241