# Task

In files `airlines.reviews.train.tsv` and `airlines.reviews.test.tsv` there is user reviews about different airlines. The whole dataset can be found <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> by link </a>.

Data includes: review, written by user, and score from 0 to 10. Task is to predict the score by the text of the review using different text processing approaches.

Below you can find the code for base model training - linear VW model with all the text encoded as Bag-of-words. Your code should improve the current model by using better quality text encoding.

In [6]:
import pandas as pd
import numpy as np


pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [7]:
df_train = pd.read_csv('airlines.reviews.train.tsv', sep='\t')
df_test = pd.read_csv('airlines.reviews.test.tsv', sep='\t')

In [8]:
df_train.head(3)

Unnamed: 0,rating,content
0,9.0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...
1,7.0,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.
2,7.0,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed."


In [9]:
df_test.head(3)

Unnamed: 0,rating,content
0,8.0,JNB-LHR on the new airbus. Seats were roomy and comfy staff polite and friendly and inflight entertainment system outstanding. We had terrible turbulence throughout the flight but the captain was informative and reassuring and everyone remained calm. Food not great but otherwise excellent.
1,6.0,"Flew Business Class DOH-BOM-DOH. Outbound: Used the Oryx lounge at Doha airport which was nice. Cabin was nearly empty. Seats are similar to those on Jet's domestic business class. Found it difficult to sleep with the recline provided. At 6'3"" legrests did not help as my legs overshot it. The light sandwich was passable. Service was attentive and cheerful. Inbound: Evening flight so looked forward to meal and wine. Same cheap French table wine. Indian non-veg meal was not great. Cabin crew were attentive and friendly. IFE was limited. One negative was that my bag was one of the last off both flights with a priority tag."
2,5.0,This is a rough review because we flew first business and coach. We usually fly coach but for a trip to Napa we used our points to go first class. The AA/United merger combined the worst two airlines in the Western world. Flew on 4/7 (260 / 193) - BAN-DFX-SAN. Service food seating excellent. Plane a little old and shaky but all in all a good flight. Returned 4/13 (193/5290) SAN-Charlotte-BNA. Although we had first class we were relegated to business with an accompanying drop in quality across the board. The trouble is the age of the planes - it's like something from a museum. The noise from the engine was so loud it was like sticking your head under the hood of a car. But for once all flights left on time without mechanical problems. We were closer to the real world of 95% of all trave...


In [10]:
X_train, Y_train = df_train['content'], df_train['rating']
X_test, Y_test = df_test['content'], df_test['rating']

In [12]:
import re
from sklearn.metrics import r2_score


def convert_to_vw(raw_text, target):
    word_pattern = re.compile(r"[a-zA-Z0-9_]+")
    words = []
    for match in re.finditer(word_pattern, raw_text.lower()):
        words.append(match.group(0))
    
    if not words: 
        return None
    return "{} |d {}".format(float(target), " ".join(words))


def write_vw(X_data, Y_data, filename):
    with open(filename, "w") as f:
        for x, y in zip(X_data, Y_data):
            vw_object = convert_to_vw(x, y)
            if not vw_object:
                continue
            f.write(vw_object + '\n')
            

def read_target_from_vw(vw_object):
    return float(vw_object.split(' ')[0])


def calc_r2(predictions_path, answers_path):
    with open(predictions_path, 'r') as f:
        y_pred = np.array([float(value) for value in f.readlines()])
        
    with open(answers_path, 'r') as f:
        y_expected = np.array([read_target_from_vw(value) for value in f.readlines()])
        
    return r2_score(y_expected, y_pred)

In [13]:
write_vw(X_train, Y_train, 'airlines.train.vw')
write_vw(X_test, Y_test, 'airlines.test.vw')

In [14]:
! head -n 2 airlines.train.vw

9.0 |d march 5th 2014 from ottawa canada to cuba wg 630 they announced that the flight was going to be delayed 1 hour no explanation why they started boarding and we took off only 1 2 hour late there were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side on the way back from cuba on march 12th 2014 wg 631 we were slow going through immigration no fault of sunwing finally arrived to our plane at 10 35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected the 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us overall the staff were great very friendly and approachable the food served was pretty good considering most airlines don t offer meal service for free it was comparable to meals we ve had to purchase on other airlines
7.0 |d sin fra bhx in economy first leg from singapore on the a380 was great largely because i was fortunate enough to get an exit row seat wit

In [15]:
! head -n 2 airlines.test.vw

8.0 |d jnb lhr on the new airbus seats were roomy and comfy staff polite and friendly and inflight entertainment system outstanding we had terrible turbulence throughout the flight but the captain was informative and reassuring and everyone remained calm food not great but otherwise excellent
6.0 |d flew business class doh bom doh outbound used the oryx lounge at doha airport which was nice cabin was nearly empty seats are similar to those on jet s domestic business class found it difficult to sleep with the recline provided at 6 3 legrests did not help as my legs overshot it the light sandwich was passable service was attentive and cheerful inbound evening flight so looked forward to meal and wine same cheap french table wine indian non veg meal was not great cabin crew were attentive and friendly ife was limited one negative was that my bag was one of the last off both flights with a priority tag


In [16]:
! vw --final_regressor result.model.vw airlines.train.vw --learning_rate 5 --bit_precision 18 --passes 20 -c -k

final_regressor = result.model.vw
Num weight bits = 18
learning rate = 5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
creating cache_file = airlines.train.vw.cache
Reading datafile = airlines.train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
81.000000 81.000000            1            1.0   9.0000   0.0000      164
45.962098 10.924195            2            2.0   7.0000   3.6948      117
39.999505 34.036913            4            4.0   1.0000   9.0000      250
42.917619 45.835733            8            8.0   1.0000  10.0000      298
27.137390 11.357161           16           16.0   8.0000   3.7579       64
24.114076 21.090763           32           32.0   1.0000   8.3515      157
17.459891 10.805705           64           64.0  10.0000  10.0000      214
13.683004 9.906117          128          128.0   8.0000   8.3591      127
12.371640 11.060277          25

In [17]:
! vw --initial_regressor result.model.vw --testonly --predictions predictions.txt airlines.test.vw

only testing
predictions = predictions.txt
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = airlines.test.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
2.617054 2.617054            1            1.0   8.0000   9.6177       45
1.327304 0.037555            2            2.0   6.0000   6.1938      113
7.045001 12.762697            4            4.0   1.0000   1.7248       89
6.794588 6.544175            8            8.0   8.0000   6.5657      149
4.249113 1.703638           16           16.0   8.0000   6.2097       50
4.000919 3.752724           32           32.0   8.0000   5.5210       53
4.803549 5.606180           64           64.0   8.0000   7.7164       34
4.736039 4.668528          128          128.0   8.0000   7.5639      197
4.860949 4.985860          256          256.0   2.0000   4.3240      104
4.961022 5.061094

In [18]:
calc_r2('predictions.txt', 'airlines.test.vw')

0.5547449450822988

Base model results in **0.55** score. Solutions scoring not less than **0.56** will be scored as 100 points. All solutions with lower score will be measured correspondingly to the resulting quality.

The result of your work should be a Zip archive named `result.zip`, that includes two files - `airlines.train.vw` and `airlines.test.vw` (exact names!) with your encoded features. It is important to save the order and amount of objects in each file. Solutions that will change order and/or the dataset would be scored as 0 points.

Those two files would be used to train vw model with same params as base one. Exactly: `vw --final_regressor result.model.vw airlines.train.vw --learning_rate 5 --bit_precision 18 --passes 20 -c -k`. After that the r2 score would be measured on file `airlines.test.vw` by running the same command as the solution above.

So, if you would like to pass the baseline solution as yours, you'd use the following script:

In [19]:
! zip result.zip airlines.train.vw airlines.test.vw

  adding: airlines.train.vw (deflated 65%)
  adding: airlines.test.vw (deflated 65%)


In [20]:
! head -c 100 result.zip

PK    WSj���N I��   airlines.train.vwUT	 �g+a�g+aux �  d   ��i��ʒ4 ��V��Q�ƻ� 	�A