# Task

In files `airlines.reviews.train.tsv` and `airlines.reviews.test.tsv` there is user reviews about different airlines. The whole dataset can be found <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> by link </a>.

Data includes: review, written by user, and score from 0 to 10. Now we will work __only with review texts from train dataset__ (file "airlines.reviews.train.tsv").

__Note:__ Tasks 1-3 should be done step by step, as further tasks require results from previous.

In [38]:
import pandas as pd
import numpy as np


pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [39]:
df = pd.read_csv('airlines.reviews.train.tsv', sep='\t', usecols=['content'])

In [40]:
df.head(3)

Unnamed: 0,content
0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...
1,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.
2,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed."


### Task 1 (10 points)
Lower the case for the whole text, remove (replace with empty string) all redundant symbols (symbols not from latin alphabet), split the text into tokens by spaces. Save the resulting text, it will be required for next task.

Find top-20 the most frequent tokens (tokens, that occur most frequently in the whole collection). Resulting 20 tokens write into __popular_tokens.txt__ in descending of frequency order, each word on separate line, 20 lines total.

Example, how to format list of words into __example_words__ file with required format:

In [41]:
example_words = ['word_'+str(i) for i in range(1,21)] # suffix _str(i) just for example, do not include into your solution
print(example_words)

answer = '\n'.join(example_words)

with open('popular_tokens.txt', 'w') as f:
    f.write(answer)

['word_1', 'word_2', 'word_3', 'word_4', 'word_5', 'word_6', 'word_7', 'word_8', 'word_9', 'word_10', 'word_11', 'word_12', 'word_13', 'word_14', 'word_15', 'word_16', 'word_17', 'word_18', 'word_19', 'word_20']


In [42]:
!head -n 5 popular_tokens.txt

word_1
word_2
word_3
word_4
word_5


In [43]:
import re
import nltk.data
import collections

from nltk.corpus import stopwords
from collections import Counter


stopwords = stopwords.words('english')
stopwords[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [46]:
def preprocess(s):
    if isinstance(s, str):
        return ' '.join(re.findall('[a-z]+', s.lower()))
    else:
        return ''

In [47]:
df['clean_content'] = df['content'].map(preprocess)
df.sample(3)

Unnamed: 0,content,clean_content
193,HND-BKK in C BKK-NRT in First. After coming off UA F from SFO the TG business class was a huge improvement. The sense of style and presentation that was absent on UA was here in spades and the crew was warm inviting and ready to provide for any special request. The flight to NRT in First was on the A380 and the lounge ground staff and spa were simply unbelievable-worth the price of the ticket alone. The meal which I pre-ordered (lobster thermador) was superb and every aspect of the flight was delightful.,hnd bkk in c bkk nrt in first after coming off ua f from sfo the tg business class was a huge improvement the sense of style and presentation that was absent on ua was here in spades and the crew was warm inviting and ready to provide for any special request the flight to nrt in first was on the a and the lounge ground staff and spa were simply unbelievable worth the price of the ticket alone the meal which i pre ordered lobster thermador was superb and every aspect of the flight was delightful
5724,17/7/14 CX659 HKG-SIN. Sparkling new B777-300ER aircraft registration B-KQO hardly over a week in service. This was the aircraft's 8th flight. Got these 4 flights at S$660 so its the best value for money. Transited in HKG from CX500 and had a 3hr layover. Boarding was late by 15 minutes and had no seating order all economy passengers just boarded at the same time. Full flight. The plane had a new fresh cabin smell with the latest seats and AVOD entertainment. The AVOD was easy to use and had up to date movies. Music selection was really good for Asian music with new mandopop albums available. Inflight service was conducted but there was barely any 'proper' service. Due to the midnight flight we were given a snack box containing cold food. Terrible offering. A box of iced lemon tea was ...,cx hkg sin sparkling new b er aircraft registration b kqo hardly over a week in service this was the aircraft s th flight got these flights at s so its the best value for money transited in hkg from cx and had a hr layover boarding was late by minutes and had no seating order all economy passengers just boarded at the same time full flight the plane had a new fresh cabin smell with the latest seats and avod entertainment the avod was easy to use and had up to date movies music selection was really good for asian music with new mandopop albums available inflight service was conducted but there was barely any proper service due to the midnight flight we were given a snack box containing cold food terrible offering a box of iced lemon tea was given service was by request after that and i ...
4023,SYD-MEL. Check-in at desk as usual and placed into an exit seat. Crew made a lot of fuss about all of the safety procedures. Onboard cabin much like a decent easyjet plane. Crew and pilots very chatty and pleasant. Arrival through the barb wire surrounded 'shed' for baggage claim was funny but also totally practical - can there be a quicker means of leaving any airport? Overall I enjoyed my flights which were also inexpensive.,syd mel check in at desk as usual and placed into an exit seat crew made a lot of fuss about all of the safety procedures onboard cabin much like a decent easyjet plane crew and pilots very chatty and pleasant arrival through the barb wire surrounded shed for baggage claim was funny but also totally practical can there be a quicker means of leaving any airport overall i enjoyed my flights which were also inexpensive


In [48]:
tokens = {}

def tokens_counter(lst):
    lst = lst.split()
    for token in lst:
        if token in tokens:
            tokens[token] += 1
        else:
            tokens[token] = 1
            

In [49]:
df['clean_content'].map(tokens_counter)

0        None
1        None
2        None
3        None
4        None
         ... 
23317    None
23318    None
23319    None
23320    None
23321    None
Name: clean_content, Length: 23322, dtype: object

In [50]:
tokens = dict(sorted(tokens.items(), key=lambda item: item[1],reverse=True))

In [51]:
answer = '\n'.join([w for w in list(tokens)[:20]])

with open('popular_tokens.txt', 'w') as f:
    f.write(answer)

In [52]:
!head -n 5 popular_tokens.txt

the
and
to
was
a


In [53]:
for k, v in tokens.items():
    print(k, v)

the 128293
and 87679
to 80123
was 61064
a 59893
on 44574
i 44344
in 43251
flight 37515
of 35150
for 29589
with 27736
were 26786
we 21658
not 20099
is 19916
but 19349
it 18319
at 17409
from 16914
very 16869
that 16639
had 14804
they 14722
my 14696
as 14339
service 13643
this 13589
good 13265
no 13102
have 13046
time 12636
food 12433
seats 11295
all 10414
you 10397
seat 10130
t 9820
be 9774
crew 9625
flights 9577
are 9446
an 9226
staff 8915
so 8723
class 8018
cabin 7901
only 7863
our 7833
would 7728
there 7640
which 7317
return 7259
one 7113
plane 7043
again 6731
check 6617
business 6583
airline 6536
or 6338
by 6222
when 6178
their 6166
me 6131
s 6050
out 5880
us 5723
hours 5563
entertainment 5500
economy 5496
back 5491
if 5476
than 5453
more 5369
flew 5351
first 5260
both 5201
airport 5165
friendly 5143
get 5110
them 5035
comfortable 4931
fly 4868
new 4860
other 4842
passengers 4818
air 4763
hour 4743
about 4731
after 4731
will 4575
up 4513
great 4476
airlines 4453
aircraft 4430
just 43