# Task

In files `airlines.reviews.train.tsv` and `airlines.reviews.test.tsv` there is user reviews about different airlines. The whole dataset can be found <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> by link </a>.

Data includes: review, written by user, and score from 0 to 10. Now we will work __only with review texts from train dataset__ (file "airlines.reviews.train.tsv").

__Note:__ Tasks 1-3 should be done step by step, as further tasks require results from previous.

In [1]:
import pandas as pd
import numpy as np


pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [2]:
df = pd.read_csv('airlines.reviews.train.tsv', sep='\t', usecols=['content'])

In [3]:
df.head(3)

Unnamed: 0,content
0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...
1,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.
2,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed."


### Task 2 (10 points)

Work with text from task 1.

Perform stemming using SnowballStemmer from NLTK library. After that remove all stop-words (stop-words should be taken from NLTK library). Find top-20 of the most frequent stemms from all the words left after stop-words removal and write into __popular_stems.txt__ file in descending of frequency (as in task 1) order.

Resulting texts (stemms with stop-words removed) save for task 3.

In [4]:
import re
import nltk.data
import collections

from nltk.corpus import stopwords
from collections import Counter


stopwords = stopwords.words('english')

In [5]:
def preprocess(s):
    if isinstance(s, str):
        return ' '.join(re.findall('[a-z]+', s.lower()))
    else:
        return ''

In [6]:
df['clean_content'] = df['content'].map(preprocess)

In [7]:
df['clean_content'].iloc[0]

'march th from ottawa canada to cuba wg they announced that the flight was going to be delayed hour no explanation why they started boarding and we took off only hour late there were of us were seated together and remaining were put in aisle seats side by side on the way back from cuba on march th wg we were slow going through immigration no fault of sunwing finally arrived to our plane at am the doors immediately closed and the plane took off minutes later minutes earlier than expected the of us were pretty much split up by each seating my old daughter by herself behind us overall the staff were great very friendly and approachable the food served was pretty good considering most airlines don t offer meal service for free it was comparable to meals we ve had to purchase on other airlines'

In [8]:
from nltk.stem import SnowballStemmer


stemmer = SnowballStemmer('english')

In [23]:
def preprocess(s: str):
    s = [stemmer.stem(t) for t in s.split() if not t in stopwords]
    s = ' '.join(s)
    return s


In [24]:
df['stemmed_content'] = df['clean_content'].map(preprocess)
df['stemmed_content'].iloc[0]

'march th ottawa canada cuba wg announc flight go delay hour explan start board took hour late us seat togeth remain put aisl seat side side way back cuba march th wg slow go immigr fault sunw final arriv plane door immedi close plane took minut later minut earlier expect us pretti much split seat old daughter behind us overal staff great friend approach food serv pretti good consid airlin offer meal servic free compar meal purchas airlin'

In [25]:
# Count unique tokens again

tokens = {}

def counting_tokens(s: str):
    s = s.split()
    for token in s:
        if token in tokens:
            tokens[token] += 1
        else:
            tokens[token] = 1

In [26]:
df['stemmed_content'].apply(counting_tokens)

0        None
1        None
2        None
3        None
4        None
         ... 
23317    None
23318    None
23319    None
23320    None
23321    None
Name: stemmed_content, Length: 23322, dtype: object

In [27]:
tokens = dict(sorted(tokens.items(), key=lambda item: item[1], reverse=True))

In [28]:
answer = '\n'.join([w for w in list(tokens)[:20]])

with open('popular_stems.txt', 'w') as f:
    f.write(answer)

In [29]:
!head -n 20 popular_stems.txt

flight
seat
time
servic
good
food
airlin
hour
crew
staff
plane
check
return
cabin
class
fli
board
would
one
busi