# Task

In files `airlines.reviews.train.tsv` and `airlines.reviews.test.tsv` there is user reviews about different airlines. The whole dataset can be found <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> by link </a>.

Data includes: review, written by user, and score from 0 to 10. Now we will work __only with review texts from train dataset__ (file "airlines.reviews.train.tsv").

__Note:__ Tasks 1-3 should be done step by step, as further tasks require results from previous.

In [1]:
import pandas as pd
import numpy as np


pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [2]:
df = pd.read_csv('airlines.reviews.train.tsv', sep='\t', usecols=['content'])

In [3]:
df.head(3)

Unnamed: 0,content
0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...
1,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.
2,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed."


### Task 3 (30 points)

Use texts from task 2.

Make TF-IDF transformation (with n-gram range = (1, 1)) for documents collection. For each document find top-1 stemm with the biggest tf-idf weight. Write those stemms into __tfidf_stems.txt__ file with the following format: each document is present by one word, rows of the output file should be the same as documents in input dataset. Resulting file should contain exactly same number of words and lines as number of documents in "airlines.reviews.train.tsv" file.

In [4]:
import re
import nltk.data
import collections

from nltk.corpus import stopwords
from collections import Counter


stopwords = stopwords.words('english')

In [5]:
def preprocess(s):
    if isinstance(s, str):
        return ' '.join(re.findall('[a-z]+', s.lower()))
    else:
        return ''

In [6]:
df['clean_content'] = df['content'].map(preprocess)

In [7]:
from nltk.stem import SnowballStemmer


stemmer = SnowballStemmer('english')

In [8]:
def preprocess(s: str):
    s = [stemmer.stem(t) for t in s.split() if not t in stopwords]
    s = ' '.join(s)
    return s

In [10]:
df['stemmed_content'] = df['clean_content'].map(preprocess)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer



In [13]:
vec = TfidfVectorizer(ngram_range=(1, 1))
tfidf = vec.fit_transform(df['stemmed_content'])

tfidf.shape

(23322, 15365)

In [15]:
# Convert matrix to list, with each element being a pair (w, tfidf(w, d , D)), where w is the word in the document d from collection D

index_value={i[1]:i[0] for i in vec.vocabulary_.items()}

fully_indexed = []

for row in tfidf:
    fully_indexed.append({index_value[column]:value for (column,value) in zip(row.indices,row.data)})

In [23]:
tokens = []

for doc in fully_indexed:
    token = max(doc, key=doc.get)
    tokens.append(token)

In [25]:
answer = '\n'.join([w for w in tokens])

with open('tfidf_stems.txt', 'w') as f:
    f.write(answer)

In [26]:
!head -n 10 tfidf_stems.txt

wg
rather
star
santa
ba
lci
km
calgari
discount
recaro
