# Задача

В файлах `airlines.reviews.train.tsv` и `airlines.reviews.test.tsv` находятся данные о пользовательских оценках различных авиакомпаний. Полноценный набор данных доступен <a href="https://github.com/quankiquanki/skytrax-reviews-dataset"> по ссылке </a>.

В данных есть отзыв, который оставил пользователь, и его оценка от 0 до 10. Пока мы будем работать __только с текстами отзыва train выборки__ (файл "airlines.reviews.train.tsv").

__Примечание:__ Задания 1-3 надо выполнять последовательно, так как в каждом следующем используются результаты предыдущего.

In [1]:
import pandas as pd
import numpy as np

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', 800)

In [2]:
df = pd.read_csv('airlines.reviews.train.tsv', sep='\t', usecols=['content'])

In [3]:
df.head()

Unnamed: 0,content
0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...
1,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.
2,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed."
3,"My fiancé and I were booked to fly to Cayo Santa Maria (CUBA) February 6-13 2014. Our flight was scheduled to leave at 6.10am. Upon arriving at the airport at 4.30am we quickly noticed that the line up was very long. When we finally got to the check-in desk they asked us where we were headed we replied Cayo Santa Maria. We advised her that we had checked in online already and we just needed to print our boarding passes. She took our baggage and weighed it. Right before she was about to send it off a rude manager from the back came and just yelled out ""gates to Santa Clara are closed"". We were so shocked because it was only 4.55am at that time. We told them the plane would just be sitting there we could still make it. The rep simply told us ""please step aside we need to assist other pas..."
4,"DXB-LHR B777-200ER BA0108 August 18 First Class. Transferred from an Emirates flight in DXB. BA DXB Galleries Lounge reception staff member excellent. Boarding reasonable with an on-time departure. Cabin crew outstanding and definitely lived up to the ""To Fly. To Serve"" BA slogan. Food tasty and well presented but not quite First Class and cost-cutting was evident. The New First seat is comfortable though the footrest is poorly designed and a storage area for small inflight items is missing. IFE monitor controls and selection very good but the screen could be more adjustable for reach. Lovely cabin ambience including colors textures mood lighting and window blinds. Toilets cramped and stocked with the cheapest liquid soaps and toilet paper. Overall an enjoyable flight but as a longstan..."


# Задание 1 (10 баллов)
Приведите весь текст в нижний регистр, удалите (заменита на пробел) все небуквенные символы (символы, не являющиеся латиницей), разбейте текст на токены по символу пробела. Сохраните полученный текст, он понадобится для следующих заданий.

Найдите топ-20 самых частотных токенов (токенов, которые встречаются чаще всех во всей коллекции). Полученные 20 токенов запишите в порядке убывания их частоты в файл __popular_tokens.txt__, в следующем формате: каждое новое слово на новой строке, всего 20 строк.

## Solution 1

Напишем функцию для препроцесса

In [4]:
import re

def preprocess_text(text: str) -> list:
    text = re.sub("[^a-z]+", " ", text.lower()).split()
    return text

Применим параллельно для корпуса текстов

In [5]:
from multiprocessing import Pool
from tqdm.notebook import tqdm

with Pool(4) as pool:
    processed_texts = list(tqdm(pool.imap(preprocess_text, df.content), total=len(df)))

HBox(children=(FloatProgress(value=0.0, max=23322.0), HTML(value='')))




In [6]:
df['processed_content'] = processed_texts

In [7]:
df.head()

Unnamed: 0,content,processed_content
0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...,"[march, th, from, ottawa, canada, to, cuba, wg, they, announced, that, the, flight, was, going, to, be, delayed, hour, no, explanation, why, they, started, boarding, and, we, took, off, only, hour, late, there, were, of, us, were, seated, together, and, remaining, were, put, in, aisle, seats, side, by, side, on, the, way, back, from, cuba, on, march, th, wg, we, were, slow, going, through, immigration, no, fault, of, sunwing, finally, arrived, to, our, plane, at, am, the, doors, immediately, closed, and, the, plane, took, off, minutes, later, minutes, earlier, than, expected, the, of, us, were, pretty, much, split, up, by, ...]"
1,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.,"[sin, fra, bhx, in, economy, first, leg, from, singapore, on, the, a, was, great, largely, because, i, was, fortunate, enough, to, get, an, exit, row, seat, with, unlimited, legroom, judging, by, fellow, passengers, one, wouldn, t, be, happy, with, normal, seats, as, they, had, rather, pathetic, legroom, nice, modern, avod, system, but, the, ptvs, were, rather, small, compared, to, other, a, airlines, service, was, really, friendly, and, warm, but, few, frills, no, amenity, kit, whatsoever, no, footrests, meals, were, alright, but, again, rather, simple, compared, to, asian, carriers, second, leg, to, birmingham, on, an, a, was, above, average, by, intra, ...]"
2,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed.","[spirit, does, what, they, state, on, their, web, site, they, get, you, there, cheaply, for, that, i, give, them, stars, because, they, did, exactly, what, the, said, they, would, do, the, plane, was, full, and, the, seats, were, close, together, i, read, all, about, that, before, i, bought, the, ticket, and, it, was, as, they, said, it, would, be, hence, the, low, cost, plan, ahead, and, know, what, to, expect, and, it, will, be, a, great, experience, its, obvious, that, some, of, the, people, that, gave, star, reviews, didn, t, understand, about, cost, of, bags, or, any, extras, and, not, ...]"
3,"My fiancé and I were booked to fly to Cayo Santa Maria (CUBA) February 6-13 2014. Our flight was scheduled to leave at 6.10am. Upon arriving at the airport at 4.30am we quickly noticed that the line up was very long. When we finally got to the check-in desk they asked us where we were headed we replied Cayo Santa Maria. We advised her that we had checked in online already and we just needed to print our boarding passes. She took our baggage and weighed it. Right before she was about to send it off a rude manager from the back came and just yelled out ""gates to Santa Clara are closed"". We were so shocked because it was only 4.55am at that time. We told them the plane would just be sitting there we could still make it. The rep simply told us ""please step aside we need to assist other pas...","[my, fianc, and, i, were, booked, to, fly, to, cayo, santa, maria, cuba, february, our, flight, was, scheduled, to, leave, at, am, upon, arriving, at, the, airport, at, am, we, quickly, noticed, that, the, line, up, was, very, long, when, we, finally, got, to, the, check, in, desk, they, asked, us, where, we, were, headed, we, replied, cayo, santa, maria, we, advised, her, that, we, had, checked, in, online, already, and, we, just, needed, to, print, our, boarding, passes, she, took, our, baggage, and, weighed, it, right, before, she, was, about, to, send, it, off, a, rude, manager, from, the, ...]"
4,"DXB-LHR B777-200ER BA0108 August 18 First Class. Transferred from an Emirates flight in DXB. BA DXB Galleries Lounge reception staff member excellent. Boarding reasonable with an on-time departure. Cabin crew outstanding and definitely lived up to the ""To Fly. To Serve"" BA slogan. Food tasty and well presented but not quite First Class and cost-cutting was evident. The New First seat is comfortable though the footrest is poorly designed and a storage area for small inflight items is missing. IFE monitor controls and selection very good but the screen could be more adjustable for reach. Lovely cabin ambience including colors textures mood lighting and window blinds. Toilets cramped and stocked with the cheapest liquid soaps and toilet paper. Overall an enjoyable flight but as a longstan...","[dxb, lhr, b, er, ba, august, first, class, transferred, from, an, emirates, flight, in, dxb, ba, dxb, galleries, lounge, reception, staff, member, excellent, boarding, reasonable, with, an, on, time, departure, cabin, crew, outstanding, and, definitely, lived, up, to, the, to, fly, to, serve, ba, slogan, food, tasty, and, well, presented, but, not, quite, first, class, and, cost, cutting, was, evident, the, new, first, seat, is, comfortable, though, the, footrest, is, poorly, designed, and, a, storage, area, for, small, inflight, items, is, missing, ife, monitor, controls, and, selection, very, good, but, the, screen, could, be, more, adjustable, for, reach, lovely, cabin, ...]"


Посчитаем частотные токены

In [8]:
from collections import Counter

cnt = Counter()
for i, text in tqdm(df.processed_content.iteritems(), total=len(df)):
    cnt.update(text)

HBox(children=(FloatProgress(value=0.0, max=23322.0), HTML(value='')))




## Output 1

In [13]:
n_common = 20
most_common_words = [word for (word, counts) in cnt.most_common(n_common)]
most_common_words

['the',
 'and',
 'to',
 'was',
 'a',
 'on',
 'i',
 'in',
 'flight',
 'of',
 'for',
 'with',
 'were',
 'we',
 'not',
 'is',
 'but',
 'it',
 'at',
 'from']

In [14]:
with open('popular_tokens.txt', 'w') as f:
    f.write('\n'.join(most_common_words))

In [15]:
!head -n 20 popular_tokens.txt

the
and
to
was
a
on
i
in
flight
of
for
with
were
we
not
is
but
it
at
from

# Задание 2 (10 баллов)

Работайте с текстом, полученным в задании 1.

Проведите стемминг с помощью SnowballStemmer из библиотеки NLTK. После этого удалите все стоп-слова (стоп-слова возьмите из библиотеки NLTK). Найдите топ-20 самых частотных стемм среди оставшихся после удаления стоп-слов и запишите в порядке убывания их частоты (аналогично заданию 1) в файл __popular_stems.txt__

Полученные тексты (стеммы с удаленными стоп-словами) сохраните для задания 3.

## Solution 2

In [47]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

snowball = SnowballStemmer('english')
stops = set(stopwords.words('english'))

In [53]:
from functools import lru_cache

def stemm_content(content: list, filtered_words: set) -> list:
    
    @lru_cache(maxsize=128)
    def stemm_token(token: str) -> str:
        return snowball.stem(token)

    return [stemm_token(token) for token in content if token not in filtered_words]

In [54]:
from functools import partial

stemm_content_with_stops = partial(stemm_content, filtered_words=stops)

In [55]:
with Pool(4) as pool:
    stemmed_texts = list(tqdm(pool.imap(stemm_content_with_stops, df.processed_content), total=len(df)))

HBox(children=(FloatProgress(value=0.0, max=23322.0), HTML(value='')))




In [56]:
df['stemmed_content'] = stemmed_texts

In [57]:
df.head()

Unnamed: 0,content,processed_content,stemmed_content
0,March 5th 2014 from Ottawa Canada to Cuba WG 630. They announced that the flight was going to be delayed 1 hour no explanation why. They started boarding and we took off only 1/2 hour late. There were 6 of us 2 were seated together and remaining 4 were put in aisle seats side by side. On the way back from Cuba on March 12th 2014 WG 631 we were slow going through immigration no fault of Sunwing. Finally arrived to our plane at 10.35am the doors immediately closed and the plane took off 5 minutes later 20 minutes earlier than expected. The 6 of us were pretty much split up by 2 each seating my 12 old daughter by herself behind us. Overall the staff were great very friendly and approachable. The food served was pretty good considering most airlines don't offer meal service for free. It wa...,"[march, th, from, ottawa, canada, to, cuba, wg, they, announced, that, the, flight, was, going, to, be, delayed, hour, no, explanation, why, they, started, boarding, and, we, took, off, only, hour, late, there, were, of, us, were, seated, together, and, remaining, were, put, in, aisle, seats, side, by, side, on, the, way, back, from, cuba, on, march, th, wg, we, were, slow, going, through, immigration, no, fault, of, sunwing, finally, arrived, to, our, plane, at, am, the, doors, immediately, closed, and, the, plane, took, off, minutes, later, minutes, earlier, than, expected, the, of, us, were, pretty, much, split, up, by, ...]","[march, th, ottawa, canada, cuba, wg, announc, flight, go, delay, hour, explan, start, board, took, hour, late, us, seat, togeth, remain, put, aisl, seat, side, side, way, back, cuba, march, th, wg, slow, go, immigr, fault, sunw, final, arriv, plane, door, immedi, close, plane, took, minut, later, minut, earlier, expect, us, pretti, much, split, seat, old, daughter, behind, us, overal, staff, great, friend, approach, food, serv, pretti, good, consid, airlin, offer, meal, servic, free, compar, meal, purchas, airlin]"
1,SIN-FRA-BHX in Economy. First leg from Singapore on the A380 was great largely because I was fortunate enough to get an exit row seat with unlimited legroom (judging by fellow passengers one wouldn't be happy with normal seats as they had rather pathetic legroom). Nice modern AVOD system but the PTVs were rather small compared to other A380 airlines. Service was really friendly and warm but few frills (no amenity kit whatsoever no footrests). Meals were alright but again rather simple compared to Asian carriers. Second leg to Birmingham on an A320 was above average by intra-Europe standards with a decent snack/beverage service and friendly service again. All flights on time.,"[sin, fra, bhx, in, economy, first, leg, from, singapore, on, the, a, was, great, largely, because, i, was, fortunate, enough, to, get, an, exit, row, seat, with, unlimited, legroom, judging, by, fellow, passengers, one, wouldn, t, be, happy, with, normal, seats, as, they, had, rather, pathetic, legroom, nice, modern, avod, system, but, the, ptvs, were, rather, small, compared, to, other, a, airlines, service, was, really, friendly, and, warm, but, few, frills, no, amenity, kit, whatsoever, no, footrests, meals, were, alright, but, again, rather, simple, compared, to, asian, carriers, second, leg, to, birmingham, on, an, a, was, above, average, by, intra, ...]","[sin, fra, bhx, economi, first, leg, singapor, great, larg, fortun, enough, get, exit, row, seat, unlimit, legroom, judg, fellow, passeng, one, happi, normal, seat, rather, pathet, legroom, nice, modern, avod, system, ptvs, rather, small, compar, airlin, servic, realli, friend, warm, frill, amen, kit, whatsoev, footrest, meal, alright, rather, simpl, compar, asian, carrier, second, leg, birmingham, averag, intra, europ, standard, decent, snack, beverag, servic, friend, servic, flight, time]"
2,"Spirit does what they state on their web site, they get you there - cheaply. For that I give them 5 stars because they did exactly what the said they would do. The plane was full and the seats were close together. I read all about that before I bought the ticket and it was as they said it would be, hence the low cost. Plan ahead and know what to expect and it will be a great experience. Its obvious that some of the people that gave 1 star reviews didn't understand about cost of bags or any extras and not done their homework - and are now very disappointed.","[spirit, does, what, they, state, on, their, web, site, they, get, you, there, cheaply, for, that, i, give, them, stars, because, they, did, exactly, what, the, said, they, would, do, the, plane, was, full, and, the, seats, were, close, together, i, read, all, about, that, before, i, bought, the, ticket, and, it, was, as, they, said, it, would, be, hence, the, low, cost, plan, ahead, and, know, what, to, expect, and, it, will, be, a, great, experience, its, obvious, that, some, of, the, people, that, gave, star, reviews, didn, t, understand, about, cost, of, bags, or, any, extras, and, not, ...]","[spirit, state, web, site, get, cheapli, give, star, exact, said, would, plane, full, seat, close, togeth, read, bought, ticket, said, would, henc, low, cost, plan, ahead, know, expect, great, experi, obvious, peopl, gave, star, review, understand, cost, bag, extra, done, homework, disappoint]"
3,"My fiancé and I were booked to fly to Cayo Santa Maria (CUBA) February 6-13 2014. Our flight was scheduled to leave at 6.10am. Upon arriving at the airport at 4.30am we quickly noticed that the line up was very long. When we finally got to the check-in desk they asked us where we were headed we replied Cayo Santa Maria. We advised her that we had checked in online already and we just needed to print our boarding passes. She took our baggage and weighed it. Right before she was about to send it off a rude manager from the back came and just yelled out ""gates to Santa Clara are closed"". We were so shocked because it was only 4.55am at that time. We told them the plane would just be sitting there we could still make it. The rep simply told us ""please step aside we need to assist other pas...","[my, fianc, and, i, were, booked, to, fly, to, cayo, santa, maria, cuba, february, our, flight, was, scheduled, to, leave, at, am, upon, arriving, at, the, airport, at, am, we, quickly, noticed, that, the, line, up, was, very, long, when, we, finally, got, to, the, check, in, desk, they, asked, us, where, we, were, headed, we, replied, cayo, santa, maria, we, advised, her, that, we, had, checked, in, online, already, and, we, just, needed, to, print, our, boarding, passes, she, took, our, baggage, and, weighed, it, right, before, she, was, about, to, send, it, off, a, rude, manager, from, the, ...]","[fianc, book, fli, cayo, santa, maria, cuba, februari, flight, schedul, leav, upon, arriv, airport, quick, notic, line, long, final, got, check, desk, ask, us, head, repli, cayo, santa, maria, advis, check, onlin, alreadi, need, print, board, pass, took, baggag, weigh, right, send, rude, manag, back, came, yell, gate, santa, clara, close, shock, time, told, plane, would, sit, could, still, make, rep, simpli, told, us, pleas, step, asid, need, assist, passeng, found, passeng, us, left, behind, told, us, go, airlin, check, open, seat, liabl, long, stori, short, purchas, anoth, packag, next, morn, flight, time, knew, lie, us, board, plane, pleas, favour, ...]"
4,"DXB-LHR B777-200ER BA0108 August 18 First Class. Transferred from an Emirates flight in DXB. BA DXB Galleries Lounge reception staff member excellent. Boarding reasonable with an on-time departure. Cabin crew outstanding and definitely lived up to the ""To Fly. To Serve"" BA slogan. Food tasty and well presented but not quite First Class and cost-cutting was evident. The New First seat is comfortable though the footrest is poorly designed and a storage area for small inflight items is missing. IFE monitor controls and selection very good but the screen could be more adjustable for reach. Lovely cabin ambience including colors textures mood lighting and window blinds. Toilets cramped and stocked with the cheapest liquid soaps and toilet paper. Overall an enjoyable flight but as a longstan...","[dxb, lhr, b, er, ba, august, first, class, transferred, from, an, emirates, flight, in, dxb, ba, dxb, galleries, lounge, reception, staff, member, excellent, boarding, reasonable, with, an, on, time, departure, cabin, crew, outstanding, and, definitely, lived, up, to, the, to, fly, to, serve, ba, slogan, food, tasty, and, well, presented, but, not, quite, first, class, and, cost, cutting, was, evident, the, new, first, seat, is, comfortable, though, the, footrest, is, poorly, designed, and, a, storage, area, for, small, inflight, items, is, missing, ife, monitor, controls, and, selection, very, good, but, the, screen, could, be, more, adjustable, for, reach, lovely, cabin, ...]","[dxb, lhr, b, er, ba, august, first, class, transfer, emir, flight, dxb, ba, dxb, galleri, loung, recept, staff, member, excel, board, reason, time, departur, cabin, crew, outstand, definit, live, fli, serv, ba, slogan, food, tasti, well, present, quit, first, class, cost, cut, evid, new, first, seat, comfort, though, footrest, poor, design, storag, area, small, inflight, item, miss, ife, monitor, control, select, good, screen, could, adjust, reach, love, cabin, ambienc, includ, color, textur, mood, light, window, blind, toilet, cramp, stock, cheapest, liquid, soap, toilet, paper, overal, enjoy, flight, longstand, loyal, ba, custom, gold, tier, ask, serious, consid, improv, product, increas, competit, ...]"


In [60]:
from collections import Counter

cnt = Counter()
for i, text in tqdm(df.stemmed_content.iteritems(), total=len(df)):
    cnt.update(text)

HBox(children=(FloatProgress(value=0.0, max=23322.0), HTML(value='')))




## Output 2

In [61]:
n_common = 20
most_common_words = [word for (word, counts) in cnt.most_common(n_common)]
most_common_words

['flight',
 'seat',
 'time',
 'servic',
 'good',
 'food',
 'airlin',
 'hour',
 'crew',
 'staff',
 'plane',
 'check',
 'return',
 'cabin',
 'class',
 'fli',
 'board',
 'would',
 'one',
 'busi']

In [62]:
with open('popular_stems.txt', 'w') as f:
    f.write('\n'.join(most_common_words))
    
! head -n 20 popular_stems.txt

flight
seat
time
servic
good
food
airlin
hour
crew
staff
plane
check
return
cabin
class
fli
board
would
one
busi

# Задание 3 (30 баллов)

Работайте с текстами, полученным в задании 2.

Сделайте TF-IDF преобразование (c n-gram range = (1, 1)) для коллекции документов. Для каждого документа найдите топ-1 стемму с самым высоким весом tf-idf. Запишите эти стеммы в файл __tfidf_stems.txt__ в следующем формате: каждому документу соответствует одно слово, строки в документе должны идти в том же порядке, что документы в исходном датасете. В итоговом файле должно быть столько же строк и слов, сколько документов в файле "airlines.reviews.train.tsv".

## Solution 3

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [64]:
vec = TfidfVectorizer(ngram_range=(1, 1), preprocessor=lambda x: ' '.join(x))

content_tfidf = vec.fit_transform(df.stemmed_content)

In [65]:
content_tfidf

<23322x15365 sparse matrix of type '<class 'numpy.float64'>'
	with 1154719 stored elements in Compressed Sparse Row format>

In [81]:
vocabulary_index = {value: key for key, value in vec.vocabulary_.items()}

In [83]:
significant_stems = []

for i, stems in tqdm(df.stemmed_content.iteritems(), total=len(df)):
    best_stem = vocabulary_index[content_tfidf[i].argmax()]
    significant_stems.append(best_stem)

HBox(children=(FloatProgress(value=0.0, max=23322.0), HTML(value='')))




In [84]:
significant_stems

['wg',
 'rather',
 'star',
 'santa',
 'ba',
 'lci',
 'km',
 'calgari',
 'discount',
 'recaro',
 'window',
 'packag',
 'attain',
 'durban',
 'brink',
 'busi',
 'spectacular',
 'tiger',
 'quit',
 'lhe',
 'hilo',
 'lpl',
 'jfk',
 'partner',
 'sydney',
 'cushion',
 'realis',
 'saudi',
 'cold',
 'dominican',
 'inordin',
 'ryanair',
 'cx',
 'print',
 'christchurch',
 'floor',
 'rather',
 'hak',
 'sof',
 'debat',
 'phuket',
 'outstand',
 'grubbi',
 'clark',
 'retrofit',
 'beg',
 'lci',
 'dragonair',
 'sofia',
 'pek',
 'god',
 'fra',
 'frantic',
 'dmm',
 'glanc',
 'bye',
 'upgrad',
 'pvg',
 'section',
 'man',
 'serv',
 'inbound',
 'kul',
 'main',
 'good',
 'april',
 'adria',
 'prg',
 'allway',
 'bhd',
 'copenhagen',
 'consol',
 'grain',
 'ewr',
 'window',
 'singapor',
 'juan',
 'toilet',
 'halifax',
 'vietnam',
 'personalis',
 'southampton',
 'told',
 'pudong',
 'nagoya',
 'pretti',
 'geneva',
 'zac',
 'lax',
 'quiet',
 'confort',
 'san',
 'thai',
 'vnukovo',
 'kqs',
 'chines',
 'avail',
 'tku

## Output 3

In [85]:
with open('tfidf_stems.txt', 'w') as f:
    f.write('\n'.join(significant_stems))

! head -n 20 tfidf_stems.txt

wg
rather
star
santa
ba
lci
km
calgari
discount
recaro
window
packag
attain
durban
brink
busi
spectacular
tiger
quit
lhe
