# Wyznaczanie keywordów przy użyciu TF-IDF #

Metoda TF-IDF

TFIDF (ang. TF – term frequency, IDF – inverse document frequency) – ważenie częstością termów – odwrotna częstość w dokumentach – jedna z metod obliczania wagi słów w oparciu o liczbę ich wystąpień, należąca do grupy algorytmów obliczających statystyczne wagi termów. Każdy dokument reprezentowany jest przez wektor, składający się z wag słów występujących w tym dokumencie. TFIDF informuje o częstości wystąpienia termów uwzględniając jednocześnie odpowiednie wyważenie znaczenia lokalnego termu i jego znaczenia w kontekście pełnej kolekcji dokumentów. 

Więcej informacji można znaleźć tutaj: https://www.sunrisesystem.pl/blog/2411-czym-jest-tfidf-i-jak-wplywa-na-pozycjonowanie-strony-ster-n.html

Zbiór treningowy i testowy pochodzi ze strony Stack Overflow. Zgromadzone dokumenty/artykuły/notatki dotyczą głównie tematów dotyczących technologii, programowania i ogólnie pojętej informatyki wraz z naukami pokrewnymi. Są one w języku angielskim.

Przygotowany algorytm dobrze spisuje się dla artykułów o tematyce technologicznej, ale dobrze radzi sobie również z innymi typami artykułów - dot. politykii, przyrody, zdrowia. Jednak muszą to być artykuły w języku angielskim.

### Import bibliotek ###

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import re

In [2]:
# zbiór treningowy
df_idf = pd.read_json("stackoverflow-data-idf.json",lines=True)
 
# dane o formacie danych zbioru treningowego
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)

Schema:

 accepted_answer_id          float64
answer_count                  int64
body                         object
comment_count                 int64
community_owned_date         object
creation_date                object
favorite_count              float64
id                            int64
last_activity_date           object
last_edit_date               object
last_editor_display_name     object
last_editor_user_id         float64
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
title                        object
view_count                    int64
dtype: object
Number of questions,columns= (20000, 19)


In [3]:
# preprocessing danych tekstowych - sprowadzanie do lowercase'a, usuwanie tagów, znaków specjalnych i cyfr
def pre_process(text):
    text = text.lower()
    text = re.sub("<!--?.*?-->","", text)
    text = re.sub("(\\d|\\W)+", " ", text)
    return text

In [5]:
df_idf['text'] = df_idf['title'] + df_idf['body']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))
 
# przykładowy tekst ze zbioru treningowego
df_idf['text'][3]

'loop variable as parameter in asynchronous function call p i have an object with the following form p pre code sortedfilters task appointment email code pre p now i want to use a code for in code loop to build my tasks array for code async parallel code which contains the functions to be executed asynchronously p pre code var tasks for var entity in sortedfilters tasks push function callback var entityresult fetchrecordsforentity entity businessunits sortedfilters var formattedentityresults formatresults entity entityresult callback null formattedentityresults code pre p the problem is that at the moment the functions are called code entity code points at the last value of the loop in this case it would be code email code p p how can i nail the exact value at the moment the function is added to the array p '

In [6]:
def get_stop_words(stop_file_path):
    """load stop words """
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

# wczytujemy zbiór stop-words - lista słów do odrzucenia
stopwords = get_stop_words("stopwords.txt")
 
# zbiór dokumentów
docs = df_idf['text'].tolist()
  
# ignorujemy wyrazy, które znajdują się w >= 85% wszystkich dokumentów 
# wyrzucamy stop words
cv = CountVectorizer(max_df=0.85, stop_words=stopwords)
word_count_vector = cv.fit_transform(docs)

  'stop_words.' % sorted(inconsistent))


In [7]:
cv = CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=10000)
word_count_vector = cv.fit_transform(docs)

  'stop_words.' % sorted(inconsistent))


In [8]:
list(cv.vocabulary_.keys())[:10]

['serializing',
 'private',
 'struct',
 'public',
 'class',
 'contains',
 'properties',
 'string',
 'serialize',
 'attempt']

In [9]:
tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [10]:
# wczytujemy zbiór testowy, wyciągamy treści artykułów i ich tytuły
df_test = pd.read_json("stackoverflow-test.json", lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] = df_test['text'].apply(lambda x:pre_process(x))
 
# tworzymy zbiór testowych dokumentów
docs_test = df_test['text'].tolist()

In [11]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
 
def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    # bierzemy tylko topn najlepszych wyników
    sorted_items = sorted_items[:topn]
 
    score_vals = []
    feature_vals = []
    
    # indeks wyrazu i tf-idf score
    for idx, score in sorted_items:
        
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

In [12]:
def find_keywords(doc):
    # dla dokumentu doc wygeneruj tf-idf
    tf_idf_vector = tfidf_transformer.transform(cv.transform([doc]))

    # posortuj kandydatów na keywordy zgodnie z tf-idf score
    sorted_items = sort_coo(tf_idf_vector.tocoo())

    # bierzemy 10 najlepszych wyników
    keywords = extract_topn_from_vector(feature_names, sorted_items, 10)

    # wypisywanie wyniku
    print("\n=====Doc=====")
    print(doc)
    print("\n===Keywords===")
    for k in keywords:
        print(k, keywords[k])

In [13]:
feature_names = cv.get_feature_names()

## Przykłady działania ##

In [14]:
# podajemy dokument, dla którego chcemy przeprowadzić ekstrakcję słów kluczowych
doc1 = '''
Researchers have come up with a method for creating realistic-looking — but fake — 
videos of anyone by using just a single image of them with a trained artificial 
intelligence system. It's a potentially worrisome capability in the runup to the 
2020 United States presidential election, as falsified videos of candidates are 
expected to spread. Researchers at the Samsung AI Center in Moscow and the Moscow-based 
Skolkovo Institute of Science and Technology explained the feat in a paper published 
this week to the arXiv, an online academic pre-print service. They said they were able 
to animate one or several photos of people by first training an AI system on a dataset 
of videos including many celebrities, so it could learn about key points on the face. 
After that, the AI system was able to combine that familiarity with one or more images 
of a person to come up with a convincing "talking head"-style video of them. 
'''

find_keywords(doc1)


=====Doc=====

Researchers have come up with a method for creating realistic-looking — but fake — 
videos of anyone by using just a single image of them with a trained artificial 
intelligence system. It's a potentially worrisome capability in the runup to the 
2020 United States presidential election, as falsified videos of candidates are 
expected to spread. Researchers at the Samsung AI Center in Moscow and the Moscow-based 
Skolkovo Institute of Science and Technology explained the feat in a paper published 
this week to the arXiv, an online academic pre-print service. They said they were able 
to animate one or several photos of people by first training an AI system on a dataset 
of videos including many celebrities, so it could learn about key points on the face. 
After that, the AI system was able to combine that familiarity with one or more images 
of a person to come up with a convincing "talking head"-style video of them. 


===Keywords===
ai 0.451
videos 0.37
system 0.199
c

In [15]:
doc2 = '''
Computer scientists at The University of Texas at Austin have taught an 
artificial intelligence agent how to do something that usually only humans 
can do -- take a few quick glimpses around and infer its whole environment, 
a skill necessary for the development of effective search-and-rescue robots 
that one day can improve the effectiveness of dangerous missions. The team, 
led by professor Kristen Grauman, Ph.D. candidate Santhosh Ramakrishnan and 
former Ph.D. candidate Dinesh Jayaraman (now at the University of California, 
Berkeley) published their results today in the journal Science Robotics.
Most AI agents -- computer systems that could endow robots or other machines 
with intelligence -- are trained for very specific tasks -- such as to 
recognize an object or estimate its volume -- in an environment they have 
experienced before, like a factory. But the agent developed by Grauman and 
Ramakrishnan is general purpose, gathering visual information that can then 
be used for a wide range of tasks.
'''

find_keywords(doc2)


=====Doc=====

Computer scientists at The University of Texas at Austin have taught an 
artificial intelligence agent how to do something that usually only humans 
can do -- take a few quick glimpses around and infer its whole environment, 
a skill necessary for the development of effective search-and-rescue robots 
that one day can improve the effectiveness of dangerous missions. The team, 
led by professor Kristen Grauman, Ph.D. candidate Santhosh Ramakrishnan and 
former Ph.D. candidate Dinesh Jayaraman (now at the University of California, 
Berkeley) published their results today in the journal Science Robotics.
Most AI agents -- computer systems that could endow robots or other machines 
with intelligence -- are trained for very specific tasks -- such as to 
recognize an object or estimate its volume -- in an environment they have 
experienced before, like a factory. But the agent developed by Grauman and 
Ramakrishnan is general purpose, gathering visual information that can the

In [16]:
doc3 = '''
A fleet of driverless cars working together to keep traffic moving smoothly can improve 
overall traffic flow by at least 35 percent, researchers have shown.
The researchers, from the University of Cambridge, programmed a small fleet of 
miniature robotic cars to drive on a multi-lane track and observed how the traffic 
flow changed when one of the cars stopped. When the cars were not driving cooperatively, 
any cars behind the stopped car had to stop or slow down and wait for a gap in the traffic, 
as would typically happen on a real road. A queue quickly formed behind the stopped car 
and overall traffic flow was slowed. However, when the cars were communicating with each 
other and driving cooperatively, as soon as one car stopped in the inner lane, it sent a 
signal to all the other cars. Cars in the outer lane that were in immediate proximity of 
the stopped car slowed down slightly so that cars in the inner lane were able to quickly 
pass the stopped car without having to stop or slow down significantly.
'''

find_keywords(doc3)


=====Doc=====

A fleet of driverless cars working together to keep traffic moving smoothly can improve 
overall traffic flow by at least 35 percent, researchers have shown.
The researchers, from the University of Cambridge, programmed a small fleet of 
miniature robotic cars to drive on a multi-lane track and observed how the traffic 
flow changed when one of the cars stopped. When the cars were not driving cooperatively, 
any cars behind the stopped car had to stop or slow down and wait for a gap in the traffic, 
as would typically happen on a real road. A queue quickly formed behind the stopped car 
and overall traffic flow was slowed. However, when the cars were communicating with each 
other and driving cooperatively, as soon as one car stopped in the inner lane, it sent a 
signal to all the other cars. Cars in the outer lane that were in immediate proximity of 
the stopped car slowed down slightly so that cars in the inner lane were able to quickly 
pass the stopped car without h

In [17]:
doc4 = '''
McMaster researchers have developed a simple and highly novel form of computing 
by shining patterned bands of light and shadow through different facets of a 
polymer cube and reading the combined results that emerge.
The material in the cube reads and reacts intuitively to the light in 
much the same way a plant would turn to the sun, or a cuttlefish would 
change the color of its skin. The researchers are graduate students in 
chemistry supervised by Kalaichelvi Saravanamuttu, an associate professor of 
chemistry and chemical biology whose lab focuses on ideas inspired by natural 
biological systems. The researchers were able to use their new process to 
perform simple addition and subtraction questions.
'''

find_keywords(doc4)


=====Doc=====

McMaster researchers have developed a simple and highly novel form of computing 
by shining patterned bands of light and shadow through different facets of a 
polymer cube and reading the combined results that emerge.
The material in the cube reads and reacts intuitively to the light in 
much the same way a plant would turn to the sun, or a cuttlefish would 
change the color of its skin. The researchers are graduate students in 
chemistry supervised by Kalaichelvi Saravanamuttu, an associate professor of 
chemistry and chemical biology whose lab focuses on ideas inspired by natural 
biological systems. The researchers were able to use their new process to 
perform simple addition and subtraction questions.


===Keywords===
cube 0.389
light 0.301
polymer 0.216
plant 0.213
professor 0.211
skin 0.201
computing 0.197
lab 0.191
natural 0.188
simple 0.184


In [None]:
# artykuły o innej tematyce niz technika

In [18]:
doc5 = '''
With eight million tons of plastic trash added to our oceans every year, 
the seas that we depend on for survival are in grave danger. Besides 
washing up on shorelines, plastic is now found just about everywhere and 
in everything. It’s in the deepest ocean trenches, on the most remote 
shorelines, in all levels of the marine food web, in the water we drink, 
the food we eat, and even the air we breathe. adidas, in partnership with 
Parley for the Oceans, is trying to do something about it.
Now in its third year, the annual Run For The Oceans, an initiative 
established and driven by adidas in collaboration with Parley, has helped 
the development of the Parley Ocean School Program. In 2018 alone, the run 
united nearly a million runners around the world and raised $1 million, 
with adidas contributing one dollar for every kilometer run directly to the 
program. This year the run will be bigger and, given the urgency of the marine
plastic pollution crisis, more important than ever.
'''

find_keywords(doc5)


=====Doc=====

With eight million tons of plastic trash added to our oceans every year, 
the seas that we depend on for survival are in grave danger. Besides 
washing up on shorelines, plastic is now found just about everywhere and 
in everything. It’s in the deepest ocean trenches, on the most remote 
shorelines, in all levels of the marine food web, in the water we drink, 
the food we eat, and even the air we breathe. adidas, in partnership with 
Parley for the Oceans, is trying to do something about it.
Now in its third year, the annual Run For The Oceans, an initiative 
established and driven by adidas in collaboration with Parley, has helped 
the development of the Parley Ocean School Program. In 2018 alone, the run 
united nearly a million runners around the world and raised $1 million, 
with adidas contributing one dollar for every kilometer run directly to the 
program. This year the run will be bigger and, given the urgency of the marine
plastic pollution crisis, more importa

In [19]:
doc6 = '''
Ian Lavery, the Labour party chairman, has hit out at a section of pro-remain 
campaigners for sneering at “ordinary people” with pro-Brexit views and sniping 
at those who want to see the results of the 2016 poll respected.
As Jeremy Corbyn faces intense pressure to back a “people’s vote” in the wake 
of the European elections, Lavery argued in an article for the Guardian that 
Labour would not win a general election “simply by fighting for the biggest share 
of the 48%”. He said both sides needed to come together to fight the prospect of 
a no-deal Brexit being pushed by some of the Conservative leadership candidates 
who are competing to be the next prime minister.'''

find_keywords(doc6)


=====Doc=====

Ian Lavery, the Labour party chairman, has hit out at a section of pro-remain 
campaigners for sneering at “ordinary people” with pro-Brexit views and sniping 
at those who want to see the results of the 2016 poll respected.
As Jeremy Corbyn faces intense pressure to back a “people’s vote” in the wake 
of the European elections, Lavery argued in an article for the Guardian that 
Labour would not win a general election “simply by fighting for the biggest share 
of the 48%”. He said both sides needed to come together to fight the prospect of 
a no-deal Brexit being pushed by some of the Conservative leadership candidates 
who are competing to be the next prime minister.

===Keywords===
who 0.291
people 0.289
wake 0.232
pressure 0.229
ordinary 0.224
biggest 0.224
vote 0.221
poll 0.217
prime 0.209
sides 0.205


In [20]:
doc7 = '''
A recent study examines the health impact of consuming alcohol at different ages. 
The authors conclude that, for people over the age of 50, the health risks may 
be less severe. Heavy drinking is linked to a range of serious health consequences.
These include certain cancers, liver and heart disease, and damage to the nervous 
system, including the brain. However, as has been exhaustively covered in the 
popular press, drinking in moderation might have certain health benefits.
A number of studies have concluded that drinking alcohol at a low level could 
have a protective effect.
'''

find_keywords(doc7)


=====Doc=====

A recent study examines the health impact of consuming alcohol at different ages. 
The authors conclude that, for people over the age of 50, the health risks may 
be less severe. Heavy drinking is linked to a range of serious health consequences.
These include certain cancers, liver and heart disease, and damage to the nervous 
system, including the brain. However, as has been exhaustively covered in the 
popular press, drinking in moderation might have certain health benefits.
A number of studies have concluded that drinking alcohol at a low level could 
have a protective effect.


===Keywords===
health 0.666
risks 0.195
ages 0.18
damage 0.178
brain 0.176
study 0.172
covered 0.172
benefits 0.172
severe 0.169
authors 0.169


In [24]:
# polskie artykuły

In [21]:
doc8 = '''
Gdyby szukać analogii w świecie przyrody, to hakerów bez problemu porównać 
można by było do… mrówek. Potrafią przecisnąć się nawet przez najmniejszą 
możliwą szczelinę. Gdy tylko dostępna jest okazja, z pewnością ją wykorzystają. 
Takimi szczelinami, w naszych komputerach są właśnie luki w oprogramowaniach, 
które otwierają drogę do danych i informacji przechowywanych na dyskach.
Czym właściwie jest luka? To swego rodzaju „furtka” dla hakerów, umożliwiająca 
zainfekowanie komputera złośliwym oprogramowaniem. Mówi się o niej, jako o błędzie 
w kodzie w zabezpieczenia aplikacji. Chociaż definicji jest wiele, wszystkie 
spójnie wyjaśniają, że luka umożliwia hakerowi wykonywanie czynności na naszym 
komputerze, uzyskuje on zatem dostęp do danych, podszywając się pod inną jednostkę/program.
'''

find_keywords(doc8)


=====Doc=====

Gdyby szukać analogii w świecie przyrody, to hakerów bez problemu porównać 
można by było do… mrówek. Potrafią przecisnąć się nawet przez najmniejszą 
możliwą szczelinę. Gdy tylko dostępna jest okazja, z pewnością ją wykorzystają. 
Takimi szczelinami, w naszych komputerach są właśnie luki w oprogramowaniach, 
które otwierają drogę do danych i informacji przechowywanych na dyskach.
Czym właściwie jest luka? To swego rodzaju „furtka” dla hakerów, umożliwiająca 
zainfekowanie komputera złośliwym oprogramowaniem. Mówi się o niej, jako o błędzie 
w kodzie w zabezpieczenia aplikacji. Chociaż definicji jest wiele, wszystkie 
spójnie wyjaśniają, że luka umożliwia hakerowi wykonywanie czynności na naszym 
komputerze, uzyskuje on zatem dostęp do danych, podszywając się pod inną jednostkę/program.


===Keywords===
na 0.831
pod 0.495
program 0.253


In [22]:
doc9 = '''
Czym jest olej kokosowy?
Olej kokosowy otrzymuje się przez rozgrzanie i tłoczenie miąższu kokosa. 
Już od wieków stosuje się go nie tylko w kuchni, ale i w codziennej pielęgnacji. 
Najwięcej używają go Azjatki, przede wszystkim popularny jest w Indiach i 
na Półwyspie Indyjskim. Stosowany na skórę pomoże w walce z rozstępami oraz 
cellulitem, można go też stosować do nawilżania skórek przy paznokciach. 
Sprawdzi się równie świetnie jako baza do maseczek i peelingów do twarzy i ust. 
Olej kokosowy jest w stanie stałym, wystarczy jednak lekko go podgrzać w dłoni, 
aby stał się płynny i przez to się nakładał.
'''

find_keywords(doc9)


=====Doc=====

Czym jest olej kokosowy?
Olej kokosowy otrzymuje się przez rozgrzanie i tłoczenie miąższu kokosa. 
Już od wieków stosuje się go nie tylko w kuchni, ale i w codziennej pielęgnacji. 
Najwięcej używają go Azjatki, przede wszystkim popularny jest w Indiach i 
na Półwyspie Indyjskim. Stosowany na skórę pomoże w walce z rozstępami oraz 
cellulitem, można go też stosować do nawilżania skórek przy paznokciach. 
Sprawdzi się równie świetnie jako baza do maseczek i peelingów do twarzy i ust. 
Olej kokosowy jest w stanie stałym, wystarczy jednak lekko go podgrzać w dłoni, 
aby stał się płynny i przez to się nakładał.


===Keywords===
na 0.844
od 0.536


Dla artykułów w języku polskim nasz algorytm niestety zawodzi, problem ten można rozwiązać przygotowując zbiór treningowy z artykułami w języku polskim.