# Laboratorium 4 - rekomendacje dla portali informacyjnych

## Przygotowanie

 * pobierz i wypakuj dataset: https://mind201910small.blob.core.windows.net/release/MINDsmall_train.zip
   * więcej możesz poczytać tutaj: https://learn.microsoft.com/en-us/azure/open-datasets/dataset-microsoft-news
 * [opcjonalnie] Utwórz wirtualne środowisko
 `python3 -m venv ./recsyslab4`
 * zainstaluj potrzebne biblioteki:
 `pip install nltk sklearn`

## Część 1. - przygotowanie danych

In [1]:
# importujemy wszystkie potrzebne pakiety

import codecs
from collections import defaultdict # mozesz uzyc zamiast zwyklego slownika, rozwaz wplyw na czas obliczen
import math
import re
from string import punctuation

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
from nltk.stem import WordNetLemmatizer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

# mozesz uzyc do obliczania najbardziej podobnych tekstow zamiast liczenia "na piechote"
# ale pamietaj o dostosowaniu formatu danych
from sklearn.neighbors import NearestNeighbors

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\WLGS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# definiujemy potrzebne zmienne

PATH = './MINDsmall_train'
STOPWORDS = set(stopwords.words('english'))

In [3]:
# wczytujemy metadane artykułów

def parse_news_entry(entry):
    news_id, category, subcategory, title, abstract = entry.split('\t')[:5]
    return {
        'news_id': news_id,
        'category': category,
        'subcategory': subcategory,
        'title': title,
        'abstract': abstract
    }

def get_news_metadata():
    with codecs.open(f'{PATH}/news.tsv', 'r', 'UTF-8') as f:
        raw = [x for x in f.read().split('\n') if x]
        parsed_entries = [parse_news_entry(entry) for entry in raw]
        return {x['news_id']: x for x in parsed_entries}

news = get_news_metadata()
news_ids = sorted(list(news.keys()))
news_indices = {x[1]: x[0] for x in enumerate(news_ids)}
print(len(news))

51282


## Część 2. - TF-IDF

In [6]:
# normalizujemy teksty na potrzeby dalszego przetwarzania

def preprocess_text(text: str):
    # usuwamy znaki interpunkcyjne
    preprocessed = re.sub(f'[{punctuation}]', '', text)
    # usuwamy wszystkie liczby
    preprocessed = re.sub(r'\d+', '', preprocessed)
    # podmieniamy wszystkie wielkie litery
    preprocessed = preprocessed.lower()
    # dzielimy na tokeny
    preprocessed = preprocessed.split()
    # usuwamy stopwords
    preprocessed = [x for x in preprocessed if x not in STOPWORDS]
    return preprocessed

def stem_texts(corpus):
    stemmer = SnowballStemmer('english')
    return [[stemmer.stem(word) for word in preprocess_text(text)] for text in corpus]

texts = [news[news_id]['abstract'] for news_id in news_ids]
stemmed_texts = stem_texts(texts)

In [7]:
# porownajmy teksty przed i po przetworzeniu

print(texts[2] + '\n')
print(' '.join(stemmed_texts[2]))

"I think we have a really good team, and a team that can really do some special, good things because that group is very close in there." - Brian Schmetzer

think realli good team team realli special good thing group close brian schmetzer


In [8]:
# tworzymy liste wszystkich slow w korpusie

def get_all_words_sorted(corpus):
    # generujemy posortowana alfabetycznie liste wszystkich slow (tokenow)
    return sorted(list(set([word for text in corpus for word in text])))

wordlist = get_all_words_sorted(stemmed_texts)
word_indices = {x[1]: x[0] for x in enumerate(wordlist)}
print(len(wordlist))

41852


In [9]:
# obliczamy liczbe tekstow, w ktorych wystapilo kazde ze slow
# pamietaj, ze jesli slowo wystapilo w danym tekscie wielokrotnie, to liczymy je tylko raz

def get_document_frequencies(corpus, wordlist):
    # return {word -> count}
    result = {}
    for word in wordlist:
        count = 0
        for text in corpus:
            if word in text:
                count += 1
                continue
        result[word] = count
    return result

document_frequency = get_document_frequencies(stemmed_texts, wordlist)

In [10]:
# obliczamy liczbe wystapien kazdego slowa w kazdym tekscie

def get_term_frequencies(corpus, news_indices):
    # return {news_id -> {word -> count}}
    return {news_id: {word: text.count(word) for word in text} for news_id, text in zip(news_indices, corpus)}

term_frequency = get_term_frequencies(stemmed_texts, news_indices)

In [53]:
# sprawdzmy wyniki

term_frequency['N10062']

{}

In [12]:
# obliczamy metryke tf_idf

def calculate_tf_idf(term_frequency, document_frequency, corpus_size):
    # return {news_id -> {word -> tf_idf}}
    return {news_id: {word: tf * math.log(corpus_size / df) for word, tf in tf_dict.items()} for news_id, tf_dict in term_frequency.items() for word, df in document_frequency.items() if word in tf_dict}

tf_idf = calculate_tf_idf(term_frequency, document_frequency, len(news_ids))

In [51]:
# sprawdzmy wyniki

tf_idf[news_ids[42337]]

dict_keys(['N10', 'N100', 'N1000', 'N10000', 'N10001', 'N10002', 'N10003', 'N10004', 'N10005', 'N10007', 'N10009', 'N1001', 'N10010', 'N10011', 'N10013', 'N10014', 'N10015', 'N10016', 'N1002', 'N10021', 'N10022', 'N10023', 'N10024', 'N10025', 'N10026', 'N10027', 'N10029', 'N1003', 'N10031', 'N10032', 'N10033', 'N10034', 'N10035', 'N10037', 'N10038', 'N10039', 'N1004', 'N10040', 'N10041', 'N10042', 'N10044', 'N10046', 'N10047', 'N10048', 'N10049', 'N10051', 'N10052', 'N10053', 'N10055', 'N10056', 'N10057', 'N10058', 'N10059', 'N1006', 'N10060', 'N10061', 'N10063', 'N10064', 'N10065', 'N10066', 'N10067', 'N10068', 'N1007', 'N10070', 'N10072', 'N10073', 'N10074', 'N10075', 'N10076', 'N10077', 'N10078', 'N10079', 'N1008', 'N10080', 'N10081', 'N10083', 'N10084', 'N10087', 'N10088', 'N10089', 'N1009', 'N10090', 'N10091', 'N10092', 'N10093', 'N10094', 'N10095', 'N10097', 'N10099', 'N101', 'N1010', 'N10100', 'N10101', 'N10102', 'N10103', 'N10107', 'N10108', 'N10109', 'N10111', 'N10112', 'N1011

## Część 3. - Podobieństwo tekstów

In [71]:
# obliczmy odleglosc miedzy dwoma artykulami
# przetestuj rozne metryki odleglosci i wybierz najlepsza

def calculate_distance(tf_idf, id1, id2, metric='euclidean'):
    if id2 not in tf_idf or id1 not in tf_idf:
        return 0
    if metric == 'euclidean':
        return math.sqrt(sum([(tf_idf[id1][word] - tf_idf[id2][word]) ** 2 for word in tf_idf[id1] if word in tf_idf[id2]]))
    elif metric == 'cosine':
        return sum([tf_idf[id1][word] * tf_idf[id2][word] for word in tf_idf[id1] if word in tf_idf[id2]]) / (math.sqrt(sum([tf_idf[id1][word] ** 2 for word in tf_idf[id1]])) * math.sqrt(sum([tf_idf[id2][word] ** 2 for word in tf_idf[id2]])))
calculate_distance(tf_idf, news_ids[42337], 'N10', 'cosine')
tf_idf[news_ids[42337]]

{'man': 2.5527969853308514,
 'claim': 2.5527969853308514,
 'creat': 2.5527969853308514,
 'car': 22.975172867977662,
 'might': 2.5527969853308514,
 'solv': 2.5527969853308514,
 'world': 2.5527969853308514,
 'traffic': 12.763984926654256,
 'congest': 5.105593970661703,
 'problem': 2.5527969853308514,
 'rick': 25.527969853308512,
 'woodburi': 2.5527969853308514,
 'spokan': 2.5527969853308514,
 'washington': 2.5527969853308514,
 'usa': 2.5527969853308514,
 'presid': 2.5527969853308514,
 'founder': 2.5527969853308514,
 'sole': 2.5527969853308514,
 'employe': 2.5527969853308514,
 'commut': 5.105593970661703,
 'carmak': 2.5527969853308514,
 'flagship': 2.5527969853308514,
 'model': 2.5527969853308514,
 'super': 2.5527969853308514,
 'slim': 2.5527969853308514,
 'twoseat': 2.5527969853308514,
 'tango': 5.105593970661703,
 'highperform': 2.5527969853308514,
 'electr': 2.5527969853308514,
 'preced': 2.5527969853308514,
 'tesla': 2.5527969853308514,
 'told': 2.5527969853308514,
 'btv': 2.552796985

In [76]:
# wyznaczmy k najpodobniejszych tekstow do danego
# pamietaj o odpowiedniej kolejnosci sortowania w zaleznosci od wykorzystanej metryki
# pamietaj, zeby wsrod podobnych tekstow nie bylo danego

def get_k_most_similar_news(tf_idf, n_id, k):
    distances = [(calculate_distance(tf_idf, n_id, n_id2, 'cosine'), n_id2) for n_id2 in news_ids if n_id != n_id2]
    distances.sort(key=lambda x: x[0], reverse=True)
    return [x[1] for x in distances[:k]]

def print_k_most_similar_news(tf_idf, n_id, k, corpus, news_indices):
    similar = get_k_most_similar_news(tf_idf, n_id, k)
    print(f'id: {n_id}, text: {corpus[news_indices[n_id]]}')
    print(f'\n{k} most similar:')
    for s_id in similar:
        print(f'\nid: {s_id}, text: {corpus[news_indices[s_id]]}, distance: {calculate_distance(tf_idf, n_id, s_id, "cosine")}')

print_k_most_similar_news(tf_idf, 'N5717', 5, texts, news_indices)

id: N5717, text: Your credit score, a reliable income and how much outstanding debt you owe are critical factors in determining the best mortgage interest rate that you'll be offered. Yet none of it has anything to do with the range of interest rates available. It's a bit more complicated than that. Although the Federal Reserve rate, bond markets, inflation and the demand for homes all play a big part, the stock market also plays a less direct role in...

5 most similar:

id: N57220, text: The Federal Reserve cut short-term interest rates by a quarter point. It is the third time this year, the Fed has cut rates. WSJ's AnnaMaria Andriotis reports, despite the cut, your credit card rates could go up. Photo Illustration: Adele Morgan, distance: 0.28960008132444265

id: N53907, text: The IRS released the federal tax rates and income brackets for 2020. The seven tax rates remain unchanged, while the income limits have been adjusted for inflation., distance: 0.2823912473624527

id: N46487, t