# [5 points] Part 1. Data cleaning

The task is to clear the text data of the crawled web-pages from different sites.

It is necessary to ensure that the distribution of the 100 most frequent words includes only meaningful words in english language (not particles, conjunctions, prepositions, numbers, tags, symbols).

Determine the order of operations below and carry out the appropriate cleaning.

1. Remove html-tags (try to do it with regular expression, or play with beautifulsoap library)
2. Apply tokenization
3. Lowercase translation
4. Apply lemmatization / stemming
5. Remove non-english words
6. Remove stop-words

> Additional processing - At your own initiative, if this helps to obtain a better distribution

#### Hints

1. To do text processing you may use nltk and re libraries
1. and / or any other libraries on your choise

In [1]:
import re
import string

import nltk
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from datasketch import MinHash, MinHashLSH
from nltk import ngrams
from nltk.corpus import stopwords, words
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm

pd.options.plotting.backend = "plotly"

nltk.download("words")
nltk.download("stopwords")
nltk.download("omw-1.4")

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Sultan\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sultan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Sultan\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

#### Data reading

The dataset for this part can be downloaded here: `https://drive.google.com/file/d/1wLwo83J-ikCCZY2RAoYx8NghaSaQ-lBA/view?usp=sharing`

In [2]:
web_sites_data_df = pd.read_csv("../data/web_sites_data.csv")
web_sites_data_df["text_num"] = web_sites_data_df.index + 1

print(web_sites_data_df.shape)
web_sites_data_df.head()

(71699, 2)


Unnamed: 0,text,text_num
0,"<html>\n<head profile=""http://www.w3.org/2005/...",1
1,"<html>\n<head profile=""http://www.w3.org/2005/...",2
2,"<html>\n<head profile=""http://www.w3.org/2005/...",3
3,"<html>\n<head profile=""http://www.w3.org/2005/...",4
4,"<html>\n<head profile=""http://www.w3.org/2005/...",5


#### Data processing

In [3]:
punkt_table = str.maketrans("", "", string.punctuation)
lemmatizer = WordNetLemmatizer()
english_words = set(words.words())
stop_words = set(stopwords.words("english"))


def process_text(text: str) -> list:
    raw_tokens = (
        BeautifulSoup(text, "html.parser").get_text(strip=True, separator=" ").split()
    )
    clean_tokens = [
        lemmatizer.lemmatize(w.translate(punkt_table).lower()) for w in raw_tokens
    ]
    tokens = [
        word
        for word in clean_tokens
        if word in english_words and not word in stop_words and len(word) > 1
    ]
    # tokens = [word for word in tokens if len(word) > 3]

    return tokens

In [4]:
# processing
# web_words_df = web_sites_data_df.copy()
# web_words_df["tokens"] = web_words_df["text"].apply(process_text)
# web_words_df = (
#     web_words_df.drop(columns=["text"])
#     .explode("tokens")
#     .rename(columns={"tokens": "word"})
#     .reset_index(drop=True)
# )
# web_words_df.to_pickle("../data/web_sites.pkl")

# reading result of processing
web_words_df = pd.read_pickle("../data/web_sites.pkl")

print(web_words_df.shape)
web_words_df.head(5)

(23582004, 2)


Unnamed: 0,text_num,word
0,1,eric
1,1,love
2,1,war
3,1,eric
4,1,love


In [5]:
web_text_df = (
    web_words_df.groupby("text_num")["word"]
    .agg(text=lambda words: " ".join(words))
    .reset_index()
)

print(web_text_df.shape)
web_text_df.head(5)

(71699, 2)


Unnamed: 0,text_num,text
0,1,eric love war eric love war author eric title ...
1,2,eric short walk eric short walk author eric ti...
2,3,poetry unabridged poetry unabridged author tit...
3,4,uncle cabin uncle cabin author title uncle cab...
4,5,consider lily consider lily author title consi...


#### Vizualization

As a visualisation, it is necessary to construct a frequency distribution of words (the 100 most common words), sorted by frequency.

For visualization purposes we advice you to use plotly, but you are free to choose other libraries

In [6]:
web_words_df["word"].value_counts().head(100).plot(kind="bar")

#### Provide examples of processed text (some parts)

Is everything all right with the result of cleaning these examples? What kind of information was lost?

> After clearing the texts, we lose a number of features that could be useful when building a model. Thus, it is necessary to extract features both before and after cleaning

In [7]:
rand_text_num = np.random.randint(
    web_text_df["text_num"].min(), web_text_df["text_num"].max() + 1
)
print(f"rand_text_num: {rand_text_num}")
print(
    f"random text:\n{web_text_df.loc[web_text_df['text_num'] == rand_text_num, 'text'].values[0]}"
)

rand_text_num: 5665
random text:
stock quote corp stock quote quote stock price bulletin investor alert sign become member today front page news viewer commentary market personal finance community stock mutual fund option bond commodity currency getting adviser premium newsletter interactive research tool corp go set alert add portfolio trade overview profile news chart pay historical quote analyst estimate option sec filing insider action corp hour change volume volume quote min previous close change day low day high week low week high market cap average volume ratio na na dividend div yield ex dividend date compare index global dow news economy muffle roar air show global military spending hit record global military spending hit trillion cut neutral wa rated outperform oversight weapon purchase quarterly dividend quarterly dividend defense stock may limited upside analyst post quarterly profit gain profit update advisory surprise earnings per share see cent stock focus martin profit 

# [10 points] Part 2. Duplicates detection. LSH

#### Libraries you can use

1. LSH - https://github.com/ekzhu/datasketch
1. LSH - https://github.com/mattilyra/LSH
1. Any other library on your choise

1. Detect duplicated text (duplicates do not imply a complete word-to-word match, but texts that may contain a paraphrase, rearrangement of words, sentences)
2. Make a plot dependency of duplicates on shingle size (with fixed minhash length) 
3. Make a plot dependency of duplicates on minhash length (with fixed shingle size)

In [8]:
def get_duplicate_pairs(sentences, threshold, shingle_size, minhash_length):
    lsh = MinHashLSH(threshold, minhash_length)
    
    min_hashes = list()
    for text in sentences:
        min_hash = MinHash(minhash_length)
        shrinks = ngrams(set(text.split()), shingle_size)
        for ngram in shrinks:
            min_hash.update(" ".join(ngram).encode("utf-8"))
        min_hashes.append(min_hash)
    
    for i, hash in enumerate(min_hashes):
        lsh.insert(i, hash)
    
    candidates = set()
    duplicate_pairs_dict = dict()

    for i, hash in enumerate(min_hashes):
        duplicate = lsh.query(hash)
        if duplicate[0] != i:
            duplicate_pairs_dict[i] = duplicate
            candidates.update(duplicate)
    
    return duplicate_pairs_dict

In [9]:
num_sentences = 5000
sentences = web_text_df["text"].sample(num_sentences).to_list()
threshold = 0.9

pairs = get_duplicate_pairs(sentences, threshold, 9, 2048)
print(f"Number of duplicate pairs: {len(pairs.keys())}")

Number of duplicate pairs: 486


In [10]:
# processing
# data = list()

# for shingle_size in tqdm([1, 2, 5, 9, 15]):
#     for minhash_length in [128, 256, 512, 1024, 2048]:
#         _pairs = get_duplicate_pairs(sentences, threshold, shingle_size, minhash_length)
#         data.append([shingle_size, minhash_length, len(_pairs.keys())])

# duplicates_df = pd.DataFrame(
#     data, columns=["shingle_size", "minhash_length", "duplicates_number"]
# )
# duplicates_df.to_pickle("../data/duplicates_df.pkl")

# reading result of processing
duplicates_df = pd.read_pickle("../data/duplicates.pkl")

duplicates_df

Unnamed: 0,shingle_size,minhash_length,duplicates_number
0,1,128,1593
1,1,256,1665
2,1,512,1637
3,1,1024,1594
4,1,2048,1513
5,2,128,767
6,2,256,785
7,2,512,807
8,2,1024,789
9,2,2048,784


In [11]:
duplicates_df[duplicates_df["minhash_length"] == 2048].plot.bar(
    x="shingle_size",
    y="duplicates_number",
    title="Dependency of duplicates on shingle size (minhash_length = 2048)",
)

In [12]:
duplicates_df[duplicates_df["shingle_size"] == 9].plot.bar(
    x="minhash_length",
    y="duplicates_number",
    title="Dependency of duplicates on minhash length (shingle_size = 9)",
)

# [Optional 10 points] Part 3. Topic model

In this part you will learn how to do topic modeling with common tools and assess the resulting quality of the models. 

The provided data contain chunked stories by Edgar Allan Poe (EAP), Mary Shelley (MWS), and HP Lovecraft (HPL).

The dataset can be downloaded here: `https://drive.google.com/file/d/14tAjAzHr6UmFVFV7ABTyNHBh-dWHAaLH/view?usp=sharing`

#### Preprocess dataset with the functions from the Part 1

#### Quality estimation

Implement the following three quality fuctions: `coherence` (or `tf-idf coherence`), `normalized PMI`, `based on the distributed word representation`(you can use pretrained w2v vectors or some other model). You are free to use any libraries (for instance gensim) and components.

### Topic modeling

Read and preprocess the dataset, divide it into train and test parts `sklearn.model_selection.train_test_split`. Test part will be used in classification part. For simplicity we do not perform cross-validation here, but you should remember about it.

Plot the histogram of resulting tokens counts in the processed datasets.

Plot the histogram of resulting tokens counts in the processed datasets.

#### NMF

Implement topic modeling with NMF (you can use `sklearn.decomposition.NMF`) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

#### LDA

Implement topic modeling with LDA (you can use gensim implementation) and print out resulting topics. Try to change hyperparameters to better fit the dataset.

### Additive regularization of topic models 

Implement topic modeling with ARTM. You may use bigartm library (simple installation for linux: pip install bigartm) or TopicNet framework (`https://github.com/machine-intelligence-laboratory/TopicNet`)

Create artm topic model fit it to the data. Try to change hyperparameters (number of specific and background topics) to better fit the dataset. Play with smoothing and sparsing coefficients (use grid), try to add decorrelator. Print out resulting topics.

Write a function to convert new documents to topics probabilities vectors.

Calculate the quality scores for each model. Make a barplot to compare the quality.