<a href="https://colab.research.google.com/github/tcmb1987/EE274-lossy-text-compression/blob/main/articles_only.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install sent2vec
!pip install zstd

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sent2vec
  Using cached sent2vec-0.3.0-py3-none-any.whl (8.1 kB)
Collecting transformers
  Using cached transformers-4.25.1-py3-none-any.whl (5.8 MB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 5.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Using cached huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers, sent2vec
Successfully installed huggingface-hub-0.11.1 sent2vec-0.3.0 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting zstd
  Downloading zstd-1.5.2.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[K     |████████████████████████████

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import string
from urllib import request
from bs4 import BeautifulSoup
from sent2vec.vectorizer import Vectorizer
import numpy as np
from scipy.spatial import distance
import matplotlib.pyplot as plt
import sys
import gzip
import zstd
nltk.download('stopwords')
nltk.download('words')
nltk.download('punkt')

## Excerpt 2 - From Herman Melville's *Moby Dick*

In [None]:
url = "https://www.gutenberg.org/files/2701/2701-0.txt"
sentences = []

response = request.urlopen(url)
soup = BeautifulSoup(response, features="html.parser")
text = soup.get_text()
sentences += sent_tokenize(text)

# Strip random symbols/non-English
sentences = list(filter(lambda ele: re.search("[a-zA-Z\s]+", ele) is not None, sentences))
sentences = [sentence.replace("\r", "") for sentence in sentences]
sentences = [sentence.replace("\n", " ") for sentence in sentences]
sentences = [sentence.replace(":", "") for sentence in sentences]
# sentences = [re.sub(r'[^\w\s]', '', sentence) for sentence in sentences]
sentences = ["".join(i for i in s if not i.isdigit()) for s in sentences]
sentences = sentences[4:]
sentences = [s for s in sentences if not len(s) > 512]

stop_words = ['a', 'an', 'the']
  
def prune_sentence(sentence, stop_words):
    word_tokens = word_tokenize(sentence)
    return ' '.join([w for w in word_tokens if not w.lower() in stop_words and not w.lower() in string.punctuation]) 

pruned_sentences = [prune_sentence(sentence, stop_words) for sentence in sentences]

Compress entire text

In [None]:
original_text = ' '.join(sentences)
pruned_text = ' '.join(pruned_sentences)
original_size = (sys.getsizeof(original_text) / 1024)
pruned_size = (sys.getsizeof(pruned_text) / 1024)
original = original_size

print("Uncompressed size of original text is %.2f KB." % original_size)
print("Uncompressed size of pruned text is %.2f KB." % pruned_size)
print("Pruned text is %.1f%% size of original text.\n" % (pruned_size/original_size*100))

original_compressed = gzip.compress(bytes(original_text, 'utf-8'))
pruned_compressed = gzip.compress(bytes(pruned_text, 'utf-8'))
original_size = (sys.getsizeof(original_compressed) / 1024)
pruned_size = (sys.getsizeof(pruned_compressed) / 1024)

print("Compressed size of original text using gzip is %.2f KB." % original_size)
print("Compressed size of pruned text using gzip is %.2f KB." % pruned_size)
print("Pruned text compresses to %.1f%% size of compressed original text.\n" % (pruned_size/original_size*100))

original_compressed = zstd.compress(bytes(original_text, 'utf-8'), 22)
pruned_compressed = zstd.compress(bytes(pruned_text, 'utf-8'), 22)
original_size = (sys.getsizeof(original_compressed) / 1024)
pruned_size = (sys.getsizeof(pruned_compressed) / 1024)

print("Compressed size of original text using zstd level 22 is %.2f KB." % original_size)
print("Compressed size of pruned text using zstd level 22 is %.2f KB." % pruned_size)
print("Pruned text compresses to %.1f%% size of compressed original text.\n" % (pruned_size/original_size*100))

print("Compressed original text is %.1f%% size of original text using zstd." % (original_size/original*100))
print("Compressed pruned text is %.1f%% size of original text using zstd." % (pruned_size/original*100))

Uncompressed size of original text is 2241.26 KB.
Uncompressed size of pruned text is 2064.99 KB.
Pruned text is 92.1% size of original text.

Compressed size of original text using gzip is 444.39 KB.
Compressed size of pruned text using gzip is 409.58 KB.
Pruned text compresses to 92.2% size of compressed original text.

Compressed size of original text using zstd level 22 is 365.48 KB.
Compressed size of pruned text using zstd level 22 is 337.83 KB.
Pruned text compresses to 92.4% size of compressed original text.

Compressed original text is 16.3% size of original text using zstd.
Compressed pruned text is 15.1% size of original text using zstd.


In [None]:
sentences_2 = sentences[990:995]
sentences_2

['Whether any of the relatives of the seamen whose names appeared there were now among the congregation, I knew not; but so many are the unrecorded accidents in the fishery, and so plainly did several women present wear the countenance if not the trappings of some unceasing grief, that I feel sure that here before me were assembled those, in whose unhealing hearts the sight of those bleak tablets sympathetically caused the old wounds to bleed afresh.',
 'Oh!',
 'ye whose dead lie buried beneath the green grass; who standing among flowers can say—here, _here_ lies my beloved; ye know not the desolation that broods in bosoms like these.',
 'What bitter blanks in those black-bordered marbles which cover no ashes!',
 'What despair in those immovable inscriptions!']

In [None]:
original_vectorizer = Vectorizer()
original_vectorizer.run(sentences_2)
original_sentence_vectors = original_vectorizer.vectors

Initializing Bert distilbert-base-uncased
Vectorization done on cpu


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
shorts_2 = pruned_sentences[990:995]
shorts_2

['Whether any of relatives of seamen whose names appeared there were now among congregation I knew not but so many are unrecorded accidents in fishery and so plainly did several women present wear countenance if not trappings of some unceasing grief that I feel sure that here before me were assembled those in whose unhealing hearts sight of those bleak tablets sympathetically caused old wounds to bleed afresh',
 'Oh',
 'ye whose dead lie buried beneath green grass who standing among flowers can say—here _here_ lies my beloved ye know not desolation that broods in bosoms like these',
 'What bitter blanks in those black-bordered marbles which cover no ashes',
 'What despair in those immovable inscriptions']

In [None]:
short_vectorizer = Vectorizer()
short_vectorizer.run(shorts_2)
short_sentence_vectors = short_vectorizer.vectors

Initializing Bert distilbert-base-uncased
Vectorization done on cpu


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
reconstructed_2 = ['Whether any of the relatives of the seamen whose names appeared there were now among the congregation; I knew not, but so many are the unrecorded accidents in fishery, and so plainly did several women present wear countenance if not trappings of some unceasing grief that I feel sure that here before me were assembled those in whose unhealing hearts sight of those bleak tablets sympathetically caused old wounds to bleed afresh.', 'Oh!', 'ye whose dead lie buried beneath the green grass who standing among flowers can say—here _here_ lies my beloved ye know not desolation that broods in bosoms like these', 'What bitter blanks in those black-bordered marbles which cover no ashes!', 'What despair in those immovable inscriptions!']
reconstructed_2

['Whether any of the relatives of the seamen whose names appeared there were now among the congregation; I knew not, but so many are the unrecorded accidents in fishery, and so plainly did several women present wear countenance if not trappings of some unceasing grief that I feel sure that here before me were assembled those in whose unhealing hearts sight of those bleak tablets sympathetically caused old wounds to bleed afresh.',
 'Oh!',
 'ye whose dead lie buried beneath the green grass who standing among flowers can say—here _here_ lies my beloved ye know not desolation that broods in bosoms like these',
 'What bitter blanks in those black-bordered marbles which cover no ashes!',
 'What despair in those immovable inscriptions!']

In [None]:
reconst_vectorizer = Vectorizer()
reconst_vectorizer.run(reconstructed_2)
reconst_sentence_vectors = reconst_vectorizer.vectors

Initializing Bert distilbert-base-uncased
Vectorization done on cpu


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Distances between pruned and original vectors

orig_distances = []

for i in range(len(sentences_2)):
  dist = distance.cosine(original_sentence_vectors[i], short_sentence_vectors[i])
  orig_distances.append(dist)

# Distances between reconstructed and original vectors

rest_distances = []

for i in range(len(sentences_2)):
  dist = distance.cosine(original_sentence_vectors[i], reconst_sentence_vectors[i])
  rest_distances.append(dist)

In [None]:
orig_distances

[0.032640695571899414,
 0.05033355951309204,
 0.028860092163085938,
 0.2735424041748047,
 0.03151208162307739]

In [None]:
rest_distances

[0.004494011402130127,
 0.004813849925994873,
 0.009506165981292725,
 0.011010468006134033,
 0.010721981525421143]

In [None]:
np.mean(orig_distances)

0.0833777666091919

In [None]:
np.mean(rest_distances)

0.00810929536819458