<a href="https://colab.research.google.com/github/spatiebalk/text_mining_project/blob/master/score_generated_texts_hp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scoring generated texts

Inputs to `score` are a list of candidate sentences and a list of reference sentences. 

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import time 
import json

from nltk.translate.bleu_score import sentence_bleu

text = open('/content/gdrive/My Drive/TxMM/harrypotter.txt').read()

import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

text_tokens = [word_tokenize(sent) for sent in sent_detector.tokenize(text.strip())]

with open('/content/gdrive/My Drive/TxMM/GRU_results/hp_generated_texts.json') as json_file: 
    generated_text = json.load(json_file) 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## BLEU score
Between 0 and 1, where 1 is similar.

In [None]:
score_dict_BLEU = {}
start = time.time()

for i in range(0, 100):
  gen_text = generated_text[str(i)]
  gen_text_sent = word_tokenize(gen_text)

  BLEUscore = sentence_bleu(text_tokens, gen_text_sent, weights = (0, 0, 1, 0))
  score_dict_BLEU[i] = BLEUscore

end = time.time()
print("This took {:.2f} minutes".format((end-start)/60))
with open('/content/gdrive/My Drive/TxMM/GRU_results/hp_scores_BLEU.json', 'w') as fp:
  json.dump(score_dict_BLEU, fp)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
<class 'list'>
['rry', 'Potter', 'and', 'the', 'Sorcerer', "'s", 'Stone', 'CHAPTER', 'ONE', 'THE', 'BOY', 'WHO', 'LIVED', 'Mr.', 'and', 'Mrs.', 'Dursley', ',', 'of', 'number', 'four', ',', 'Privet', 'Drive', ',', 'were', 'proud', 'to', 'say', 'that', 'they', 'were', 'perfectly', 'normal', ',', 'thank', 'you', 'very', 'much', '.'] <class 'list'>
<class 'str'>
<class 'list'>
85748
['was', 'no', 'longer', 'a', 'bit', 'of', 'the', 'potion', 'with', 'the', 'prophecy', '.', 'He', 'stopped', 'to', 'the', 'first', 'time', 'to', 'get', 'the', 'sound', 'of', 'the', 'wizard', 'to', 'stay', 'with', 'the', 'sound', 'of', 'Ron', 'and', 'Hermione', 'slithered', 'at', 'the', 'tiny', 'part', 'of', 'them', ',', 'even', 'though', 'he', 'was', 'still', 'concerned', 'at', 'the', 'silver', 'staircase', 'on', 'the', 'statue', 'when', 'he', 'added', 'and', 'the', 'handsome', 'thing', 'there', 'about'

## ROUGE score
Higher is more similar.

In [None]:
!pip install rouge-score
from rouge_score import rouge_scorer

Collecting rouge-score
  Downloading https://files.pythonhosted.org/packages/1f/56/a81022436c08b9405a5247b71635394d44fe7e1dbedc4b28c740e09c2840/rouge_score-0.0.4-py2.py3-none-any.whl
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


In [None]:
score_dict_ROUGE = {}
start = time.time()

for i in range(0, 100):
  gen_text = generated_text[str(i)]
  scorer = rouge_scorer.RougeScorer(['rouge3'], use_stemmer=True)

  ROUGEscore = scorer.score(text, gen_text)
  score_dict_ROUGE[i] = {"P": ROUGEscore["rouge3"][0], "R": ROUGEscore["rouge3"][1], "F1": ROUGEscore["rouge3"][2]}

end = time.time()
print("This took {:.2f} minutes".format((end-start)/60))
with open('/content/gdrive/My Drive/TxMM/GRU_results/hp_scores_ROUGE.json', 'w') as fp:
  json.dump(score_dict_ROUGE, fp)

This took 0.22 minutes


## BERT scores
Higher is more similar (cosine similarity is used).

In [10]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/0c/7d5950fcd80b029be0a8891727ba21e0cd27692c407c51261c3c921f6da3/transformers-4.1.1-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 12.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 56.7MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 59.1MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=cce8561efdfdf2b91c

In [32]:
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from tqdm import tqdm

In [12]:
BERT_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
BERT_model = BertModel.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Split the reference text up into about chunks of 512 tokens, aka 3200 steps of about 1900 characters. For each chunk of tokens, calculate the average BERT embedding and add to list. Later we will compare the token of the generated text to each of these 3200 tokens and add the cosine similarity score to be able to score this generated text using BERT.

In [34]:
ref_text_embeddings = []
i = 0

steps = 3200
par_len = int(len(text)/3200)

assert steps * par_len <= len(text)

for i in tqdm(range(0, steps)):

  par = text[i*par_len:i*par_len+par_len]
  inputs = BERT_tokenizer(par, truncation=True, return_tensors="pt")
  outputs = BERT_model(**inputs)

  last_hidden_states = outputs.last_hidden_state
  avg_embedding = torch.mean(last_hidden_states, dim=1)
  avg_embedding = torch.squeeze(avg_embedding)
  avg_embedding = avg_embedding.detach().numpy()

  ref_text_embeddings.append(avg_embedding)

 24%|██▎       | 754/3200 [17:30<57:30,  1.41s/it]

KeyboardInterrupt: ignored

In [39]:
from sklearn.metrics.pairwise import cosine_similarity

754
768


In [45]:
def BERTscore(ref_text_embeddings, gen_text):
  sim = 0
  for embed in ref_text_embeddings:
    sim += cosine_similarity(embed.reshape(1, -1), gen_text.reshape(1, -1))

  return sim/len(ref_text_embeddings)

In [46]:
score_dict_BERT = {}
start = time.time()

for i in range(0, 100):
  gen_text = generated_text[str(i)]

  # create BERT embedding
  inputs = BERT_tokenizer(gen_text, truncation=True, return_tensors="pt")
  outputs = BERT_model(**inputs)

  last_hidden_states = outputs.last_hidden_state
  avg_embedding = torch.mean(last_hidden_states, dim=1)
  avg_embedding = torch.squeeze(avg_embedding)
  avg_embedding = avg_embedding.detach().numpy()

  # compare gen_text with ref_text
  sim = BERTscore(ref_text_embeddings, avg_embedding )

  score_dict_BERT[i] = int(sim[0][0])

end = time.time()
print("This took {:.2f} minutes".format((end-start)/60))

with open('/content/gdrive/My Drive/TxMM/GRU_results/hp_scores_BERT.json', 'w') as fp:
  json.dump(score_dict_BERT, fp)

This took 1.32 minutes
