In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
!pip install text-matcher
!pip install nltk
!python -m nltk.downloader stopwords

import gpt_2_simple as gpt2
import os

from datetime import datetime
from google.colab import files

TensorFlow 1.x selected.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



In order for metrics to run, upload one file containing the corpus of original songs in a file called `with_repeat_data.txt`, and the results generated from your model of choice in `results.txt`. Both must be contained at the root directory of your Google Drive. Note that `with_repeat_data.txt` is the corpus with repeat tokens added in. 

In [None]:
gpt2.mount_gdrive()
gpt2.copy_file_from_gdrive("with_repeat_data.txt")
gpt2.copy_file_from_gdrive("results.txt")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The code below strips the tokens (but not UNKs) from a given text so the tokens do not count towards plagiarized character count. It's run on both documents before comparing similarity. 

In [None]:
start_token = "<|startoftext|>"
end_token = "<|endoftext|>"
line_break_token = "<|line_break|>"
verse_break_token = "<|verse_break|>"

def strip_tokens(input_text, clean_text):
  with open(input_text, "r") as input_f, open(clean_text, "w") as clean_f:
    for line in input_f.read().splitlines():
      clean_line = line
      if line_break_token in line:
        clean_line = clean_line.replace(line_break_token, "")
      if start_token in line:
        clean_line = clean_line.replace(start_token, "")
      if end_token in line:
        clean_line = clean_line.replace(end_token, "")
      if verse_break_token in line:
        clean_line = clean_line.replace(verse_break_token, "")
      
      clean_f.write(clean_line + "\n")

In [None]:
strip_tokens("data.txt", "clean_data.txt")
strip_tokens("results.txt", "clean_results.txt")

The script below logs the matches into a `csv` file that is parsed to retrieve the matches. Note that `text-matcher` appends to the log file passed in, so the first line removes it before running on the two documents. 

Matches will be highlighted in red. 

In [None]:
!rm -rf log.csv
!text-matcher clean_data.txt clean_results.txt -l log.csv

16 total matches found.
Extending match forwards with words: oh oh
Extending match forwards with words: unk unk
Extending match forwards with words: oh oh
Extending match forwards with words: oh oh
Extending match forwards with words: oh oh


match 1:
[32mclean_data.txt[0m: (1083756, 1083798) aye singing oh oh-oh [31moh-oh-oh oh-oh <|UNK|> oh-oh-oh-oh-oh-oh[0m oh ooh yeah if we could throw
[32mclean_results.txt[0m: (4177, 4240) life of a <|UNK|> oh yeah <|repeat [31moh oh oh oh oh my my my <|UNK|> oh oh my <|UNK|> oh oh my oh[0m oh <|UNK|> just a day a day a day


match 2:
[32mclean_data.txt[0m: (1518589, 1518651) fire oh oh oh i'm on fire [31mooh-ooh ooh ooh-ooh ooh-ooh ooh-ooh-ooh ooh-ooh ooh ooh ooh[0m ooh-ooh ooh-ooh-ooh
[32mclean_results.txt[0m: (16169, 16231) get in my bed because i am the one and you are the other [31mooh ooh ooh ooh ooh ooh ooh ooh ooh ooh ooh ooh ooh ooh ooh[0m way you keep me coming keep me coming


match 3:
[32mclean_data.txt[0m: (5377088, 

The following code looks complicated but is just parsing the csv generated by `text-matcher` to retrieve the locations of where the characters matched, calculate their difference, and find the percentage in the generated song document that was plagiarized from the original document. 

In [None]:
import csv
import re

def get_location_tuple(location_str):
  location_lst = location_str.split("] [")

  replace_markers = location_lst[0].replace("[", "").replace("]", "").replace("(", "").replace("), ", "|").replace(")", "")
  convert_int = [(int(s.split(", ")[0]), int(s.split(", ")[1])) for s in replace_markers.split("|")]

  return convert_int

def calc_char_diff(location_tuple):
  return location_tuple[1] - location_tuple[0]

with open('log.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
      match_lst_A = get_location_tuple(row['Locations in A'])
      match_lst_B = get_location_tuple(row['Locations in B'])

      num_matches = int(row['Num Matches'])
      generated_text_len = float(row['Text B Length'])

    total_plagiarized = 0
    for match_idx in range(num_matches):
      print("\nMatch ", match_idx + 1)

      total_plagiarized += calc_char_diff(match_lst_B[match_idx])
      print("Number of characters plagiarized: ", total_plagiarized)

    percent_plagiarized = total_plagiarized / float(generated_text_len)
    print("\nPercentage plagiarized from the corpus: ", percent_plagiarized)


Match  1
Number of characters plagiarized:  63

Match  2
Number of characters plagiarized:  125

Match  3
Number of characters plagiarized:  215

Match  4
Number of characters plagiarized:  280

Match  5
Number of characters plagiarized:  481

Match  6
Number of characters plagiarized:  544

Match  7
Number of characters plagiarized:  723

Match  8
Number of characters plagiarized:  808

Match  9
Number of characters plagiarized:  842

Percentage plagiarized from the corpus:  0.01137637982516585


The below code is sourced from [this Stack Overflow post](https://stackoverflow.com/questions/15173225/calculate-cosine-similarity-given-2-sentence-strings). It measures the surface similarity of two texts. We  vectorize our generated songs and original corpus as a vocabulary distribution in order to represent how “pop-like” our generated lines are. 

In [None]:
import math
import re
from collections import Counter

WORD = re.compile(r"\w+")


def get_cosine(vec1, vec2):
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator


def text_to_vector(text):
    words = WORD.findall(text)
    return Counter(words)

with open("data.txt", "r") as big_corpus, open("results.txt", "r") as results:
  vector1 = text_to_vector(big_corpus.read())
  vector2 = text_to_vector(results.read())

  cosine = get_cosine(vector1, vector2)

print("Cosine similarity:", cosine)

Cosine similarity: 0.9556724155306071
