# Baseline Models 

This method only works for sentiment transfer. This is because the attribute markers in semantic transfer are less nuanced than those needed for formality transfer. The attribute markers used here will not learn words that are more formal and cannot be generalized as easily. 

These are the model proposed to compare against DeleteOnly and DeleteAndRetrieve

In [1]:
import numpy as np
import seaborn as sns

import os
import re

## Load Data

In [2]:
formal = open('../../Data/Unsupervised Data/Entertainment_Music/U_Formal_EM_Test.txt').read()
informal = open('../../Data/Unsupervised Data/Entertainment_Music/U_Informal_EM_Val.txt').read()

# formal_holdout = open('../../Data/Unsupervised Data/Entertainment_Music/U_Formal_EM_ValTest.txt').read()
# informal_holdout = open('../../Data/Unsupervised Data/Entertainment_Music/U_Informal_EM_ValTest.txt').read()

In [3]:
def process_sequence(seq):
    """This inserts a space in between the last word and a period"""
    s = re.sub('([.,!?()])', r' \1 ', seq)
    s = re.sub('\s{2,}', ' ', s)
    return s

f_corpus = [process_sequence(seq) for seq in formal.split('\n')]
if_corpus = [process_sequence(seq) for seq in informal.split('\n')]

In [4]:
print("{} formal sequences and {} informal sequences for evaluation.".format(len(f_corpus), 
                                                                          len(if_corpus)))

1083 formal sequences and 2878 informal sequences for evaluation.


## Retrieve Only

This is an incredibly trivial approach to implementing unsupervised style transfer. The process is to identify a sequence $x^{tgt}$ that is similar in content to $x^{src}$ and return that sequence.

I'm going to use <b>Jaccard Similarity</b>, which is the intersection over union. This seems like a simple approach that would help filter out style words and match on content. 

In [6]:
def jaccard_similarity(seq_a, seq_b):
    a = set(seq_a.split(" "))
    b = set(seq_b.split(" "))
    c = a.intersection(b)
    return float(len(c) / (len(a) + len(b) - len(c)))

def compute_similarities(inf, f):
    evaluations = {}
    for seq_if in inf:
        best = -1
        for seq_f in f:
            score = jaccard_similarity(seq_if, seq_f)
            if score > best:
                best = score
                best_if = seq_if
                best_f = seq_f
        evaluations[best_if] = best_f
    return evaluations

def see_match(seq_a, target_dict):
    """See best content match of seq_a"""
    seq_b = target_dict[seq_a]
    sim = jaccard_similarity(seq_a, seq_b)
    print("Informal Seq: {} \nFormal Seq: {} \nJaccard Similarity {}".format(seq_a, 
                                                                             seq_b,
                                                                             sim))

In [None]:
def retrieve_dict(input_corpus, target_corpus):
    

In [7]:
seq_target_dict = compute_similarities(if_corpus, f_corpus)

In [10]:
see_match(if_corpus[200], seq_target_dict)

Informal Seq: I'm sure hes in the phone book < : ) < 
Formal Seq: They cannot see anything in the beginning .  
Jaccard Similarity 0.11764705882352941


## Template Based

This method is slightly less trivial than the other. Here we proceed by retrieving a target sentence y 