# Determining the Difference Between Input & Output Texts

The goal of this function is to create a new feature: "text_diff". This feature shows the concrete (isolated) changes made to the text ommiting the parts of text which are present in both input and output.

**Some Possible changes that could occur:**

    - Spelling changes: Misteries -> Mysteries
    - Grammar changes: On the countryside, I ride bikes -> I ride bikes on the countryside
    - Drastic changes: Completely new text, Deleted text, etc
    - The essence of changes: 
        + Deleted words 
        + New words 
        + Rearranged words (switch place in sentence) 
        + Edited words (only a few characters in word have changed -> Bike vs Biking)

**Features that could be usefull:**

    - Actual textual difference
    - See above: essence of changes
    - Similarity measure between the texts
    - Whether bad words are present or not -> Compare to a vocabulary of vulgar words.
    
**To Do:**

    - undestanding RDD & Dataframe format
    - propper difference between input & output text (what do we want the difference to show)
    - test out a model

In [1]:
import difflib
import sys


string1 = "abxcd"
string2 = "abcd"

matches = difflib.SequenceMatcher(
    None, string1, string2).get_matching_blocks()
for match in matches:
    print(string1[match.a:match.a + match.size])

ab
cd



In [2]:
# text1 = open("sample1.txt").readlines()
# text2 = open("sample2.txt").readlines()

text1 = 'I am Simon Hiel, I study Handelsingenieur'
text2 = 'I am Cedric Foccaert, I study Handelsingenieur' 

sys.stdout.writelines(difflib.unified_diff(text1, text2,fromfile='before.py', tofile='after.py'))

--- before.py
+++ after.py
@@ -3,16 +3,21 @@
 a m  -S+C+e+d+r i-m+c+ +F o-n- -H-i+c+c+a e-l+r+t ,   I

In [38]:
string1 = 'I am Simon Hiel and I study Handelsingenieur'
string2 = 'I am Not Simon Hiel but I am Cedric Foccaert and I study Major Data Science' 

matches = difflib.SequenceMatcher(None, string1, string2).get_matching_blocks()
for match in matches:
    print(string1[match.a:match.a + match.size])

I am
 Simon Hiel
 and I study 
a
i
en
e



In [39]:
## Calculating similarity between strings

SM = difflib.SequenceMatcher(lambda x: x == " ",string1,string2)
matches = SM.get_matching_blocks()

similarity = round(SM.ratio(), 4)
print(similarity) #Better implementation: Weighted ratio!

for match in matches:
    print(string1[match.a:match.a + match.size])


0.5546
I am
 Simon Hiel 
and I study 
a
i
en
e



In [40]:
def show_diff(seqm):
    output= []
    for opcode, a0, a1, b0, b1 in seqm.get_opcodes():
        if opcode == 'equal':
            output.append(seqm.a[a0:a1])
        elif opcode == 'insert':
            output.append(" <ins>" + seqm.b[b0:b1] + " </ins>")
        elif opcode == 'delete':
            output.append(" <del>" + seqm.a[a0:a1] + " </del>")
#         elif opcode == 'replace':
#             output.append(" <old>" + seqm.a[a0:a1] + " <old>" + " ---> " + " <new>" + seqm.b[b0:b1] + " <new>" )
#         else:
#             raise RuntimeError, "unexpected opcode"
            
    return ''.join(output)




In [41]:
test = difflib.SequenceMatcher(None, string1, string2)
show_diff(test)

'I am <ins> Not </ins> Simon Hiel <ins> but I am Cedric Foccaert </ins> and I study ai <del>ng </del>ene <del>ur </del>'