### <font color='pink'>Shaleigh Smith</font>

---

# <font color='salmon'>difflib Module</font>

---

# <font color='scarlet'>What is the difflib Module used for?</font>  

---


# <font color='black'>Comparing Sequences: </font>

### <font color='indigo'>difflib.Differ( )</font> 

### <font color='purple'>difflib.unified_diff( ) & difflib.context_diff</font> 

###  <font color='navy'>difflib.SequenceMatcher( )</font> 

###  <font color='green'>difflib.get_close_matches( )</font> 

### <font color='teal'>difflib.ndiff( ) & difflib.restore( )</font> 

###  <font color='turquoise'>difflib.HtmlDiff( )</font> 

---

---

In [1]:
import difflib

---

# <font color='indigo'>difflib.Differ( )</font>

### Compares sequences of text lines

### Displays reocognizable deltas (differences)

### Shows specific differences between individual lines

---

### <font color='indigo'>Deltas are shown using symbols at the beginning of the line: </font>

#### - the line was in the 1st sequence but not the 2nd
    
#### + the line was in the 2nd sequence but not the 1st
    
#### ? displays specific differences and their location
    
####  __ nothing has changed



---

# <font color='indigo'>difflib.Differ.compare(a,b)</font>



#### .splitlines( ) returns a list of the lines as strings 

---

In [2]:
poem1 = """Sometimes coding is tough.
Sometimes coding is rough.
But other times coding is fun.
Although I would rather be coding in the sun."""

poem1_lines = poem1.splitlines()

poem2 = """Sometimes coding is tough.
Python makes me huff and puff,
And other times, coding is fun! :)
Although I would rather be coding in the sun."""

poem2_lines = poem2.splitlines()


In [3]:

d = difflib.Differ()

poem_difference = d.compare(poem1_lines, poem2_lines)
print('\n'.join(poem_difference))


  Sometimes coding is tough.
- Sometimes coding is rough.
+ Python makes me huff and puff,
- But other times coding is fun.
? ^^^                          ^

+ And other times, coding is fun! :)
? ^^^            +              ^^^^

  Although I would rather be coding in the sun.


---


In [4]:
dna1 = """AGAGCCGTCGGGTCAAAGTCAGTCAAGTTTGG"""

dna2 = """AGAGCCGTCGGGTCAAAAGTCAGTCAAGTTGG"""

d = difflib.Differ()

dna_difference = d.compare(dna1, dna2)
print('\n'.join(dna_difference))

  A
  G
  A
  G
  C
  C
  G
  T
  C
  G
  G
  G
  T
  C
  A
  A
  A
+ A
  G
  T
  C
  A
  G
  T
  C
  A
  A
  G
  T
  T
- T
  G
  G


---


# <font color='purple'>difflib.unified_diff( )</font> 

# <font color='purple'>difflib.context_diff( )</font>

### Similar to difflib.Differ

### Less Information

---

## <font color='purple'>difflib.unified_diff(a,b)</font> 


In [5]:
poem1 = """Sometimes coding is tough.
Sometimes coding is rough.
But other times coding is fun.
Although I would rather be coding in the sun."""

poem1_lines = poem1.splitlines()

poem2 = """Sometimes coding is tough.
Python makes me huff and puff,
And other times, coding is fun! :)
Although I would rather be coding in the sun."""

poem2_lines = poem2.splitlines()

In [6]:
poem_difference2 = difflib.unified_diff(poem1_lines, poem2_lines)
print('\n'.join(poem_difference2))

--- 

+++ 

@@ -1,4 +1,4 @@

 Sometimes coding is tough.
-Sometimes coding is rough.
-But other times coding is fun.
+Python makes me huff and puff,
+And other times, coding is fun! :)
 Although I would rather be coding in the sun.


---


## <font color='purple'>difflib.context_diff(a,b)</font> 

In [7]:
poem1 = """Sometimes coding is tough.
Sometimes coding is rough.
But other times coding is fun.
Although I would rather be coding in the sun."""

poem1_lines = poem1.splitlines()

poem2 = """Sometimes coding is tough.
Python makes me huff and puff,
And other times, coding is fun! :)
Although I would rather be coding in the sun."""

poem2_lines = poem2.splitlines()

In [8]:
poem_difference2 = difflib.context_diff(poem1_lines, poem2_lines)
print('\n'.join(poem_difference2))

*** 

--- 

***************

*** 1,4 ****

  Sometimes coding is tough.
! Sometimes coding is rough.
! But other times coding is fun.
  Although I would rather be coding in the sun.
--- 1,4 ----

  Sometimes coding is tough.
! Python makes me huff and puff,
! And other times, coding is fun! :)
  Although I would rather be coding in the sun.


---

---


# <font color='navy'>difflib.SequenceMatcher( )</font>


### Compares pairs of sequences of any type

### MUST be hashable 

---

## <font color='navy'> An object is _hashable_ if it has a hash value which never changes during its lifetime</font>



---


### Hashable: 
#### Objects that are immutable
    - Strings
    - Integers
    - Booleans
    - Floats
    - Tupples
    
    

### Not Hashable:
#### Objects that are mutable
    - Lists
    - Sets
    - Dictionaries
        

In [9]:
hash(1000112322341232423) # 19 characters long

1000112322341232423

In [10]:
hash(10001123223412324234) # 20 characters long

777751186557548430

In [11]:
hash('Hello World') #string

-6749998673946092768

In [12]:
w = [1, 2, 3, 4] #list
hash(w)

TypeError: unhashable type: 'list'

In [13]:
x = (1, 2, 3, 4) #tuple
hash(x)

485696759010151909

In [14]:
y = set(['hi', 'hello', 'bonjour']) #set
hash(y)

TypeError: unhashable type: 'set'

In [15]:
z = {'a': 1, 'b': 2} #dictionary
hash(z)

TypeError: unhashable type: 'dict'

---


## <font color='navy'> Why is this important? </font>
### We can use any hashable (immutable) object with SequenceMatcher

---


## <font color='navy'>difflib.SequenceMatcher(isjunk=None, a='', b='', autojunk=True )</font>


###  Default _isjunk_ argument: no elements are ignored 

### Default _autojunk_ argument: this automatically treats certain items in a sequence as junk

- If an item has duplicates that make up more than 1% of a sequence that's at least 200 items long it's marked 'popular' and is considered junk.

### .ratio() value (score) over 0.6 (or 60%) means the sequences are close matches

---

In [16]:
a = 'Hey how are you doing today?'
b = 'Good, how are you doing today?'

match = difflib.SequenceMatcher(None, a, b)
c = match.ratio()*100
print(c)

86.20689655172413


---

In [17]:
a = 'Hey how are you doing today?'
b = 'Good,                     how are you doing today?'

match = difflib.SequenceMatcher(None, a, b)
c = match.ratio()*100
print(c)

64.1025641025641


---

In [18]:
d = 'Hey how are you doing today?'
e = 'Hey how are you doing today?'

match1 = difflib.SequenceMatcher(None, d, e)
f = match1.ratio()*100
print(f)

100.0


---

In [20]:
g = 'Hey how are you doing today?'
h = 'Fine the weather is nice!'

match3 = difflib.SequenceMatcher(None, g, h)
i = match3.ratio()*100
print(i)

18.867924528301888


---

In [21]:
list1 = [1, 2, 3]
list2 = [2, 3, 6, 7, 9]

match4 = difflib.SequenceMatcher(None, list1, list2)
lists1 = match4.ratio()*100
print(lists1)

50.0


In [22]:
list3 = [[1, 2], [3]]
list4 = [[4, 5, 6], [7, 9]]

match5 = difflib.SequenceMatcher(None, list3, list4)
lists = match4.ratio()*100
print(lists)

TypeError: unhashable type: 'list'

---

In [23]:
dna_1 = 'AGAGCCGTCGGGTCAAAGTCAGTCAAGTTTGG'

dna_2 = 'AGAGCCGTCGGGTCAAAAGTCAGTCAAGTTGG'

dna_match = difflib.SequenceMatcher(None, dna_1, dna_2)
dna_seq = dna_match.ratio()*100
print(dna_seq)


96.875


---

# <font color='green'>difflib.get_close_matches( )</font> 

###  Compares words using a 'similarity score'

### Returns a list of the matches above that score

---

## <font color='green'>difflib.get_close_matches(word, possibilities, n=3, cuttoff=0.6)</font> 

In [24]:
difflib.get_close_matches('Hello', ['Hi', 'Helo', 'Heyo', 'Hell', 'Bonjour', 'Bye'])

['Helo', 'Hell', 'Heyo']

---

In [25]:
difflib.get_close_matches('Hello', ['Hi', 'Hey', 'Heyo', 'Hell', 'Bonjour', 'Bye'], n = 1)

['Hell']

---

In [27]:
difflib.get_close_matches('Hello', ['Hi', 'Hey', 'Heyo', 'Hell', 'Bonjour', 'Bye'], n=5, cutoff=0.4)

['Hell', 'Heyo', 'Hey']

---

# <font color='teal'>difflib.ndiff( )</font> 


### Compares lists of strings

### Returns a delta (like differ)


# <font color='teal'>difflib.restore( )</font> 

### Returns one of the sequences that generated a delta

### Used after difflib.ndiff

## <font color='teal'>difflib.ndiff(a, b)</font> 



In [28]:
diff = difflib.ndiff('hello\nmy\nname\nis\nShaleigh\n'.splitlines(keepends=True), #line breaks included
                    'hey\nmi\nnae\nis\nShaliehg\n'.splitlines(keepends=True))
print(''.join(diff))

- hello
- my
+ hey
+ mi
- name
?   -
+ nae
  is
- Shaleigh
?      ^ -
+ Shaliehg
?     + ^



---

## <font color='teal'>difflib.restore(delta, which)</font> 



In [29]:
diff = difflib.ndiff('hello\nmy\nname\nis\nShaleigh\n'.splitlines(keepends=True),
             'hey\nmi\nnae\nis\nShaliehg\n'.splitlines(keepends=True))

diff = list(diff)
print(''.join(difflib.restore(diff, 1)))

hello
my
name
is
Shaleigh



---

#  <font color='turquoise'>difflib.HtmlDiff( )</font> 


### Creates an HTML table or HTML file with table

### Compares text line by line, side by side

---


In [None]:
?difflib

### Sources

https://pymotw.com/3/difflib/index.html

https://docs.python.org/3.6/library/difflib.html

https://docs.python.org/3.6/glossary.html#term-hashable

https://docs.python.org/2.4/lib/sequence-matcher.html