# ...in Python

## Comparing Data in Python

by Anastasia Ramig

These recipe examples were tested on January 24, 2022.

## 1. Difference between datasets

Python includes the [difflib](https://docs.python.org/3/library/difflib.html) library for computing deltas. This can be used for comparing differences in files, for example.

We're going to use the `WeiningerCEX_132_reading_smilesvalence.txt` and `avalon_1.2.0_reading_smilesvalence.txt` files to demonstrate some of the difflib functions. You can find these files [here](https://github.com/ualibweb/UALIB_Workshops/tree/master/02_Unix_fall_2020/Udata/Benchmark).

### Import libraries

In [17]:
import difflib

### Context diff

Print the context_diff between files

In [18]:
file1 = open(r'WeiningerCEX_132_reading_smilesvalence.txt').readlines()
file2 = open(r'avalon_1.2.0_reading_smilesvalence.txt').readlines()
cdiff = difflib.context_diff(file1, file2)
print(' '.join(cdiff), end="")

*** 
 --- 
 ***************
 *** 43,61 ****
   F5 0 3 3 3 3 3
   Cl0 1
   Cl1 0 3
 ! Cl2 0 3 3
   Cl3 0 3 3 3
 ! Cl4 0 3 3 3 3
   Cl5 0 3 3 3 3 3
   Br0 1
   Br1 0 3
 ! Br2 0 3 3
   Br3 0 3 3 3
 ! Br4 0 3 3 3 3
   Br5 0 3 3 3 3 3
   I0 1
   I1 0 3
 ! I2 0 3 3
   I3 0 3 3 3
 ! I4 0 3 3 3 3
   I5 0 3 3 3 3 3
 --- 43,61 ----
   F5 0 3 3 3 3 3
   Cl0 1
   Cl1 0 3
 ! Cl2 1 3 3
   Cl3 0 3 3 3
 ! Cl4 1 3 3 3 3
   Cl5 0 3 3 3 3 3
   Br0 1
   Br1 0 3
 ! Br2 1 3 3
   Br3 0 3 3 3
 ! Br4 1 3 3 3 3
   Br5 0 3 3 3 3 3
   I0 1
   I1 0 3
 ! I2 1 3 3
   I3 0 3 3 3
 ! I4 1 3 3 3 3
   I5 0 3 3 3 3 3


### Unified diff

Print the unified_diff between files

In [19]:
file1 = open(r'WeiningerCEX_132_reading_smilesvalence.txt').readlines()
file2 = open(r'avalon_1.2.0_reading_smilesvalence.txt').readlines()
udiff = difflib.unified_diff(file1, file2)
print(' '.join(udiff), end="")

--- 
 +++ 
 @@ -43,19 +43,19 @@
  F5 0 3 3 3 3 3
  Cl0 1
  Cl1 0 3
 -Cl2 0 3 3
 +Cl2 1 3 3
  Cl3 0 3 3 3
 -Cl4 0 3 3 3 3
 +Cl4 1 3 3 3 3
  Cl5 0 3 3 3 3 3
  Br0 1
  Br1 0 3
 -Br2 0 3 3
 +Br2 1 3 3
  Br3 0 3 3 3
 -Br4 0 3 3 3 3
 +Br4 1 3 3 3 3
  Br5 0 3 3 3 3 3
  I0 1
  I1 0 3
 -I2 0 3 3
 +I2 1 3 3
  I3 0 3 3 3
 -I4 0 3 3 3 3
 +I4 1 3 3 3 3
  I5 0 3 3 3 3 3


### Side by Side HTML diff

The `difflib.HTMLDiff` class can be used to generate an HTML formatted side by side comparison of the files which can be displayed directly within a Jupyter Notebook. Here is a recipe below. Note that the output is not shown here.

In [None]:
import difflib
from IPython import display
file1 = open(r'WeiningerCEX_132_reading_smilesvalence.txt').readlines()
file2 = open(r'avalon_1.2.0_reading_smilesvalence.txt').readlines()

sbs_diff = difflib.HtmlDiff(tabsize=2)
with open("sbs_diff_table.html", "w") as outfile:
    html = sbs_diff.make_file(fromlines=file1, tolines=file2, 
                                fromdesc='WeiningerCEX_132_reading_smilesvalence', todesc='avalon_1.2.0_reading_smilesvalence')
    outfile.write(html)
display.HTML(open('sbs_diff_table.html', 'r').read())