# Data mining

Now that we have re-analysed the outbreak data from the [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4001082), we may wish to augment that original study with new data or ask some new questions.

For this workflow, we'll be sketching with:

* Bloom filters
* HyperLogLog (coming soon)

***

## Isolates with matching resistome profiles

In our [resistome profiling workflow](./r4.3.Resistome-profiling.ipynb), we did a MinHash screen of raw reads and then aligned to AMR gene variation graphs. We ended up with the following resistome profile for isolate EC1a (ERX168346):

```
(Bla)SHV-12
(Sul)SulII
(Tmt)DfrA1
(Bla)SHV-183
(Sul)SulI
(AGly)Sat-2A
```

> If we looked at the alignments against the variation graphs, we would see that the blaSHV-183 is likely a mis-call of blaSHV-12 as they are very similar sequences but blaSHV-12 has more support. We will remove blaSHV-183 from the profile.

Now, what if we wanted to find other isolates with a matching resistome profile? Trawling through the [ENA](https://www.ebi.ac.uk/ena) to download the sequence data, re-run resistome profiling experiments etc. will take a long time. Instead, let's use the brilliant [BIGSI](http://www.bigsi.io/) and its index of the ENA.

* start with getting the sequences for all the AMR genes that are in our isolate's resistome profile:

In [None]:
# download the AMR gene database we used for resistome profiling
!wget https://raw.githubusercontent.com/will-rowe/groot/master/db/full-ARG-databases/arg-annot-db/argannot-args.fna

# load the FASTA file into a database
import screed
screed.make_db("./argannot-args.fna")
fadb = screed.ScreedDB("./argannot-args.fna")

# store each gene
SHV12 = fadb["argannot~~~(Bla)SHV-12~~~FJ685654:24-860"].sequence
SulII = fadb["argannot~~~(Sul)SulII~~~EU360945:1617-2432"].sequence
DfrA1 = fadb["argannot~~~(Tmt)DfrA1~~~JQ794607:474"].sequence
SulI = fadb["argannot~~~(Sul)SulI~~~AF071413:6700-7539"].sequence
Sat2A = fadb["argannot~~~(AGly)Sat-2A~~~X51546:518-1042"].sequence

# create the resistome profile
resistomeProfile = {"SHV12":SHV12, "SulII":SulII, "DfrA1":DfrA1, "SulI":SulI, "Sat2A":Sat2A}

* now, set up the search function for our calls to the BIGSI ENA index:

In [None]:
import requests
def search(seq, threshold):
    # set up the api call
    url = "http://api.bigsi.io/search?threshold=%f&seq=%s" % (float(threshold), seq)
    results = requests.get(url).json()
    samples = []
    # bigsi includes species metadata, derived from Bracken + Kraken analysis of the ENA data
    classification = []
    for i, j in list(results.values())[0]["results"].items():
        samples.append(i)
        classification.append(j)
    return samples, classification

# set threshold for proportion of query k-mers contained in results
threshold=0.8

* start with a sanity check and make sure our isolate is returned from the ENA:

In [None]:
# BIGSI uses ERR, so look ours up (see r.4.2.Sample-QC for the table)
isolateID = "ERR193657"

# to save us re-running the BIGSI search (not that it took long), store the results for later
results = []
classificationDict = {}

# search BIGSI for each gene in our resistome profile
for gene in resistomeProfile:
    print("query: {}" .format(gene))
    
    # run the search
    genomes, classification = search(resistomeProfile[gene], threshold)
    
    # check the results
    found = False
    for i, ERR in enumerate(genomes):
        classificationDict[ERR] = classification[i]["species"]
        if (ERR == isolateID):
            found = True
            results.append(genomes)
    
    # check if our isolate was in the search results for this gene
    if (found == True):
        print("\t- our isolate was returned in the results")
    else:
        print("\t- our isolate was NOT returned in the results")


* great - our isolate was returned for each gene in it's resistome profile. Now let's do something more interesting. Using the search results, find all the isolates in BIGSI that share the resistome profile of our isolate:

In [None]:
# use a set data type
matchingIsolates = set(results[0])

# get intersection on search results for each gene
for isolates in results[1:]:
    matchingIsolates.intersection_update(isolates)

# another sanity check, make sure our isolate is there
check = False
for isolate in matchingIsolates:
    if isolate == isolateID:
        check = True
if (check == False):
    print("fail. our isolate was not returned in the BIGSI search for the resistome profile")
elif (len(matchingIsolates) < 1):
    print("fail. no isolates matching our resistome profile")
else:
    
    # print the number of isolates we have that match our resistome profile
    print("success: {} isolates in ENA matching the resistome profile" .format(len(matchingIsolates)))

* we stored the classifications that BIGSI assigned to each entry in it's index (using kraken/braken), now we can check what we have:

In [None]:
for isolate in matchingIsolates:
    print("isolate:")
    print("\tERR= {}" .format(isolate))
    print("\tpred.= {}" .format(classificationDict[isolate]))

* there are few *e.cloacae* isolates in there, let's filter them out:

In [None]:
cloacaeIsolates=[]
import re
for isolate in matchingIsolates:
    search = re.search( r'Enterobacter\scloacae', classificationDict[isolate], re.M|re.I)
    if search:
        cloacaeIsolates.append(isolate)

print("success: {} e.cloacae isolates matching the resistome profile" .format(len(cloacaeIsolates)))

* we can try getting some more information on them:

In [None]:
# code adapted from: https://bioinfo.umassmed.edu/bootstrappers/guides/main/python_get_sra_run_ids.html
import requests, csv, io
def getInfoTableFromSearchTerm(search):
        payload = {"save": "efetch","db": "sra","rettype" : "runinfo", "term" : search };
        r = requests.get('http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi', params=payload)
        if 200 ==  r.status_code:
            if r.text.isspace():
                raise Exception("Got blank string from " + str(r.url ))
            else:
                reader_list = csv.DictReader(io.StringIO(r.text))
                infoRows = []
                for row in reader_list:
                    infoRows.append(row)
                if 0 == len(infoRows):
                    raise Exception('Found %d entries in SRA for "%s" when expecting at least 1' % (len(infoRows), search))
                else:        
                    return infoRows
                return infoRows
        else:
            raise Exception("Error in downloading from " + str(r.url) + " got response code " + str(r.status_code))

In [None]:
for isolate in cloacaeIsolates:
    print(isolate)
    
    # lookup the ENA info
    infoTable = getInfoTableFromSearchTerm(isolate)
    if (len(infoTable) == 0):
        print("\tENA lookup failed - no hits")

    # print some bits of info
    for row in infoTable:
        print("\tUpload Date:\t" + row.get("LoadDate"))
        print("\tPlatform:\t" + row.get("Platform"))
        print("\tBioproject:\t" + row.get("BioProject"))
        print("\tSubmitting Centre:\t" + row.get("CenterName"))

>More content coming soon