The information for how to access PRIDE was found here http://www.ebi.ac.uk/pride/help/archive/access/webservice
This first code block is a direct copy of theirs just to figure out how to do web-calls, and provides a nice
test of whether their webservices are running correctly, which is not always the case.

In [1]:

import json
import urllib.request as urllib2


projects = ["PXD004033", "PXD003805"]
# get rid of the first argument, as it is the name of the script itself

# for each of the provided project accessions retrieve the record and print some details
for project in projects:
  try:
    # Set the request URL
    url = 'http://www.ebi.ac.uk/pride/ws/archive/project/' + project
    # Create the request
    req = urllib2.Request(url)
    # Send the request and retrieve the data
    resp = urllib2.urlopen(req).read()
    # Interpret the JSON response 
    project = json.loads(resp.decode('utf8'))
    #print (project)
    # Output some project properties 
    print(project['accession'] + ' - ' + project['title'])
    print('\tsubmission date: ' + project['publicationDate'])
    print('\tsubmission type: ' + project['submissionType'])
    print('\tsubmitter email: ' + project['submitter']['email'])
    print('\tnumber of assays: ' + str(project['numAssays']))
  except (Exception):
    print('Error for project: ' + project + ' perhaps this project does not exist?')

PXD004033 - Metaproteomic Monitoring of Perinatal Mouse Gut Microbiota
	submission date: 2016-10-13
	submission type: PARTIAL
	submitter email: stefano.levimortera@opbg.net
	number of assays: 0
PXD003805 - Metaproteome of chicken gastrointestinal tract microbiota
	submission date: 2016-10-06
	submission type: PARTIAL
	submitter email: jseifert@uni-hohenheim.de
	number of assays: 0


In [7]:
#Given that we are going to be using this little loop for
#each batch of results from PRIDE's webservice, I figureed we should make a method out of it
import TaxonomyQueries
import PrideData

def CheckOrganismsFromBatchRequest(BatchOfResults):
    for Result in BatchOfResults:
        Object = PrideData.PrideSubmission(Result)
        ProjectAccession = Object.GetAccession()
        OrganismList = Object.GetOrgList()
       
        for Org in OrganismList:
            # 1. check to see if I have seen this before
            if Org in Dictionary_IsNotBacteria:
                #don't really care about it
                continue
            if Org in Dictionary_IsBacteria:
                #there is an organism in this project that is a bacteria! Hooray!
                #save this to the list of objects that I care about
                PrideSubmissionsWithBacteria.append(Object)
                continue
            #I only get here because Org has not been seen before. Time for the NCBI web calls
            #sometimes, they add parenthetical stuff about strains or common names. So let's 
            #just get the first two and that will be genus, species
            TaxonID = 0
            ManyNames = Org.split(' ')
            if len(ManyNames) < 2:
                TaxonID = TaxonomyQueries.GetTaxonIDFromSingleWord(ManyNames[0])
            else:
                TaxonID = TaxonomyQueries.GetTaxonIDFromGenusSpecies(ManyNames[0], ManyNames[1])
            if TaxonID == 0:
                #we had one of several failures here. Print some error message and move on
                #notably we don't do anything with the created Object. Just let it die 
                print("Error: Unable to get TaxonID for :%s: project %s"%(Org, ProjectAccession))
                continue
                
            IsBacteria = TaxonomyQueries.TestIsBacteriaWithTaxonID(TaxonID)
            if IsBacteria:
                Dictionary_IsBacteria[Org] = TaxonID
                PrideSubmissionsWithBacteria.append(Object)
            else:
                Dictionary_IsNotBacteria[Org] = TaxonID


In [3]:
#set aside some arrays to keep track of things that i've seen before. I don't want to keep calling
#the webservice if I don't have to
Dictionary_IsBacteria = {} # key = species, value = taxonID
Dictionary_IsNotBacteria = {} # key = species, value = taxonID
PrideSubmissionsWithBacteria = [] #array of objects


In [8]:
#1. Get an idea of how many projects there are at PRIDE
url_numProjects = 'http://www.ebi.ac.uk/pride/ws/archive/project/count'
req = urllib2.Request(url_numProjects)
# Send the request and retrieve the data
resp = urllib2.urlopen(req).read()
#this response is a byte (datatype) #print (type(resp))

NumProjects = int(resp)
#NumProjects = 91 # a cutout for testing
BatchSize = 10
Pages = (NumProjects//BatchSize) + 1 #double division means integer division (floor the float)

#2. Now we iteratively query the EBI webservices to learn about projects.
#   We are specifically interested in finding submissions that contain Bacterial species
for PageNumber in range(Pages):
    #build a URL to get back batches of projects projects
    url = 'http://www.ebi.ac.uk/pride/ws/archive/project/list?show=%s&page=%s&order=desc'%(BatchSize,PageNumber)
    print ("**Sending webRequest to EBI for PRIDE projects, page %s of %s"%(PageNumber, Pages))
    #print (url)
    req = urllib2.Request(url)
    # Send the request and retrieve the data
    resp = urllib2.urlopen(req).read()
    # Interpret the JSON response 
    ProjectBatch = json.loads(resp.decode('utf8'))
    # type(ProjectBatch) ##found out this is a dictionary with one key value pair. 
    # key = 'list' , value is a big list of stuff I think.
    ListOfResults = ProjectBatch['list']
    CheckOrganismsFromBatchRequest(ListOfResults)


**Sending webRequest to EBI for PRIDE projects, page 0 of 304
**Sending webRequest to EBI for PRIDE projects, page 1 of 304
**Sending webRequest to EBI for PRIDE projects, page 2 of 304
**Sending webRequest to EBI for PRIDE projects, page 3 of 304
Error: Unable to get TaxonID for :human gut metagenome: project PXD004039
**Sending webRequest to EBI for PRIDE projects, page 4 of 304
**Sending webRequest to EBI for PRIDE projects, page 5 of 304
**Sending webRequest to EBI for PRIDE projects, page 6 of 304
Error: Unable to get TaxonID for :thiotrophic endosymbiont of Bathymodiolus azoricus (Menez Gwen): project PXD004061
Error: Unable to get TaxonID for :methanotrophic endosymbiont of Bathymodiolus azoricus (Menez Gwen): project PXD004061
**Sending webRequest to EBI for PRIDE projects, page 7 of 304
**Sending webRequest to EBI for PRIDE projects, page 8 of 304
**Sending webRequest to EBI for PRIDE projects, page 9 of 304
**Sending webRequest to EBI for PRIDE projects, page 10 of 304
**Send

In [6]:
#now I should try and figure out how much coverage exists for these data sets. Perhaps some of them are really small.

#for the moment, just try and aggregate things
for Project in PrideSubmissionsWithBacteria:
    InstrumentList = Project.GetInstrument()
    OrgList = Project.GetOrgList()
    #print (InstrumentList)
    if not 'Q Exactive' in InstrumentList:
        continue
    print ("%s: %s: %s" %(OrgList, Project.GetAccession(), Project.GetDescription()))


['Streptococcus pyogenes M1 GAS']: PXD005167: Proteomics characterisation of membrane vesicles (MV) and corres
['Mycobacterium marinum ATCC BAA-535']: PXD003766: Pathogenic mycobacteria contain up to five type VII secretion (T
['Clostridia']: PXD004512: Metaproteomics was used to identify proteins from a mized commun
['Homo sapiens (Human)', 'Escherichia coli']: PXD003640: Despite the outstanding advantages that isobaric labelling has t
['Escherichia coli']: PXD003468: To study incorporation of nonproteinogenic amino acids in bacter
['Streptococcus pyogenes serotype M1 (strain ATCC BAA-947 / MGAS5005)', 'Homo sapiens (Human)', 'Mus musculus (Mouse)']: PXD003405: In order to test the usability of the new MS feature detection a
['Alteromonas macleodii ATCC 27126', 'Breviatea', 'Arcobacter']: PXD003275: The Breviatea form a lineage of free-living protists that emerge
['Alteromonas macleodii ATCC 27126', 'Breviatea', 'Arcobacter']: PXD003275: The Breviatea form a lineage of free-living pro