# SNARE-Protein Project

In [9]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup as BS
import urllib

## If Any Known Proteins From SNARE Family Belongs to Mycobacterium Tuberculosis

### 1. Get SNARE family protein IDs from InterPro

In [14]:
# downloadProteinID function 
# input: 
#   None
# output:
#   the name of the file contains a list of proteins IDs downloaded from InterPro
def downloadProteinID ():
    url = "http://www.ebi.ac.uk/interpro/entry/IPR010989/proteins-matched?taxonomy=2&export=ids"
    r = requests.get(url, allow_redirects=True)
    open("SNARE_proteins_bacteria.txt", 'wb').write(r.content)
    return "SNARE_proteins_bacteria.txt"

# The file of protein ids download from InterPro
proteinIDs = downloadProteinID ()

### 2. Get organism names from Uniport

In [15]:
# searchUniport_stable function
# input:
#   filename: string
#             the filename of a list of Protein ids 
# output:
#   a file with all the organism names matched to the protein ids and the
#   corresponding freqency of apperance
# SideNote: This function accomplish the smae goal as searchUniport_fast, but
#           this function accomplish the goal with uniprot api, where the 
#           result is stable but rather slow, which could take up to 10 mins.
def searchUniport_stable (filename):
    outFile = open("organism.txt", "w")
    proteinIDs = open(filename, "r")
    orgDic = {}
    for protein in proteinIDs:
        url = "https://www.uniprot.org/uniprot/?query=id:" + protein + "&columns=organism&format=tab"
        print("before")
        result = requests.get(url).content
        print ("after")
        organism = str(result).split("\\n")[1]
        if organism in orgDic:
            orgDic[organism] = orgDic[organism] + 1
        else:
            orgDic[organism] = 1
    for org_freqs in sorted(orgDic, key=orgDic.get, reverse=True):
        outFile.write(str(orgDic[org_freqs]) + " " + org_freqs + "\n")

In [16]:
# searchUniport_fast function
# input:
#   filename: string
#             the filename of a list of Protein ids 
# output:
#   a file with all the organism names matched to the protein ids and the
#   corresponding freqency of apperance
# SideNote: This function accomplish the smae goal as searchUniport_stable, but
#           accomplish the goal with python's BeautifulSoup package, where the 
#           result could be unstable but faster than using Uniport API. 
def searchUniport_fast (filename):
    outFile = open("organism.txt", "w")
    proteinIDs = open(filename, "r")
    orgDic = {}
    for protein in proteinIDs:
        text = requests.get('http://www.uniprot.org/uniprot/' + protein).text
        soup = BS(text)
        title = soup.head.title.text
        organism = title.split(" - ")[2]
        if organism in orgDic:
            orgDic[organism] = orgDic[organism] + 1
        else:
            orgDic[organism] = 1
    for org_freqs in sorted(orgDic, key=orgDic.get, reverse=True):
        outFile.write(str(orgDic[org_freqs]) + " " + org_freqs + "\n")

Get organism names from Uniport. This could take a while depends on the function choice. 

In [17]:
searchUniport_fast(proteinIDs)

### 3. Search through the organism names to see if it contains Mycobacterium Tuberculosis

In [18]:
# queryOrg function
# input: 
#   queryName: string
#              the name of organism we want check if it's in the organism list
#   orgFile: string
#            the name of the file output by searchUniport function, which contains
#            all the organism names that have a specific protein fanmily
# output:
#   a boolean that tells if the query organism is in the organisms list
def queryOrg (queryName, orgFile):
    f = open(orgFile, "r")
    for org in f:
        lowerOrg = org.lower()
        qLower = queryName.lower()
        if qLower in lowerOrg:
            return True
    return False

In [20]:
# Determine if TB is in the orgnanism list
query = "Mycobacterium tuberculosis"
organism = "organism.txt"
result = queryOrg(query, organism)
print ("Mycobacterium tuberculosis is in the name list: " + str(result))

Mycobacterium tuberculosis is in the name list: False


### 4. Conclusion

## Blast SNARE-like proteins against Mycobacterium tuberculosis

### 1. Import Data

Import a list of all (most) predicted "Incs" from C. trachomatis from the paper *Expression and Localization of Predicted Inclusion Membrane Proteins in Chlamydia trachomatis, Mary M. Weber ect.* Store the tags and 3 corresponding strains in **C.trachomatis_predict_Incs.txt**

Change the locus tag for D/UW-3/CX strain in order to match NCBI locus tag format and store the changed tags into DUW-3CX_predict_Incs.txt file using unix command:
```shell
cut -d " " -f 1 C.trachomatis_predict_Incs.txt | sed s/CT/CT_/g > DUW-3CX_predict_Incs.txt
```
Get the locus tag for L2/434/Bu strain and A/HAR-13 strain using similar commands:
```shell
cut -d " " -f 2 C.trachomatis_predict_Incs.txt > L2434Bu_predict_Incs.txt
cut -d " " -f 3 C.trachomatis_predict_Incs.txt > AHAR13_predict_Incs.txt
```

Import protein tags for each strains as a dataframe:

In [6]:
DUW3CX_proteins = pd.read_csv("DUW-3CX_predict_Incs.txt", sep = " ", index_col = False)
L2434Bu_proteins = pd.read_csv("L2434Bu_predict_Incs.txt", sep = " ", index_col = False)
AHAR13_proteins = pd.read_csv("AHAR13_predict_Incs.txt", sep = " ", index_col = False)

print(DUW3CX_proteins.head())
print(L2434Bu_proteins.head())
print(AHAR13_proteins.head())

  D/UW-3/CX
0    CT_005
1    CT_006
2    CT_036
3    CT_058
4    CT_079
  L2/434/Bu
0   CTL0260
1   CTL0261
2   CTL0291
3   CTL0314
4   CTL0335
  A/HAR-13
0  CTA0006
1  CTA0007
2  CTA0038
3  CTA0062
4  CTA0084


### 2. Set up Entrez Direct to perfrom BLAST search

Follow the direction from NCBI website [here](https://www.ncbi.nlm.nih.gov/books/NBK179288/), install Entrez Direct: E-utilities on the UNIX Command Line.

First, run the following code on terminal:
```shell
cd ~
  /bin/bash
  perl -MNet::FTP -e \
    '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
     $ftp->login; $ftp->binary;
     $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
  gunzip -c edirect.tar.gz | tar xf -
  rm edirect.tar.gz
  builtin exit
  export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
  ./edirect/setup.sh
```
Then, add the command edirect to soucre.sh so that it can be run in any path:

```shell
echo "source ~/.bash_profile" >> $HOME/.bashrc
echo "export PATH=\${PATH}:/Users/xuqiantan/edirect" >> $HOME/.bash_profile
```


### 3. Use esearch to get the sequence of the tagged locus

Use the unix command below to get tagged locus protein sequences for each strain and store all the sequence in **Chlamydia_protein_seq.fasta** for blast search:

```shell
for file in DUW-3CX_predict_Incs.txt L2434Bu_predict_Incs.txt AHAR13_predict_Incs.txt
do
    input=${file}
    while read line
    do
        echo ${line}
        esearch -db protein -query ${line} | efetch -format fasta >> Chlamydia_protein_seq.fasta
    done < ${file}
done
```


### 4. Perform blast search on the protein sequence