# Obtaining Data

- Palmitoylation site information is obtained from SwissPalm
- sites.txt is obtained and convert to sites.csv
- sites.csv is saved at https://github.com/sonluongvu/Palm_structure

Open sites.csv

In [None]:
! wget 'https://github.com/sonluongvu/Palm_structure/raw/main/sites.csv'

--2022-01-22 21:47:37--  https://github.com/sonluongvu/Palm_structure/raw/main/sites.csv
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sonluongvu/Palm_structure/main/sites.csv [following]
--2022-01-22 21:47:38--  https://raw.githubusercontent.com/sonluongvu/Palm_structure/main/sites.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3260327 (3.1M) [text/plain]
Saving to: ‘sites.csv’


2022-01-22 21:47:38 (45.8 MB/s) - ‘sites.csv’ saved [3260327/3260327]



In [None]:
import pandas as pd
import numpy as np
sites_path = 'https://github.com/sonluongvu/Palm_structure/raw/main/sites.csv'
sites = pd.read_csv(sites_path)
sites.shape

(7452, 26)

Number of palmitoylation sites based on organisms

In [None]:
sites.organism.value_counts()

Mus musculus                                                            4328
Homo sapiens                                                            2691
Rattus norvegicus                                                        160
Saccharomyces cerevisiae (strain ATCC 204508 / S288c)                     46
Bos taurus                                                                38
Oryctolagus cuniculus                                                     30
Arabidopsis thaliana                                                      30
Torpedo californica                                                       13
Drosophila melanogaster                                                   12
Canis familiaris                                                           8
Hepatitis E virus genotype 3 (isolate Human/United States/US2)             8
Semliki forest virus                                                       8
Sindbis virus                                                              8

- Most of the provided palmitoylation sites above are not verified by experiments
- Listed of verified sites ID is obtained from SwissPalm database

Obtain list of verified palm sites for Homo sapiens

In [None]:
! wget 'https://github.com/sonluongvu/Palm_structure/raw/main/Homo_sapiens_verified_sites.csv'

--2022-01-22 21:47:38--  https://github.com/sonluongvu/Palm_structure/raw/main/Homo_sapiens_verified_sites.csv
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/sonluongvu/Palm_structure/main/Homo_sapiens_verified_sites.csv [following]
--2022-01-22 21:47:39--  https://raw.githubusercontent.com/sonluongvu/Palm_structure/main/Homo_sapiens_verified_sites.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7373 (7.2K) [text/plain]
Saving to: ‘Homo_sapiens_verified_sites.csv’


2022-01-22 21:47:39 (37.1 MB/s) - ‘Homo_sapiens_verified_sites.csv’ saved [7373/7373]



In [None]:
Homo_sapiens_id_path = '/content/Homo_sapiens_verified_sites.csv'
Homo_sapiens_id = pd.read_csv(Homo_sapiens_id_path)
Homo_sapiens_id.shape

(648, 1)

Obtain the list of id

In [None]:
Homo_sapiens_list = []
for i in range(len(Homo_sapiens_id.index)):
  x = str(Homo_sapiens_id.iloc[i][0])
  y = 'SPALMS'+ x[7:]
  Homo_sapiens_list.append(y)

len(Homo_sapiens_list)

648

Obtain info for Homo_sapiens_list from sites

In [None]:
Homo_sapiens_sites = sites[sites['id'].isin(Homo_sapiens_list)]
Homo_sapiens_sites.uniprot_ac.value_counts()

P16144    24
P60033    17
P48509    14
Q9H3Z4    14
P01112    12
          ..
P11801     1
P14672     1
P21554     1
P49798     1
Q8IZJ1     1
Name: uniprot_ac, Length: 202, dtype: int64

Obtain protein sequence

In [None]:
! pip install biopython

Collecting biopython
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 4.8 MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.79


In [None]:
import requests as r
from Bio import SeqIO
from io import StringIO 

Uniprot_ID = Homo_sapiens_sites['uniprot_ac']
Seq_list = []
for cID in Uniprot_ID:
  baseUrl="http://www.uniprot.org/uniprot/"
  currentUrl=baseUrl+cID+".fasta"
  response = r.post(currentUrl)
  cData=''.join(response.text)

  Seq=StringIO(cData)
  pSeq=list(SeqIO.parse(Seq,'fasta'))
  Seq_list.append(pSeq)

KeyboardInterrupt: ignored

In [None]:
protein_seq_list = []
for record in Seq_list:
  sequence = 200*'-' + str(record[0].seq) +200*'-'
  protein_seq_list.append(sequence)
Homo_sapiens_sites['Protein_seq'] = protein_seq_list
Homo_sapiens_sites

In [None]:
columns = ['id', 'uniprot_ac', 'pos', 'protein_seq']
Homo_sapiens_info = Homo_sapiens_sites.filter(columns, axis = 1)
Homo_sapiens_info['protein_seq'] = protein_seq_list
Homo_sapiens_info.shape

In [None]:
Homo_sapiens_info.to_csv('Homo_sapiens_info.csv', index = False)
type(Seq_list[1])

Obtain 100 aa upstream and 100 aa downstream