# Introduction to CPTAC
This tutorial will introduce you to the CPTAC dataset through a (new) Python package called `cptac` that was written to access the CPTAC data. This package is similar to `TCGAbiolinks`. The package `cptac` is written using `pandas` dataframes. These dataframes are great and similar to some of the dataframes we navigated in R. This tutorial will introduce you to some of the uses of `pandas`. There is also a `pandas` cheat sheet in the Shared Folder that we recommend saving for future reference. 

`cptac` documentation: https://pypi.org/project/cptac/

### Installing and Importing `cptac`
Just as in R, we need to install before we can load in `cptac` to our current python environment. This installation only needs to occur once. In your terminal, install `cptac` into your conda environment. Double check that your conda environment is activated. At the very start of the line in the terminal, `(base)` means no conda environment is active. `(name_of_environment)` tells which conda environment you have activated. 

Install `cptac` with `pip install cptac`. Return to this notebook and run `import cptac`. If you are having trouble, try to close and restart the notebook.

### Start exploring CPTAC with `cptac` (!!)
Similar to `TCGAbiolinks`, we need to access and download the data. Take note of this syntax as you will need it for future reference.

In [9]:
#import cptac
cptac.list_datasets()

Unnamed: 0_level_0,Description,Data reuse status,Publication link
Dataset name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Brca,breast cancer,no restrictions,https://pubmed.ncbi.nlm.nih.gov/33212010/
Ccrcc,clear cell renal cell carcinoma (kidney),no restrictions,https://pubmed.ncbi.nlm.nih.gov/31675502/
Colon,colorectal cancer,no restrictions,https://pubmed.ncbi.nlm.nih.gov/31031003/
Endometrial,endometrial carcinoma (uterine),no restrictions,https://pubmed.ncbi.nlm.nih.gov/32059776/
Gbm,glioblastoma,no restrictions,https://pubmed.ncbi.nlm.nih.gov/33577785/
Hnscc,head and neck squamous cell carcinoma,no restrictions,https://pubmed.ncbi.nlm.nih.gov/33417831/
Lscc,lung squamous cell carcinoma,password access only,unpublished
Luad,lung adenocarcinoma,no restrictions,https://pubmed.ncbi.nlm.nih.gov/32649874/
Ovarian,high grade serous ovarian cancer,no restrictions,https://pubmed.ncbi.nlm.nih.gov/27372738/
Pdac,pancreatic ductal adenocarcinoma,password access only,unpublished


In [10]:
cptac.download(dataset="Brca")
br = cptac.Brca()

                                         

The following lists the data available for the breast cancer dataset. Notice that even though this is through cptac (proteomics), we can also accesss the accompanying transcriptomics, CNV, etc. Additionally, we will be focusing on the proteomics, but acetylproteomics and phosphoproteomics are also interesting aspects to explore. They encompass "post-translational modifications". That is the addition of acetyl or phosphate groups to proteins after they are translated. This can often give insight into which proteins are being used in which cellular pathways. 

In [11]:
br.list_data()

Below are the dataframes contained in this dataset:
	acetylproteomics
		Dimensions: (122, 9868)
	clinical
		Dimensions: (122, 23)
	CNV
		Dimensions: (122, 23692)
	derived_molecular
		Dimensions: (122, 32)
	followup
		Dimensions: (276, 95)
	phosphoproteomics
		Dimensions: (122, 38775)
	proteomics
		Dimensions: (122, 10107)
	somatic_mutation
		Dimensions: (24106, 3)
	transcriptomics
		Dimensions: (122, 23121)


In [30]:
protein_data = br.get_proteomics()

#The dataframes are MultIndex pandas dataframes. 
#However, to teach the basics of pandas, we will remove the "multi" part of the dataframe.
protein_data = protein_data.droplevel(1, axis=1)

protein_data

Name,A1BG,A2M,A2ML1,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,AAMDC,...,ZSCAN31,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CPT000814,-0.6712,-0.2075,2.7959,1.3969,-1.0899,,1.6708,-0.3484,-0.4756,-0.7299,...,-5.2868,-0.6536,0.3384,2.1169,1.3910,-2.1230,0.9136,-0.8082,-1.4793,0.9136
CPT001846,1.3964,1.3302,-5.0948,0.7674,-1.6845,,2.1022,-0.5814,0.2916,-2.2857,...,-0.7592,0.4711,0.6018,0.2062,-0.2137,-2.1219,0.0860,2.5814,-0.2852,-0.1074
X01BR001,2.0219,1.6269,-3.2943,0.3352,-1.0739,1.2255,0.2754,-1.1187,-0.0534,-0.2519,...,,0.2306,-0.3010,0.3395,-0.5316,,0.4996,0.7622,-1.5607,0.0256
X01BR008,-0.5290,0.3267,1.4342,0.4938,-2.8676,,,-1.0691,-0.3643,-1.8173,...,-2.1789,0.2695,0.1506,1.0498,0.7546,1.7889,-0.2499,-0.2590,-0.1263,0.3725
X01BR009,1.2556,3.4489,2.8043,-0.2956,-1.7261,,,-2.0471,-0.3547,-0.8298,...,-2.3990,-0.2596,0.1898,-0.5010,-0.4189,0.3080,0.5057,0.2181,-0.2288,-0.2750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X21BR001,-0.6610,-0.6402,-4.8578,1.2319,-1.6491,,,-0.3074,-0.3074,-0.0266,...,-0.2528,0.5090,0.0306,0.4908,-0.5570,2.3864,0.3764,-0.6974,1.3541,1.1123
X21BR002,-1.3735,0.4227,-4.9553,0.6327,-3.1434,,,0.3071,0.7562,-1.6912,...,-3.3351,0.1548,1.0792,-0.6619,-1.4444,-0.3704,0.4909,0.3938,0.2992,-0.3494
X21BR010,1.1583,0.3329,-5.7358,-0.1658,-2.0413,-1.2433,0.9090,-0.2410,0.6717,-0.1651,...,-0.7054,0.2752,0.8850,-2.6704,-0.9444,-1.9717,0.0650,0.6300,-0.0686,0.1798
X22BR005,0.4948,-1.0986,-8.8314,0.2826,-1.0123,-2.5732,5.7567,1.7644,0.5415,0.1531,...,-0.3936,-0.0340,-0.9367,-0.1922,1.2572,1.3220,-1.0698,0.4012,-0.3792,1.2752


### Introduction to Pandas
The following commands are basics to navigate through pandas dataframes. First we import pandas as a comman alias, `pd`. Remember there are additional python and pandas resources in the Shared Folder

# To do: Find a tutorial on pandas and add to python list of resources somewhere

In [None]:
import pandas as pd

In [34]:
#Juptyer gives a nice format for viewing pandas dataframes
protein_data

Name,A1BG,A2M,A2ML1,AAAS,AACS,AADAT,AAED1,AAGAB,AAK1,AAMDC,...,ZSCAN31,ZSWIM8,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
Patient_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CPT000814,-0.6712,-0.2075,2.7959,1.3969,-1.0899,,1.6708,-0.3484,-0.4756,-0.7299,...,-5.2868,-0.6536,0.3384,2.1169,1.3910,-2.1230,0.9136,-0.8082,-1.4793,0.9136
CPT001846,1.3964,1.3302,-5.0948,0.7674,-1.6845,,2.1022,-0.5814,0.2916,-2.2857,...,-0.7592,0.4711,0.6018,0.2062,-0.2137,-2.1219,0.0860,2.5814,-0.2852,-0.1074
X01BR001,2.0219,1.6269,-3.2943,0.3352,-1.0739,1.2255,0.2754,-1.1187,-0.0534,-0.2519,...,,0.2306,-0.3010,0.3395,-0.5316,,0.4996,0.7622,-1.5607,0.0256
X01BR008,-0.5290,0.3267,1.4342,0.4938,-2.8676,,,-1.0691,-0.3643,-1.8173,...,-2.1789,0.2695,0.1506,1.0498,0.7546,1.7889,-0.2499,-0.2590,-0.1263,0.3725
X01BR009,1.2556,3.4489,2.8043,-0.2956,-1.7261,,,-2.0471,-0.3547,-0.8298,...,-2.3990,-0.2596,0.1898,-0.5010,-0.4189,0.3080,0.5057,0.2181,-0.2288,-0.2750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X21BR001,-0.6610,-0.6402,-4.8578,1.2319,-1.6491,,,-0.3074,-0.3074,-0.0266,...,-0.2528,0.5090,0.0306,0.4908,-0.5570,2.3864,0.3764,-0.6974,1.3541,1.1123
X21BR002,-1.3735,0.4227,-4.9553,0.6327,-3.1434,,,0.3071,0.7562,-1.6912,...,-3.3351,0.1548,1.0792,-0.6619,-1.4444,-0.3704,0.4909,0.3938,0.2992,-0.3494
X21BR010,1.1583,0.3329,-5.7358,-0.1658,-2.0413,-1.2433,0.9090,-0.2410,0.6717,-0.1651,...,-0.7054,0.2752,0.8850,-2.6704,-0.9444,-1.9717,0.0650,0.6300,-0.0686,0.1798
X22BR005,0.4948,-1.0986,-8.8314,0.2826,-1.0123,-2.5732,5.7567,1.7644,0.5415,0.1531,...,-0.3936,-0.0340,-0.9367,-0.1922,1.2572,1.3220,-1.0698,0.4012,-0.3792,1.2752


There are multiple options to index into a pandas dataframe. The first is the format.

In [38]:
protein_data["A1BG"]

Patient_ID
CPT000814   -0.6712
CPT001846    1.3964
X01BR001     2.0219
X01BR008    -0.5290
X01BR009     1.2556
              ...  
X21BR001    -0.6610
X21BR002    -1.3735
X21BR010     1.1583
X22BR005     0.4948
X22BR006     0.5049
Name: A1BG, Length: 122, dtype: float64

In [2]:
# CPTAC python tutorial 1: downloading dataset

import subprocess
import sys

def install(package):
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

install("cptac")
import cptac

cptac.list_datasets()
cptac.download(dataset="Brca")
br = cptac.Brca()
# br.list_data()
# proteomics = br.get_proteins()
# samples = proteomicseomics.index
# samples = proteomics.index
# proteins = proteomics.columns
# print("Samples:",samples[0:20].tolist()) #the first twenty samples
# print("Proteins:",proteins[0:20].tolist()) #the first twenty proteins
# proteomics.head()


                                         

True

In [5]:
#cptac.download(dataset="Brca")

<module 'sys' (built-in)>

In [None]:
br = cptac.Brca()