## Data Wrangling-MetaData
---

### - Present notebook uses _pandas_ (version 1.1.1), to curate and clean information from any MetaData file
### - Generates sample information used by the ML-Classifiers
### - File used in this notebook can be downloaded form [GREIN](http://www.ilincs.org/apps/grein/session/3ac4c6e5dd644337909800e52c1ba8f1/download/downloadmeta?w=)

#### Step 1: Load libraries

In [1]:
# Pandas for Dataframe processing
import pandas as pd

# This will print entire output of the cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#### Step 2: Load the raw metadata file
- File used in this notebook can be downloaded form [GREIN](http://www.ilincs.org/apps/grein/session/3ac4c6e5dd644337909800e52c1ba8f1/download/downloadmeta?w=)

In [2]:
# Import the MetaData file
MetaData = pd.read_csv("GSE103147_full_metadata.csv", index_col = 0)

# Viewing 
MetaData.head(1)

# Dimensions
MetaData.shape

Unnamed: 0,geo_accession,title,status,submission_date,last_update_date,type,channel_count,source_name_ch1,organism_ch1,characteristics_ch1,...,BioSample,SampleType,TaxID,ScientificName,Tumor,CenterName,Submission,Consent,RunHash,ReadHash
GSM2754496,GSM2754496,100_04_0779_DAY_0_T_Ag85_100_L8.LB23,Public on Oct 17 2017,Aug 27 2017,Oct 17 2017,SRA,1,T cells,Homo sapiens,cell type: Tcells,...,SAMN07564511,simple,9606,Homo sapiens,no,GEO,SRA603022,public,55B430EF91FD45723A75A0F83E487566,A1E7835851FA48F2F69C0DBF1E4E0640


(1650, 80)

#### Step 3: Boolean Subsetting (Case Samples)
- Conditions used,

a. characteristics_ch1 = cell type: Tcells

b. characteristics_ch1.3 = timepoint: 0

c. characteristics_ch1.5 = group: __case__

d. characteristics_ch1.1 =  stimulation: unstim

In [3]:
# Making boolean dataFrame
Case=(MetaData["characteristics_ch1"]== "cell type: Tcells") & (MetaData["characteristics_ch1.3"]=="timepoint: 0") & (MetaData["characteristics_ch1.5"]=="group: case") & (MetaData["characteristics_ch1.1"]=="stimulation: unstim")

# Subsetting with boolean dataFrame
Case_samples = MetaData["Run"][Case]

# Converting pandas series to dataFrame
case_samples = pd.DataFrame(Case_samples)

# Adding new columns with labels
case_samples['Labels'] = "Case"

# Viewing 
case_samples.head(2)

# Dimensions
case_samples.shape

Unnamed: 0,Run,Labels
GSM2755030,SRR5980958,Case
GSM2755033,SRR5980961,Case


(40, 2)

#### Step 4: Boolean Subsetting (Control Samples)
- Conditions used,

a. characteristics_ch1 = cell type: Tcells

b. characteristics_ch1.3 = timepoint: 0

c. characteristics_ch1.5 = group: __control__

d. characteristics_ch1.1 =  stimulation: unstim

In [4]:
# Making boolean dataFrame
Control=(MetaData["characteristics_ch1"]== "cell type: Tcells") & (MetaData["characteristics_ch1.3"]=="timepoint: 0") & (MetaData["characteristics_ch1.5"]=="group: control") & (MetaData["characteristics_ch1.1"]=="stimulation: unstim")

# Subsetting with boolean dataFrame
Control_samples = MetaData["Run"][Control]

# Converting pandas series to dataFrame
control_samples = pd.DataFrame(Control_samples)

# Adding new columns with labels as 0
control_samples['Labels'] = "Control"

# Viewing 
control_samples.head(2)

# Dimensions
control_samples.shape

Unnamed: 0,Run,Labels
GSM2755031,SRR5980959,Control
GSM2755032,SRR5980960,Control


(73, 2)

#### Step 5: Concatenation
- Concatenation of case and control dataframes
- Label __"1"__ denotes case samples
- Label __"0"__ denotes control samples

In [5]:
# Extract Samples with labels
sample_id_map = pd.concat([control_samples, case_samples], axis = 0)
sample_id_map.head(2)
sample_id_map.shape

Unnamed: 0,Run,Labels
GSM2755031,SRR5980959,Control
GSM2755032,SRR5980960,Control


(113, 2)

#### Step 6: Save the file
- Save the file without the index column

In [6]:
# Saving the file without Index
sample_id_map.to_csv("Sample_Labels.csv", index = False)

---