# Data
The file *human_small5.csv* contains expression levels of 100 human transcripts over 100 RNA sequencing runs. The file format is as follows:

* Each row corresponds to a transcript
* The first 15 columns are transcript description columns
* The rest 100 columns corresponds to each RNA sequencing run

Note: this is a very small subset of the original dataset. The original dataset's dimensions are 1,888,228x8,619

First step is to read the data into a dataframe with pandas.

In [2]:
# Import the `pandas` library as `pd`
import pandas as pd


#read data file
df = pd.read_csv("data/human_small5.csv")

#see df dimentions
print(df.shape)

#print first 5 rows
print(df.head(5))

(100, 115)
     Transcript id   Source   Gene stable ID Transcript stable ID  \
0  ENST00000482857  ENSEMBL  ENSG00000152689      ENST00000482857   
1  ENST00000482858  ENSEMBL  ENSG00000100033      ENST00000482858   
2  ENST00000482859  ENSEMBL  ENSG00000241741      ENST00000482859   
3  ENST00000482860  ENSEMBL  ENSG00000146963      ENST00000482860   
4  ENST00000482861  ENSEMBL  ENSG00000144290      ENST00000482861   

                                    Gene description  \
0  RAS guanyl releasing protein 3 [Source:HGNC Sy...   
1  proline dehydrogenase 1 [Source:HGNC Symbol;Ac...   
2  ribosomal protein L7a pseudogene 30 [Source:HG...   
3  LUC7 like 2; pre-mRNA splicing factor [Source:...   
4  solute carrier family 4 member 10 [Source:HGNC...   

   Transcript length (including UTRs and CDS) Gene name Transcript name  \
0                                         688   RASGRP3     RASGRP3-214   
1                                        4676     PRODH       PRODH-211   
2           

## Column names
Our data has has columns of type *object* and *float*. The first 15 columns are only description columns and rest are the data columns. Save these columns into two different lists so that its easier to access only data columns to perform mathematical operations.

In [8]:
infoCols=df.columns[:15]
print (infoCols)
dataCols=df.columns[15:]
print (dataCols)


Index(['Transcript id', 'Source', 'Gene stable ID', 'Transcript stable ID',
       'Gene description', 'Transcript length (including UTRs and CDS)',
       'Gene name', 'Transcript name', 'Gene % GC content', 'Gene type',
       'Protein stable ID', 'Ensembl Family Description', 'Pfam domain ID',
       'RFAM transcript name ID', 'Interpro Short Description'],
      dtype='object')
Index(['SRR3362406', 'SRR4044997', 'SRR3616971', 'SRR2963615', 'SRR5366356',
       'SRR2963313', 'SRR2964288', 'SRR4052832', 'SRR2964130', 'SRR4052927',
       'SRR645773', 'SRR857237', 'SRR2313095', 'SRR4050956', 'SRR2963586',
       'SRR2963434', 'SRR2939143', 'SRR903263', 'SRR4050737', 'SRR2964133',
       'SRR2963820', 'SRR2963277', 'SRR3593406', 'SRR4050892', 'SRR2963008',
       'SRR2964588', 'SRR2963718', 'SRR3090249', 'SRR2964480', 'SRR1313209',
       'SRR315336', 'SRR3090560', 'SRR3157890', 'SRR3362438', 'SRR1602519',
       'SRR2314045', 'SRR5038621', 'SRR4050661', 'SRR903214', 'SRR4052754',
    

## Clean data
Remove any transcripts (rows) where mean expression level is less than 0.05 for all the runs. Display the info data of the transcripts removed.

In [45]:
# find rows where sum over all data columns is zero
rowstoDrop=[]
for i in range(df.shape[0]):
    thisRsum=df.iloc[[i]][dataCols].sum(axis=1).sum()
    if thisRsum<5:
        print(df.iloc[[i]]['Transcript id'])
        rowstoDrop.append(i)

df_2=df.drop(rowstoDrop)
print(df_2.shape)
#check rows columns were removed
df_2[df_2["Transcript id"]=="ENST00000482920"]

        

11    ENST00000482869
Name: Transcript id, dtype: object
15    ENST00000482873
Name: Transcript id, dtype: object
25    ENST00000482886
Name: Transcript id, dtype: object
27    ENST00000482888
Name: Transcript id, dtype: object
31    ENST00000482892
Name: Transcript id, dtype: object
43    ENST00000482909
Name: Transcript id, dtype: object
54    ENST00000482920
Name: Transcript id, dtype: object
60    ENST00000482928
Name: Transcript id, dtype: object
62    ENST00000482930
Name: Transcript id, dtype: object
64    ENST00000482932
Name: Transcript id, dtype: object
70    ENST00000482938
Name: Transcript id, dtype: object
71    ENST00000482939
Name: Transcript id, dtype: object
77    ENST00000482946
Name: Transcript id, dtype: object
82    ENST00000482951
Name: Transcript id, dtype: object
88    ENST00000482957
Name: Transcript id, dtype: object
91    ENST00000482961
Name: Transcript id, dtype: object
(84, 115)


Unnamed: 0,Transcript id,Source,Gene stable ID,Transcript stable ID,Gene description,Transcript length (including UTRs and CDS),Gene name,Transcript name,Gene % GC content,Gene type,...,SRR3090408,SRR3090459,SRR2133252,SRR645769,SRR4052856,SRR3593307,SRR4051015,SRR2735903,SRR2964052,SRR1056401


Remove any data columns (columns) where expression for all transcripts is 0. Display the columns removed (if any).

In [64]:
# find rows where sum over all data columns is zero
colstoDrop=[]
for d in dataCols:
    thisCsum=df_2[d].sum()
    #print(d,thisCsum)
    if thisCsum<=0:
        print(d)
        colstoDrop.append(d)

print(colstoDrop)
#drop all the columns in the list
df_3=df_2.drop(colstoDrop,axis=1)

#check if columns are removed
"SRR2963313" in df_3.columns
#OR check if all are removed

if(!set(colstoDrop).intersection(df_3.columns)):
    print("All remo")


SRR2963313
SRR2964130
SRR2963586
SRR2964133
SRR2963820
SRR2963008
SRR2964588
SRR2963718
SRR2965196
SRR2964212
SRR2965397
SRR2964164
SRR2963148
SRR2964156
SRR2964052
['SRR2963313', 'SRR2964130', 'SRR2963586', 'SRR2964133', 'SRR2963820', 'SRR2963008', 'SRR2964588', 'SRR2963718', 'SRR2965196', 'SRR2964212', 'SRR2965397', 'SRR2964164', 'SRR2963148', 'SRR2964156', 'SRR2964052']


False