# Quick & Easy: Visualizing NGS Count Data with Python

**Why python?** <br>
1. Python offers a clearer, more readable syntax and robust error handling, making complex programming tasks easier to manage and debug compared to Bash scripts.
2. With an extensive ecosystem of libraries and frameworks, Python can handle a wider range of applications—from data analysis to web development—while Bash is primarily suited for system and file management tasks.
3. Python is cross-platform compatible, which means that it can run on any device you can imagine (including your fridge). A strong community support provides continuous improvements and resources, ensuring more sustainable, maintainable, and scalable solutions than Bash.


# 1. Python Packages: Installation and Importing

Python offers a wide variety of packages to help you with your data analysis. More and more packages are being created or you can even create your own! 

**Installation of packages**

There are several options to install packages:
1. Using pip for your python version in the command line (https://pypi.org/project/pip/)
2. Using a tool like conda (https://docs.conda.io/en/latest/), which also includes python and an environment management system
3. Download packages from github (if you have a special scientific case and someone in another lab already wrote some code you can use) and install manually.


**Importing packages**

Now we have installed our first packages. How do we use it in python?

We first need to import the package into our script. To access functions a package offer us, we use the package name to reference it. For this reason we can also give pandas as shorter name, so we do not need to type out pandas everytime we want to use it. 

In [20]:
#import pandas
import pandas as pd

# Importing just one function from a package is also possible, for example:
# from matplotlib import pyplot

Available packages (should) come with documentation. For pandas you can check out their extensive documentation website here: https://pandas.pydata.org/docs/index.html 

# 2. Data types in Python

Before we can use our imported package, we need to learn about data types that are used in Python. Here we want to give you a short overview over those datatypes.

### Lists


**A list is an ordered, mutable collection of items.**

In [21]:
fruits = ["apple", "banana", "cherry"]
fruits

['apple', 'banana', 'cherry']

Elements of a list can be accessed using brackets: [] 

Beware: Python starts counting elements at 0! Go ahead and test what happens when you try to access different items in the list.

In [22]:
print(fruits[0])

apple


To add to a list, we can use the append fuction.

In [23]:
fruits.append('mleon')
fruits

['apple', 'banana', 'cherry', 'mleon']

Or we can replace an element in the list (for example when we have added a typo...):

In [24]:
fruits [3] = 'melon'
fruits

['apple', 'banana', 'cherry', 'melon']

### Dictionaries

**A dictionary stores key-value pairs, allowing fast retrieval based on unique keys.**

Dictionaries (short dicts) are very powerful, especially in bioinformatics. Think of pairs of primer names and sequences! So, remember dicts! They are your friends.

In [25]:
primer_order = {'primer_1':'AATGC', 'primer_2':'CGTAGCT', 'primer_3':'cgtcagt'}
primer_order['primer_2']

'CGTAGCT'

By the way, did you see that the sequence in `primer_3` is lowercase? we can fix that, if we want

In [26]:
primer_order['primer_3'].upper()

'CGTCAGT'

... and write the data back to the dict

In [27]:
primer_order['primer_3'] = primer_order['primer_3'].upper()
primer_order

{'primer_1': 'AATGC', 'primer_2': 'CGTAGCT', 'primer_3': 'CGTCAGT'}

You can always return all keys and values as lists:

In [28]:
list(primer_order.keys())

['primer_1', 'primer_2', 'primer_3']

In [29]:
list(primer_order.values())

['AATGC', 'CGTAGCT', 'CGTCAGT']

<div class="alert alert-block alert-success"> <b>Now it's your turn!</b> <br>
<b>Exercise 1:</b>
Perform the following actions: <br>
1. Create a dictionary with cute zoo animals and their names: <br><br>
    a "Lion" named "Leo",<br>
    a "Tiger" named "Tobias",<br>
    an "Elephant" named "Ella",<br>
    a "Giraffe" named "Gabriella",<br>
    a "Zebra" named "Zelda"<br>
    a "Tiger" named "Ted" <br> <br>
2. from the dictionary, create a <b>non-redundant list</b> with all zoo animals <br>
3. from the dictionary, create a list of all zoo animal's names, sorted alphabetically (use the <code>sorted(your_list)</code> command!) <br> <br>

<i>Hint: Be careful how you construct your dict! Python can only handle unique keys! Values can be non-unique...</i>


</div>

In [30]:
# Your solution goes here




















In [31]:
# A possible solution: CLICK to unfold
'''
construct the dict using the animals' names as keys. They are unique anyways. If we do it the other way round, the duplicate key "Tiger" will be overwritten and we'll 
miss Tobias :-( Try it out! 
btw, this is a comment block!
'''

zoo_animals = {
    "Leo": "Lion",
    "Tobias": "Tiger",
    "Ella": "Elephant",
    "Gabriella": "Giraffe",
    "Zelda": "Zebra",
    "Ted" : "Tiger" 
}

all_animals = list(set(zoo_animals.values()))
print(all_animals)

all_names_sorted = sorted(list(zoo_animals.keys()))
print(all_names_sorted)

['Elephant', 'Tiger', 'Giraffe', 'Lion', 'Zebra']
['Ella', 'Gabriella', 'Leo', 'Ted', 'Tobias', 'Zelda']


### Tuples & Sets
Just for the sake of introducing all types, will be skipped today...


**A tuple is an ordered, immutable collection of items.**

In [32]:
coordinates = (10, 20)
print(coordinates[1]) 

20


In [33]:
# tuples are immutable 
#coordinates[0] = 15  

**A set is an unordered collection of unique items.**

In [61]:
unique_numbers = {1, 2, 3, 2, 1}

In [62]:
# Sets automatically remove duplicates:
unique_numbers

{1, 2, 3}

# 3. Pandas Dataframes

Pandas is a versatile library for data manipulation and analysis. It introduces two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional), which are perfect for handling tabular data commonly found in bioinformatics. 

Here we want to explore the basics of dataframes with a example similar to count data from a transcriptomics experiment. In this dataframe we have genes, their respective chromosome and the expression level of these genes in our sample.

In [41]:
# Creating a DataFrame from a dictionary
gene_data = {
    'Gene': ['BRCA1', 'GLDP1', 'EGFR', 'MYC', 'APOE'],
    'Chromosome': ['17', '17', '7', '8', '19'],
    'Expression_Level': [50, 75, 65, 80, 45]
    
}

df = pd.DataFrame(gene_data)

Now we can have a look in a viewable version of this dataframe:

In [42]:
df

Unnamed: 0,Gene,Chromosome,Expression_Level
0,BRCA1,17,50
1,GLDP1,17,75
2,EGFR,7,65
3,MYC,8,80
4,APOE,19,45


We can now select data based on rows or columns:

In [43]:
df[df['Chromosome'] == '17']

Unnamed: 0,Gene,Chromosome,Expression_Level
0,BRCA1,17,50
1,GLDP1,17,75


In [44]:
df[df['Expression_Level'] < 50]

Unnamed: 0,Gene,Chromosome,Expression_Level
4,APOE,19,45


<div class="alert alert-block alert-success"> <b>Now it's your turn!</b> <br>
<b>Exercise 2:</b>

1. Find all genes with an expression level over 70.
2. Get all information for the gene BRCA1 from the dataframe.

</div>

In [45]:
# Your solution goes here












In [46]:
# A possible solution: CLICK to unfold

# 1. Genes with expression level over 70
x = df[df['Expression_Level'] > 70]
print(x)

# 2. All Information for gene BRCA1
y = df[df['Gene'] == 'BRCA1']
print(y)

    Gene Chromosome  Expression_Level
1  GLDP1         17                75
3    MYC          8                80
    Gene Chromosome  Expression_Level
0  BRCA1         17                50


We can also access a single row or column of our dataframe.

In [47]:
#accessing a row (is returned as Series object)
df.loc[0]

Gene                BRCA1
Chromosome             17
Expression_Level       50
Name: 0, dtype: object

In [48]:
#accessing multiple rows (is returned as a dataframe)
df.loc[[0,1]]

Unnamed: 0,Gene,Chromosome,Expression_Level
0,BRCA1,17,50
1,GLDP1,17,75


In [49]:
#accessing a single field by row and column
df.loc[0, 'Gene']


'BRCA1'

# 4 Importing your own data files

Pandas also allows us to read in our own data from a **csv** or **excel** file as dataframes!

In [65]:
# reading in your raw count data
import pandas as pd

counts = pd.read_csv('gene_counts.txt')

print(counts)

      # Program:featureCounts v2.0.6; Command:"/mnt/bin/subread/subread-v2.0.6/bin/featureCounts" "-T" "8" "-t" "gene" "-g" "ID" "-a" "Arabidopsis_thaliana.TAIR10.61.gff3" "-o" "gene_counts.txt" "aligned_bams/col0.root.28C.rep1_Aligned.sortedByCoord.out.bam" "aligned_bams/col0.root.28C.rep2_Aligned.sortedByCoord.out.bam" "aligned_bams/col0.root.28C.rep3_Aligned.sortedByCoord.out.bam" "aligned_bams/hy5.root.28C.rep1_Aligned.sortedByCoord.out.bam" "aligned_bams/hy5.root.28C.rep2_Aligned.sortedByCoord.out.bam" "aligned_bams/hy5.root.28C.rep3_Aligned.sortedByCoord.out.bam" 
0      Geneid\tChr\tStart\tEnd\tStrand\tLength\talign...                                                                                                                                                                                                                                                                                                                                                                               

This does not look right.. Any idea what is wrong?

Our data is separated by **tabs** (`\t`) instead of commas (as pandas expects as standard delimiter for a **comma-separated file**). If you go to the documentation page of pandas for the `read_csv()`-function (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html), can you find an option which might help here?

Using the option `delimiter`, we can tell the function, which type of delimiter is used in our file.

In [66]:
counts = pd.read_csv('gene_counts.txt', delimiter='\t')

print(counts)

                                                                                                                                                                                                                                                                                                              # Program:featureCounts v2.0.6; Command:"/mnt/bin/subread/subread-v2.0.6/bin/featureCounts" "-T" "8" "-t" "gene" "-g" "ID" "-a" "Arabidopsis_thaliana.TAIR10.61.gff3" "-o" "gene_counts.txt" "aligned_bams/col0.root.28C.rep1_Aligned.sortedByCoord.out.bam" "aligned_bams/col0.root.28C.rep2_Aligned.sortedByCoord.out.bam" "aligned_bams/col0.root.28C.rep3_Aligned.sortedByCoord.out.bam" "aligned_bams/hy5.root.28C.rep1_Aligned.sortedByCoord.out.bam" "aligned_bams/hy5.root.28C.rep2_Aligned.sortedByCoord.out.bam" "aligned_bams/hy5.root.28C.rep3_Aligned.sortedByCoord.out.bam" 
Geneid         Chr Start  End    Strand Length aligned_bams/col0.root.28C.rep1_Aligned.sortedB... aligned_bams/col0.root.28C.re

Now we have the correct delimiter. There is still another problem with the data. Can you spot it?

Our data does have a header row. We can tell `read_csv()` that our data has a header in a certain row using the number of this row. If our data does not have a header, we can use `header=None`.

In [67]:
counts = pd.read_csv('gene_counts.txt', delimiter='\t', header=1)

print(counts)

               Geneid Chr   Start     End Strand  Length  \
0      gene:AT1G01010   1    3631    5899      +    2269   
1      gene:AT1G01020   1    6788    9130      -    2343   
2      gene:AT1G01030   1   11649   13714      -    2066   
3      gene:AT1G01040   1   23121   31227      +    8107   
4      gene:AT1G01050   1   31170   33171      -    2002   
...               ...  ..     ...     ...    ...     ...   
27650  gene:ATCG01250  Pt  141854  143708      +    1855   
27651  gene:ATCG01270  Pt  144921  145154      -     234   
27652  gene:ATCG01280  Pt  145291  152175      -    6885   
27653  gene:ATCG01300  Pt  152506  152787      +     282   
27654  gene:ATCG01310  Pt  152806  154312      +    1507   

       aligned_bams/col0.root.28C.rep1_Aligned.sortedByCoord.out.bam  \
0                                                    290               
1                                                    171               
2                                                     35       

As we can see the column names are not ideal, as they are very long. First we can get a list of column names using the following command: 

In [68]:
counts.columns

Index(['Geneid', 'Chr', 'Start', 'End', 'Strand', 'Length',
       'aligned_bams/col0.root.28C.rep1_Aligned.sortedByCoord.out.bam',
       'aligned_bams/col0.root.28C.rep2_Aligned.sortedByCoord.out.bam',
       'aligned_bams/col0.root.28C.rep3_Aligned.sortedByCoord.out.bam',
       'aligned_bams/hy5.root.28C.rep1_Aligned.sortedByCoord.out.bam',
       'aligned_bams/hy5.root.28C.rep2_Aligned.sortedByCoord.out.bam',
       'aligned_bams/hy5.root.28C.rep3_Aligned.sortedByCoord.out.bam'],
      dtype='object')

We can replace the column names using a new list. Beware to use the correct order, as to not confuse your samples!

In [69]:
counts.columns =  ['Geneid', 'Chr', 'Start', 'End', 'Strand', 'Length',
       'col0.root.28C.rep1',
       'col0.root.28C.rep2',
       'col0.root.28C.rep3',
       'hy5.root.28C.rep1',
       'hy5.root.28C.rep2',
       'hy5.root.28C.rep3']

In [55]:
counts.columns

Index(['Geneid', 'Chr', 'Start', 'End', 'Strand', 'Length',
       'col0.root.28C.rep1', 'col0.root.28C.rep2', 'col0.root.28C.rep3',
       'hy5.root.28C.rep1', 'hy5.root.28C.rep2', 'hy5.root.28C.rep3'],
      dtype='object')

We can also set the index of the data frame to the `Geneid` column, since this will be the same in all the count data. This also allows us to access rows by gene name instead of using the row number.

But first we remove the "gene:" from this column, as it makes things unneccesarily complicated. 

In [70]:
counts['Geneid'] = counts['Geneid'].str.replace('gene:', '')
counts

Unnamed: 0,Geneid,Chr,Start,End,Strand,Length,col0.root.28C.rep1,col0.root.28C.rep2,col0.root.28C.rep3,hy5.root.28C.rep1,hy5.root.28C.rep2,hy5.root.28C.rep3
0,AT1G01010,1,3631,5899,+,2269,290,265,272,380,433,350
1,AT1G01020,1,6788,9130,-,2343,171,177,193,344,301,294
2,AT1G01030,1,11649,13714,-,2066,35,57,58,60,58,57
3,AT1G01040,1,23121,31227,+,8107,744,739,803,880,884,802
4,AT1G01050,1,31170,33171,-,2002,1544,1328,1502,1697,1870,1626
...,...,...,...,...,...,...,...,...,...,...,...,...
27650,ATCG01250,Pt,141854,143708,+,1855,0,0,0,0,0,0
27651,ATCG01270,Pt,144921,145154,-,234,0,0,0,0,0,0
27652,ATCG01280,Pt,145291,152175,-,6885,0,0,0,0,0,0
27653,ATCG01300,Pt,152506,152787,+,282,0,0,0,0,0,0


In [71]:
counts = counts.set_index('Geneid')
counts

Unnamed: 0_level_0,Chr,Start,End,Strand,Length,col0.root.28C.rep1,col0.root.28C.rep2,col0.root.28C.rep3,hy5.root.28C.rep1,hy5.root.28C.rep2,hy5.root.28C.rep3
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AT1G01010,1,3631,5899,+,2269,290,265,272,380,433,350
AT1G01020,1,6788,9130,-,2343,171,177,193,344,301,294
AT1G01030,1,11649,13714,-,2066,35,57,58,60,58,57
AT1G01040,1,23121,31227,+,8107,744,739,803,880,884,802
AT1G01050,1,31170,33171,-,2002,1544,1328,1502,1697,1870,1626
...,...,...,...,...,...,...,...,...,...,...,...
ATCG01250,Pt,141854,143708,+,1855,0,0,0,0,0,0
ATCG01270,Pt,144921,145154,-,234,0,0,0,0,0,0
ATCG01280,Pt,145291,152175,-,6885,0,0,0,0,0,0
ATCG01300,Pt,152506,152787,+,282,0,0,0,0,0,0


Now we can access each gene by using the Geneid instead of the row number:

In [73]:
counts.loc['AT1G01010']

Chr                      1
Start                 3631
End                   5899
Strand                   +
Length                2269
col0.root.28C.rep1     290
col0.root.28C.rep2     265
col0.root.28C.rep3     272
hy5.root.28C.rep1      380
hy5.root.28C.rep2      433
hy5.root.28C.rep3      350
Name: AT1G01010, dtype: object

**Metadata**

Since our samples belong to two different categories, we can add some metadata. For this we read in the metadata file. Alternatively you can create a dataframe using a dictionary, if you do not have a file with metadata available.

In [60]:
# get the sample names
metadata = pd.read_csv('metadata.csv')
metadata

Unnamed: 0,sample_id,genotype
0,col0.root.28C.rep1,Col-0
1,col0.root.28C.rep2,Col-0
2,col0.root.28C.rep3,Col-0
3,hy5.root.28C.rep1,hy5
4,hy5.root.28C.rep2,hy5
5,hy5.root.28C.rep3,hy5
