# **[Project] Modelling of DNA Methylation Information**

# Introduction

We provide human [nanopore sequencing](https://en.wikipedia.org/wiki/Nanopore_sequencing) data comprising:

- the actual reads (i.e., base sequences),
- non-hydroxy methylation information.

> DNA methylation is a biological process by which methyl groups are added to the DNA molecule.
> Methylation can change the activity of a DNA segment without changing the sequence.
> When located in a gene promoter, DNA methylation typically acts to repress gene transcription.
> Two of DNA's four bases, cytosine and adenine, can be methylated.
> Cytosine methylation is widespread in both eukaryotes and prokaryotes; and hence we focus here on cytosine methylation only.

The data was aligned to a reference genome, and the data pertaining to chromosome 20 was extracted.

The data is in the [TSV format](https://en.wikipedia.org/wiki/Tab-separated_values) ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)-like with tabs as delimiter).

Each file contains one read per line, comprising the following information:

- First column: mapping position
- Second column: DNA sequence
- Third column: methylation data

The methylation data is given as as comma-separated list of number of skipped Cs (remember: only C methylation is in focus here), i.e., the skipped bases are not methylated.

As an example:

```
42   ACTGCCCTGCCCC   1,2,0
         ^    ^^
```

The read starts at position 42; the marked Cs are methylated.

# Data access

The download from the TNT homepage is straightforward:

In [None]:
! wget http://www.tnt.uni-hannover.de/edu/vorlesungen/AMLG/data/project-dna-methylation.tar.gz
! tar -xzvf project-dna-methylation.tar.gz
! mv -v project-dna-methylation/ data/
! rm -v project-dna-methylation.tar.gz

In the `data/` folder you will now find the file `reads.tsv`.

To start, you can read a TSV file into a [pandas](https://pandas.pydata.org) [`DataFrame`](pandas dataframe to dict) using the [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv) function with the `sep` parameter set to `\t`:

In [None]:
import pandas as pd


# Read the TSV file into a DataFrame
df = pd.read_csv(filepath_or_buffer="data/reads.tsv", sep="\t")

# Display the first few rows of the DataFrame
print(df.head())