# DNA Digital Drives
------------------


> Humanity has a data storage problem: More data were created in the past 2 years than in all of preceding history. And that torrent of information may soon outstrip the ability of hard drives to capture it. Now, researchers report that they've come up with a new way to encode digital data in DNA to create the highest-density large-scale data storage scheme ever invented. Capable of storing 215 petabytes (215 million gigabytes) in a single gram of DNA, the system could, in principle, store every bit of datum ever recorded by humans in a container about the size and weight of a couple of pickup trucks. But whether the technology takes off may depend on its cost. [1]

While there are still technical hurdles to overcome with biology (mostly in decoding the DNA drive [2]), there have been several successful projects that have shown proof-of-concept [3]. However, the seemingly straightforward task has not settled on a consistent approach to encoding text-to-DNA. In fact, a search of the literature show that there are quite a few different strategies, considerations, and approaches that are being pursued.

Now it is your turn to enter the field and write a "DNA-to-text" encoder and decoder. You may utlize any of the references below for ideas, inspiration, or specific algorithms. You do not need to develop a completely novel approach, but you must be able to __explain your approach, and support your rationale for all the decisions you made__. Do not forget about punctuation and numbers.

In addition, please create a short (no longer than 5 minutes) video explaining the background of the problem, your approach, details of your implemention, and a working demonstration of encoding and decoding the following text:
>"When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation."

You do not need to do a complete code walkthrough, but just an overview of how you implemented it. The presentation does not need to have a high-production value. It can be just be you presenting slides with a talking head and does not need to be edited. This functionality is availalbe directly in Panopto

Upload your video to Panopto for students to view at this link: https://uchicago.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx?folderID=d4e15734-eebd-402b-a75d-b09100e92746


Post a link to your video in the Slack `#showcase` channel using the following format:

```
    NAME:    <your name>
    VIDEO:    https://uchicago.hosted.panopto.com/Panopto/<your link here>
    DNA:     <Convert the string "Let knowledge grow from more to more; and so be human life enriched." in DNA using your approach>
```

Commit you presentation slides and any supporting materials (ie. scripts) to the GitHub repository. Submit the GitHub URL to canvas.


### References 
1. [DNA could store all of the world's data in one room](https://www.sciencemag.org/news/2017/03/dna-could-store-all-worlds-data-one-room)
2. [Storing data in DNA is a lot easier than getting it back out](https://www.technologyreview.com/2018/01/26/145993/storing-data-in-dna-is-a-lot-easier-than-getting-it-back-out/)
3. [Synthetic double-helix faithfully stores Shakespeare's sonnets | Nature](https://www.nature.com/articles/nature.2013.12279)

### Additional References
* [The Rise of DNA Data Storage](https://www.wired.com/story/the-rise-of-dna-data-storage/)
* [Towards practical, high-capacity, low-maintenance information storage in synthesized DNA | Nature](https://www.nature.com/articles/nature11875)
- [Nick Goldman talking about DNA Hard Drivers at the WEF2015](https://www.youtube.com/watch?v=tBvd7OSDGgQ)
- [Goldman group DNA storage](http://www.ebi.ac.uk/research/goldman/dna-storage)
- [Emily Leprous talking about DNA storage](https://vimeo.com/119612296)
http://courses.cs.vt.edu/cs2104/Spring13Onufriev/LectureNotes/DNA.storage.pdf
- [Hidding messages in DNA microdots](http://www.researchgate.net/profile/Carter_Bancroft/publication/12921709_Hiding_messages_in_DNA_microdots/links/0922b4f2ac1d18eb73000000.pdf)
- [An improved Huffman coding method for archiving text, images, and music characters in DNA](http://www.biotechniques.com/multimedia/archive/00055/Supplementary_Materi_55848a.pdf)
- [Bacterial based storage and encryption device](http://2010.igem.org/files/presentation/Hong_Kong-CUHK.pdf)
- [The Xenotext Experiment](http://triplehelixblog.com/2014/01/the-xenotext-experiment/)
- [If You Were a Secret Message, Where in the Human Genome Would You Hide?](http://nautil.us/blog/-if-you-were-a-secret-message-where-in-the-human-genome-would-you-hide)
- [Store digital files for eons in silica-encased DNA](http://hackaday.com/2015/02/21/store-digital-files-for-eons-in-silica-encased-dna)

## DNA Storage
#### Original Text: 
When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.


In [2]:
original_text = "When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation."


In [3]:
# define the 3 x 3 matrix and the mapping from positions to DNA base pairs. 
# each position represents a unique DNA sequence. I chose diverse, mixed, and balanced combinations
matrix_to_dna = {
    (0, 0): 'ACG', (0, 1): 'TGC', (0, 2): 'GGA',
    (1, 0): 'CTA', (1, 1): 'GTT', (1, 2): 'AAC',
    (2, 0): 'TTG', (2, 1): 'CCA', (2, 2): 'GCT'
}


# Reverse map for decoding
dna_to_matrix = {v: k for k, v in matrix_to_dna.items()}



### Encoder (Text to DNA)
Encode the text into binary and convert it into matrix positions

In [4]:

def text_to_binary(text):
    '''
    Function to convert text to binary
    Ref: https://www.geeksforgeeks.org/python-convert-string-to-binary/

    Inputs:
      text (string): input text as string

    Returns:
      bin_text (string): binary representation of the text
    ''' 
    bin_text = ''.join(format(ord(char), '08b') for char in text)
    return bin_text

def binary_to_matrix(bin_text):
    '''
    Function to split binary string into components of 2 bits where each components represents a matrix location
    Ref: https://stackoverflow.com/questions/20024490/how-to-split-a-byte-string-into-separate-parts

    Inputs:
      bin_text (string): input binary text as string

    Returns:
      dna_matrix (list): matrix representation of binary string
    ''' 
    dna_matrix = []
    for i in range(0, len(bin_text), 2):
        row = int(bin_text[i], 2)  # for row (1st bit)
        col = int(bin_text[i + 1], 2)  # for column (2nd bit)
        dna_matrix.append((row, col))

    return dna_matrix
    
def encode_text_to_dna(text):
    '''
    Function to convert the text to a dna sequence representation

    Inputs:
      text (string): original text

    Returns:
      dna_sequence (string): the dna sequence representation of given text
    ''' 
    
    binary_text = text_to_binary(text)
    matrix_positions = binary_to_matrix(binary_text) 
    
    # map the matrix positions to DNA sequences
    dna_sequence = ''.join(matrix_to_dna[pos] for pos in matrix_positions) 
    
    return dna_sequence

encoded_dna = encode_text_to_dna(original_text)
print(encoded_dna)



TGCTGCTGCGTTTGCCTACTAACGTGCCTATGCTGCTGCCTAGTTCTAACGCTAACGACGTGCCTACTATGCTGCCTAGTTCTAACGCTAACGACGTGCGTTTGCACGTGCCTACTAACGTGCCTATGCTGCACGCTAACGACGTGCACGACGGTTTGCCTAGTTGTTTGCGTTTGCTGCTGCGTTACGCTATGCGTTACGGTTTGCCTATGCTGCACGCTAACGACGTGCCTAGTTGTTTGCCTATGCCTAACGCTAACGACGTGCCTACTAACGTGCGTTTGCTGCTGCCTAGTTTGCTGCCTAACGTGCTGCCTAGTTCTAACGCTAACGACGTGCCTATGCTGCTGCGTTTGCCTATGCCTATGCTGCTGCCTAGTTCTATGCGTTTGCACGTGCGTTACGGTTACGCTAGTTACGACGCTAACGACGTGCCTACTATGCTGCGTTTGCACGACGCTAACGACGTGCCTAACGCTATGCCTATGCTGCTGCCTAACGGTTTGCCTAGTTGTTTGCCTAGTTTGCTGCCTATGCTGCTGCGTTACGGTTACGCTAACGACGTGCCTAGTTCTATGCCTATGCTGCTGCCTAACGGTTTGCCTATGCTGCTGCGTTACGGTTTGCGTTACGGTTTGCCTAACGTGCTGCGTTACGCTATGCGTTCTATGCACGCTAACGACGTGCCTATGCCTATGCCTAGTTGTTTGCGTTACGCTAACGCTAACGACGTGCCTAGTTGTTTGCCTAGTTCTATGCCTATGCTGCACGCTAACGACGTGCGTTACGACGTGCCTATGCTGCTGCCTAGTTGTTTGCGTTACGACGTGCCTAGTTACGTGCCTATGCTGCACGCTAACGACGTGCGTTTGCACGTGCCTAGTTGTTACGCTAACGACGTGCCTATGCACGTGCCTACTATGCTGCGTTACGGTTTGCGTTACGGTTTGCCTAGTTGTTTGCCTAGTTACGTGCGTTTGCCTATGCCTATGCTGCACGC

### Decoder (DNA to Text)


In [5]:

def matrix_to_binary(matrix_positions):
    '''
    Function to convert matrix to binary

    Inputs:
      matrix_positions (list): your matrix positions

    Returns:
      bin_text (string): binary representation of the text
    ''' 
    binary_data = []
    for row, col in matrix_positions:
        binary_data.append(format(row, '01b'))  # to convert row back to 1-bit
        binary_data.append(format(col, '01b'))  # to convert column back to 1-bit
    bit_text = ''.join(binary_data)
    return bit_text

def binary_to_text(binary):
    '''
    Function to convert binary string back into characters where every 8 bits form one character (from ascii)
    Ref: https://www.ibm.com/docs/en/informix-servers/14.10?topic=locale-code-sets-character-data

    Inputs:
      binary (string): the binary representation

    Returns:
      characters (string): the characters decoded
    ''' 
    text = []
    for i in range(0, len(binary), 8):
        text.append(chr(int(binary[i:i + 8], 2)))
    characters = ''.join(text)
    return characters

def decode_dna_to_text(dna_sequence):
    '''
    Function to decode the dna sequence into text

    Inputs:
      dna_sequence (string): your dna sequence

    Returns:
      characters (string): the characters decoded
    ''' 
    # split the DNA sequence into components of 3 (each set of 3 bases is a matrix position)
    dna_chunks = [dna_sequence[i:i+3] for i in range(0, len(dna_sequence), 3)]
    
    # DNA seq back to matrix positions
    matrix_positions = [dna_to_matrix[chunk] for chunk in dna_chunks]

    # convert to original text
    binary_data = matrix_to_binary(matrix_positions)
    out_text = binary_to_text(binary_data)
    return out_text

decoded_text = decode_dna_to_text(encoded_dna)
print(decoded_text)


When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.


In [6]:
# Compare the two texts:
if original_text == decoded_text:
    print("The texts are identical.")
else:
    print("The texts are NOT identical.")



The texts are identical.


In [7]:
new_text = "Let knowledge grow from more to more; and so be human life enriched."
new_enc = encode_text_to_dna(new_text)
print(new_enc)

TGCACGGTTACGTGCCTATGCTGCTGCGTTTGCACGACGCTAACGACGTGCCTACTAGTTTGCCTAGTTCTATGCCTAGTTGTTTGCGTTTGCGTTTGCCTAGTTACGTGCCTATGCTGCTGCCTATGCACGTGCCTATGCGTTTGCCTATGCTGCACGCTAACGACGTGCCTATGCGTTTGCGTTACGCTATGCCTAGTTGTTTGCGTTTGCGTTACGCTAACGACGTGCCTATGCCTATGCGTTACGCTATGCCTAGTTGTTTGCCTAGTTTGCACGCTAACGACGTGCCTAGTTTGCTGCCTAGTTGTTTGCGTTACGCTATGCCTATGCTGCACGCTAACGACGTGCGTTTGCACGTGCCTAGTTGTTACGCTAACGACGTGCCTAGTTTGCTGCCTAGTTGTTTGCGTTACGCTATGCCTATGCTGCACGGTTCTAGTTACGCTAACGACGTGCCTAACGTGCTGCCTAGTTCTATGCCTATGCACGACGCTAACGACGTGCGTTACGGTTTGCCTAGTTGTTACGCTAACGACGTGCCTAACGCTATGCCTATGCTGCACGCTAACGACGTGCCTACTAACGTGCGTTTGCTGCTGCCTAGTTTGCTGCCTAACGTGCTGCCTAGTTCTAACGCTAACGACGTGCCTAGTTACGTGCCTACTATGCTGCCTATGCCTATGCCTATGCTGCACGCTAACGACGTGCCTATGCTGCTGCCTAGTTCTATGCGTTACGCTATGCCTACTATGCTGCCTAACGGTTTGCCTACTAACGTGCCTATGCTGCTGCCTATGCACGACGCTAGTTCTA


References:
- Rise of DNA Data Storage: https://www.wired.com/story/the-rise-of-dna-data-storage/
- Synthetic DNA Storage Milestone: https://blogs.microsoft.com/ai/synthetic-dna-storage-milestone/
- Matrix representations of DNA: https://pubs.rsc.org/en/content/articlelanding/2015/sc/c4sc02930e
- Multidimensional representations of DNA: https://www.nature.com/articles/s41565-023-01348-9
- ASCII representation: https://www.ibm.com/docs/en/informix-servers/14.10?topic=locale-code-sets-character-data