Discover the algorithms underlying a variety of bioinformatics topics: computational mass spectrometry, alignment, dynamic programming, genome assembly, genome rearrangements, phylogeny, probability, string algorithms and others.

探索各种生物信息学相关的算法：计算质谱，比对，动态规划，基因组装配，基因组重排，系统发育，概率，字符串算法等。

# Counting DNA Nucleotides

# 计算DNA核苷酸

## A Rapid Introduction to Molecular Biology

## 分子生物学快速入门

Making up all living material, the **cell** is considered to be the building block of life. The **nucleus**, a component of most **eukaryotic** cells, was identified as the hub of cellular activity 150 years ago. Viewed under a light microscope, the nucleus appears only as a darker region of the cell, but as we increase magnification, we find that the nucleus is densely filled with a stew of macromolecules called **chromatin**. During **mitosis** (eukaryotic cell division), most of the chromatin condenses into long, thin strings called **chromosomes**. See Figure 1 for a figure of cells in different stages of mitosis.

**细胞**作为构成所有生物原料被认为是生命的基石。**细胞核**是大多数**真核细胞**的组成部分，150年前被确定为细胞活动的中心。在光学显微镜下观察，细胞核仅作为细胞的较暗区域出现，但随着我们增加放大倍数，我们发现细胞核密集地充满了称为**染色质**的大分子物质。在**有丝分裂**期间（真核细胞分裂），大多数染色质浓缩成长而细的细胞串，称为**染色体**。有关有丝分裂不同阶段的细胞图见下图。

![Figure 1. A 1900 drawing by Edmund Wilson of onion cells at different stages of mitosis. The sample has been dyed, causing chromatin in the cells (which soaks up the dye) to appear in greater contrast to the rest of the cell.](Images/001.png)

**Figure 1.** A 1900 drawing by Edmund Wilson of onion cells at different stages of mitosis. The sample has been dyed, causing chromatin in the cells (which soaks up the dye) to appear in greater contrast to the rest of the cell.

**图1.** 在1900年Emmund Wilson在有丝分裂不同阶段绘制的洋葱细胞图。由于样品已被染色，导致细胞中的染色质（吸收染料）与细胞的其他部分形成鲜明对比。

One class of the macromolecules contained in chromatin are called **nucleic acids**. Early 20th century research into the chemical identity of nucleic acids culminated with the conclusion that nucleic acids are **polymers**, or repeating chains of smaller, similarly structured molecules known as **monomers**. Because of their tendency to be long and thin, nucleic acid polymers are commonly called **strands**.

染色质中含有的一类大分子称为**核酸**。20世纪早期对核酸化学特性的研究最终得出结论：核酸是**聚合物**，或者将这种重复结构的称为**单体**。由于它们倾向于长而薄，核酸聚合物通常被称为**链**。

The nucleic acid monomer is called a **nucleotide** and is used as a unit of strand length (abbreviated to nt). Each nucleotide is formed of three parts: a **sugar** molecule, a negatively charged **ion** called a phosphate, and a compound called a **nucleobase** ("base" for short). Polymerization is achieved as the sugar of one nucleotide bonds to the phosphate of the next nucleotide in the chain, which forms a **sugar-phosphate backbone** for the nucleic acid strand. A key point is that the nucleotides of a specific type of nucleic acid always contain the same sugar and phosphate molecules, and they differ only in their choice of base. Thus, one strand of a nucleic acid can be differentiated from another based solely on the order of its bases; this ordering of bases defines a nucleic acid's **primary structure**.

核酸单体称为**核苷酸**，并作为链长度的单位（缩写为*nt*）。每个核苷酸由三部分组成：**糖分子**，带有负离子的**磷酸盐**，和**核碱基**化合物（简称“碱基”）。当一个核苷酸的糖与链中下一个核苷酸的磷酸键合时开始聚合，其形成核酸链的**糖-磷酸骨架**。关键点在于特定类型核酸的核苷酸总是含有相同的糖和磷酸盐分子，它们的区别仅在于它们对碱基的选择。因此，核酸的一条链可以仅基于其碱基的顺序与另一条链区分开;碱基的这种排序定义了核酸的**主要结构**。

For example, Figure 2 shows a strand of **deoxyribose nucleic acid** (DNA), in which the sugar is called **deoxyribose**, and the only four choices for nucleobases are molecules called **adenine** (A), **cytosine** (C), **guanine** (G), and **thymine** (T).

例如，图2显示了**脱氧核糖核酸**（DNA）链，其中糖被称为**脱氧核糖**，分别有四种碱基：**腺嘌呤**（A），**胞嘧啶**（C），**鸟嘌呤**（G）和**胸腺嘧啶**（T）。

![Figure 2. A sketch of DNA's primary structure.](Images/002.png)

**Figure 2.** A sketch of DNA's primary structure.


**图2.** DNA的主要结构草图。

For reasons we will soon see, DNA is found in all living organisms on Earth, including bacteria; it is even found in many viruses (which are often considered to be nonliving). Because of its importance, we reserve the term **genome** to refer to the sum total of the DNA contained in an organism's chromosomes.

DNA存在于地球上的所有生物体中，包括细菌;它甚至存在于许多病毒中（通常被认为是非生命的）。由于其重要性，我们使用“**基因组**”来指代生物体染色体中包含的DNA的总和。

## Problem

## 问题

A **string** is simply an ordered collection of symbols selected from some **alphabet** and formed into a word; the **length** of a string is the number of symbols that it contains.

**字符串**只是从某些**字母表**中选择的符号的有序集合，并形成一个单词;字符串的**长度**是它包含的符号数。 

An example of a length 21 **DNA string** (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

长度为21的**DNA串**（其字母包含符号'A'，'C'，'G'和'T'）的示例是“ATGCTTCAGAAAGGTCTTACG”。

**Given:** A DNA string s of length at most 1000 nt.

**Return:** Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

## Sample Dataset

## 样本数据集

```
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
```

## Sample Output
 
## 样本输出

```
20 12 17 21
```

In [24]:
def count_DNA(string):
    return string.count("A"), string.count("C"), string.count("G"), string.count("T") 

In [25]:
print(count_DNA("AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"))

(20, 12, 17, 21)


In [27]:
with open("Bioinformatics_Stronghold/data/rosalind_dna.txt", "r") as r_dna:
    r_dna = r_dna.read()

In [28]:
print(count_DNA(r_dna))

(243, 228, 221, 231)


In [29]:
r_dna.count("A")

243

# Transcribing DNA into RNA

# DNA 转录为 RNA

## The Second Nucleic Acid

## 第二种核酸

In “**Counting DNA Nucleotides**”, we described the **primary structure** of a **nucleic acid** as a polymer of **nucleotide** units, and we mentioned that the omnipresent nucleic acid **DNA** is composed of a varied sequence of four bases.

在“**计数DNA核苷酸**”中，我们描述了**核酸**的**一级结构**作为**核苷酸**单位的聚合物，我们提到了无处不在的核酸**DNA**由四个碱基的不同序列组成。

Yet a second nucleic acid exists alongside DNA in the **chromatin**; this molecule, which possesses a different sugar called **ribose**, came to be known as **ribose nucleic acid**, or RNA. RNA differs further from DNA in that it contains a base called **uracil** in place of **thymine**; structural differences between DNA and RNA are shown in Figure 1. Biologists initially believed that RNA was only contained in plant **cells**, whereas DNA was restricted to animal cells. However, this hypothesis dissipated as improved chemical methods discovered both nucleic acids in the cells of all life forms on Earth.

然而，第二种核酸与**染色质**中的DNA一起存在;这种分子具有不同的糖，称为**核糖**，后来被称为**核糖核酸**或RNA。RNA与DNA的不同之处在于它含有一种叫做尿嘧啶的**碱基代替**胸腺嘧啶**;DNA和RNA之间的结构差异如图1所示。生物学家最初认为RNA仅包含在**植物细胞**中，而DNA仅限于动物细胞。然而，随着改进的化学方法在地球上所有生命形式的细胞中发现了两种核酸，这一假设消失了。

![Figure 1. Structural differences between RNA and DNA](Images/003.png)

**Figure 1.** Structural differences between RNA and DNA

**图1. ** RNA和DNA之间的结构差异

The **primary structure** of DNA and RNA is so similar because the former serves as a blueprint for the creation of a special kind of RNA molecule called **messenger RNA**, or mRNA. mRNA is created during RNA transcription, during which a **strand** of DNA is used as a template for constructing a strand of RNA by copying nucleotides one at a time, where uracil is used in place of thymine.

DNA和RNA的**主要结构**是如此相似，因为前者是创建**信使RNA**或mRNA这种特殊RNA分子的蓝图。mRNA在RNA转录期间产生，在此期间，DNA的**链**用作构建RNA链的模板，通过一次复制一个核苷酸，其中使用尿嘧啶代替胸腺嘧啶。

In eukaryotes, DNA remains in the **nucleus**, while RNA can enter the far reaches of the cell to carry out DNA's instructions. In future problems, we will examine the process and ramifications of RNA transcription in more detail.

在真核生物中，DNA存在于**细胞核**中，而RNA可以进入细胞的远端以执行DNA的命令。在以后的问题中，我们将更详细地研究RNA转录的过程和分枝。

## Problem

An **RNA string** is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

**RNA串**是由包含'A'，'C'，'G'和'U'的字母组成的字符串。

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

给定对应于编码链的DNA串*t*，其转录的RNA串*u*通过用*u*中的'U'替换t中所有出现的'T'而形成。

**Given:** A DNA string t having length at most 1000 nt.

**Return:** The transcribed RNA string of t.

## Sample Dataset

```
GATGGAACTTGACTACGTAAATT
```

## Sample Output

```
GAUGGAACUUGACUACGUAAAUU
```

In [20]:
def transcribing_RNA(string):
    return string.replace("T", "U")

In [21]:
print(transcribing_RNA("GATGGAACTTGACTACGTAAATT"))

GAUGGAACUUGACUACGUAAAUU


In [22]:
with open("../Bioinfo/Bioinformatics_Stronghold/data/rosalind_rna.txt") as rna:
    rna = rna.read()

In [23]:
print(transcribing_RNA(rna))

AACUGCGGCAUCUUAAUCGUGCACUCUUCACAAUGACUACAUGAACAUCAAUUCAGGACGAGGUCUUAUAGCCGGUACUAUGCUUGUCCUGUGAAGGUGCCAUGAGAACAUUGAGAAUAACGCCCCUGGCGCCUUUCACAAUCAUUUCGGGUCACUCCCCAUAUCCGCUAGGGCAACGGUGACGUCUUUCACGAAAUUCAUAGGUAAAGACCGACUUUCAAGCUUGCUAUACGAAUCGCCAGGUCCCUAUUAAACACUAGUAUACAUACACCUCCAGGUGGACCGCGAGUCAAAACAACCAAUUACCUUAGCCUGCAAUCGACCGAGUUAUGGCAGUCCGGAGGAUACGCCGCUCCUCGACCGCAUUUAACUGGUUUGUUGUCACACGAACGCGAUUCUACUGGUAAUUUAAUAUUCUAGAUGCUCUAAGAGCACUUUCUGUGAUGUGAUGCGAAAGGCAUGAACGCUAAACAACUGCCUCGCACAUCACUGUUCACAAGUAGAGCGAUGCCGUGUACUCACAUCAUGCCGUAGUUCUGGUGAUGUUCAGUGCCAGUACAAUGCAUCCUUGGCGCCCGCACGAGCUCCUUGAUAACACUGUGACAGAUAAGGCUAUCUGUAUACCGUCUUCGCGUCCUUCAGGCCUUCAGGGGAAGAGCGCUCAGAGAUACUUGAUCACGAUUCCGCCGGGCUCUGACGGAAUCCAACAGACACAAUUCUAGGCGGUAACCGGCCUUACUUGCGUAUGUGAGUUUCCUGAAAAUGCAUUUUCUAUUGCACCAUGAAUGCCUGGAGAAGUAAUCUCGUCGUCACUCCCAGUCCGACAAGCCAAUAAUCUACCCGCUUUGACUGUGACUAACAACAAUUCUCGCGGGCCGAACGGCAGGACCGGGGUUCGAGACACGAAUCAGCAGGAACAGGCCAGGCCUAGGUAAUGGCUAUGUUCUUGUCG



# Complementing a Strand of DNA

# DNA的互补链

## The Secondary and Tertiary Structures of DNA

## DNA的二级和三级结构

In “Counting DNA Nucleotides”, we introduced nucleic acids, and we saw that the **primary structure** of a nucleic acid is determined by the ordering of its **nucleobases** along the **sugar-phosphate backbone** that constitutes the bonds of the nucleic acid **polymer**. Yet primary structure tells us nothing about the larger, 3-dimensional shape of the molecule, which is vital for a complete understanding of nucleic acids.

在“计数DNA核苷酸”中，我们引入了核酸，并且我们看到核酸的**一级结构**由其核碱基沿着构成核酸聚合物键的**糖-磷酸**主链的有序性决定。然而，初级结构没有告诉我们关于分子的更大的三维形状，这对于完全理解核酸是至关重要的。

The search for a complete chemical structure of nucleic acids was central to molecular biology research in the mid-20th Century, culminating in 1953 with a publication in Nature of fewer than 800 words by James Watson and Francis Crick. Consolidating a high resolution X-ray image created by Rosalind Franklin and Raymond Gosling with a number of established chemical results, Watson and Crick proposed the following structure for DNA:

寻找完整的核酸化学结构是20世纪中叶分子生物学研究的核心，最终于1953年在詹姆斯·沃森和弗朗西斯·克里克的“自然”杂志上发表了不到800字的文章。结合由Rosalind Franklin和Raymond Gosling创建的高分辨率X射线图像以及许多已确定的化学结果，Watson和Crick提出了以下DNA结构：

1. The DNA molecule is made up of two strands, running in opposite directions.

2. Each base bonds to a base in the opposite strand. Adenine always bonds with thymine, and cytosine always bonds with guanine; the complementof a base is the base to which it always bonds; see Figure 1.

3. The two strands are twisted together into a long spiral staircase structure called a double helix; see Figure 2.


1. DNA分子由两条链组成，以相反的方向运行。 

2. 每个碱基与相反链中的碱基键合。腺嘌呤总是与胸腺嘧啶结合，胞嘧啶总是与鸟嘌呤结合;碱基互补是它始终联系的基础;参见图1. 

3. 将两股绞合成一个称为双螺旋的长螺旋楼梯结构;见图2。

![Figure 1. Base pairing across the two strands of DNA.](Images/004.png)

**Figure 1.** Base pairing across the two strands of DNA.

![Figure 2. The double helix of DNA on the molecular scale.](Images/005.png)

**Figure 2.** The double helix of DNA on the molecular scale.

Because they dictate how bases from different strands interact with each other, (1) and (2) above compose the secondary structure of DNA. (3) describes the 3-dimensional shape of the DNA molecule, or its tertiary structure.

因为它们决定了来自不同链的碱基如何相互作用，上述（1）和（2）构成了DNA的二级结构。（3）描述了DNA分子的三维形状或其三级结构。

In light of Watson and Crick's model, the bonding of two complementary bases is called a **base pair** (bp). Therefore, the length of a DNA molecule will commonly be given in bp instead of nt. By complementarity, once we know the order of bases on one strand, we can immediately deduce the sequence of bases in the complementary strand. These bases will run in the opposite order to match the fact that the two strands of DNA run in opposite directions.

根据Watson和Crick的模型，两个互补碱基的键合称为**碱基对**（bp）。因此，DNA分子的长度通常以bp而不是nt给出。通过互补性，一旦我们知道一条链上碱基的顺序，我们就可以立即推断出互补链中的碱基序列。这些碱基将以相反的顺序运行以匹配两条DNA链以相反方向运行的事实。

## Problem

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

在DNA字符串中，符号“A”和“T”是彼此的互补，“C”和“G”也是如此。

The reverse complement of a DNA string *s* is the string *sc* formed by reversing the symbols of *s*, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

DNA串*s*的反向互补是通过反转*s*的符号形成的串*sc*，然后取每个符号的补码（例如，“GTCA”的反向补码是“TGAC”）。

**Given:** A DNA string s of length at most 1000 bp.

**Return:** The reverse complement sc of s.

## Sample Dataset

```
AAAACCCGGT
```

## Sample Output


```
ACCGGGTTTT
```

In [36]:
def complement_strand(string):
    rules = {"A":"T", "T":"A", "C":"G", "G":"C", "\n":""}
    return "".join(rules[i] for i in string[::-1])

In [37]:
print(complement_strand("AAAACCCGGT"))

ACCGGGTTTT


In [38]:
with open("../Bioinfo/Bioinformatics_Stronghold/data/rosalind_revc.txt", "r") as revc:
    revc = revc.read()

In [39]:
print(complement_strand(revc))

ACGAGAGGCCTTTCATACTGAATTCGCTCCTTTACCGATGCTGAAGGTTCGCGTAGGCATGGCAATTGAAGACGCTCCGCATTGACCCCTCTCGCGTTAACTCAAACAAGCTGGGCGTGCCGTAGGAGACTTTCAGCTACTGACCTTGCTCTTTCGGACTGGCAAGAAGGTAAGTCTGCTAAAGTCTTTCAGAACGTCCCCCTAAGTACGGAAGGGTTCGTTATTACGAGGATAGATATCGGCAATCTGGAGAGTCCAGAGTTATTGGCATTCGAGGGGATTCGAGAGTGCGTCCTGGCATGAACGATCCAGTCGGGTACTCCGGATAGCCCAAAGATCTGTATTAATGGCGCAGATGACCTGACCGGTCGGAGTCTGGCTCACCCAATGGAGCCGATGGTCAAACTAGGCGGAACATATTTTAGAGGACCGTGTAATCCAAGTCAAGTCTTCAGCAGGTATTAGGGCGAGCTGTATCTAGGCGGAGCTGCTATGAGTAGTCTCGCTTTCCGTCGTCTCGTCTTGCTAATCGATTTGTCATTGCTCGAGCAAGTTATTCCAGACCAACTACTAGCTCCAAAACGTAGTCGAGACCTGGTTATAGCGTTGTAGCTCTACCCTCATACAAGTGTTTGACGCTGAATGATCGTAAATGAAGCTTAGATTATCAGCTTGTCGTCAATATCTTAGGTGCAGAAACGAGAGAGTCTACAGTGTGTTCTATATCAGCGTACCATGATCGTCTCCCGCTGCCCAATGCAGCATTGTCAGGTGGAATCATTGTCTAATGACTTTCGATCAGTCGCCGGTGGCC


# Mendel's First Law 

# 孟德尔第一定律

## Introduction to Mendelian Inheritance

## 孟德尔遗传定律简介

Modern laws of inheritance were first described by Gregor Mendel (an Augustinian Friar) in 1865. The contemporary hereditary model, called **blending inheritance**, stated that an organism must exhibit a blend of its parent's traits. This rule is obviously violated both empirically (consider the huge number of people who are taller than both their parents) and statistically (over time, blended traits would simply blend into the average, severely limiting variation).

现代的遗传定律首先由格雷戈尔·孟德尔（奥古斯丁·弗莱尔）在1865年描述。当时的遗传模型，称为**混合遗传**，表明有机体必须表现出其父母特征的混合。这个规则显然在经验上都被违反（考虑到比他们父母都高的人数）和统计学（随着时间的推移，混合特征会简单地融入平均，严重限制的变化）。

Mendel, working with thousands of pea plants, believed that rather than viewing traits as continuous processes, they should instead be divided into discrete building blocks called **factors**. Furthermore, he proposed that every factor possesses distinct forms, called **alleles**.

孟德尔研究了成千上万的豌豆植物，认为不应将特征视为连续过程，而应将其划分为称为**因子**的离散构建块。此外，他提出每个因素都有不同的形式，称为**等位基因**。

In what has come to be known as his **first law** (also known as the law of segregation), Mendel stated that every organism possesses a pair of alleles for a given factor. If an individual's two alleles for a given factor are the same, then it is **homozygous** for the factor; if the alleles differ, then the individual is **heterozygous**. The first law concludes that for any factor, an organism randomly passes one of its two alleles to each offspring, so that an individual receives one allele from each parent.

在后来被称为他的**第一定律**（也称为分离定律）的事件中，孟德尔说每个生物体都拥有一对特定因子的等位基因。如果个体的两个等位基因对于给定因子是相同的，则该因子是**纯合的**;如果等位基因不同，那么个体是**杂合的**。第一定律得出结论，对于任何因素，有机体随机地将其两个等位基因中的一个传递给每个后代，以便个体从每个母体接收一个等位基因。

Mendel also believed that any factor corresponds to only two possible alleles, the **dominant** and **recessive** alleles. An organism only needs to possess one copy of the dominant allele to display the trait represented by the dominant allele. In other words, the only way that an organism can display a trait encoded by a recessive allele is if the individual is homozygous recessive for that factor.

孟德尔还认为，任何因子只对应于两个可能的等位基因，**显性**和**隐性**等位基因。生物体仅需要拥有一个显性等位基因就可以显示由显性等位基因代表的性状。换句话说，生物体能够显示由隐性等位基因编码的性状的唯一方式是个体是否是该因子的纯合隐性。

We may encode the dominant allele of a factor by a capital letter (e.g., A) and the recessive allele by a lower case letter (e.g., a). Because a heterozygous organism can possess a recessive allele without displaying the recessive form of the trait, we henceforth define an organism's **genotype** to be its precise genetic makeup and its **phenotype** as the physical manifestation of its underlying traits.

我们可以用大写字母（例如A）编码因子的显性等位基因，用小写字母（例如a）编码隐性等位基因。因为杂合生物可以具有隐性等位基因而不显示性状的隐性形式，所以我们今后将生物体的**基因型**定义为其精确的基因组成，并将其**表型**定义为其潜在性状的物理表现。

The different possibilities describing an individual's inheritance of two alleles from its parents can be represented by a **Punnett square**; see Figure 1 for an example.

描述个体从父母那里继承两个等位基因的不同可能性可以用**旁氏表**表示;有关示例，请参见图1。

![Figure 1. A Punnett square representing the possible outcomes of crossing a heterozygous organism (Yy) with a homozygous recessive organism (yy); here, the dominant allele Y corresponds to yellow pea pods, and the recessive allele y corresponds to green pea pods.](Images/006.png)

**Figure 1.** A Punnett square representing the possible outcomes of crossing a heterozygous organism (Yy) with a homozygous recessive organism (yy); here, the dominant allele Y corresponds to yellow pea pods, and the recessive allele y corresponds to green pea pods.

**图1.**一个旁氏表，表示杂合生物（Yy）与纯合隐性生物（yy）杂交的可能结果;这里，显性等位基因Y对应于黄豌豆荚，而隐性等位基因y对应于绿豌豆荚。

## Problem

**Probability** is the mathematical study of randomly occurring phenomena. We will model such a phenomenon with a **random variable**, which is simply a variable that can take a number of different distinct **outcomes** depending on the result of an underlying random process.

**概率**是研究随机发生现象的数学方法。我们将使用**随机变量**对这种现象进行建模，**随机变量**只是一个变量，它可以根据潜在随机过程的结果取得许多不同的**结果**。

For example, say that we have a bag containing 3 red balls and 2 blue balls. If we let $X$ represent the random variable corresponding to the color of a drawn ball, then the **probability** of each of the two outcomes is given by  and $Pr(X=blue)=\frac{2}{5}$.

例如，假设我们有一个包含3个红球和2个蓝色球的包。如果我们让$X$代表对应于绘制球颜色的随机变量，则两个结果中每一个的**概率**由$Pr(X=red)=\frac{3}{5}$和$Pr(X=blue)=\frac{2}{5}$。

Random variables can be combined to yield new random variables. Returning to the ball example, let $Y$ model the color of a second ball drawn from the bag (without replacing the first ball). The probability of $Y$ being red depends on whether the first ball was red or blue. To represent all outcomes of $X$ and $Y$, we therefore use a **probability tree diagram**. This branching diagram represents all possible individual probabilities for $X$ and $Y$, with outcomes at the endpoints ("leaves") of the tree. The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree; see Figure 2 for an illustrative example.

随机变量可以组合以产生新的随机变量。回到球的例子，让$Y$模拟从球袋中抽出的第二个球的颜色（第一个球不放回）。 $Y$变红的概率取决于第一球是红色还是蓝色。为了表示$X$和$ Y $的所有结果，我们因此使用**概率树图**。该分支图表示$X$和$Y$的所有可能的个体概率，其结果在树的端点（“叶子”）处。任何结果的概率都是从树的开始沿路径的概率乘积给出的;有关说明性示例，请参见图2。

![Figure 2. The probability of any outcome (leaf) in a probability tree diagram is given by the product of probabilities from the start of the tree to the outcome. For example, the probability that X is blue and Y is blue is equal to (2/5)(1/4), or 1/10.](Images/008.png)

**Figure 2.** The probability of any outcome (leaf) in a probability tree diagram is given by the product of probabilities from the start of the tree to the outcome. For example, the probability that X is blue and Y is blue is equal to (2/5)(1/4), or 1/10.

**图2.** 概率树图中任何结果（叶）的概率由从树的开始到结果的概率的乘积给出。例如，X为蓝色且Y为蓝色的概率等于（2/5）（1/4）或1/10。


An **event** is simply a collection of outcomes. Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes. For our colored ball example, let $A$ be the event "$Y$ is blue." $Pr(A)$ is equal to the sum of the probabilities of two different outcomes: 

$$
Pr(X=blue\ and\ Y=blue)+Pr(X=red\ and\ Y=blue)
$$

or $\frac{3}{10}+\frac{1}{10}=\frac{2}{5}$ (see Figure 2 above).

**事件**只是结果的集合。由于结果是截然不同的，事件的概率可以写成其组成结果概率的总和。对于我们的彩球示例，让$A$成为“$Y$为蓝色”的事件。 

$$
Pr(X=blue\ and\ Y=blue)+Pr(X=red\ and\ Y=blue)
$$

或者$\frac{3}{10}+\frac{1}{10}=\frac{2}{5}$（见上图2）。

**Given:** Three positive integers $k$, $m$, and $n$, representing a population containing $k+m+n$ organisms: $k$ individuals are homozygous dominant for a factor, $m$ are heterozygous, and $n$ are homozygous recessive.

**给定：**三个正整数$k$，$m$和$n$，代表一个含有$k+m+n$有机体的人口：$k$个体是纯合子占优势的因子，$m$是杂合子，$n$是纯合的隐性。

**Return:** The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

**返回：** 两个随机选择的交配生物将产生具有显性等位基因的个体（从而显示显性表型）的概率。假设任何两种生物都可以交配。

## Sample Dataset

```
2 2 2
```

## Sample Output

```
0.78333
```

In [56]:
def probability_dominant(AA, Aa, aa):
    sum = AA + Aa + aa
    aa_aa = (aa/sum)*((aa-1)/(sum-1))
    Aa_Aa = (Aa/sum)*((Aa-1)/(sum-1))
    Aa_aa = (Aa/sum)*(aa/(sum-1)) + (aa/sum)*(Aa/(sum-1))
    
    return 1 - (aa_aa + Aa_Aa * 0.25 + Aa_aa * 0.5)

In [57]:
print(probability_dominant(2, 2, 2))

0.7833333333333333


In [60]:
print(probability_dominant(27, 22, 26))

0.759009009009009


# Calculating Expected Offspring

# 计算后代的期望值

## The Need for Averages

## 平均需求

Averages arise everywhere. In sports, we want to project the average number of games that a team is expected to win; in gambling, we want to project the average losses incurred playing blackjack; in business, companies want to calculate their average expected sales for the next quarter.

平均值的应用非常广泛。在体育方面，我们希望预测一支球队有望获胜的平均比赛数量;在赌博中，我们想要预测玩二十一点的平均损失;在商业方面，公司希望计算下一季度的平均预期销售额。

Molecular biology is not immune from the need for averages. Researchers need to predict the expected number of antibiotic-resistant pathogenic bacteria in a future outbreak, estimate the predicted number of locations in the genome that will match a given motif, and study the distribution of **alleles** throughout an evolving population. In this problem, we will begin discussing the third issue; first, we need to have a better understanding of what it means to average a random process.

分子生物学不能满足对平均值的需求。研究人员需要在未来的爆发中预测抗生素抗性致病菌的预期数量，估计基因组中与给定基序匹配的位置的预测数量，并研究不断变化的群体中等位基因的分布。在这个问题上，我们将开始讨论第三个问题;首先，我们需要更好地理解平均随机过程意味着什么。

## Problem

For a **random variable** $X$ taking integer values between $1$ and $n$, the **expected value** of $X$ is $E(X)=\sum_{k=1}^{n}k\times Pr(X=k)$. The expected value offers us a way of taking the long-term average of a random variable over a large number of trials.

对于一个在$1$到$n$区间内取整值的**随机变量**$X$，$X$的**数学期望**是$E(X)=\sum_{k=1}^{n}k\times Pr(X=k)$。数学期望为我们提供了一种在大量试验中获取随机变量的长期平均值的方法。

As a motivating example, let $X$ be the number on a six-sided die. Over a large number of rolls, we should expect to obtain an average of 3.5 on the die (even though it's not possible to roll a 3.5). The formula for expected value confirms that $E(X)=\sum^{6}_{k=1}k\times Pr(X=k)=3.5$.

例如，将$X$令作六面骰子上的数字。在大量投掷实验后，通过数学期望的公式$E(X)=\sum^{6}_{k=1}k\times Pr(X=k)$可以得出其期望值为$3.5$，$3.5$虽是“点数”的期望值，但却不属于可能结果中的任一个，没有可能掷出此点数。

More generally, a random variable for which every one of a number of equally spaced outcomes has the same probability is called a **uniform random variable** (in the die example, this "equal spacing" is equal to 1). We can generalize our die example to find that if $X$ is a uniform random variable with minimum possible value a and maximum possible value b, then $E(X)=\frac{a+b}{2}$. You may also wish to verify that for the dice example, if $Y$ is the random variable associated with the outcome of a second die roll, then $E(X+Y)=7$.

更一般地，多个等间隔结果中的每一个具有相同概率的随机变量被称为**均匀随机变量**（在上面示例中，该“等间距”等于1）。我们可以将上述例子推广，发现如果$X$是一个具有最小可能值$a$和最大可能值$b$的均匀随机变量，那么$E(X)= \frac {a + b} {2}$。更近一步我们可以发现，两个相关联的均匀随机变量的期望值可以相加，如果$Y$是与第二个掷骰子的结果相关联的随机变量，则$E(X+Y)= 7$。

**Given:** Six nonnegative integers, each of which does not exceed 20,000. The integers correspond to the number of couples in a population possessing each **genotype** pairing for a given **factor**. In order, the six given integers represent the number of couples having the following genotypes:

**给定：** 六个非负整数，每个整数不超过20,000。整数对应于拥有给定**因子**的每种**基因型**配对的群体中的夫妻数量。按顺序，六个给定的整数表示具有以下基因型的夫妇的数量：

1. AA-AA
2. AA-Aa
3. AA-aa
4. Aa-Aa
5. Aa-aa
6. aa-aa

**Return:** The expected number of offspring displaying the dominant phenotype in the next generation, under the assumption that every couple has exactly two offspring.

**返回：** 在假设每对夫妇只有两个后代的情况下，下一代显示出显性表型的后代的预期数量。

## Sample Dataset

```
1 0 0 1 0 1
```

##  Sample Output

```
3.5
```

In [11]:
def expect_offspring(AA_AA, AA_Aa, AA_aa, Aa_Aa, Aa_aa, aa_aa):

    return a1 + a2 + a3 + a4 + a5 + a6

In [12]:
expect_offspring(1,0,0,1,0,1)

3.5

In [13]:
expect_offspring(17196, 19218, 16073, 16183, 19183,18309)

1303860346.0

In [20]:
with open("/home/duansq/Downloads/rosalind_iev (1).txt", "r") as line:
    lines = line.read()

In [21]:
couples = lines.split()

In [22]:
offspring = int(couples[0])*2 + int(couples[1])*2 + int(couples[2])*2 + int(couples[3])*(3/2.0) + int(couples[4])

In [23]:
print(offspring)

154003.0


# Computing GC Content

# 计算GC含量

## Identifying Unknown DNA Quickly

## 快速识别未知DNA

A quick method used by early computer software to determine the language of a given piece of text was to analyze the frequency with which each letter appeared in the text. This strategy was used because each language tends to exhibit its own letter frequencies, and as long as the text under consideration is long enough, software will correctly recognize the language quickly and with a very low error rate. See Figure 1 for a table compiling English letter frequencies.

早期计算机软件用来确定给定文本语言的快速方法是分析每个字母出现在文本中的频率。之所以使用这种策略是因为每种语言都倾向于展示自己的字母频率，只要所考虑的文本足够长，软件就能快速正确地识别语言并且错误率非常低。有关编制英文字母频率的表格，请参见图1。

![Figure 1. The table above was computed from a large number of English words and shows for any letter the frequency with which it appears in those words. These frequencies can be used to reliably identify a piece of English text and differentiate it from that of another language. Taken from http://en.wikipedia.org/wiki/File:English_letter_](Images/009.png)

**Figure 1.** The table above was computed from a large number of English words and shows for any letter the frequency with which it appears in those words. These frequencies can be used to reliably identify a piece of English text and differentiate it from that of another language. Taken from http://en.wikipedia.org/wiki/File:English_letter_

**图1. **上表是根据大量英文单词计算得出的，并显示任何字母在这些单词中出现的频率。这些频率可用于可靠地识别一段英文文本，并将其与另一种语言区分开来。摘自http://en.wikipedia.org/wiki/File:English_letter_

You may ask: what in the world does this linguistic problem have to do with biology? Although two members of the same species will have different **genomes**, they still share the vast percentage of their **DNA**; notably, 99.9% of the 3.2 billion **base pairs** in a human genome are common to almost all humans (i.e., excluding people having major genetic defects). For this reason, biologists will speak of the human genome, meaning an average-case genome derived from a collection of individuals. Such an average case genome can be assembled for any species, a challenge that we will soon discuss.

您可能会问：这个语言问题与生物学有什么关系？虽然同一物种的两个成员具有不同的**基因组**，但他们仍然拥有相当大比例的DNA;值得注意的是，人类基因组中32亿**碱基对**中的99.9％对于几乎所有人来说是共同的（即，排除具有主要遗传缺陷的人）。出于这个原因，生物学家将谈论人类基因组，这意味着来自个体集合的平均病例基因组。这样的平均病例基因组可以为任何物种组装，这是我们即将讨论的挑战。

The biological analog of identifying unknown text arises when researchers encounter a molecule of DNA from an unknown species. Because of the base pairing relations of the two DNA strands, cytosine and guanine will always appear in equal amounts in a double-stranded DNA molecule. Thus, to analyze the symbol frequencies of DNA for comparison against a database, we compute the molecule's **GC-content**, or the percentage of its bases that are either cytosine or guanine.

当研究人员遇到来自未知物种的DNA分子时，会发现识别未知文本的生物类似物。由于两条DNA链的碱基配对关系，胞嘧啶和鸟嘌呤在双链DNA分子中总是以相等的量出现。因此，为了分析DNA的符号频率以与数据库进行比较，我们计算分子的**GC含量**，或其碱基的百分比，即胞嘧啶或鸟嘌呤。

In practice, the GC-content of most eukaryotic genomes hovers around 50%. However, because genomes are so long, we may be able to distinguish species based on very small discrepancies in GC-content; furthermore, most prokaryotes have a GC-content significantly higher than 50%, so that GC-content can be used to quickly differentiate many prokaryotes and eukaryotes by using relatively small DNA samples.

实验表明，大多数真核基因组的GC含量徘徊在50％左右。然而，由于基因组很长，我们可能能够根据GC含量的微小差异来区分物种;此外，大多数原核生物的GC含量显着高于50％，因此GC含量可用于通过使用相对较小的DNA样品快速区分许多原核生物和真核生物。

## Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA字符串的GC含量由字符串中符号“C”或“G”的百分比给出。例如，“AGCTATAG”的GC含量为37.5％。请注意，任何DNA字符串的反向互补具有相同的GC含量。

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

DNA字符串在合并到数据库中时必须进行标记。常用的字符串标记方法称为FASTA格式。在这种格式中，字符串由以“>”开头的行引入，后跟一些标签信息。后续行包含字符串本身;以'>'开头的第一行表示下一个字符串的标签。

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

在Rosalind的实验中，FASTA格式的字符串将用ID“Rosalind_xxxx”标记，其中“xxxx”表示0000和9999之间的四位数代码。

**Given:** At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

**Return:** The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

## Sample Dataset

```
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCTATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
```

## Sample Output

```
Rosalind_0808
60.919540
```

## Note on Absolute Error

We say that a number $x$ is within an absolute error of $y$ to a correct solution if $x$ is within $y$ of the correct solution. For example, if an exact solution is $6.157892$, then for $x$ to be within an absolute error of $0.001$, we must have that $\|x−6.157892\|<0.001$, or $6.156892<x<6.158892$.

如果$x$在正确解的$y$之内，我们说数字$x$在$y$的绝对误差范围内是正确的解。例如，如果精确解是$6.157892$，那么对于$x$在绝对误差$0.001$内，我们必须具有$\|x−6.157892\|<0.001$或$6.156892<x<6.158892$。

Error bounding is a vital practical tool because of the inherent round-off error in representing decimals in a computer, where only a finite number of decimal places are allotted to any number. After being compounded over a number of operations, this round-off error can become evident. As a result, rather than testing whether two numbers are equal with $x=z$, you may wish to simply verify that $\|x−z\|$ is very small.

错误边界是一个重要的实用工具，因为在计算机中表示小数的固有舍入误差，其中只有有限数量的小数位被分配给任何数字。在通过多个操作复合之后，这种舍入错误可能变得明显。因此，您可能希望简单地验证$\|x-z\|$而不是测试两个数字是否与$x=z$相等非常小。

The mathematical field of **numerical analysis** is devoted to rigorously studying the nature of computational approximation.

**数值分析**的数学领域致力于严格研究计算近似的本质。

In [126]:
def content_GC(string):
    return (string.count("G") + string.count("C"))/(len(string))

In [134]:
content_GC("CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT")

0.6091954022988506

In [142]:
with open("/home/duansq/Downloads/rosalind_gc.txt", "r") as gc:
    t = ""
    for i in gc.readlines():
        t = t + i.strip("\n")

In [141]:
t = t.split(">")
t.pop(0)
d = {}
for i in t:
    d[i[0:13]] = i[13:]
x = {}
for i in d:
    x[i] = content_GC(d[i])
print(max(x), x[max(x)])

Rosalind_9463 0.5238095238095238


# Translating RNA into Protein

## The Genetic Code

Just as **nucleic acids** are **polymers** of nucleotides, **proteins** are chains of smaller molecules called **amino acids**; 20 amino acids commonly appear in every species. Just as the **rimary structure** of a nucleic acid is given by the order of its nucleotides, the primary structure of a protein is the order of its amino acids. Some proteins are composed of several subchains called **polypeptides**, while others are formed of a single polypeptide; see Figure 1.

正如**核酸**是核苷酸的**聚合物**一样，**蛋白质**是由**氨基酸**构成的小分子链;每种物种中通常会出现20种氨基酸。正如核酸的**一级结构**由其**核苷酸**的顺序给出，蛋白质的**一级结构**是其**氨基酸**的顺序。一些蛋白质由称为**多肽**的几个亚链组成，而其他蛋白质由单个多肽组成;见图1。

![Figure 1. The human hemoglobin molecule consists of 4 polypeptide chains; α subunits are shown in red and β subunits are shown in blue](Images/010.png)

**Figure 1.** The human hemoglobin molecule consists of 4 polypeptide chains; α subunits are shown in red and β subunits are shown in blue

**图1. **人血红蛋白分子由4条多肽链组成; α亚基以红色显示，β亚基以蓝色显示

Proteins power every practical function carried out by the cell, and so presumably, the key to understanding life lies in interpreting the relationship between a chain of amino acids and the function of the protein that this chain of amino acids eventually constructs. **Proteomics** is the field devoted to the study of proteins.

正是由于蛋白质的存在，细胞才能执行各种功能，因此，理解生命的关键在于理解氨基酸链与这种氨基酸链最终构建的蛋白质功能之间的关系。 **蛋白质组学**是致力于蛋白质研究的领域。

How are proteins created? The **genetic code**, discovered throughout the course of a number of ingenious experiments in the late 1950s, details the **translation** of an RNA molecule called **messenger RNA** (mRNA) into amino acids for protein creation. The apparent difficulty in translation is that somehow 4 RNA bases must be translated into a language of 20 amino acids; in order for every possible amino acid to be created, we must translate **3-nucleobase strings** (called **codons**) into amino acids. Note that there are 43=64 possible codons, so that multiple codons may encode the same amino acid. Two special types of codons are the **start codon** (AUG), which codes for the amino acid methionine always indicates the start of translation, and the three **stop codons** (UAA, UAG, UGA), which do not code for an amino acid and cause translation to end.

蛋白质是如何产生的？在20世纪50年代后期的许多巧妙实验过程中发现的**遗传密码**详细描述了一种名为**信使RNA **（mRNA）的RNA分子**转化为氨基酸以进行蛋白质创建过程。显然，要将4个RNA碱基翻译成20个氨基酸是很困难的;为了产生每种可能的氨基酸，我们必须将**每三个碱基串**（称为**密码子**）翻译成氨基酸。注意，有$4^3=64$个可能的密码子，因此多个密码子可以编码相同的氨基酸。两种特殊类型的密码子分别是是**起始密码子**（AUG），其编码氨基酸蛋氨酸总是表示翻译的起点，而三个**终止密码子**（UAA，UAG，UGA），不是氨基酸的代码，导致翻译结束。

The notion that protein is always created from RNA, which in turn is always created from DNA, forms the **central dogma of molecular biology**. Like all dogmas, it does not always hold; however, it offers an excellent approximation of the truth.

蛋白质总是由RNA产生，而RNA总是由DNA产生，由此形成分子生物学的**中心法则**。像所有的法则一样，它并不总是存在;然而，但它提供了一个很好的近似真相。

An organelle called a **ribosome** creates peptides by using a helper molecule called **transfer RNA** (tRNA). A single tRNA molecule possesses a string of three RNA nucleotides on one end (called an **anticodon**) and an amino acid at the other end. The ribosome takes an RNA molecule transcribed from DNA, called **messenger RNA** (mRNA), and examines it one codon at a time. At each step, the tRNA possessing the complementary anticodon bonds to the mRNA at this location, and the amino acid found on the opposite end of the tRNA is added to the growing peptide chain before the remaining part of the tRNA is ejected into the cell, and the ribosome looks for the next tRNA molecule.

称为**核糖体**的细胞器通过使用**转录RNA**（tRNA）辅助分子来产生肽。单个tRNA分子在一端具有一串三个RNA核苷酸（称为**反密码子**），在另一端具有氨基酸。核糖体附着到从DNA转录的**信使RNA**（mRNA），并一次检测一个密码子。在每次检测时，具有互补反密码子的tRNA与该位置处的mRNA结合，并且在tRNA的另一端发现的氨基酸被添加到合成的肽链中时，tRNA的剩余部分被喷射到细胞中，紧接着核糖体寻找下一个tRNA分子。

Not every RNA base eventually becomes translated into a protein, and so an interval of RNA (or an interval of DNA translated into RNA) that does code for a protein is of great biological interest; such an interval of DNA or RNA is called a **gene**. Because protein creation drives cellular processes, genes differentiate organisms and serve as a basis for **heredity**, or the process by which traits are inherited.

还有一点值得注意的是并非每个RNA碱基最终都会转化为蛋白质，因此编码蛋白质的RNA区域（或转化为RNA的DNA区间）具有很大的生物学意义;这种DNA或RNA的间隔称为**基因**。因为蛋白质的产生驱动了细胞的生物过程，基因可以区分生物体，并作为**遗传**的基础，或遗传特征的过程。

## Problem

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

通过使用20个字母（除了B，J，O，U，X和Z之外的所有字母）表示20个常见氨基酸。蛋白质串由这20个符号构成。此后，遗传串将包含蛋白质串以及DNA串和RNA串。

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

RNA密码子表规定了将特定密码子编码到氨基酸字母表中的细节。

```
UUU F      CUU L      AUU I      GUU V
UUC F      CUC L      AUC I      GUC V
UUA L      CUA L      AUA I      GUA V
UUG L      CUG L      AUG M      GUG V
UCU S      CCU P      ACU T      GCU A
UCC S      CCC P      ACC T      GCC A
UCA S      CCA P      ACA T      GCA A
UCG S      CCG P      ACG T      GCG A
UAU Y      CAU H      AAU N      GAU D
UAC Y      CAC H      AAC N      GAC D
UAA Stop   CAA Q      AAA K      GAA E
UAG Stop   CAG Q      AAG K      GAG E
UGU C      CGU R      AGU S      GGU G
UGC C      CGC R      AGC S      GGC G
UGA Stop   CGA R      AGA R      GGA G
UGG W      CGG R      AGG R      GGG G 
```

**Given:** An RNA string *s* corresponding to a strand of mRNA (of length at most 10 kbp).

**Return:** The protein string encoded by *s*.

## Sample Dataset

```
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
```

## Sample Output

```
MAMAPRTEINSTRING
```

In [310]:
with open("Bioinformatics_Stronghold/data/code.txt", "r") as code:
    t = code.read()
    code = t.split()
code = dict(zip(code[0::2], code[1::2]))

In [317]:
def transfer(string):
    decoded = ""
    ## or
    ## for i in range(0, len(string)-3, 3):
    ##     decoded += code[string[i:i+3]]
    for i in range(len(string)//3):
        if code[string[3*i:3*(i+1)]] == "Stop":
            break
        decoded += code[string[3*i:3*(i+1)]]
    return decoded

In [318]:
x = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"

In [319]:
transfer(x)

'MAMAPRTEINSTRING'

In [291]:
with open("Bioinformatics_Stronghold/data/rosalind_prot.txt", "r") as prot:
    prot = prot.read()

In [292]:
transfer(prot)

'MQVRPLDRPRRYLSSKTTGYNLVFSGDCRTEACCQDDKCTPGSRELLLVRNFYKREDSNHAMSLQSKAYASLKTQSGVILLARDVPRIRGTFSILPAADQNRSTCPINVDTAIIGRLCLNPSEPSSGGSALPCYVGYDTQLSCRLAVCYTCLLQVCVNKAGPGPQDLGWPEGGEHCAHDASLKGSTVIARDLVVSEPPCSMEVPLCVTLPTPGSLPSVLTHDSPDLSVPPQSIIKMIGTVHKRFRSLFDQSSIEPSSYDLVGRSRGTRDHACSHQTLLACNWDKTVVISHYVVLCYLADRLNNLLPTPGAGKVSQKIRYGNMWFTMATEMAISYSPGRANAPINFHGKGQLILSSCRKMSREICTPANTRSWKEVQRRRRVLLPRPIDAVALMSASGARVPGRNGAAHLLMMINYVYSSRSRRECSTSVTRSGIRSRMCHYKERCSCIGRHTLLGGQASNPQSCMRLVCCDRVMVSDWNRLAAWAIFYFPGRVDRCIPSKLIGRDATASDGPRLWSRRRKLYAVHVRYRTFVSHCSNRGLYFPPFWEFELRVRKRVRVWSYCLSRPSLSRPAKQSPRRLALRKENVRRISRSAVASLNLPVIDARGSTGLEQSARSHVCLNLAARLLVGQDLTRLNLLMCILVRLFCAADLVTSRYFILIFRAPCFTVGTGNMVSVARPTSTCVGPSGSTTHLMSWDSQWASRHTCNRSITNAAERGLVHHCRVLVTFSCSDPSRWHGLNQQLRKIGYFPQIFYFFGSGVSSLQAKINLRPRHSHYLLLYTLYCLHPRPEQLTIRIRGLPRASTRWSPERTRRITRVTTAEPASELAVISCCSYDSIKGMVIRLRQGRVWAIGFEPPFRPNSGVNFRHIRFTNVQGEGQSIYDPTAFEDVLGVRNVHFTAKISLPCICTGHSAGAGPIIAHLNDSRIPPAHVMTFVKMIADWRILHKCDVAASGWVALTLPRAQGVLTLPIVMQFLARICDTPIIHTQKYIDRKGTGNL

{1: 2}