# lesson 2
data: cm.fa.gz  

这是一种学名叫Camponotus floridanus的蚂蚁的基因组序列，以fasta格式存储。

任务是：1）网上查找资料了解fasta格式；2）计算基因组中每条scaffold的长度，输出格式为第一列是scaffold ID，第二列为该scaffold长度；3)统计该种蚂蚁的基因组大小是多少Mb、4）计算该基因组的CG含量、5)统计该基因组中酶切位点CCGGTCGACCGG的数量。

## answer

fasta格式的文件是文本文件，用于储存序列，可以储存DNA、RNA和蛋白质序列，一般分为两个部分，第1行是以>开头的序列描述信息，包括数据库中的编号，序列名称，序列类型，剩余的为序列信息。序列信息一般70或80个碱基一行。如下图所示：  

![fasta_eg](./image/fasta_eg.JPG)

fasta格式的文件有很多扩展名，不同的扩展名代表不同的生物学数据，但基本格式一致。如下图所示：  

![fasta_extension](./image/fasta_extension.JPG)

## 数据统计

In [12]:
# lesson2 practice
# date: 2018/08/01
# author: zxzhu

import re

def main(infile_name, outfile_name):
    title = ''
    sequence = ''
    genome = 0
    sites = 0
    GC = 0
    scaffolds = []
    f = open(infile_name, 'r')
    for line in (i.strip() for i in f):
        if line.startswith('>'):
            if sequence:
                scaffolds.append((len(sequence), title))
                genome += len(sequence)
                GC += sequence.count('G')+sequence.count('C')
                sites += site_count(sequence)
                sequence = ''
                title = line[1:]
            else:
                title = line[1:]
        else:
            sequence += line
    scaffolds.append((len(sequence), title))
    genome += len(sequence)
    GC += sequence.count('G')+sequence.count('C')
    sites += site_count(sequence)
    f.close()
    
    w = open(outfile_name, 'w')
    scaffolds = sorted(scaffolds, key=lambda x: x[-1])
    w.write('\n'.join([i[-1] + '\t' + str(i[0]) for i in scaffolds]))
    w.close()
    
    print('Genome size: {:.5}Mb.\nGC content: {:.4}%.\nscaffold:{}.\n'.format(genome / 1000000, GC * 100 / genome, len(scaffolds)))
    print('The number of Restriction Enzyme cutting sites is {}\n'.format(sites))

def site_count(sequence, pattern = r'CCGGTCGACCGG'):
    site = re.findall(pattern, sequence)
    return len(site)

infile_name = './data/Traning/Lesson2/cm.fa'
outfile_name = 'scaffold_length.txt'
main(infile_name, outfile_name)


Genome size: 235.58Mb.
GC content: 33.01%.
scaffold:24029.

The number of Restriction Enzyme cutting sites is 14



## 问题

如下图所示，cm.fa文件中除了有scaffold\*外，还有C\*的序列长度一般是100bp至10kb左右，例如C3769279这种序列是什么意思？和scaffold有什么区别？

![unknow sequence](./image/unknown_sequence.JPG)

统计的时候我把这两种序列都算作了scaffold。

## Reference

[FASTQ与FASTA](https://zhuanlan.zhihu.com/p/34518389)

[FASTA format](https://en.wikipedia.org/wiki/FASTA_format#FASTA_file)

[contig 与 scaffold](http://blog.sina.com.cn/s/blog_670445240101mw8e.html)

[ContigN50 与 ScaffoldN50](http://blog.sina.com.cn/s/blog_80d2d9fd0100x3fa.html)