##### Copyright 2020 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<table class="tfo-notebook-buttons" align="left">
  <td><a target="_blank" href="https://www.tensorflow.org/io/tutorials/genome"><img src="https://www.tensorflow.org/images/tf_logo_32px.png">View on TensorFlow.org</a></td>
  <td><a target="_blank" href="https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/genome.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png">Run in Google Colab</a></td>
  <td><a target="_blank" href="https://github.com/tensorflow/io/blob/master/docs/tutorials/genome.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png">View source on GitHub</a></td>
      <td><a href="https://storage.googleapis.com/tensorflow_docs/io/docs/tutorials/genome.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png">Download notebook</a></td>
</table>

## 개요

This tutorial demonstrates the `tfio.genome` package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities).

이 패키지는 [Google Nucleus](https://github.com/google/nucleus) 라이브러리를 사용하여 일부 핵심 기능을 제공합니다. 

## 설정

In [None]:
try:
  %tensorflow_version 2.x
except Exception:
  pass
!pip install tensorflow-io

In [None]:
import tensorflow_io as tfio
import tensorflow as tf

## FASTQ 데이터

FASTQ는 기본 품질 정보와 함께 두 가지 시퀀스 정보를 모두 저장하는 일반적인 게놈 파일 형식입니다.

First, let's download a sample `fastq` file.

In [None]:
# Download some sample data:
!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq

### FASTQ 데이터 읽기

Now, let's use `tfio.genome.read_fastq` to read this file (note a `tf.data` API coming soon).

In [None]:
fastq_data = tfio.genome.read_fastq(filename="test.fastq")
print(fastq_data.sequences)
print(fastq_data.raw_quality)

As you see, the returned `fastq_data` has `fastq_data.sequences` which is a string tensor of all sequences in the fastq file (which can each be a different size) along with `fastq_data.raw_quality` which includes Phred encoded quality information about the quality of each base read in the sequence.

### 품질

You can use a helper op to convert this quality information into probabilities if you are interested.

In [None]:
quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)
print(quality.shape)
print(quality.row_lengths().numpy())
print(quality)

### One hot encodings
You may also want to encode the genome sequence data (which consists of `A` `T` `C` `G` bases) using a one hot encoder. There's a built in operation that can help with this.


In [None]:
print(tfio.genome.sequences_to_onehot.__doc__)

In [None]:
print(tfio.genome.sequences_to_onehot.__doc__)