# Section 01: Introduction

<a rel="license" href="https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.txt"><img alt="Attribution-NonCommercial-ShareAlike 4.0 International" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/by-nc-sa.eu.svg" title="This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License"/></a>

## Content

- [Overview](#overview)
    - [Machine Learning: A New Paradigm in Population Genetics](#machine-learning)
    - [Jupyter Notebook](#jupyter-notebook)
- [Resources](#resources)
- [Python Programming](#python-programming)
    - [Variables](#variables)
        - [Task 1](#task1)
    - [Built-in Data Types](#built-in-data-types)
        - [`int`](#int)
        - [`float`](#float)
        - [`bool`](#bool)
        - [`str`](#str)
        - [`list`](#li)
        - [`dict`](#dict)
    - [Commonly Used Statements](#commonly-used-statements)
        - [`if`](#if)
        - [`for`](#for)
        - [`while`](#while)
        - [Define a Function](#def)
        - [Import a Package](#import)
    - [File Inputs and Outputs](#io)
    - [Task 2: Comprehensive Application](#task2)
- [Summary](#summary)

<a name="overview"></a>
## Overview

<a name="jupyter-notebook"></a>
### Machine Learning: A New Paradigm in Population Genetics

Population genetics is a discipline studying how evolution shapes genetic variation. Traditionally, studying population genetics has involved either experimental methods or theoretical approaches, such as statistical inference.

Recent advances in machine learning, especially deep learning, have introduced a new paradigm for studying population genetics. This is because machine learning excels at solving complex problems with large-scale data, similar to the challenges encountered in the genomic era of population genetics.

<p align="center">
    <img alt="pmgl" src="https://github.com/xin-huang/pgml/blob/main/Section_01/pgml.png?raw=true" width=480>
</p>

<p align="center">
    <strong>A figure created by <a href="https://openai.com/research/dall-e">DALL·E</a> for machine learning in population genetics</strong>
</p>

In this course, we will introduce basic knowledge of machine learning by implementing its applications to address practical problems from recent publications in population genetics. We will divide our course into two parts, based on machine learning paradigms:

- **Supervised Learning**: This is the process of training a machine learning model on a dataset that contains input-output pairs, where the model learns to predict the output from the input. It is ideal for tasks like classification and regression. In this part, we will apply supervised learning algorithms, such as logistic regression and decision trees, to detect ghost introgressed fragments in genomes.

- **Unsupervised Learning**: Involves training a model on data without pre-defined labels, allowing the model to identify patterns or structures on its own. It is used for clustering, dimensionality reduction, and more. Here, we will explore unsupervised learning algorithms, like dimensionality reduction techniques and generative models, to uncover population structure or simulate artificial genomes.

We will not cover **reinforcement learning**, which involves training a model to make decisions based on rewards from interactions with an environment, due to its current limited application in population genetics.

Although the focus is on applying machine learning to population genetics problems, due to the versatility and power of machine learning, we encourage you to think about how these techniques can be applied within your own fields of study.

<a name="jupyter-notebook"></a>
### Jupyter Notebook <img align="right" alt="jupyter" src="https://jupyter.org/assets/share.png" width=72>

In this course, all materials are implemented using [Jupyter Notebook](https://jupyter.org/), which is an interactive, web-based platform that allows for the integration of code, plots, and text into a single document, known as a Jupyter notebook. Jupyter Notebook supports multiple programming languages, such as Python, R, and Julia, which are especially popular for data analysis.

A Jupyter notebook consists of a series of cells. There are primarily two types of cells: text cells, which use [Markdown syntax](https://www.markdownguide.org/basic-syntax/) for formatting, and code cells, which contain executable code. The code within these cells can be run to display its results directly beneath the cell, allowing for interactive computational narratives.

To run a code cell in Google Colab, we can click the play icon at the left side of the code cell or by pressing `Ctrl + Enter`.

In [None]:
# This is a code cell.
print("Hello World!")

Here, the symbol `#` indicates that the statement is a comment, which is not executed. The `print()` function is a built-in function from Python, which prints the content within the parentheses to the screen.

In this course, we will use Python and some common [Linux](https://en.wikipedia.org/wiki/Linux) shell commands for analysing population genetic problems with machine learning. To execute a shell command in a Jupyter Notebook code cell, you must prefix the command with an exclamation mark (`!`) before the command name:

In [None]:
!cat /proc/version

Here, [cat](https://linux.die.net/man/1/cat) is a Linux command that can display the content of a file (e.g., /proc/<wbr>version).

<a name="resources"></a>
## Resources

- [W3schools Python Tutorial](https://www.w3schools.com/python/)
- [Python Language Reference](https://docs.python.org/3/reference/)
- [Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)
- [Deep Learning](https://www.deeplearningbook.org/)
- [Population Genetics Notes](https://github.com/cooplab/popgen-notes)

<a name="python-programming"></a>
## Python Programming <wbr><img alt="python" align="right" src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Python-logo-notext.svg/1869px-Python-logo-notext.svg.png" width=32>

[Python](https://www.python.org/) is a widely-used programming language, favored in the fields of machine learning and bioinformatics. In our course, we will apply three prominent Python packages for implementing machine learning applications in population genetics:

- [scikit-learn](https://scikit-learn.org/stable/) offers a wide array of algorithms for various machine learning tasks, including classification, regression, and clustering.
- [PyTorch](https://pytorch.org/), developed by Meta, is a deep learning framework that is popular among academic researchers for its user-friendly interface.
- [TensorFlow](https://www.tensorflow.org/), created by Google, is another deep learning framework designed for large-scale, real-world applications, often preferred in the industry.

<p align="center">
    <img alt="scikit-learn" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/2560px-Scikit_learn_logo_small.svg.png" width=96>
    <img alt="pytorch" src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c6/PyTorch_logo_black.svg/1200px-PyTorch_logo_black.svg.png" width=96>
    <img alt="scikit-learn" src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ab/TensorFlow_logo.svg/1200px-TensorFlow_logo.svg.png" width=96>
</p>

Before diving into our machine learning implementation, we will start with the basics of Python programming. This foundational knowledge will allow us to progressively enhance our understanding of Python programming throughout the course.

<a name="variables"></a>
### Variables

In Python, creating a variable is straightforward, just assign a value to a name. For example, we can create a variable named `x` to store an integer value `1` using the following:

In [None]:
x = 1
print(x)

<a name="task1"></a>
**Task 1:** Create a variable `y` that stores an integer value `200` and print out `y`.

In [None]:
# Please implement your code here.


<details>
  <summary>
    <font size="3" color="darkgreen">
      <b>Click for hints</b>
    </font>
  </summary>

  - During the course, we will ask you to perform some tasks to practice the knowledge we have learned. We encourage you to think of your solutions independently. However, if you are unsure how to complete a task, you can click on the cell containing hints and solutions located below the task. Note that the solution provided here may not be the only or the best solution; we encourage you to explore and find the most effective solution.
  - <details>
      <summary>
        <font size="3" color="darkblue">
          <b>Click for solutions</b>
        </font>
      </summary>

      ```
      y = 200
      print(y)
      ```
  </details>
</details>

<a name="buit-in-data-types"></a>
### Built-in Data Types


<a name="int"></a>
#### `int`

There are several built-in [data types](https://www.w3schools.com/python/python_datatypes.asp) in Python. For example, when we assigned an integer to the variable `x`, Python automatically identified its data type as `int` because it can infer the data type from the assigned value. Consequently, there is no need to explicitly specify the data type of a variable upon its creation in Python. We can find out the data type of a variable by using the `type()` function:


In [None]:
type(x)

<a name="float"></a>
#### `float`

`float` is another data type for numbers, which is used for representing real numbers with decimal points and suitable for various mathematical calculations that require precision beyond integers:

In [None]:
pi = 3.1415926535897932384626
type(pi)

<a name="bool"></a>
#### `bool`
In addition to mathematical calculations, logical operations are essential for making decisions in programming. For this purpose, we use the `bool` data type in Python, which comprises only two values: `True` and `False`. `bool` is the abbreviation for [Boolean](https://en.wikipedia.org/wiki/Boolean_data_type) in Python. A `bool` variable can be created by assigning either `True` or `False` to it:

In [None]:
flag = True
type(flag)

<a name="str"></a>
#### `str`

Occasionally, we need to store characters or sequences of characters, such as DNA sequences; in these instances, the variable should be of the `str` data type. `str` stands for "string" in Python, which is used to represent textual data. To create a `str` variable, text can be enclosed within single (`'`), double (`"`), or triple quotes (`'''` or `"""`):

In [None]:
dna_sequence = 'ATCG'
type(dna_sequence)

When a number is enclosed in quotes, the variable is treated as a `str` data type, regardless of its numeric value:

In [None]:
x = "1"
type(x)

We can join two `str` variables by using the addition symbol `+`:

In [None]:
dna_sequence1 = 'ATCG'
dna_sequence2 = 'GCTA'
dna_sequence = dna_sequence1 + dna_sequence2
print(dna_sequence)

Each character in a `str` variable can be accessed by using its index in square brackets, where the index starts from `0`. For example, we can obtain the first character in `dna_sequence`:

In [None]:
print(dna_sequence[0])

However, we cannot modify the items in a `str` variable:

In [None]:
# This code does not work, because the value of a `str` variable is immutable.
dna_sequence[0] = 'C'

<a name="li"></a>
#### `list`

When dealing with a large volume of data, such as thousands of genome sequences, it is impractical to create thousands of variables for each genome. In such scenarios, the `list` data type becomes invaluable as it enables the storage of multiple items within a single variable. To create a `list`, we can enclose our items within square brackets `[]`:

In [None]:
dna_sequences = ['ATCG', 'GCTA', 'AAAAA', 'TTT', 'CCCCGGG']
print(dna_sequences)

To add an item into a list, we can use the `append()` method:

In [None]:
dna_sequences.append('GGG')
print(dna_sequences)

We can also add a list into a list:

In [None]:
dna_sequences.append(['A', 'T', 'C', 'G'])
print(dna_sequences)

If we want to avoid creating a nested list, we can ultilize the `extend()` method:

In [None]:
dna_sequences.extend(['A', 'T', 'C', 'G'])
print(dna_sequences)

To access an item in a list, we can use its index in square brackets. For example, we can get the first item in `dna_sequences`:

In [None]:
print(dna_sequences[0])

We can also modify an item in a list through its index. For instance, we can change the last item in `dna_sequences`:

In [None]:
dna_sequences[len(dna_sequences)-1] = 'ACG'
print(dna_sequences)
dna_sequences[-1] = 'GTA'
print(dna_sequences)

The `len()` function returns the number of items in a list. Since list indexing starts from `0`, the index of the last item is `len(list)-1`. Additionally, Python supports negative indexing to access elements from the end, allowing us to use the index `-1` to directly access the last item.

Additionally, we can obtain a part of the list using slicing. For example, we can take the firs three elements in the `dna_sequences` list:

In [None]:
dna_sequences[0:3]

The slice is specified by indicating a `start` index, an `end` index, and an optional `step` size within square brackets (`[]`) after the list variable. The basic form of slicing syntax is `list[start:stop:step]`.

- The `start` index is inclusive, meaning the slice will include the element at this index. If `start` is omitted, the slice starts from the beginning of the list.
- The `stop` index is exclusive, meaning the slice will stop before the element at this index. If `stop` is omitted, the slice goes through to the end of the list.
- The `step` size indicates the increment between indices in the slice. If `step` is omitted, it defaults to 1, so every element between `start` and `stop` is included in the slice.

Here are examples:

In [None]:
# Get elements from index 2 to 5
print(dna_sequences[2:6])

# Get elements from the beginning to index 3
print(dna_sequences[:4])

# Get elements from index 5 to the end
print(dna_sequences[5:])

# Get every other element from the list
print(dna_sequences[::2])

# Reverse the list
print(dna_sequences[::-1])

Similarly, `str` variables can use the same slicing technique as lists. This allows us to extract substrings from a string using a similar syntax: `string[start:stop:step]`.

<a name="dict"></a>
#### `dict`

Although lists are effective for collections of data, there are instances where accessing items using a non-numeric index (called "key") may be more convenient. In such situations, the `dict` (short for "dictionary") data type becomes particularly useful. To create a dictionary, we can use curly braces:

In [None]:
dna_sequences = {
    'Human': 'ATCG',
    'E.coli': 'GCTA',
    'Drosophila': 'AAAAA',
}
print(dna_sequences)

We can obtain the `Human` sequence:

In [None]:
print(dna_sequences['Human'])

We can add a new item into a dictionary by specifying a key-value pair:

In [None]:
dna_sequences['Arabidopsis'] = 'TTT'
print(dna_sequences)

Here, `'Arabidopsis'` is the key and `TTT` is the corresponding value.

<a name="commonly-used-statements"></a>
### Commonly Used Statements

To effectively use variables and built-in data types, we need some statements to help us to control the workflow in our programs, such as making decisions or looping over data.

<a name="if"></a>
#### `if`

The `if` statement is commonly used for decision-making; that is, if a condition is met, an operation will be executed. For instance, we can print a sequence from a species stored in the `dna_sequences` variable, provided the sequence of the specified species exists:

In [None]:
species = 'Human'
if species in dna_sequences.keys():
    print('The sequence from ' + species + ': ' + dna_sequences[species])
print("The if block has ended.")

The `if` statement begins with the keyword `if`, followed by the condition we wish to evaluate. A colon (`:`) follows the condition, signaling the start of the `if` block. The subsequent code block must be indented to indicate that it belongs to the `if` statement.

In Python, control structures such as `if` statements, `for` and `while` loops, along with function definitions that begin with the keyword `def`, utilize a colon (`:`) to indicate the start of a code block. This code block must be indented to distinguish it as part of the control structure. Python uses indentation to define blocks of code, meaning that all lines of code within the same block must have the same level of indentation. The end of a code block is marked by returning to the previous level of indentation. This principle of indentation is a fundamental aspect of Python syntax, ensuring code readability and structural clarity.

The `keys()` method returns the keys of a dictionary (as in `dna_sequences` in this case), and the `in` keyword checks whether the value of a variable (`species` in this case) is among the keys of the dictionary:

In [None]:
print(dna_sequences.keys())
print(species in dna_sequences.keys())

The `if` statement evaluates the condition following the keyword `if`. If the condition (e.g., `species in dna_sequences.keys()`) evaluates to `True`, the code within the `if` block is executed (e.g., `print('The sequence from ' + species + ': ' + dna_sequences[species])`). If the condition is `False`, the `if` block is skipped.

We can follow the `if` statement with an `else` statement to execute operations when the `if` condition is `False`:

In [None]:
species = 'Oryza sativa'
if species in dna_sequences.keys():
    print('The sequence from ' + species + ': ' + dna_sequences[species])
else:
    print('We do not have the sequence from ' + species)
print(species in dna_sequences.keys())

If we have multiple conditions, we can follow the `if` statement with an `elif` (short for "else if") statement to specify additional conditions. The `elif` statement can be used multiple times to create a chain of conditions, allowing for complex decision-making paths in the code. Each `elif` is checked in sequence after the initial `if`. If none of the conditions in the `if` or `elif` statements are met, we can also add an `else` statement at the end. The `else` block will execute if all preceding conditions are `False`, providing a default action when no specific conditions apply.

For example, we can add the sequence of a species that does not exist in the `dna_sequences` variable to the same variable:

In [None]:
sequences = {
    'Oryza sativa': 'GGG',
}

species = 'Oryza sativa'
if species in dna_sequences.keys():
    print('The sequence from ' + species + ': ' + dna_sequences[species])
elif sequences[species] != '':
    dna_sequences[species] = sequences[species]
    print('Added a new sequence!')
    print('The sequence from ' + species + ': ' + dna_sequences[species])
else:
    print('We do not have the sequence from ' + species)

This structure is highly flexible and can accommodate as many `elif` statements as necessary to cover the different scenarios your program might encounter.

Here, `''` represents an empty string, which is a string that contains no characters. The `!=` operator checks whether the values on either side of it are unequal. If the values are not the same, the expression evaluates to `True`; if they are the same, it evaluates to `False`. Hence, `sequences[species] != ''` checks if the value associated with the key `species` in the `sequences` dictionary is not an empty string. Conversely, the `==` operator verifies whether the values on either side of it are equal. If the values are the same, the expression evaluates to `True`; if they are not the same, it evaluates to `False`.

In [None]:
sequences['C.elegans'] = ''

species = 'C.elegans'
if species in dna_sequences.keys():
    print('The sequence from ' + species + ': ' + dna_sequences[species])
elif sequences[species] != '':
    dna_sequences[species] = sequences[species]
    print('Added a new sequence!')
    print('The sequence from ' + species + ': ' + dna_sequences[species])
else:
    print('We do not have the sequence from ' + species)

<a name="for"></a>
#### `for`

The `for` statement is used for iterating over sequences, such as lists and dictionaries, allowing us to execute a block of code for each item. This makes it a powerful tool for looping through data structures to perform operations on each element.

For example, we can use the `for` statement to iterate over the `dna_sequences` dictionary, printing out each key (species here) and its corresponding value (sequences in this case):

In [None]:
print('The species that we have sequences are:')
for k in dna_sequences:
    print('- ' + k + ': ' + dna_sequences[k])
print("The for loop has ended.")

The `for` statement begins with the keyword `for`, followed by a variable name (e.g., `k`), which is then followed by the keyword `in`. After `in`, we specify the iterable variable (e.g., `dna_sequences`) that we want to iterate over.  Within the loop, we can access each item in `dna_sequences` using the variable `k`.

<a name="while"></a>
#### `while`

The `while` statement is used for executing a block of code repeatedly as long as a given condition is met. It enables the creation of loops that continue to run until the condition becomes `False`. This is particularly useful for scenarios where the number of iterations needed is not known before the loop starts.

For example, we can also use the `while` statement to loop through the `dna_sequences` dictionary and print out each key-value pair:



In [None]:
print('The species that we have sequences are:')
sum = 0
while sum < len(dna_sequences):
    k = list(dna_sequences)[sum]
    print('- ' + k + ': ' + dna_sequences[k])
    sum += 1
print("The while loop has ended.")

Here, we use the `sum ` to track the number of key-value pairs we have encountered. If `sum` is smaller than the total number of items in the `dna_sequences` dictionary, then we can continue iterating through the dictionary to process and count the remaining key-value pairs until `sum` is not smaller (that is, `>=`) the total number of items in the dictionary.

The statement `sum += 1` means that the current value of the `sum` variable is increased by 1. This is a shorthand for `sum = sum + 1`, which takes the current value of sum, adds 1 to it, and then updates `sum` with the new value. It is commonly used in loops and iterative processes as a counter to keep track of the number of iterations or to accumulate a total incrementally.

<a name="def"></a>
#### Define a Function

From the examples provided, we observe that certain code segments are used repeatedly:
```
if species in dna_sequences.keys():
    print('The sequence from ' + species + ': ' + dna_sequences[species])
elif sequences[species] is not None:
    dna_sequences[species] = sequences[species]
    print('Added a new sequence!')
    print('The sequence from ' + species + ': ' + dna_sequences[species])
else:
    print('We do not have the sequence from ' + species)

```
To make our code more reusable and organized, we can encapsulate this logic within a function. This approach allows us to decompose complex problems into smaller, manageable components and structure our code into logical sections. It simplifies navigating and understanding the flow of the program, making it easier to maintain and extend. Here is the example:

In [None]:
def check_sequences(dictionary: dict, species: str,
                    sequences: dict = None) -> None:
    """
    Description:
        Verifies if a specified species' sequence exists within the dictionary.
        If the sequence is absent but provided, it is then added to the
        dictionary.

    Arguments:
        dictionary (dict): The dictionary to check for the species' sequence.
        species (str): The species to look for in the dictionary.
        sequences (dict, optional): A dictionary containing the species and
            its sequence to add to the dictionary if not already present.
            Default: None.

    Returns:
        None.
    """
    if species in dictionary.keys():
        print('The sequence from ' + species + ': ' + dictionary[species])
    elif (sequences is not None):
        if (species in sequences) and (sequences[species] != ''):
            dictionary[species] = sequences[species]
            print('Added a new sequence!')
            print('The sequence from ' + species + ': ' + dictionary[species])
        else:
            print('We do not have the sequence from ' + species)
    else:
        print('We do not have the sequence from ' + species)

To define a function, we start with the keyword `def`, followed by the function name (`check_sequences` in this case). We then specify the parameters of our function within the parentheses after the function name. In the `check_sequences()` function, we define three parameters:

- `dictionary`
- `species`
- `sequences`

We assign a default value of `None`, a built-in data type typically used to signify the absence of a value, to the sequences parameter. This means that if we do not specify a value for sequences when calling the `check_sequences()` function, it will automatically use `None`. Since we do not set default values for the `dictionary` and `species` parameters, they must be provided when using the `check_sequences()` function; otherwise, Python will raise a `TypeError` due to missing required positional arguments:

In [None]:
# This code does not work, because one required parameter `species` is missing.
check_sequences(dna_sequences)

In [None]:
check_sequences(dna_sequences, 'Human')
# Add a new squeuence
sequences['C.elegans'] = 'TTTTT'
check_sequences(dna_sequences, 'C.elegans', sequences)

When calling a function, it is necessary to follow the order of parameters as defined in the function. However, arguments can also be passed explicitly to their corresponding parameters by name, making the order irrelevant in this context:

In [None]:
check_sequences(species='C.elegans', dictionary=dna_sequences)

To enhance the readability and maintainability of our code, we can specify the expected data type of each parameter using a colon followed by the type. This practice, known as type annotation, helps developers understand what types of values should be passed to functions. We also annotate the return type of our function with `-> None`, indicating that our function does not return any value, as it primarily prints information to the screen. Although type annotations are optional in Python and do not influence the runtime behavior of our code, they play a crucial role in documenting our function clearly.

Similarly, this block is called a docstring in Python, which is enclosed in triple quotes. It serves as documentation for what a function does, including details about its parameters, return values, and any other relevant information.

```
"""
Description:
    Verifies if a specified species' sequence exists within the dictionary.
    If the sequence is absent but provided, it is then added to the
    dictionary.

Arguments:
    dictionary (dict): The dictionary to check for the species' sequence.
    species (str): The species to look for in the dictionary.
    sequences (dict, optional): A dictionary containing the species and
        its sequence to add to the dictionary if not already present.
        Default: None.

Returns:
    None.
"""
```

Note that triple quotes (either `'''` for single triple quotes or `"""` for double triple quotes) can be used to define multi-line strings, allowing us to easily span a string over several lines without using escape characters for newlines. This is particularly useful for long texts, docstrings, or when embedding formatted text directly within our code.

In contrast, single (`'`) or double (`"`) quotes are used for single-line strings. To span a string containing newline characters over multiple lines with single or double quotes, we would need to use the newline escape character (`\n`):

In [None]:
single_line_string = "This is an example of a\nsingle-line string in Python,\nusing double quotes."
print(single_line_string)

<a name="import"></a>
#### Import a Package

Python boasts an active and supportive community, which stands as one of the key reasons behind its widespread popularity. This vibrant ecosystem allows us to reuse code developed by the community, significantly speeding up the development process. For scientific computing and data visualization, three widely-used Python packages are [numpy](https://numpy.org/), [scipy](https://scipy.org/), and [matplotlib](https://matplotlib.org/). Besides, [pandas](https://pandas.pydata.org/) is a popular library for data analysis.

These packages are pre-installed in Google Colab. To use them, we can use the `import` statement:

In [None]:
import scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We can rename a package when importing it by using the keyword `as` for convenience or to avoid naming conflicts.

Here, we use `numpy`, `scipy`, and `matplotlib` to plot a normal distribution:

In [None]:
fig, ax = plt.subplots(1, 1)
x = np.linspace(scipy.stats.norm.ppf(0.01), scipy.stats.norm.ppf(0.99), 100)
ax.plot(x, scipy.stats.norm.pdf(x), 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.show()

To use a method (or function) from a package in Python, the notation is `package.method()` or `package.function()`, where `package` is the name of the package (or module) and `method()` or `function()` is the name of the method or function you want to use. If you have imported the package with an alias, you would use that alias instead of the original package name. For example, if you `imported numpy as np`, you would use `np.linspace()` to create an array of linearly spaced numbers between a specified start value and stop value.

Because we do not assign an alias to `scipy` upon importing, we must explicitly specify the full method name as `scipy.stats.norm.ppf()` to use it.

<a name="io"></a>
### File Input and Output

Usually, we need to read and write our data from files. To output our `dna_sequences` dictionary into a file named `dna_sequences.txt`, we can use the following statement:

In [None]:
with open('dna_sequences.txt', 'w') as f:
    for k in dna_sequences:
        f.write('- ' + k + ': ' + dna_sequences[k] + '\n')

The statement begins with the keyword `with`, followed by the `open()` function. Within `open()`, we specify two arguments: the first is the name of the file we want to open (e.g., `dna_sequences.txt`), and the second is the mode, here `w` for writing. We assign an alias `f` to this opened file. This makes it possible to access and work with the file using the variable `f`. For instance, to write items from the `dna_sequences` dictionary to the file, we can iterate over the dictionary and use `f.write()` to add each item to the file.

To read data from the `dna_sequences.txt` file, we can change the mode from `w` (write) to `r` (read) in the `open()` function. Then, we can use a for loop to iterate through each line in the file:

In [None]:
with open('dna_sequences.txt', 'r') as f:
    for line in f:
        print(line.strip())

The `.strip()` method is used on each line to remove the newline character (`\n`) at the end of each line, which is included when reading lines this way. This is optional and depends on how you need to process each line.

In [None]:
# Without the .strip() method, an extra blank line is printed between two lines.
with open('dna_sequences.txt', 'r') as f:
    for line in f:
        print(line)

<a name="task2"></a>
### Task 2: Comprehensive Application

[The central dogma of molecular biology](https://www.genome.gov/genetics-glossary/Central-Dogma) describes the flow of genetic information from DNA to RNA, and then from RNA to protein. Based on this principle, we can translate DNA sequences into protein sequences.

In this task, we will focus on the DNA sequence of the [G6PD](https://www.ncbi.nlm.nih.gov/gene/2539) protein. G6PD is of particular interest as it may be under [natural selection](https://www.nature.com/articles/376246a0) in Africans due to resistance to severe malaria. We can download its DNA sequence using the following command:

In [None]:
!wget -c https://raw.githubusercontent.com/xin-huang/pgml/main/Section_01/G6PD.dna.sequence.txt

We can take a look at its DNA sequence:

In [None]:
!cat G6PD.dna.sequence.txt

Using what we have learned, please implement a Python program to translate the DNA sequence in the `G6PD.dna.sequence.txt` file into the corresponding protein sequence using the standard DNA or RNA codon table from [here](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables). Denote stop codons with the `*` symbol. Store your result in a variable called `g6pd_protein` and a file named `G6PD.protein.sequence.txt`. For reference, you can find the DNA and protein sequences [here](https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi?REQUEST=GENEID&DATA=2539).

In [None]:
# Please implement your code here.
g6pd_protein = ''

You can check you result by executing the following code cells.

In [None]:
g6pd_protein == 'MAEQVALSRTQVCGILREELFQGDAFHQSDTHIFIIMGASGDLAKKKIYPTIWWLFRDGLLPENTFIVGYARSRLTVADIRKQSEPFFKATPEEKLKLEDFFARNSYVAGQYDDAASYQRLNSHMNALHLGSQANRLFYLALPPTVYEAVTKNIHESCMSQIGWNRIIVEKPFGRDLQSSDRLSNHISSLFREDQIYRIDHYLGKEMVQNLMVLRFANRIFGPIWNRDNIACVILTFKEPFGTEGRGGYFDEFGIIRDVMQNHLLQMLCLVAMEKPASTNSDDVRDEKVKVLKCISEVQANNVVLGQYVGNPDGEGEATKGYLDDPTVPRGSTTATFAAVVLYVENERWDGVPFILRCGKALNERKAEVRLQFHDVAGDIFHQQCKRNELVIRVQPNEAVYTKMMTKKPGMFFNPEESELDLTYGNRYKNVKLPDAYERLILDVFCGSQMHFVRSDELREAWRIFTPLLHQIELEKPKPIPYIYGSRGPTEADELMKRVGFQYEGTYKWVNPHKL*'


In [None]:
!cat G6PD.protein.sequence.txt

<details>
  <summary>
    <font size="3" color="darkgreen">
      <b>Click for hints</b>
    </font>
  </summary>

  1. **Create a Codon Table Dictionary:** Given that three nucleotides encode a single amino acid, it is straightforward to map these nucleotide triplets (codons) to their corresponding amino acids. A [dictionary](#dict) is an ideal structure for this codon table, using the triplet sequences as keys and their respective amino acids as values.

  2. **Read the DNA Sequence from a File:** Utilize the `open()` function with [read mode](#io) to read the DNA sequence from `G6PD.dna.sequence.txt`. Employ a [`for` loop](#for) to process the file line by line, storing the sequence in a variable (e.g., `g6pd_dna`). Ensure you use the `strip()` method to eliminate any trailing newline characters from each line.

  3. **Convert the DNA Sequence to a Protein Sequence:** Implement a [`while` loop](#while) to iterate through the DNA sequence stored in your variable. Extract sets of three characters (codons) at a time, and use the codon table dictionary to translate these codons into their corresponding amino acids. Accumulate these amino acids to form the protein sequence, storing it in a variable (e.g., `g6pd_protein`).

  4. **Write the Protein Sequence to a File:** Use the `open()` function with [write mode](#io) to output the protein sequence into a file named `G6PD.protein.sequence.txt`.
  - <details>
      <summary>
        <font size="3" color="darkblue">
          <b>Click for solutions</b>
        </font>
      </summary>

      ```
      codon_table = {
          'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L',
          'CTT': 'L', 'CTC': 'L', 'CTA': 'L', 'CTG': 'L',
          'ATT': 'I', 'ATC': 'I', 'ATA': 'I', 'ATG': 'M',
          'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V',
          'TCT': 'S', 'TCC': 'S', 'TCA': 'S', 'TCG': 'S',
          'CCT': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
          'ACT': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
          'GCT': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
          'TAT': 'Y', 'TAC': 'Y', 'TAA': '*', 'TAG': '*',
          'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
          'AAT': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
          'GAT': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
          'TGT': 'C', 'TGC': 'C', 'TGA': '*', 'TGG': 'W',
          'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
          'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
          'GGT': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
      }

      g6pd_ccds = ''

      with open('G6PD.dna.sequence.txt', 'r') as f:
          for line in f:
              g6pd_ccds += line.strip()

      g6pd_protein = ''

      i = 0
      while i < len(g6pd_ccds):
          codon = g6pd_ccds[i:i+3]
          g6pd_protein += codon_table[codon]
          i += 3

      with open('G6PD.protein.sequence.txt', 'w') as f:
          f.write(g6pd_protein + '\n')
      ```
  </details>
</details>

<a name="summary"></a>
## Summary

In this section, we covered:

1. The structure and content of our course, including an overview of the curriculum and the instructional approach.
2. The fundamental concepts of **Python programming**, encompassing variables, built-in data types, conditional statements (`if` statements), iterative statements (`for` and `while` loops), function definition and package importation, and file I/O operations for reading and writing data.