# Python Tutorial {#python_tutorial}
- 本教程默认使用**python3**
- **强烈建议** 安装好Anaconda，下载本教程的.ipynb文件[python_tutorial.ipynb](./python_tutorial.ipynb)，使用jupyter notebook打开并运行。

![life_is_short](assets/life_is_short.png)

## Anaconda 和 jupyter

### Anaconda

* 可以轻松的使用Anaconda管理和安装python的包（with `conda`），且包含一些非常有用的工具如jupyter notebook
* [官网下载地址](https://www.anaconda.com/download/)，也可以在TUNA镜像站[免流量下载Anaconda](https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/)

Operating System | Download Link | Notes
--- | --- | ---
Mac | [Anaconda3-5.2.0-MacOSX-x86_64.pkg](https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.2.0-MacOSX-x86_64.pkg) | 
Linux | [Anaconda3-5.2.0-Linux-x86_64.sh](https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.2.0-Linux-x86_64.sh) | 注意需要添加环境变量
Windows | [Anaconda3-5.2.0-Windows-x86_64.exe](https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.2.0-Windows-x86_64.exe) | 

![anaconda_packages](assets/anaconda_packages.png)

用`conda`安装python package，以`h5py`为例:
```bash
conda install h5py
```

用`conda`更新`h5py`至最新版本:
```bash
conda update h5py
```

安装conda时也自动安装了pip，pip也是一个用来安装和管理python包的工具，用`pip`安装python package，以`h5py`为例:
```bash
pip install h5py
```

用`pip`更新`h5py`至最新版本:
```bash
pip install --upgrade h5py
```

### jupyter notebook
URL: (http://jupyter.org/)

是一款基于浏览器的交互性极强的python开发环境，在科研和工业界都广泛使用，可以帮助使用者方便的可视化结果，快速书写和调整代码，**非常推荐使用**

**打开 jupyter notebook**

```bash
jupyter notebook --no-browser
```

或者使用软件版的Anaconda中集成的jupyter软件打开。

**Jupyter notebooks manager**

![jupyter_main](assets/jupyter_main.png)


**Jupyter notebook**

![jupyter_notebook](assets/jupyter_notebook.png)

**方便的可视化（与matplotlib，seaborn等配合）**

![jupyter_matplotlib](assets/jupyter_matplotlib.png)


**展示图片**

![jupyter_image](assets/jupyter_image.png)

**展示dataframe（与pandas配合）**

![jupyter_dataframe](assets/jupyter_dataframe.png)


**支持markdown**

![jupyter_markdown](assets/jupyter_markdown.png)

#### jupyter基本用法：
- 保存，增加，删除，复制，粘贴代码框，上下移动代码框，运行，终止代码框，重启kernel（将会**清空内存**），切换代码框版式
- 使用shift+enter运行代码框，使用enter换行
- 可以搭配插件nbextenstion使用，提供更多功能，用法：

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
```

# Basic 

## python语法规范
python非常注意规范的书写语法，以缩进为例，python强制要求使用tabs/spaces来缩进。推荐使用tab或四个空格来缩进。
```python
# use a tab
for i in range(3):
    print(i)
# use 2 spaces
for i in range(3):
  print(i)
# use 4 spaces
for i in range(3):
    print(i)
```

## 在终端运行python脚本
创建一个python脚本`welcome.py`
```python
print('welcome to python!')
```

在相同目录下运行:
```bash
chmod +x welcome.py  #set the python script as executable
python welcome.py    #use python to run welcome.py
```

用shebang符号可以不需要指明python解释器，即在python脚本的第一行添加`#! /usr/bin/env python`
```python
#! /usr/bin/env python
# -*- coding: UTF-8 -*-
print('welcome to python!')
```

现在就可以不需要指明python解释器，直接运行python脚本了。

```bash
chmod +x welcome.py  #set the python script as executable
./welcome.py
```

## Hello World!

In [5]:
# This is a one line comment
print('Hello World!')

Hello World!


In [7]:
print("The \n makes a new line")
print("The \t is a tab")
print('I\'m going to the movies')

The 
 makes a new line
The 	 is a tab
I'm going to the movies


In [9]:
firstVariable = 'Hello World!'
print(firstVariable)

Hello World!


In [10]:
# go over ? mark after if you are not sure what method does. 
print(firstVariable.lower())
print(firstVariable.upper())
print(firstVariable.title())

hello world!
HELLO WORLD!
Hello World!


In [11]:
# To look up what each method does in jupyter notebook
firstVariable.lower?

In [12]:
# Can also use help
help(firstVariable.lower)

Help on built-in function lower:

lower(...) method of builtins.str instance
    S.lower() -> str
    
    Return a copy of the string S converted to lowercase.



## Simple Math

In [19]:
# Addition, add two int together
print (1+1)
print (130-2.0)
print (126/3)
print (2*3)
print (2**3)
print (10%3)

2
128.0
42.0
6
8
1


## if statement
Comparison Operator | Function
--- | --- 
< | less than
<= | less than or equal to
> | greater than
>= | greater than or equal to
== | equal
!= | not equal

In [39]:
num = 3
if num % 3 == 0:
    print("if statement satisfied")

if statement satisfied


Logical Operator | Description
--- | ---
and | If both the operands are True then condition becomes True.
or | If any of the two operands are True then condition becomes True. 
not | Used to reverse the logical (not False becomes True, not True becomes False)

In [25]:
# both the conditions are true, so the num will be printed out
num = 3
if num > 0 and num  < 15:
    print(num)

3


## else and elif

In [27]:
my_num = 5
if my_num % 2 == 0:
    print("Your number is even")
elif my_num % 2 != 0:
    print("Your number is odd")
else: 
    print("Are you sure your number is an integer?")

Your number is odd


## Swap values

In [38]:
a = 1
b = 2
b, a = a, b
print(a, b)

2 1


## List
请务必注意，python的索引都是**从0开始的**，而不是1！

 |  |  |  |
--- | --- | --- | --- | ---
z =| [3, | 7, | 4, | 2]
index | 0 | 1 | 2 | 3

### Accessing Values in List

In [29]:
# Defining a list
z = [3, 7, 4, 2]

In [30]:
# The first element of a list is at index 0
z[0]

3

In [32]:
# Access Last Element of List 
z[-1]

2

### Slicing Lists

In [34]:
# first index is inclusive (before the :) and last (after the :) is not. 
# not including index 2
z[0:2]

[3, 7]

In [35]:
# everything up to index 3
z[:3]

[3, 7, 4]

In [36]:
# index 1 to end of list
z[1:]

[7, 4, 2]

### Minimum, Maximum, Length, and Sum of a list

In [None]:
print(min(z), max(z), len(z), sum(z))

### Add to the End of List

In [37]:
x = [3, 7, 2, 11, 8, 10, 4]
y = ['Steve', 'Rachel', 'Michael', 'Adam', 'Monica', 'Jessica', 'Lester']
x.append(3)
y.append('James')
print(x)
print (y)

[3, 7, 2, 11, 8, 10, 4, 3]
['Steve', 'Rachel', 'Michael', 'Adam', 'Monica', 'Jessica', 'Lester', 'James']


### list comprehension

In [43]:
#Use for loops
a = []
for i in range(10):
    a.append(i + 10)
print(a)

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


In [42]:
#Use list comprehension
a = [i + 10 for i in range(10)]
print(a)

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


## Dictionary

字典是另一种可变容器模型，且可存储任意类型对象。

字典的每个键值 `key=>value` 对用冒号 `:` 分割，每个键值对之间用逗号 `,` 分割，整个字典包括在花括号 `{}` 中

键一般是唯一的，如果重复最后的一个键值对会替换前面的，值不需要唯一

### 定义和获取字典中的值

In [45]:
dict = {'a': 1, 'b': 2, 'b': '3'};
dict['b']

'3'

### 修改字典

In [48]:
dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'};
 
dict['Age'] = 8; # update existing entry
dict['School'] = "DPS School"; # Add new entry
 
 
print ("dict['Age']: ", dict['Age'])
print ("dict['School']: ", dict['School'])

dict['Age']:  8
dict['School']:  DPS School


### Dict comprehension

In [50]:
#Use for-loops:
a = {}
for i in range(10):
    a[i] = chr(ord('A') + i) 
print(a)

{0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J'}


In [51]:
#Use dict comprehension:
a = {i:chr(ord('A') + i) for i in range(10)}
print(a)

{0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J'}


# Scientific computation

## Scientific computing相关的python工具包
![scipy ecosystem](assets/scipy_ecosystem.jpg)

## 使用python工具包
python的开发者提供了数以万计的python工具包，以原生工具包`os`和矩阵计算工具包`numpy`为例，导入方法如下：

In [53]:
import os
import numpy as np 

对于初学者，我们推荐优先掌握如下python工具包：
- Numpy
- Scipy
- Pandas
- Matplotlib

它们提供了非常强大有用的科学计算功能，在矩阵运算、统计建模、机器学习、数据可视化等领域都应用广泛。

如果读者对机器学习和深度学习感兴趣，可以进一步了解以下工具包：
- scikit-learn
- Keras/Tensorflow/Pytorh

### 矩阵计算工具: Numpy
URL: (http://www.numpy.org/)

In [54]:
import numpy as np

In [67]:
# create an empty matrix of shape (5, 4)
X = np.zeros((5, 4), dtype=np.int32)
X

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int32)

In [68]:
# create an array of length 5: [0, 1, 2, 3, 4]
y = np.arange(0,5)
y

array([0, 1, 2, 3, 4])

In [69]:
# create an array of length 4: [0, 1, 2, 3]
z = np.arange(4)
z

array([0, 1, 2, 3])

In [70]:
# set Row 1 to [0, 1, 2, 3]
X[0] = np.arange(4)
# set Row 2 to [1, 1, 1, 1]
X[1] = 1
X

array([[0, 1, 2, 3],
       [1, 1, 1, 1],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int32)

In [71]:
# add 1 to all elements
X += 1
X

array([[1, 2, 3, 4],
       [2, 2, 2, 2],
       [1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]], dtype=int32)

In [72]:
# add y to each row of X
X += y.reshape((-1, 1))
X

array([[1, 2, 3, 4],
       [3, 3, 3, 3],
       [3, 3, 3, 3],
       [4, 4, 4, 4],
       [5, 5, 5, 5]], dtype=int32)

In [73]:
# add z to each column of X
X += z.reshape((1, -1))
X

array([[1, 3, 5, 7],
       [3, 4, 5, 6],
       [3, 4, 5, 6],
       [4, 5, 6, 7],
       [5, 6, 7, 8]], dtype=int32)

In [75]:
# get row sums => 
row_sums = X.sum(axis=1)
row_sums

array([16, 18, 18, 22, 26])

In [76]:
# get column sums
col_sums = X.sum(axis=0)
col_sums

array([16, 22, 28, 34])

In [78]:
# matrix multiplication
A = X.dot(X.T)
A

array([[ 84,  82,  82,  98, 114],
       [ 82,  86,  86, 104, 122],
       [ 82,  86,  86, 104, 122],
       [ 98, 104, 104, 126, 148],
       [114, 122, 122, 148, 174]], dtype=int32)

### 数值分析工具(概率分布，信号分析等.): Scipy
URL: (https://www.scipy.org/)

scipy.stats contains a large number probability distributions:
![scipy_stats](assets/scipy_stats.png)

### 操作data frames的工具包：pandas
URL: (http://pandas.pydata.org/pandas-docs/stable/)

In [80]:
import pandas as pd

In [96]:
# read a bed file
genes = pd.read_table('data/gene.bed', header=None, sep='\t',
                     names=('chrom', 'start', 'end', 'gene_id', 'score', 'strand', 'biotype'))
genes.head(10)

Unnamed: 0,chrom,start,end,gene_id,score,strand,biotype
0,chr10,100237155,100237302,ENSG00000212464.1,.,-,snoRNA
1,chr10,100258570,100258677,ENSG00000207362.1,.,+,snRNA
2,chr10,100398351,100398446,ENSG00000212325.1,.,-,misc_RNA
3,chr10,100666694,100667009,ENSG00000274660.1,.,-,misc_RNA
4,chr10,100907173,100907280,ENSG00000222072.1,.,-,misc_RNA
5,chr10,100974984,100975084,ENSG00000207551.1,.,+,miRNA
6,chr10,101307528,101307718,ENSG00000222238.1,.,+,snRNA
7,chr10,101364844,101365035,ENSG00000222414.1,.,-,snRNA
8,chr10,101601416,101601497,ENSG00000263436.1,.,+,miRNA
9,chr10,101601416,101601497,ENSG00000283558.1,.,-,miRNA


In [87]:
# get all gene IDs
gene_ids = genes['gene_id']

In [95]:
# set gene_id as index
genes.index = genes['gene_id']
genes.head()

Unnamed: 0_level_0,chrom,start,end,gene_id,score,strand,biotype
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENSG00000212464.1,chr10,100237155,100237302,ENSG00000212464.1,.,-,snoRNA
ENSG00000207362.1,chr10,100258570,100258677,ENSG00000207362.1,.,+,snRNA
ENSG00000212325.1,chr10,100398351,100398446,ENSG00000212325.1,.,-,misc_RNA
ENSG00000274660.1,chr10,100666694,100667009,ENSG00000274660.1,.,-,misc_RNA
ENSG00000222072.1,chr10,100907173,100907280,ENSG00000222072.1,.,-,misc_RNA


In [90]:
# get row with given gene_id
gene = genes.loc['ENSG00000212325.1']
gene

chrom                  chr10
start              100398351
end                100398446
gene_id    ENSG00000212325.1
score                      .
strand                     -
biotype             misc_RNA
Name: ENSG00000212325.1, dtype: object

In [93]:
# get rows with biotype = 'protein_coding'
genes_selected = genes[genes['biotype'] == 'protein_coding']
genes_selected.head()

Unnamed: 0_level_0,chrom,start,end,gene_id,score,strand,biotype
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENSG00000120054.11,chr10,100042192,100081877,ENSG00000120054.11,.,-,protein_coding
ENSG00000107566.13,chr10,100150093,100188334,ENSG00000107566.13,.,-,protein_coding
ENSG00000213341.10,chr10,100188297,100229619,ENSG00000213341.10,.,-,protein_coding
ENSG00000095485.16,chr10,100232297,100267680,ENSG00000095485.16,.,-,protein_coding
ENSG00000196072.11,chr10,100273279,100286712,ENSG00000196072.11,.,-,protein_coding


In [97]:
# get protein coding genes in chr1
genes_selected = genes.query('(biotype == "protein_coding") and (chrom == "chr1")')
genes_selected.head()

Unnamed: 0,chrom,start,end,gene_id,score,strand,biotype
23888,chr1,153772370,153774079,ENSG00000279767.1,.,-,protein_coding
23944,chr1,16539065,16539575,ENSG00000268991.2,.,+,protein_coding
23950,chr1,16673002,16673512,ENSG00000237847.2,.,+,protein_coding
23952,chr1,16733951,16734461,ENSG00000279132.2,.,-,protein_coding
24470,chr1,100038096,100083377,ENSG00000156875.13,.,+,protein_coding


In [98]:
# count genes for each biotype
biotype_counts = genes.groupby('biotype')['gene_id'].count()
biotype_counts

biotype
3prime_overlapping_ncRNA                 31
IG_C_gene                                14
IG_C_pseudogene                           9
IG_D_gene                                37
IG_J_gene                                18
IG_J_pseudogene                           3
IG_V_gene                               144
IG_V_pseudogene                         188
IG_pseudogene                             1
Mt_rRNA                                   2
Mt_tRNA                                  22
TEC                                    1068
TR_C_gene                                 6
TR_D_gene                                 4
TR_J_gene                                79
TR_J_pseudogene                           4
TR_V_gene                               108
TR_V_pseudogene                          30
antisense                              5529
bidirectional_promoter_lncRNA             8
lincRNA                                7520
macro_lncRNA                              1
miRNA                   

In [99]:
# add a column for gene length
genes['length'] = genes['end'] - genes['start']
genes.head()

Unnamed: 0,chrom,start,end,gene_id,score,strand,biotype,length
0,chr10,100237155,100237302,ENSG00000212464.1,.,-,snoRNA,147
1,chr10,100258570,100258677,ENSG00000207362.1,.,+,snRNA,107
2,chr10,100398351,100398446,ENSG00000212325.1,.,-,misc_RNA,95
3,chr10,100666694,100667009,ENSG00000274660.1,.,-,misc_RNA,315
4,chr10,100907173,100907280,ENSG00000222072.1,.,-,misc_RNA,107


In [101]:
# save DataFrame to Excel file
length_table.to_excel('data/length_table.xlsx')

### Basic graphics and plotting: matplotlib
URL: (https://matplotlib.org/contents.html)

![matplotlib](assets/matplotlib.png)

### Statistical data visualization: seaborn
URL: (https://seaborn.pydata.org/)

![seaborn](assets/seaborn.png)


### Progress bar: tqdm
URL: (https://pypi.python.org/pypi/tqdm)

一个有用的计时工具，可以使用`pip`或者`conda`安装：

```
pip install tqdm
```

In [109]:
from tqdm import tqdm_notebook as tqdm
from time import sleep

In [110]:
for i in tqdm(range(20)):
    sleep(0.2)




# 课外阅读
- [Recommendation: Python tutorial by Shibinbin](https://shibinbin.gitbooks.io/bioinfomatics-training-program/content/python_basics.html#install_python)
- [Python Tutorials](https://github.com/mGalarnyk/Python_Tutorials)
- [廖雪峰python教程](https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000)

# Homework
- 在电脑上安装Anaconda，在jupyter notebook中运行本教程中的相关代码，观察输出