# Lecture 5 Modules

## Python modules
- A python module (also called a package) is a python file (or folder comprised of multiple python files) that contains function and/or class definitions and executable statements
- We have already seen the `re` module
- There are many modules in the Python Standard Library; these are preinstalled with python
- There are thousands of third party modules for many different applications
- python modules can be installed from the Python Package Index AKA [PyPi](https://pypi.org/)
- Python comes with a package manager called `pip` that can download and install modules from PyPi
- Anaconda comes with many useful third party modules preinstalled
    - The main ones we will use will be `numpy`, `scipy`, `pandas`, `matplotlib`, and `seaborn`
    - All of them are included in Anaconda
- You can use `pip` in the Anaconda prompt
- To install a package like numpy:
```
pip install numpy
```
- To uninstall a package
```
pip uninstall numpy
```
- You can also use the `conda` command to install packages

### Importing modules
- Python modules need to be imported in your code
    - This will load the module
    - We need to load the modules because the alternative would be for python to pre-load all the modules on start-up which would be slow.
        - It doesn't make sense for python to load modules that you don't need
- We have seen that we need to import the `re` module
- When we import a module we can access its functions and classes using `.` syntax for example `re.sub` allows us to use the `sub` function in the `re` module

### Importing submodules from modules
- There are also submodules within modules.
- Submodules have their own functions
- For example `matplotlib` has a submodule called `pyplot`
```python
import matplotlib.pyplot
```
You could also access submodule functions by chaining `.`
```python
import matplotlib
# access scatter which is a function in pyplot which is a submodule of matplotlib
my_plot = matplotlib.pyplot.scatter(range(10),range(10))
```
- Use `*` to import all modules and functions from a module
```python
from re import *
sub('[ATGC]+','sequence','AATGC')
```
- By convention we write our import statements at the beginning of our code

### using the `from` keyword
- Our code can start getting really long if we chain modules, submodules, and functions
- We can use from to only load a single submodule or single function
- Importing a submodule:
```python
from matplotlib import pyplot
my_plot = pyplot.scatter(range(10),range(10))
```
- Importing a single function
```python
from matplotlib.pyplot import scatter
my_plot = scatter(range(10),range(10))
```

### Importing modules and functions using aliases with `as`
- We can call a module or function anything we want in our code using `as`
- There are some common abbreviations used for various packages
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```
- Use sensible aliases for modules and functions because your code will become very confusing if you don't!

In [43]:
import re as Pizza
Pizza.sub('[Cc]','G','ATGCCCCC')

'ATGGGGGG'

In [44]:
# Please please please don't write code like this
from re import compile as Hamburger
from re import search as ice_cream
from re import sub as Cookie
stuff = Hamburger(r'[ATGC]+')
thing = ice_cream(stuff,"Sequence is AAATGCATGA")
jawn = Cookie(stuff,'deleted','Sequence is AATGCATGA')
print(jawn)

Sequence is deleted


## Importing from your own python files
- You can import a function from another python file as long as both files are in the same directory (folder)
- The module name is the file name without the extension
- For example in homework 1 we made a file called "sequence_functions.py", we can import from it like below:
```python
from sequence_functions import has_start_codon
```
or 
```python
import sequence_functions as sf
sf.has_start_codon('ATGAAA')
```
- Remember to do a demonstration

## Controlling the execution of your own python files
- When you import a file python will execute all of the code in the file, which you may or may not want it to do
- Python files can have different behavior when they are imported vs. when they are run
- Let's say you only want python to define functions when it is imported and you want it to define functions and execute code when the file is run
- Here is a file called "example.py"
```python
def add2(x,y):
    return(x + y)
if __name__ == "__main__":
    print(add2(10,5))
    print(add2(2,2))
```
- In the python file example above python will always define the function `add2`, but the `print` calls will not be executed if you import the function, but they will be executed if you run the file

```python
# nothing will be printed
from example import add2
s = add(100,100)
```

- "example.py" will only print if you click the run button in Spyder or you run it in the Anaconda prompt with:

```
python example.py
```

- The `__name__` variable is a built-in variable that is equal to the module name when the file is being imported and it is equal to `"__main__"` when the file is being run. 

## `pathlib` module
- The `os` module is an older module that python uses to interact with your computer's file system
- `pathlib` is a newer more modern interface that let's you work with the file system
- `pathlib` gives us `Path` objects
- Paths are the location of files and folders on your computer

In [41]:
# load Path into python
from pathlib import Path

# get the location of the current directory where python is working
current_dir = Path('.')

# '.' means the current directory
print(current_dir)

.


### Absolute and relative paths
- Absolute paths start at the base location of your drive e.g. C
- Relative paths are relative to the current directory
    - To go up one directory "./.."
- `Path` objects have a method `.absolute()` to get the absolute path

In [42]:
current_dir = Path('.').absolute()
print(current_dir)

c:\Users\rehman\Documents\bioinfo_coding_bootcamp\Day5


- Path objects can point to either directors (folders) or files
    - Check if a path points to a folder with `.is_dir()`
    - Check if it is a file with `.is_file()`

In [10]:
current_dir = Path('.').absolute()
current_dir.is_dir()

True

In [11]:
current_dir = Path('.').absolute()
current_dir.is_file()

False

In [12]:
# get the path as a string
current_dir = Path('.').absolute()
str(current_dir)
# If you are running this on Windows, 
# Note how python has added escape characters to the path

'c:\\Users\\rehman\\Documents\\bioinfo_coding_bootcamp\\Day5'

In [13]:
# get the files in your directory
list(current_dir.glob('*'))

[WindowsPath('c:/Users/rehman/Documents/bioinfo_coding_bootcamp/Day5/.ipynb_checkpoints'),
 WindowsPath('c:/Users/rehman/Documents/bioinfo_coding_bootcamp/Day5/examples'),
 WindowsPath('c:/Users/rehman/Documents/bioinfo_coding_bootcamp/Day5/hw3.ipynb'),
 WindowsPath('c:/Users/rehman/Documents/bioinfo_coding_bootcamp/Day5/Lecture5.ipynb')]

In [14]:
# search all subfolders in the directory with rglob
# recursive glob
# this will match any file with a "txt" extension
list(current_dir.rglob('*.txt'))

[]

In this context **recursive** means to search a folder and all subfolders within it. (Also all subfolders in each subfolder, etc.)

### `glob` module
- You can instead use the glob function from the glob module
- `glob` takes a string of the path as input instead of a `Path` object
- Both `glob` and `Path.glob` use unix like pattern expansion
    - `*` is a wildcard which will match any number of characters except for `\\` with separates directories in the paths
    - `**` matches any characters including the directory separator
    - `C:\users\rehman\documents\**\*.txt` will match any text files in the documents folder

```python
from glob import glob
txt_files = list(glob(r"C:\users\rehman\documents\**\*.txt"))
```

### Other `Path` object functions
- `Path.mkdir()` will create a new folder/directory
- `Path.exists()` will tell you if the file or folder exists

In [15]:
current_dir.exists()

True

- `Path.parts` is a tuple of the different "parts" of the file

In [16]:
current_dir.parts

('c:\\', 'Users', 'rehman', 'Documents', 'bioinfo_coding_bootcamp', 'Day5')

- `Path.parent` will give the one directory higher than the path `Path.parents` will return a sequence of the directories going up to the base of the path

In [18]:
current_dir.parent

WindowsPath('c:/Users/rehman/Documents/bioinfo_coding_bootcamp')

In [21]:
list(current_dir.parents)

[WindowsPath('c:/Users/rehman/Documents/bioinfo_coding_bootcamp'),
 WindowsPath('c:/Users/rehman/Documents'),
 WindowsPath('c:/Users/rehman'),
 WindowsPath('c:/Users'),
 WindowsPath('c:/')]

- `Path.name` will give you the final part of the path whether it is a directory or a file

In [24]:
current_dir.name

'Day5'

- `Path.stem` gives you the file name without the extension
- `Path.suffix` and `Path.suffixes` will give you file suffixes

In [26]:
file = list(current_dir.glob('*.ipynb'))[0]
print(file)
file.suffix

c:\Users\rehman\Documents\bioinfo_coding_bootcamp\Day5\hw3.ipynb


'.ipynb'

- `Path.joinpath()` allows you to append to the path

In [30]:
new_path = current_dir.joinpath('folder','test.py')
print(new_path)

c:\Users\rehman\Documents\bioinfo_coding_bootcamp\Day5\folder\test.py


In [31]:
new_path.exists()

False

### Deleting files
- `Path.unlink()` will delete a file

```python
my_path = Path(r"c:\users\rehman\documents\test.py")
# to delete the file use unlink
my_path.unlink()
```

### There are more `Path` functions; refer to the official [documentation](https://docs.python.org/3/library/pathlib.html)

## `shutil` module
- Python module for file operations
- We will cover 3 useful shutil functions
    - There are many more functions: see the [documentation](https://docs.python.org/3/library/shutil.html)
- `shutil` functions take either strings or `Path` objects as inputs
- `shutil.copy` copies a file

```python
import shutil as sh
# will copy the my_file.py to a new location
sh.copy("my_file.py","new_directory/my_file.py")
```

- `shutil.copytree` is like `shutil.copy` except it is for directories
- `shutil.rmtree` deletes a directory