# Lecture 7 File I/O, Functions, and Generators

- Reminder about reading
- Reminder about pathlib, glob, and shutil

## Working with text files
- Reading and writing text files is a common programming task
    - Reading a file into a python data structure is called parsing the file
- Text files can have any extension not just ".txt" 
    - ".csv" is commonly used for comma separated files
- We can store tables of data in text files by separating values with a special character called a delimiter
- The most common delimiters are tabs or commas
    - by convention tab-delimited files have '.txt' extension and comma delimited files have '.csv' extension
- You can access or create a file with `open`
- `open` will return an object which is a connection to the file
- `open` takes 2 arguments
    - the file name
    - the mode
        - 'w' create a file, overwrite the file if it already exists
        - 'r' read-only
        - 'a' append, meaning only write to the end of an existing file
- connections must be closed after you are done reading or writing the file or you will have problems

```python
f = open('a_file.txt','w')
for i in range(10):
    f.write(f"{i}\n")
f.close()
```

- The above code is not the preferred way to use open. It is very easy to forget to close the file, so never write code like you see above
- Use a `with` statement instead
    - `with` is a context manager and will automatically close the connection

```python
with open('a_file.txt','w') as f:
    for i in range(10):
        f.write(f"{i}\n")
```
- `with` will automatically close the file after the code under it has been executed

# **Always use `with` when you open a file**

## File object methods
- `f.read()` reads the entire contents of a file into a string
- `f.readline()` reads a single line from the file, each time it is called it returns the next line of the file starting from the beginning
- `f.readlines()` read all lines into a list, with each line as an element of the list
- `f.write()` write a string to the end of a file (or the beginning if the file is empty)
- `f.writelines()` write a list of strings with each element as its own line
- All of these functions return strings, if you are trying to read numbers from a file you will have to convert them to `int` or `float`

### reading a file into a list
```python
with open("file_example.txt",'r') as file:
    # read all the lines in the file
    lines = file.readlines()
# every element of the list lines will be a string ending with a newline character

# remove the new line and split the files by tab
lines = [line.rstrip('\n').split('\t') for line in lines]
# lines is now a list of lists with each element corresponding to a value in the tab delimited file
```

### reading a file into a list, but skipping the first row which is a header
```python
with open("file_example.txt",'r') as file:
    # access the first line
    header = file.readline()
    # read all lines in the file starting from the second line
    lines = file.readlines()
# every element of the list lines will be a string ending with a newline character

# remove the new line and split the files by tab
lines = [line.rstrip('\n').split('\t') for line in lines]
# lines is now a list of lists with each element corresponding to a value in the tab delimited file
```

### writing a comma-separated file
```python
with open('numbers.csv','w') as file:
    for i in range(0,100,10):
        file.write(','.join([str(j) for j in range(i,i+10)]) + '\n')
```
- If we try to write strings that do not end with a new line "\n" character the `write` function will continue to append to the same line

## Demonstration
- FASTA files are a specially formatted text file for storing nucleotide and protein sequences
- read FASTA sequence and write complementary sequence

## Function default values
- We define functions with `def`
    ```python
    def power(x,n=2):
        return x ** n
    ```
- In the above example we define a function with 2 variables `x` and `n`
- The default value of `n` is 2
    - since a default value is specified providing `n` when the function is called is optional

In [1]:
def power(x,n=2):
    return x ** n

# I can call the function providing only x, and the default value of n will be used
power(3)

9

In [2]:
def power(x,n=2):
    return x ** n

# I can also provide a different value for n
power(x=2,n=3)

8

## Function calls
- I can call the function `power` in multiple ways
```python
power(2,3)
power(2,n=3)
power(x=2,n=3)
power(n=3,x=2)
```
- `x` and `n` are called keyword arguments or named arguments
- The only restriction when calling a function is that you cannot pass an argument without a keyword after passing a keyword.
- This will cause an error:
```python
power(x=2,3)
```

## Docstrings
- You can write a little help snippet for your function using a triple quoted string right under the def statement line
- You can see any function's docstring in juypter or the ipython shell by adding a ? to the name e.g. `print?`

In [3]:
print?

[1;31mDocstring:[0m
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
[1;31mType:[0m      builtin_function_or_method


In [4]:
def add_one(n):
    """Add 1 to a value and return the result
    This is the second line of the docstring"""
    return n + 1

In [5]:
add_one?

[1;31mSignature:[0m [0madd_one[0m[1;33m([0m[0mn[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Add 1 to a value and return the result
This is the second line of the docstring
[1;31mFile:[0m      c:\users\rehman\appdata\local\temp\ipykernel_22408\1919225670.py
[1;31mType:[0m      function


## lambda functions
- You can define a function in one line with the `lambda` keyword
- ```python
power = lambda x, n=2: x ** n
```
- This is equivalent to the power function above

In [6]:
p = lambda x, n=2: x ** n
p(2)

4

In [7]:
p(3,3)

27

- a `lambda` function is defined with the lambda keyword followed by input variables followed by a colon followed by the python expression
- Ideally `lambda` functions should be used only for simple expressions
- If you write a complicated `lambda` function, it is probably better to rewrite it as a normal function using the `def` keyword

## `map`
- So far we have seen we can use `for` loops and list comprehensions to apply a function to a collection of values such as a `list`, `tuple`, or `str` object
    - Collections of objects are called iterables in python
    - `range` objects are also iterables
    
```python
# get the squared value of the numbers 0-9
squares = [power(i) for i in range(10)]
```

Another way

```python
squares = []
for i in range(10):
    squares.append(power(i))
```
- Another option is to use the `map` function

```python
squares = map(power,range(10))
```

- `map` applies a function (the first argument) to every value in an iterable (the second argument)

In [10]:
map(power,range(10))

<map at 0x2a7707239a0>

- `map` returns a map object which is an iterable like range or the output of `re.finditer`
    - a `map` object is a special kind of iterable called a ***generator***
- We have to convert it to a `list` to see what the values inside are

In [11]:
list(map(power,range(10)))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [12]:
list(map(power,range(10)))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

- `map` only works on functions that can accept a single argument
- However, there is a workaround

In [13]:
from functools import partial
# create a new function with a default value for n equal to 3
partial_fun = partial(power,n=3)
list(map(partial_fun,range(10)))

[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

- We can combine `lamba` functions with `map` to get very compact code

In [14]:
squares = map(lambda x: x**2,range(25))
print(list(squares))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576]


## `filter`
- `filter` allows us to get only values in an iterable that meet a certain criteria
- It works like `map` except the function must return `True` or `False`
- The output is only the values in the iterable where the function evaluates to `True`

In [15]:
# get all multiples of 3 from 0 to 99
mult3 = filter(lambda x: x % 3 == 0, range(100))
print(list(mult3))

[0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99]


## MapReduce
- MapReduce is a programming paradigm often used in Big Data
- Basically we split a large dataset into a lot of small pieces and `map` a function to each piece
- Then we combine the pieces into a single object with `reduce`
- We must import `reduce` from the `functools` module
- For example, we could get the cumulative product of a set of numbers squared from 1 to *n*
$$
C = \prod_{i=1}^{n}i^2
$$

In [16]:
# Map Reduce example
# get the cumulative product of all numbers 1 - 5 squared

from functools import reduce
# square every number
squares = map(lambda x: x**2,range(1,6))
# get the product of every number in squares
product = lambda x, y: x * y
cumprod = reduce(product,squares)
print(cumprod)

14400


- `reduce` requires a function that takes 2 input variables and returns 1 output variable
- `reduce` applies the function to the first 2 values in the iterable
    - Then it takes the output of that and applies the function to the previous output and the next value in the iterable
- In the above example the function is `lambda x,y: x * y` which returns the product of 2 numbers
- When reduce is called it works like this
1. product(1 ** 2, 2 ** 2) this equals 4
2. product(4, 3 ** 2) this equals 36
3. product(36, 4 ** 2) this equals 576
4. product(576, 5 ** 2) this equals 14400
4. We have reached the end of the iterable so the final result is 14400


## Generators
- We have been dancing around the concept of generators
- `map`, `filter`, `re.findall`, `range` all return generators
- generators are iterables that are evaluated in a *lazy* way
- *Lazy* means that the values in a generator are not processed by your computer until that specific value is requested.
- Generators are incredibly useful when you are using a really large dataset that cannot fit into your computer memory, or you are doing computations that take a long time.
- We have seen that the `list` function will evaluate a generator and return a list

In [18]:
# print the numbers 0-9 squared
my_map = map(lambda x: x ** 2, range(10))
print(my_map)
for value in my_map:
    print(value)

<map object at 0x000002A770746910>
0
1
4
9
16
25
36
49
64
81


- We can use the generator returned by the map in our code and it will not be evaluated until it is required

In [19]:
def square(n):
    print(f'Calculating the square of {n}')
    return n ** 2

sq = map(square,range(10))
# the square function has not been executed yet

In [20]:
def plus_one(x):
    print(f'Adding one to {x}')
    return x + 1
sq_plus_1 = map(plus_one,sq)
# the square function and the plus_one function have not been executed yet
# note how nothing has been printed

In [21]:
# now we will iterate through sq_plus_1
[i for i in sq_plus_1]

Calculating the square of 0
Adding one to 0
Calculating the square of 1
Adding one to 1
Calculating the square of 2
Adding one to 4
Calculating the square of 3
Adding one to 9
Calculating the square of 4
Adding one to 16
Calculating the square of 5
Adding one to 25
Calculating the square of 6
Adding one to 36
Calculating the square of 7
Adding one to 49
Calculating the square of 8
Adding one to 64
Calculating the square of 9
Adding one to 81


[1, 2, 5, 10, 17, 26, 37, 50, 65, 82]

- Only when we try to access the values in a generator is a generator evaluated
- All values are accessed when we convert a generator to a list
- When we access a value in a generator it is removed from the generator

In [22]:
# sq_plus_1 is now empty
list(sq_plus_1)

[]

- Most of the time you will not have to create your own generators, but I am going to show you how to do it for the sake of completeness
- You can define a generator by defining a function that uses the `yield` keyword instead of `return`

In [23]:
# create a generator
def square_to_n(n = 10):
    for i in range(n):
        yield i ** 2
    return

gen = square_to_n(10)
for i in gen:
    print(i)
print(list(gen))

0
1
4
9
16
25
36
49
64
81
[]


- If you are writing code that takes a long time to execute it is often a good idea to write it as a generator so that the computation is only performed when required

In [24]:
# simulating slow code
from time import sleep

def square(x):
    # wait 1 second
    sleep(1)
    return x ** 2

def square_to_n(n):
    for i in range(n):
        yield square(i)
    return

sq = square_to_n(5)

In [25]:
%%time
sq_plus_1 = map(lambda x: x + 1, sq)

CPU times: total: 0 ns
Wall time: 0 ns


In [26]:
%%time
[i for i in sq_plus_1]

CPU times: total: 0 ns
Wall time: 5.04 s


[1, 2, 5, 10, 17]

- It takes 5 seconds only when the values in the generator are called
- so sq_plus_1 is a generator that is created from using map to apply a function to another generator
- Both functions are only evaluated when the values in sq_plus_1 are called

### `next` function
- You can access the "next" value in a generator with `next`

In [27]:
def square_to_n(n):
    for i in range(n):
        yield i ** 2
    return
gen = square_to_n(5)
print(next(gen))
print(next(gen))
print(next(gen))

0
1
4


- Each time `next` is called the next value is returned
- You will get an error if the generator is empty and you call next

In [28]:
# the first three values from the generator are gone
list(gen)

[9, 16]

In [30]:
# in general it's better to use a for loop then next
def square_to_n(n):
    for i in range(n):
        yield i ** 2
    return
gen = square_to_n(5)
for i in gen:
    print(i)

0
1
4
9
16


- You won't need to write any generators for this class but you should understand how they work

In [31]:
# remember my example of generating a chromosome
from random import choice

def get_random_sequence(n):
    """Returns a generator of random nucleotides with n elements"""
    for i in range(n):
        yield choice('ATGC')
    return

''.join(list(get_random_sequence(9)))

'CACAAAGCG'

In [32]:
# now I can use a generator to make a chromosome without it taking a long time
%time
chromosome = get_random_sequence(int(1e8))

CPU times: total: 0 ns
Wall time: 0 ns


- I have a random chromosome with 100 million base pairs and it took zero seconds because I'm using a generator
- It won't take any time until the generator is evaluated
- Without a generator 100 million base pairs would take up about 400 MB of RAM, but a generator uses only minimal memory unless the entire generator is evaluated and assigned to a variable

In [33]:
def get_complement(n):
    binding = {'A':'T','G':'C','T':'A','C':'G'}
    return binding[n]

In [34]:
%%time
seq = ''
for i in range(100000):
    seq += get_complement(next(chromosome))

# only going to print the first 100 complementary nucleotides out of 100,000 for my sanity
print(seq[:100])

ATTAACATTGAGGTATCTTACCATGGCCGTTTCCAAGGTTAAGGCTTAAACTATACAGAGAGGACGTTTTTATCCCTATGTAATATTGGTACTGGTATGT
CPU times: total: 125 ms
Wall time: 131 ms


In [35]:
numbers = list(range(10))

def gen():
    for i in numbers:
        yield i
    return

x = gen()
for j in x:
    print(j)

0
1
2
3
4
5
6
7
8
9
