# 2 - Working with Colab

<b>Summary</b>:
> * Reading and writing files in Colab
> * Working on your own (mini) project


For more details see:
- https://neptune.ai/blog/google-colab-dealing-with-files

[link text](https://)### Reading and writing files in Colab

If you have a link to a plain text file you can import it directly into collab. For example, we'll use the `requests` library to load the Iris dataset

In [None]:
import requests


result = requests.get('https://raw.githubusercontent.com/venky14/Machine-Learning-with-Iris-Dataset/master/Iris.csv')
data_text = result.text

Though often you'll need to upload data to Colab. Here you can do that by creating a folder in the left sidebar of colab. Create a folder called `data` and drag and drop `data.csv`. Now you can import the file.

In [None]:
with open("./data/data.csv") as f:
  data = f.read()

In [None]:
import numpy as np
a = np.arange(10)
print(a)
print(a.sum())
print(np.sum(a))

[0 1 2 3 4 5 6 7 8 9]
45
45


In order to analyze data you have to first read it in to your program. Once you've performed your calculations, you may want to save it for future use. Here we will read in a data file containing observations from a Gaussian distribution of unknown mean and variance, calculate the mean and variance.

In [None]:
# get the header
data = open("./data/data.tsv")
header = data.readline()
print(header)

data_clean = []
for line in data:
    # remove any extra characters at the beginning of the string
    line = line.strip()
    # split the data by the deliminator. If you don't know what the deliminator is, check the data
    line = line.split('\t')
    # turn the observation into a float
    line_data = float(line[1])
    data_clean.append(line_data)


mean_data = sum(data_clean)/len(data_clean)
variance_data = sum([(data_i - mean_data)**2 for data_i in data_clean])/len(data_clean)

print("Mean =", mean_data)
print("Variance =", variance_data)

Observation number	Observation

Mean = 3.042112833945
Variance = 1.0535919965442606


Now let's save the results to a file for future use. Here the argument ```'w'``` means "write" and ```%``` is an operator to convert the float to a string

In [None]:
out_path = "./data/processed_data.tsv"

file_out = open(out_path, 'w')

# write the header
file_out.write('Mean\tVariance\n')
# write the data
file_out.write('%f\t%s\n' % (mean_data, variance_data))


file_out.close()

Alternatively, you could open the file with a ```with``` command and Python will automatically close the file once it has looped through all the lines, but you will need to add a statement to make sure Python ignores the header in your file.

In [None]:
with open(in_path, 'r') as file_:
    read_data = file_.read()

# We can check that the file has been automatically closed.f.closed
file_.closed

NameError: name 'in_path' is not defined

### Mini-project

Often you will need to write your own code to answer your scientific questions. Python libraries are helpful, but they do not contain every conceivable function. Furthermore, it's useful to have some idea of what a function in a library is doing. Being able to describe what is going on "under the hood" can help you understand whether or not a given function is suitable for your project.

An effective way to build this ability is through learning-by-doing, so today you will select one of four small projects and code solutions in *pure Python*.





1. Say you have a population of $k$ pairs of rabbits, free to reproduce without limit. In this toy model, rabbits are able to mate at the age of one month so that at the end of its second month a female can produce another pair of rabbits. Suppose that our rabbits never die and that the female always produces one new pair (one male, one female) every month *from the second month on*.
After $n$ generations, how many rabbits will you have? Given the framing of the question, you do not need a model of population dynamics to provide an answer. You can simply use the Fibonacci sequence

$F_{n} = \left\{\begin{matrix}
F_{n-1} + F_{n-2}, &  n > 1 \\
1, & n = 1 \\
0, &  n = 0 \\
\end{matrix}\right.$

Set $n=35$ and $k=5$. (Note that the Fibonacci sequence gives the number of rabbits for one pair!)


In [None]:
"""
Hint. Construct a function that has n,k as input, and contains a loop.
Check whether the results is correct for small n, with pen and paper.
"""




The number of rabbits is 46137325
The number of rabbits is 875089148811941


There is also a closed form solution to this problem, can you solve it? (Hint, define a recurrence relation and use the characteristic polynomial).



2. In DNA, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'. The reverse complement of DNA $s$ is the string $s^{c}$ formed by reversing the symbols of $s$, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC"). Read in the string of DNA in `data_2.txt` and print the reverse compliment.






In [None]:
"""
Hint. There are many simple ways to achieve this result.
A possibility is to use "if" conditions, and string slicings.
Additionally, you can try to use a dictionary for the complements, and separate the problem into two functions:
- one returning the reverse of an input string
- one returning the complement of an input string
"""




The reverse complement of the DNA is...
CTAAAATGGTCTGGCGTGACGCTATGTAGCTCCTAAGTTACGTACTGTCCGTACATTGAACAATTAAAGGTAGCCCCCCTTTGAGAGCGTTGCAACTTGTTCGTAACTGTTAGACCCATCCTGTTTCGGTAGAAGATATCCACTATACCGGTTAGTATGGGATCCAAGCTACGCTGTACTTCGCAGTATATTATCAATATAAAATGTCCACACCGTTCGAGTTGAATTGCTCTGTTTCATCCCCATCGGCCAGCGTACTTGCTACGAGCAAGCATATTGGCGGTCGTCAAACACCTGTCATCACGTGGTCTGGGGTATGCGCCGGTAGATGGTACAACCGCGTTAGGGACGGCACCCGTCCAACAGCATACCTTCAACTACTCGTTTTACTTGAACTACAGTTCGTGCTTTTGTCAGGGCGCACCGTCCCTACTCCAAAAACTCCGAGAGCGCGAAGCACGGCCAGTGGCAAAGTCTTATGTACATGCGAATGGATTTGAGGAAGCAGAGTTGCCTACACATGTTCCGGAGATATCTCGCCGGTATAAAAGGGTGAAGTGGTGAAATCATACCGCTGTGTAACGCACTCTAGGACAAACATACCAACGCTCCAATAGTCCCCTTAGGCATGGAGGCGAGTTAATAGGCAAATATTTCGTATCACATTCGACTGTGAAATAGCCTTTAAGTACAGCTCAAGGTTTTTAAACTTCACATCAGTGTGCTAATCTCACCTTGAGTTGGACTAGCAACCAGGCAAATTATAGTGTCCGCCGAGAGAATTGTCAGGTCGGCTGTCATACGTTGGAGAGTTTGACCGGTCTATATC
Output from str.maketrans()...
CTAAAATGGTCTGGCGTGACGCTATGTAGCTCCTAAGTTACGTACTGTCCGTACATTGAACAATTAAAGGTAGCCCCCCTTTGAGAGCGTTGCAACTTG

3. In a weighted alphabet, every symbol is assigned a positive real number called a weight. A string formed from a weighted alphabet is called a weighted string, and its weight is equal to the sum of the weights of its symbols. The standard weight assigned to each member of the 20-symbol amino acid alphabet is the monoisotopic mass of the corresponding amino acid. So the molecular weight of a protein can be calculated as the sum of the weights of its amino acids. Given the amino acid string in `data_3.txt`, calculate the molecular weight. Using the molecular weights of each amino acid provided in `amino_acid_weights.txt`.


In [None]:
"""
Hint: Divide the problem into separate parts.
- Read the molecular weights, and save them into a dictionary
- Read the data
- Make a function that has a string as input, the molecular weights dictionary, and returns its weight.
- Calculate all molecular weights.
"""




The monoisotopic mass [Da] of the protein is 115256.037480


4. Often in biology we want to calculate the distance between observations. This could be the distance in species composition between two communities or distance in the genetic composition of two strands of DNA, to provide two examples. Here you will be calculating the latter as the **fraction** of nucleotides that differ between DNA strings $s_{1}$ and $s_{2}$, $d(s_{1}, s_{2})$.  You are provided with a list of $n$ sequences in `data_4.txt`, calculate the distance between each pair of sequencies ($d(s_{i}, s_{j})$) and make a nested list in matrix form.

$\mathbf{d} =
\begin{pmatrix}
d_{1,1} & d_{1,2} & \cdots & d_{1,n} \\
d_{2,1} & d_{2,2} & \cdots & d_{2,n} \\
\vdots  & \vdots  & \ddots & \vdots  \\
d_{n,1} & d_{n,2} & \cdots & d_{n,n}
\end{pmatrix}$


In [None]:
"""
Hint: Divide the problem into separate parts.
- Read the data
- Construct a function that calculates the distance (fraction) of two strings.
- Construct the matrix
"""

# specify your path



[[0.0, 0.31053203040173727, 0.5537459283387622, 0.3289902280130293, 0.5841476655808904, 0.31596091205211724, 0.6015200868621065, 0.5385450597176982, 0.48642779587404994], [0.31053203040173727, 0.0, 0.5928338762214984, 0.4820846905537459, 0.6254071661237784, 0.47014115092290987, 0.6753528773072747, 0.6199782844733985, 0.5787187839305103], [0.5537459283387622, 0.5928338762214984, 0.0, 0.46362649294245384, 0.5613463626492943, 0.6134636264929425, 0.5656894679695983, 0.46254071661237783, 0.31704668838219324], [0.3289902280130293, 0.4820846905537459, 0.46362649294245384, 0.0, 0.5309446254071661, 0.48751357220412594, 0.5494028230184582, 0.42888165038002174, 0.28555917480998916], [0.5841476655808904, 0.6254071661237784, 0.5613463626492943, 0.5309446254071661, 0.0, 0.6123778501628665, 0.48751357220412594, 0.30510314875135724, 0.4527687296416938], [0.31596091205211724, 0.47014115092290987, 0.6134636264929425, 0.48751357220412594, 0.6123778501628665, 0.0, 0.6568946796959826, 0.6091205211726385, 0