# Homework Assignment #1 (Individual)
## Git practice, debugging practice, unfamiliar data, and new Python packages

### <p style="text-align: right;"> &#9989; Theodore Zimbo.</p>
### <p style="text-align: right;"> &#9989; zimbot13.</p>

<img src="https://biopython.org/assets/images/biopython_logo_white.png" width=300px align="right" style="margin-left: 20px" alt="Image credit: https://biopython.org/">

### Goals for this homework assignment
By the end of this assignment, you should be able to:
* Use Git to create a repository, track changes to the files within the repository, and push those changes to a remote repository.
* Debug some Python code.
* Work with an unfamiliar data format and successfully load it into your notebook.
* Visualize unfamiliar files/data using Python.
* Read documentation and example code to use a new Python package

Work through the following assignment, making sure to follow all of the directions and answer all of the questions.

There are **22 points** possible on this assignment. Point values for each part are included in the section headers and question prompts.

**This assignment is due roughly two weeks from now at 11:59 pm on Friday, September 24.** It should be uploaded into the "Homework Assignments" submission folder for Homework #1 on D2L.  Submission instructions can be found at the end of the notebook.

---
## Part 1: Setting up a git repository to track your progress on your assignment (3 points)

For this assignment, you're going to create new **private** GitHub repository that you can used to track your progress on this homework assignment and future assignments. Again, this should be a **private** repository so that your solutions are not publicly accessible.

**&#9989; Do the following**:

1. On [GitHub](https://github.com) make sure you are logged into your account and then create a new <font color="red">**_private_**</font> GitHub repository called `cmse202-f21-turnin`.
2. Once you've initialized the repository on GitHub, **clone a copy of it onto JupyterHub or your computer**.
3. Inside the `cmse202-f21-turnin` repository, create a new folder called `hw-01`.
4. Move this notebook into that **new directory** in your repository then **add it and commit it to your repository**. **Important**: you'll want to make sure you **save and close** the notebook before you do this step and then re-open it once you've added it to your repository.
5. Finally, to test that everything is working, `git push` the notebook file so that it shows up in your <font color="red">**_private_**</font> GitHub repository on the web.

**Important**: Make sure you've added your Professor and your TA as collaborators to your new "turnin" respository with "Read" access so that they can see your assignment. **You should check the Slack channel _for your section of the course_ to get this information.**

**Double-check the following**: Make sure that the version of this notebook that you are working on is the same one that you just added to your repository! If you are working on a different copy of the noteobok, **none of your changes will be tracked**.

If everything went as intended, the file should now show up on your GitHub account in the "`cmse202-f21-turnin`" repository inside the `hw-01` directory that you just created.  Periodically, **you'll be asked to commit your changes to the repository and push them to the remote GitHub location**. Of course, you can always commit your changes more often than that, if you wish.  It can be good to get into a habit of committing your changes any time you make a significant modification, or when you stop working on the project for a bit.

&#9989; **Do this**: Before you move on, put the command that your instructor should run to clone your repository in the markdown cell below.

``` bash
git clone https://github.com/zimbot13/cmse202-f21-turnin
```

---

## Part 2: Bit of code debugging: reading Python and understanding error messages (6 points)

As a bit of Python practice, review the following code, read the error outputs and **fix the code*. When you fix the code **add a comment to explain what was wrong with the original code**.

### Fixing errors

**Question 1 [6 points]**: Resolve the errors in the following pieces of code and add a comment that explains what was wrong in the first place.

In [None]:
for i in range(10)
    print("The value of i is %i" %i)

In [None]:
def compute_fraction(numerator, denominator):
    fraction = numerator/denominator
    print("The value of the fraction is %f" %fraction)
    
compute_fraction(5, 0)

In [None]:
def compute_fraction(numerator, denominator):
    fraction = numerator/denominator
    print("The value of the fraction is %f" %fraction)
    
compute_fraction("one", 25)

In [None]:
import numpy as np

n = np.arange(20)
print("The value of the 10th element is %d" %n(9))

In [None]:
odd = [1, 3, 5, 7, 9]
even = [2, 4, 6, 8, 10]

for i in odd:
    print(i)
    
for j in evven:
    print(j)

In [None]:
spanish = dict()
spanish['hello'] = 'hola'
spanish['yes'] = 'si'
spanish['one'] = 'uno'
spanish['two'] = 'dos'
spanish['three'] = 'tres'
spanish['red'] = 'rojo'
spanish['black'] = 'negro'
spanish['green'] = 'verde'
spanish['blue'] = 'azul'

print(spanish["hello"])
print(spanish["one"], spanish["two"], spanish["three"])
print(spanish["orange"])

---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing Part 2", and push the changes to GitHub.

---

## Part 3: Working with unfamiliar data and a new Python library (13 points)

Since we're been practicing download data and repositories from the internet and learning to use new Python packages, you're going to practice doing exactly that in this assignment! This will require using the command line a bit (or running command-line commands from inside your notebook), reading documentation, and looking at code examples you're not familiar with. These are all authentic parts of being an independent computational professional.

---
### 3.1: Download the data! (2 points)

For this assignment you're going to need to download a data file from the internet. It's a relatively small file, so it shouldn't take very long. Since you can't do parts of the assignment without the file, let the instructor know if you run into issues right away! Remember, in order to work with the data in this notebook, you'll need to make sure the data is in the same place as the notebook or you'll need to put the full path to the file in your data reading commands.

**Add and commit the file to your repository once you've downloaded it.**

The file you need is located here: `http://devinsilvia.com/cmse202/Example_chromatogram.ab1`

In the cell below, include the command line command that you used to download the files (you can either run the command on the command line or inside the jupyter notebook using the correct leading character). If you're not sure how to download them using the command line, download them however you need to in order to get them on to your computer and move on.

In [None]:
# Put your download command here


---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing part 3.1", and push the changes to GitHub.

---

### 3.2: Loading/Reading unfamiliar data in Python (4 points)

You might notice that the file you downloaded has the extension ".ab1". This is likely a file extension that you are not familiar with and it actually indicates that it is an "ABI" file. So what is this file? [A quick internet search](https://fileinfo.com/extension/abi) indicates that this is a binary file format that contains information about a DNA sequence produced by a particular DNA analysis instrument. It's also commonly referred to as a "trace file". Although you might not have a background in DNA processing or biology in general, you should have all of the skills necessary to interact with this data in Python.

That said, we've never opened such files in class! Your first task is to figure out how to open and read the file using Python. You should take a moment to search the internet for a Python package that will open an ABI file. It's particularly useful if there is a way of visualizing the contents of that file as well. If everything goes well, you should find that there is a package called [Biopython](http://biopython.org/) that is capable of reading and, using matplotlib, plotting the information contained in this type of file -- great!

&#9989; **Question 2 [1 point]**: There is at least one other package out there that you could install and use for loading ABI files in Python. Did you find this package in your search? What is it called? Why aren't we using that package?

<font size=+3>&#9998;</font> Do This - Erase the contents of this cell and replace it with your answer to the above question!  (double-click on this text to edit this cell, and hit shift+enter to save the text)

#### Installing Biopython and loading the data

Unfortunately, Biopython is not already included with Anaconda. However, you should be able to follow the directions on the [download](http://biopython.org/wiki/Download) page of the Biopython documentation to install the package.

&#9989; **Question 3 [1 point]**: What command did you use to install Biopython? Includes this command in the Markdown cell below.

``` bash
# Put the command for installing Biopython here!

```

&#9989; Once you've installed Biopython, **do the following [2 points]**:

1. Open/read "Example_chromatogram.ab1" using the `SeqIO` module from the Biopython package. *Important note*: in the `SeqIO` module, there is a `parse` function and a `read` function. `parse` has a lot of extra functionality, but you should be able to just use the `read` function for the data that you've been given. You might need to review the documentation for `SeqIO`, which you can find [here](http://biopython.org/wiki/SeqIO)
2. Once you've loaded up the trace file, you should extract the DNA sequence from the file and store it as a new variable. The sequence is actually stored as an attribute on the ABI file object that you loaded up. (Remember, Python is an object-oriented language!)
3. Print the sequence string. You should find that you get something that looks like this:

```
NNNNNNNNTCGTTGGTGACCAGCGGAGGGATCATTACCGAGTTTACAACTCCCAAACCCCTGTGAACATACCACTTGTTGCCTCGGCGGA
TCAGCCCGCTCCCGGTAAAACGGGACGGCCCGCCAGAGGACCCCTAAACTCTGTTTCTATATGTAACTTCTGAGTAAAACCATAAATAAA
TCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCAAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTG
AATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCTGGCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCACAGCTT
GGTGTTGGGACTCGCGTTAATTCGCGTTCCTCAAATTGATTGGCGGTCACGTCGAGCTTCCATAGCGTAGTAGTAAAACCCTCGTTACTG
GTAATCGTCGCGGCCACGCCGTTAAACCCCAACTTCTGAATGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAAGCATATCAAT
AAGCGGAGGAAAAGAAACCAACAGGGATTGCCCTAGTAACGGCGAGTGAAGCGGCAACAGCTCAAATTTGAAATCTGGCTCTCGGGCCCG
AGTTGTAATTTGTAGAGGATACTTTTGATGCGGTGCCTTCCGAGTTCCCTGGAACGGGACGCCATAGAGGGTGAGAGCCCCGTCTGGTTG
GATGCCAAATCTCTGTAAAGTTCCTTCAACGAGTCGAGTAGTTTGGGAATGCTGCTCTAAATGGGAGGTATATGTCTTCTAAAGCTAAAT
ACCGGCCAGAGACCGATAGCGCACAAGTAGAGTGATCGAAAGATGAAAAGCACTTTGAAAAGAGAGTTAAAAAGTACGTGAAATTGTTGA
AAGGGAAGCGTTTATGACCAGACTTGGGCTTGGTTAATCATCTGGGGTTCTCCCCAGTGCACTTTTCCAGTCCAGGCCAGCATCAGTTTC
CCCGGGGGANAAGGNNGCGGGAATGTGGCTCNCTTCNGGGAGTGTNTAGCCCACCGNGNANNCCCTGGGGGGGACTGAGTCGCGCATCTG
CAGNNGCTGNNTANGTTNNNNNNNNNNNNNNNNNNNNNNN
```

In [None]:
# Put your code here


---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing part 3.2", and push the changes to GitHub.

---

### 3.3: Working with the the data (4 points)

Now that you've got a DNA sequence loaded into Python using Biopython, you're going to perform a bit of analysis to understand the properties of the sequence. First off, here's a bit of background on what all those letters in the sequence mean:

* Each letter, C, T, A, and G, represents a particular type of DNA nucleotide. Specifically, these letters represent the base of the nucleotide, which can be one of four things:
    - cytosine (C)
    - thymine (T)
    - adenine (A)
    - guanine (G)
* The "N"s stand for any letter, which indicate that the data in that region is of sufficiently poor quality that determining the correct base is not possible. 

Now, assuming you don't have a particularly strong background in the properties of DNA sequences, one naive question we might want to ask of the data is:

**"How common is each type of nucleotide base in this DNA sequence?"**

&#9989; Using whatever means necessary, your job is to determine the total length of the DNA sequence and the corresponding number of times that each letter (C, T, A, G, N) shows up in the sequence. Then, **make a bar chart that shows the relative fraction of each nucleotide type compared to the entire length of the sequence [2 points]**. Your resulting plot should look something like this:

<img src="https://i.imgur.com/CvfTN4B.png">

<font color='red'>*Hint*</font>: You should check to see if the Biopython sequence object has any sort of built-in methods that would simplify this process! Using the tools that are built-in the Python packages you are using is generally more efficient than writing all of the same code yourself.

If you were unable to read in the sequence data in the previous section, you can use the following lines of code to create a similar sort of sequence and then make the plot use that sequence:

```
from Bio.Seq import Seq

my_seq = Seq('NNTCGTTGGTGACCAGCGGAGGGATCATTACCGAGTTTACAAGACTGAGTCGCGCATCTGCAGNNGCTGNNTANGTTNNNN')
```


In [None]:
# Put your code here


&#9989; **Question 4 [1 point]**: What does your resulting graph tell you about the relatively probability of a given nucleotide coming up in a given DNA sequence? With just one DNA sequence, do you think you can generalize any claim you might make to all DNA sequences?

<font size=+3>&#9998;</font> Do This - Erase the contents of this cell and replace it with your answer to the above question!  (double-click on this text to edit this cell, and hit shift+enter to save the text)

---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing part 3.3", and push the changes to GitHub.

---

### 3.4: Using a specialized package for visualization of data (3 points)

It turns out that if you were to visualize a DNA trace file, it would look something like this:

<img src="https://i.imgur.com/SaneoEw.png" width=800px>

The spikes in the above plot correspond to different channels of the data where specific nucleotide bases dominate the signal. This is the information that is used to create the string of letters in the DNA sequence you looked at above. If the quality of the data is poor and there is no clearly dominating spike, that location would be marked with an "N".

&#9989;  **Using Biopython and matplotlib (and the same ABI data file), you are going to attempt to recreate this type of visualization to produce and image that looks like the following [3 points]:**

<img src="https://i.imgur.com/rDOtf5G.png">

You should make sure that your plot spans the same range on the x-axis as the above image. **Important note**: the colors in the above plot do not necessarily represent the same nucleotide bases as in the plot before that, but are simply used to convey a similar type of information.

<font color='red'>*Hint*</font>: You should be able to find some Biopython documentation that will make this relatively straightforward.

In [None]:
# Put your code here


---
### &#128721; STOP
**Pause to commit your changes to your Git repository!**

Take a moment to save your notebook, commit the changes to your Git repository using the commit message "Committing part 3.4", and push the changes to GitHub.

---

---
## Assignment wrap-up

Please fill out the form that appears when you run the code below.  **You must completely fill this out in order to receive credit for the assignment!**

In [None]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.office.com/Pages/ResponsePage.aspx?id=MHEXIi9k2UGSEXQjetVofddd5T-Pwn1DlT6_yoCyuCFUMVNYSEkxMUJOTUtGRUQzRUdMMTVSM0VVOS4u" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

### Congratulations, you're done!

Submit this assignment by uploading it to the course Desire2Learn web page.  Go to the "Homework Assignments" folder, find the dropbox link for Homework #1, and upload it there.

&#169; Copyright 2021,  Department of Computational Mathematics, Science and Engineering at Michigan State University