<small><small><i>
Introduction to Python for Bioinformatics - available at https://github.com/GunzIvan28/MScMak2025-IntroductionToPython.
</i></small></small>

## Files, Scripting and Modules

So far, we have been writing all our Python Code in Jupyter notebooks. However, if you want to use the code we have written as part of a pipeline, you need to write scripts. Also, most of the time the data you need to analyse is in a file, which you need to read to Python and process. 


### Reading Files

So far we have been working from memory. In Bioinformatics, you will need to read some file or even write some output to file. We use the `open` function. 

In [None]:
myfile = open("../Data/test.txt", "w")        #Opens a file that is new or exists; r=reading, w=writing, 
myfile.write("My first file written from Python \n")
myfile.write("---------------------------------\n")
myfile.write("Hello, world!\n")
myfile.close() #Closes the file[always a must!, don't forget]

In [None]:
type(myfile)    #Text wrapper that enables navigation through the file, enables editing and manipulation through the file

In [None]:
myfile.seek(2)

In [None]:
read_file = open("../Data/test.txt", 'r')

In [None]:
read_file.readline() #Reads line per line

In [None]:
read_file.readlines() #This reads all the content from start and displays last line

In [None]:
read_file.seek(0) #Brings cursor back to the start

In [None]:
read_file.readlines()

The **mode** in which you open the file determines whether to write (w), read (r) or append(a) to file. 

Opening a file creates what we call a **file handle** which contains methods for manipulating the file. In our case, `myfile` has the methods to write and close the file. Closing the file makes it accessible in the disk. 

Alternatively, one can open the file in a mode that automatically closes the file when done. 

In [None]:
with open("../Data/test1.txt", "w") as myfile:
    myfile.write("My first file written from Python \n")
    myfile.write("---------------------------------\n")
    myfile.write("Hello, world!\n")

Let's check what else we can do with `open`.

In [None]:
?open

#### Fetching file from the web
Download this [file](https://www.uniprot.org/docs/humchrx.txt) we will use to explore file reading in python. 

In [None]:
import urllib.request                                 #Script for python to replace 'wget for bash'....so import urlib.request

url = "https://www.uniprot.org/docs/humchrx.txt"      #specify url
destination_filename = "../Data/humchrx.txt"          #specify destination
urllib.request.urlretrieve(url, destination_filename) #format of the import script

#### Reading a file line-at-a-time

We can read the file line by line using `readline`. Thie reads the line one by one until the end of the file. This is suitable for a large file which may not fit memory. 

In [None]:
humchrx = open('../Data/humchrx.txt', 'r')
line = humchrx.readlines()
print(line)

In [None]:
humchrx.close()

In [None]:
with open('../Data/test.txt', 'r') as myfile:
    while True:
        line = myfile.readline()
        if len(line) == 0: # If there are no more lines
            break
        print(line) 

In [None]:
with open('../Data/humchrx.txt', 'r') as myfile:
    while True:
        line = myfile.readline()
        if len(line) == 0: # If there are no more lines
            break
        print(line) 

### Read the whole file

If the file is small or PC has enough memory, you can read the whole file into memory as a list using `readlines`.

In [None]:
with open('../Data/test.txt', 'r') as myfile:
    lines = myfile.readlines()                 #reads line per line
    for line in lines:
        print(line)

or as a whole

In [None]:
with open('../Data/test.txt', 'r') as myfile:
    whole_file = myfile.read()                  #reads file as a single line
    print(whole_file)

In [None]:
with open('../Data/humchrx.txt', 'r') as myfile:
    whole_file = myfile.read()                  #reads file as a single line
    print(whole_file)

In [None]:
### Note: if 'r' meets a blank line, it will print '\n' for that

### Exercise 1

Write a function the reads the file (humchr.txt) and writes to another file (gene_names.txt) a clean list of gene names.

In [None]:
humchr=open('../Data/humchrx.txt', 'r')

In [None]:
with open('../Data/humchrx.txt', 'r') as myfile:
    for line in myfile.readlines():
        
        if line.startswith("Gene"):
            print(line)

## Exercises

 1. Write a function `areatriangle(b,h)` to compute the area of a triangle: formula is `area = .5*b*h`. Output should look like: 
 
 `The area of a triangle of base 3 and height 5 is 7.5`

In [None]:
def area_traingle(b,h):
    area = 0.5*b*h
    print("The area of a triangle of base %d and height %d is %.1f" %(b,h,area))
    return

In [None]:
area_traingle(3, 5)

2. Write a function `celsius_to_fahrenheit(temp)` to convert Celsius to Fahrentheit temperature. The formula is `(9/5) times temp plus 32`. Print the output in the form: 
 
`The Celsius temperature 50.0 is equivalent to 122.0 degrees Fahrenheit.`

In [None]:
def celcius_to_fahrenheit(temp):
    """A function that converts a temperature in celcius to farhenheit"""
    f_temp = (9/5*temp) + 32
    print("The Celsius temperature %.1f is equivalent to %.1f degrees Fahrenheit" % (temp, f_temp))
    return

In [None]:
celcius_to_fahrenheit(50)

3. Create a function that prompts the user for their first and last name. The function written should include the city and state.
That is, ask two more questions to get the city and the state you live in. Print where you are from on a new line. Put the customary comma between
city and state. Your run should look like the following: 


```  
Enter your first name: Ivan
Enter your last name: Lloyd
Enter the city you live in: Kampala
Enter the state you live in: Central

Your name is: Ivan Lloyd
You live in:  Kampala, Central


In [None]:
def biodata():
    fname = input("Enter your first name:")
    lname = input("Enter your last name:")
    city = input("Enter the city you live in:")
    state = input("Enter the state you live in:")
    print("Your name is: %s %s" %(fname, lname))
    print("You live in: %s, %s" %(city, state))
biodata()

4. Write a function `count_down()` that starts at 10 and counts down to rocket launch. It's output should be:  
`10 9 8 7 6 5 4 3 2 1 BLASTOFF!`  

You can make all the numbers on the same line or different lines. Use a while loop.

5. Write a function `sum_prod(x,y)` that prints the sum and product of the numbers x and y on separate lines, the sum printing first.

6. Write a function `steps(n)` that adds up the numbers 1 through n and prints out the result. You should use either a `'while'` loop or a `'for'` loop. Be sure that you check your answer on several numbers n.  Be careful that your loop steps through all the numbers from 1 through and including n.

7. Write a function `conv(miles)` to convert miles to feet. There are 5280 feet in each mile. Make the print out a statement as follows:  
`There are 10560 feet in 2 miles.`  Except for the numbers, this statement should be exactly as written.

8. Write a function `drinks(age)`. This function should use an `if-elif-else` statement to print out:  

`Have a glass of milk.` for anyone under 7;  
`Have a coke.` for anyone under 21, and   
`Have a martini.` for anyone 21 or older.   

Tip: Be careful about the ages 7 (a seven year old is not under 7) and 21. Also be careful to make the phrases exactly as shown.
Test the runs (3 of them). Note that the test for the function output will use different numbers:

```  
drinks(5)
Have a glass of milk.

drinks(10)
Have a coke.

drinks(25)
Have a martini.

9. Write a function `odd_nums()` that prints the odd numbers from 1 through 100. Make all of these numbers appear on the same line (actually, when the line fills up it will wrap around, but ignore that.). In order to do this, your print statement should have `end=" "` in it. For example, `print(name,end=" ")` will keep the next print statement from starting a new line. Be sure there is a space between these quotes or your numbers will run together. Use a single space as that is what the testing program expects. Use a 'for' loop and the `range()` function.  

Things to be careful of that might go wrong: 
You print too many numbers, you put too much or too little space between them, you print each number on its own line, you print even numbers or all numbers, your first number isn't 1 or your last number isn't 99.  
Always check first and last outputs when you write a loop.

Test run (I've inserted a newline here to cause wrapping in the editor):

```  
odd_nums()
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 
57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

10. Write a function `trapez()` that computes the area of a trapezoid. Here is the formula: `A = (1/2)(b1+b2)h`. In the formula b1 is the length of one of the bases, b2 the other. The height is h and the area is A. Basically, this takes the average of the two bases times the height. For a rectangle b1 = b2, so this reduces to b1*h. This means that you can do a pretty good test of the correctness of your function using a rectangle (that way you can compute the answer in your head). Use input statements to ask for the bases and the height.
Convert these input strings to real numbers using `float()`. Print the output nicely EXACTLY like mine below.

Tip: Be careful that your output on the test case below is exactly as shown so that the testing function judges your output correctly.  The testing function does not look at your input statements, so you don't have to use my input prompts if you don't want to. However, testing function will enter the three inputs in the order shown. See the other test run below.

```  
trapez()
Enter the length of one of the bases: 3
Enter the length of the other base: 4
Enter the height: 8
The area of a trapezoid with bases 3.0 and 4.0 and height 8.0 is 28.0


Another test run. In grading, expect different input numbers to be used.

trapez()
Enter the length of one of the bases: 10
Enter the length of the other base: 11
Enter the height: 12
The area of a trapezoid with bases 10.0 and 11.0 and height 12.0 is 126.0

11. Write a function `diner_waitress()` that asks for you order. First start an empty list, call it order. Then use a `while loop and an input()` statement to gather the order. Continue in the while loop until the customer says `that's all`. One way to end the loop is to use `break` to break out of the loop when `that's all` is entered. Recall that you can add to a list by using the list's .append() method; suppose that your list is called order. You are going to have to input one food at a time and append it to the order list. Then print out the order. Here is my run:  
```  
diner_waitress()

Hello, I'll be your waitress. What will you have?

menu item: eggs
menu item: bacon
menu item: toast
menu item: jelly
menu item: that's all

You've ordered:
['eggs', 'bacon', 'toast', 'jelly']

12. Heron's formula for computing the area of a triangle with sides a, b, and c is as follows. Let `s = .5(a + b + c)` --- that is, 1/2 of the perimeter of the triangle. Then the area is the square root of `s(s-a)(s-b)(s-c)`. You can compute the square root of x by `x**.5` (raise x to the 1/2 power). Use an input statement to get the length of the sides. Don't forget to convert this input to a real number using float(). Adjust your output to be just like what you see below. Here is a run of my program:

```  
heron()

Enter length of side one: 9
Enter length of side two: 12
Enter length of side three: 15

Area of a triangle with sides 9.0 12.0 15.0 is 54.0

13. The following list gives the hourly temperature during a 24 hour day. Write a function, that will take such a list and compute 3 things: average temperature, high (maximum temperature), and low (minimum temperature) for the day.  I will test with a different set of temperatures, so don't pick out the low or the high and code it into your program. This should work for other hourly_temp lists as well. This can be done by looping (iterating) through the list. I suggest you not write it all at once. You might write a function that computes just one of these, say average, then improve it to handle another, say maximum, etc. Sample run using the list hourly_temp. 

Note that the testing function will use a different hourly list.  Be sure that you function works on this list and test it on at least one other list of your own construction. Note also, that the list the grader uses may not have the same number of items as this one.


hourly_temp  = [40.0, 39.0, 37.0, 34.0, 33.0, 34.0, 36.0, 37.0, 38.0, 39.0, \
               40.0, 41.0, 44.0, 45.0, 47.0, 48.0, 45.0, 42.0, 39.0, 37.0, \
               36.0, 35.0, 33.0, 32.0]


```  
weather(hourly_temp)
Average: 38.791666666666664
High: 48.0
Low: 32.0

14. Write a function that is complementary to the one in the previous problem that will convert a date such as June 17, 2016 into the format 6/17/2016.  I suggest that you use a dictionary to convert from the name of the month to the number of the month. Then it is easy to look up the month number as months["February"] and so on. Note that the month names should begin with capital letters. 
 
***Tip:***    
In print statements, commas create a space.  So you may have difficulty avoiding a space between the 7, 17, and 2016 below and the following comma.  I suggest that you build the output as a single string containing the properly formatted date and then print that.  You can convert any number to string by using str() and tie the parts together using +. Duplicate the format of the example output exactly. 

Here is a printout of my run for June 17, 2016.

```  
date_conv_modified("July",17, 2016)
7/17/2016

15. Write a function temp_stat(temps) to compute the average, median, standard deviation and variance of the temperatures in the table.  Print each out.
The following code generates the same temperatures each time because the seed is set to 150. Print the temperature list as the first line of the function.

Here is what my run on the table of temperatures built below looks like:

```  
temp_stat(temperatures)
[52, 61, 45, 50, 44, 34, 57, 80, 91, 50, 38, 91, 84, 20, 55, 23, 83, 42, 44, 84]
Mean:  56.4
Median:  51.0
Standard deviation:  22.04397518836526
Variance:  485.9368421052631

```

16. Write a function `write_to_file(filename,myname,myage,major)` that opens the file <filename> and writes 3 lines in it using the data given. Here is a sample of what could be in the file. Call the file 'namefile.txt', so that it is identifiable as a text file:

```  
My name is George  
My age is 21   
I am majoring in Physics

```

Tips: 
1. `write_to_file()` can take only a string and, in fact, only one. So whatever you write has to be put together into one string (using +, for example).
2. Add `+"\n"` on the end of each write to put a newline at the end of each line. Otherwise, everything will be jammed together on one line.
3. Convert your age (a number) to a string -- use `str()` to do that. Then join it to the other parts of the string before writing.
4. When running the function, use quotes around every argument except the number.

17. Write a function that can enter the productand its cost and it will print out nicely. Specifically, allow 25 characters for the product name and left-justify it in that space; allow 6 characters for the cost and right justify it in that space with 2 decimal places. Precede the cost with a dollar-sign.  There should be no other spaces in the output.

Here is how one of my runs looks:
```  
toothbrush               $  2.60

```

18. Write a program that will sort an alphabetic list (or list of words) into alphabetical order. Make it sort independently of whether the letters are 
capital or lowercase. First print out the wordlist, then sort and print out the sorted list. Here is my run on the list firstline below (note that the wrapping was added when I pasted it into the file -- this is really two lines in the output).

```  
['Happy', 'families', 'are', 'all', 'alike;', 'every', 'unhappy', 'family',
 'is', 'unhappy', 'in', 'its', 'own', 'way.', 'Leo Tolstoy', 'Anna Karenina']
['alike;', 'all', 'Anna Karenina', 'are', 'every', 'families', 'family',
'Happy', 'in', 'is', 'its', 'Leo Tolstoy', 'own', 'unhappy', 'unhappy', 'way.']

```

### MORE ON READING AND WRITING TO FILES

_Reading/writing files summary:_

```
infile = open(filename)  # For reading. Also infile = open(filename,'r')
infile.close()
outfile = open(filename,"w")  "Open for writing
outfile.write("string to write")
outfile.close()
```

In [None]:
def print_file(filename):
    """ Opens file and prints its contents line by line. """
    infile = open(filename)
    
    for line in infile:
        print(line, end="") # the file has "\n" at the end of each line already
    
    infile.close()

In [None]:
print_file("../Data/ls_orchid.fasta")

In [None]:
def print_file():
    """ Opens file and prints its contents line by line. """
    infile = open("../Data/newhumpty.txt")
    
    for line in infile:
        print(line, end = "") # the file has "\n" at the end of each line already
    
    infile.close()

In [None]:
print_file()

In [None]:
def copy_file(infilename, outfilename):
    """ Opens two files and copies one into the other line by line. """
    infile = open(infilename, 'r')
    outfile = open(outfilename,'w')
    
    for x in infile:
        outfile.write(x)
        
    infile.close()
    outfile.close()

In [None]:
copy_file("../Data/HumptyDumpty.txt", "../Data/copy_cat.txt")

In [None]:
def copy_file():
    """ Opens two files and copies one into the other line by line. """
    infile = open("../Data/HumptyDumpty.txt", 'r')
    outfile = open("../Data/Newdumpty.txt",'w')
    
    for x in infile:
        outfile.write(x)
        
    infile.close()
    outfile.close()

In [None]:
copy_file()

##### Example 1:

Convert this function to a standalone program or script that takes two file names from the command line and copies one to the other   

Steps:   
1. Delete "Def" line. You don't need it.
2. Use Edit menu of Jupyternotebooks to Unindent all the lines.
3. import the system library `sys`
4. `sys.argv` is a list of the filenames following the program name.   
   `sys.argv[0]` is the program name,   
   `sys.argv[1]` is first argument, etc.   
   Get the infilename and outfilename from this list.
5. Save the program as `../Scripts/copy_file.py`
6. Run the program from a terminal window (Mac) or (Linux) as:  
   `python ../Scripts/copy_file.py ../Data/HumptyDumpty.txt ../Data/newhumpty.txt`

In [None]:
import sys
infilename = sys.argv[1]
outfilename = sys.argv[2]

""" Opens two files and copies one into the other line by line. """
infile = open(infilename, 'r')
outfile = open(outfilename,'w')

for x in infile:
    outfile.write(x)
    
infile.close()
outfile.close()

In [None]:

""" Opens two files and copies one into the other line by line. """

infile = open(infilename)
outfile = open(outfilename,'w')

for line in infile:
    outfile.write(line)
    
infile.close()
outfile.close()

In [None]:
copy_file("../Data/test.txt", "../Data/written.txt")

In [None]:
## Stand-alone script
import sys

infilename = sys.argv[1]
outfilename = sys.argv[2]

infile = open(infilename)
outfile = open(outfilename,'w')

for line in infile:
    outfile.write(line)
    
infile.close()
outfile.close()

##### Example 2:   

The function reads through a text file and counts the number of different words. It uses a dict (dictionary data type) `d = {key1:value1,  key2:value2,  key3:value3}` where `d[key2]` gives value2, etc. The key in this case is a word and the value is the number of times it occurs.

The plan is to read through a text file, split each line into its constituent
words, add each word to a dictionary, then add one to the number of times the
word occurs in the dictionary. Finally, we sort the dictionary and print it
out listing each word (key) and its count (value). 

In [None]:
def count_words(filename):
    """ 
    Makes a list of the words in the file filename and the number of times each word appears.

    """
        
    text_file = open(filename)     # open the file for reading
    
    # Set up an empty dictionary to start a standard design pattern loop
    words_dic = {}
    
    # This loop adds each word to the dictionary and updates its count. 
    # Change all words to lower case so Horse and horse are seen as the same word.
    
    for line in text_file:         # step through each line in the text file
        for word in line.lower().split():   # split into a list of words
            word = word.strip("'?,.;!-/\"") # strip out the stuff we ignore
            if word not in words_dic:
                words_dic[word] = 0      # add word to words with 0 count
            words_dic[word] = words_dic[word] + 1    # add 1 to the count
    
    text_file.close() 
                   
    # Sorts the dictionary words into a list and then print them out
    print("List of words in the file with number of times each appears.")
    word_list = sorted(words_dic)
    for word in word_list:
        print(word, words_dic[word])

In [None]:
count_words("../Data/HumptyDumpty.txt")

In [None]:
# Stand-alone program
import sys

filename = sys.argv[1]
# print("\n",filename,"\n")  # You can check that the filename is correct
    
text_file = open(filename)     # open the file for reading

# Set up an empty dictionary to start a standard design pattern loop
words_dic = {}

# This loop adds each word to the dictionary and updates its count. Change 
# all words to lower case so Horse and horse are seen as the same word.
for line in text_file:         # step through each line in the text file
    for word in line.lower().split():   # split into a list of words
        word = word.strip("'?,.;!-/\"") # strip out the stuff we ignore
        if word not in words_dic:
            words_dic[word] = 0      # add word to words with 0 count
        words_dic[word] = words_dic[word] + 1    # add 1 to the count

text_file.close() 
               
# Sorts the dictionary words into a list and then print them out
print("List of words in the file with number of times each appears.")
word_list = sorted(words_dic)
for word in word_list:
    print(words_dic[word], word)


##### Example 3: CSV files

We now turn to reading/writing CSV files, that is Comma Separated Value files.   
Text files lack the structure that we need for certain applications.  CSV files can be read or written by spreadsheet programs such as Excel.   
They may be the most common way of transferring files from one application to another.   

_Summary of CSV file statements:_   

```import csv

infile = open(filename)    # For reading. Also infile = open(filename,'r')
infile.close()             # An open file locks other applications out
rows = csv.reader(infile)            # Read row

f = open(filename, 'w', newline='')  # Open for writing
csv.writer(f).writerows(rowlist)     # Write all rows at once
csv.writer(f).writerow(row)          # Write one row
f.close()```

In [None]:
import csv

def read_csv_file(filename):
    """Reads a CSV file and prints each row, which is a list. """
    f = open(filename)
    for row in csv.reader(f):
        print(row)
    f.close()

In [None]:
read_csv_file("../Data/BooksRead.csv")

#### Exercise

Rewrite read_csv_file(filename), call it read_csv_file2(filename), so that you print each row without the list bracket. You will print each item in the row separately instead of printing the whole row.   
This requires you to know before-hand how many columns are in the csv file. In the case of `BooksRead.csv`, 
there are 3 items in each row:   
How do you address each item in the row list named row? They are row[?] and row[??] and row[???], where you fill in the ?, ??, and ??? values.   

Here's what the output should look like:

```read_csv_file2("BooksRead.csv")
Beckert, Sven Empire of Cotton history
Buckley, Carla The Deepest Secret mystery
Carcaterra, Lorenzo Chasers mystery
Catton, Bruce The Army of the Potomac: The Glory Road military
Cohen, Gabriel The Ninth Step mystery
Darwin, Charles Origin of Species science
Ho, Yong China: An Illustrated History history
James, Henry Daisy Miller novel
Larsson, Stieg The Girl who played with fire novel
Lewis, Michael Liar's Poker: rising through the wreckage on Wall Street economics
Messenger, Bill Elements of Jazz: From Cakewalks to Fusion music
Paulos, John Allen Innumeracy mathematics
Penzler, Otto, ed. Murder at the Racetrack  mystery
Pintoff, Stefanie Secret of the White Rose mystery
Post, Robert C. Democracy, Expertise, Academic Freedom law
Solzhenitsyn, Alexander One Day in the Life of Ivan Denisovich novel
Torrence, Bruce F. and Eve A. The Student's Introduction to Mathematica mathematics
Woods, Stewart Mounting Fears novel
```

In [None]:
import csv

def read_csv_file2():
    """Reads a CSV file and prints each row without list brackets. """
    filename = "../Data/BooksRead.csv"
    f = open(filename,'r')
    lis2 = [] 
    lis3 = []
    for row in csv.reader(f):
        print(str(row[0] + " " + row[1] + " " + row[2])) # Access .csv at indexes, concatenate them accordingly after converting them to strings
            
    f.close()
read_csv_file2()

In [None]:
# Stand-alone script
import csv
import sys

filename = sys.argv[1]
f = open(filename,'r')
lis2 = [] 
lis3 = []
for row in csv.reader(f):
    print(str(row[0] + " " + row[1] + " " + row[2]))


##### Example 4: Just for Fun 😆😆😆

Let's write a four line poem. Call it `simple_poem()`. Essentially you have to write a loop around this so that you get 4 lines.
Remember that the inside or scope of the loop has to be indented 4 spaces. Follow the conventional English language sentence construction notation of `article + noun + verb + adverb` using the lists given below for each and combine them using the `random` module.

```
verbs=["are","is","goes","cooks","shoots","faints","chews","screams"]
nouns=["bear","lion","mother","baby","sister","car","bicycle","book"]
adverbs=["handily","sweetly","sourly","gingerly","forcefully","meekly"]
articles=["a","the","that","this"]

```

In [None]:
import random

verbs=["are","is","goes","cooks","shoots","faints","chews","screams"]
nouns=["bear","lion","mother","baby","sister","car","bicycle","book"]
adverbs=["handily","sweetly","sourly","gingerly","forcefully","meekly"]
articles=["a","the","that","this"]

def simple_poem():
    article = random.choice(articles)    
    noun = random.choice(nouns)
    verb = random.choice(verbs)
    adverb = random.choice(adverbs)
    
    our_sentence = article + " " + noun + " " + verb + " " + adverb + "."
    our_sentence = our_sentence.capitalize()
    
    print(our_sentence)
simple_poem()

### Scripts and Modules

A script is a file containing Python definitions and statements for performing some analysis. Scripts are known as when they are intended for use in other Python programs. Many Python modules come with Python as part of the standard library. 

You can get a list of available modules using help() and explore them.

In [None]:
ls

In [None]:
cd ../Scripts/

In [None]:
"""write_genes.py takes an annotation file and
writes gene names to file
Usage:
    python write_genes.py <>"""
import sys

# print(gene_file)
# print(out_file)
dna_list=list('ACGT')
def getGenList(gene_file):
    with open (gene_file, 'r') as humchr:
        tag = False #Start by setting the tag to false
        gene_list=[]
        for line in humchr:
                if line.startswith('Gene'):
                    tag = True
                if tag:
                    line_split = line.split()
                    if len(line_split) != 0:
                        if '-' in line_split[0]:
                            continue
                        else:
                            gene_list.append(line_split[0])
    return gene_list[3:][:-2]

    clean_gene_list = getGenList()

def writeGeneList(clean_gene_list):
    with open(out_file, 'w') as gene_names:   #creating a new file called gene_names
        for gene in clean_gene_list:
                gene_names.writelines(gene+'\n')
    print('Genes have been written successfully!!')
if len(sys.argv) < 3:
    print(__doc__)
else:
    gene_file = sys.argv[1]
    out_file = sys.argv[2]
    clean_gene_list = getGenList(gene_file)
    writeGeneList(clean_gene_list, out_file)

In [None]:

with open (gene_file, 'r') as humchr:
    with open(out_file, 'w')as gene_names:      
        tag = False
        gene_list=[]
        for line in humchr:
            if line.startswith('Gene'):
                    tag = True
            if tag:
                line_split = line.split()
                if len(line_split) != 0:
                    if'_' in line_split[0]:
                        continue
                    else:
                        gene_list.append(line_split[0])
                        #print(gene_list)
        for gene in (gene_list[3:][:-2]):
            gene_names.writelines(gene+'\n')

In [None]:
ls -l

In [None]:
import write_genes

In [None]:
from write_genes import *

In [None]:
getGenList('../Data/humchrx.txt')

In [None]:
%% bash python write_genes.py ../Data/humchrx.txt ../Data/gene_names2.txt

### OR

In [None]:
!python ../Scripts/write_genes.py ../Data/humchrx.txt ../Data/gene_names2.txt

In [None]:
%% bash 

### File handling, OS module, Shutil and Path modules

Python can also interface directly with the Linux operating system using the **os**, **Shutil** and **path** modules.

First, let's import the OS module

In [None]:
import os

In [None]:
os.

In [None]:
os.getcwd() #Same as pwd in bash

In [None]:
os.chdir('..') #Goes one directory back

In [None]:
os.getcwd()

In [None]:
?os

In [None]:
os.listdir()

In [None]:
os.chdir("Scripts")

In [None]:
os.getcwd()

In [None]:
os.path.isdir('../Scripts')

In [None]:
os.path.isfile('./bank.py')

### path manipulation
The path module inside the os module contains methods related with path manipulation.For example you can use `path.join()` to join paths. 
- `path.exists(path):` Checks if a given path exists.
- `path.split(path):` Returns a tuple splitting the file or directory name at the end and the rest of the path
- `path.splitext(path):` Splits out the extension of a file. It returns a tuple with the dotted extension and the original parameter up to the dot.
- `path.join(directory1,directory2,...)`: Join two or more path name components, inserting the operating system path separator as needed

In [None]:
import os
?os.path.join

Explore more at your own time.

### Shutil
Utility functions for copying and archiving files and directory trees.

In [None]:
import shutil

In [None]:
?shutil

# Converting function to a module

You can save it as a python module by:    
1. Creating a new Python file e.g my_module.py    
2. Copying the function inside my_module.py 

In [None]:
def greet(name):
    print("Hello, %s!" % name)

In [None]:
greet("Grace")

In [None]:
pwd

In [None]:
import my_module

In [None]:
import sys
sys.path.append('Scripts')

In [None]:
print(my_module.greet("Joel"))

Now you can import and use it in another script:

In [None]:
import my_module
print(my_module.greet("Ivan"))

## Exercise

a. Write a function called make_album() that builds a dictionary describing a music album 
The function should take in an artist name and an album title, and it should return a dictionary containing these two pieces of information. 
Use the function to make three dictionaries representing different albums Print each return value to show that the dictionaries are storing the album information correctly.

b. Add an optional parameter  to make_album() that allows you to store the number of tracks on an album If the calling line includes a value for the number of tracks, add that value to the album’s dictionary Make at least one new function call that includes the number of tracks on an album.


Write a python function that, using a DNA sequence read from file, answers the following questions:
1. Shows that the DNA string contains only four letters.
2. In the DNA string there are regions that have a repeating letter. What is the letter and length of the longest repeating region?
3. How many ’ATG’s are in the DNA string?

NB: Use the file `coding_seq.fa`. Bonus points if the script works for a multisequence fasta file.

In [None]:
dna='AGGGTTTCTCTGTGTAGCCCTGGCTGTCCTGGAACTCACTCTGTAGACCAGGCTGGCCTTGAACTCAGAAATCTGCCGGCCTCTGCCTCCCAAGTGCTGGGATTAAAGGTGTGTGCCACCACAGCTCAGGGTTCTTTTTTATCATTAAAATAATTTATTACTTTTTAGTTCATGTACATTGGTGTTTCATCTGTGTGTGTCTGTATGAAGGCTTTGGATCCCCTGGAGTTACAGACAGTTATTAGCTGCCATGTGGGTGCTGGGAATTGAACCCAGATCCTCTGGAAGAGCAGCCAGTGCTCTTAACTGCTGAGCTATTTCTCTCGCCCTGGCAGCTACTTTTCTATAGATTATTCTAATTATTTTATACAGATGAACTACAGGCTGGGGATGGGGAGATGGCTCACCAGGTGAGAGCCTTTGCCATGCAATGCCCAGAACCCAGGCTGGAAGGGAAGACCTGACCTCTACAGTCAGGCTGCAGCACCCCTGCCCCATCATGCACATACACACATAAATAAAATAAAACCCAAATGGACTCATACAGTATTTGCCTTTGTGACTAGCTTATTTTATTGAGCAATTTCACCCATAGCATTTCAAATAGAACAGCTTCAAGTGTACAGCAAAATTAAATAGATGGTACAAGGGTTTCCTAAATGTCTCCTGCCCTTGATATATTGCTTACCCCTCTCTTAAATGTTTCACTTCCTAAATAATACCTATGTGAGGTAATGCATATTTAATTGGCTAGATTTTATCATTTATGATGTGTATATAATTTTCAAATAGCATGCTGTATATGATAAATAGTTTTATCTCTCTATTTGAAATACAAATTAAACTTTACAAAGACTTCACAGCGTCTCCTGTTTATTGCAGGGGATATGTTCACTGGACTTCAGCAGACACCTAAGACTGGATAGTAGTAACCTAAGCCACAGTCTAGTCGCTCACTGTGGCCATAACATTTTAGCTACTTCCCTCCACCTTCATGTAGCTCCTGTGCATGTTTTCGTTTATACCTTAATATTTCACTTTTAGGAGGCATTGATAGAAGTGAAACTACATCTGATTCCAAATGCTACTTGTTCATTGTTGATACATAAGAAAGCATTTATTTATTTATGTATCTACCACATCCTACTTGTTGTTCAATCCAGGAGTCTTTGGTTGATCACCTTTATATGTAGACAGTCATGCCATGCAAAAACAGTTGTGTTTTCCTTCTCAGAGGCCCCTCTCCTGCTTTATCTTCCTCTTTGCTCCGCCCTCTCTCTCTTGCCCTCCCTTACCACTGTTGCCTCCTTTCCTTTCCTTTTTTCCTTTTCCTTTTTCTTGTGGTTTTCCGAGACAGGGTTTCTCCGTATAACCCTGACTGTCCTGGAACTCTCTGCCTCCCGAGTGCTGGGATTAAAGGCGTGCACCACCACCGCCCGGGTGTCTCCTTTTCTTTTATTGTTCTTTTCTTTGTTCTTTTACTACATAAACTGAGTTCCAGTATAATGTTGACAATAGAAGACATCCTTTTCTTGCTCCTGATTTTAATGGGAAAGGTCGAATGGTATGTGGTTCATGTAGACCACATTTTGTTTCCCTCTCACCCATTGATGGACACTTGGGTAGCTTCCATTTTTGGCTGTTGTGAATAATGCTGCTATGAACATGGGTGTGCACAGAGCTCTCTGAGACGCTGCTTTCAGTCCTTCTGGCAGTAGATCTTCATGGAGGAGCACGGAGTGACCCAAACTGAACACATGGCTACCATAGAAGCCCATGCAGTGGCCCAGCAAGTCCAGCAGGTCCATGTAGCCACGTACACTGAGCACAGTATGCTAAGTGCTGATGAAGACTCCCCTTCCTCCCCCGAGGACACTTCTTATGATGACTCGGACATCCTCAACTCCACGGCAGCTGATGAGGTAACTGCCCATCTGGCTGCTGCAGGTCCTGTGGGAATGGCCGCTGCTGCTGCTGTGGCAACAGGGAAGAAACGGAAACGGCCTCATGTGTTTGAGTCTAATCCATCTATCCGAAAGAGACAGCAGACACGTTTGCTTCGGAAACTCAGAGCCACGTTGGATGAGTACACGACGCGAGTGGGACAGCAAGCGATTGTACTCTGCATCTCACCCTCCAAACCCAACCCTGTCTTCAAGGTGTTTGGCGCAGCACCTTTGGAGAATGTGGTGCGAAAGTACAAGAGCATGATCCTGGAAGACCTCGAGTCTGCTCTGGCAGAACACGCCCCTGCGCCACAGGAGGTTAATTCAGAGCTGCCGCCTCTCACCATCGATGGGATTCCAGTCTCTGTGGACAAAATGACCCAGGCTCAGCTTCGGGCATTTATCCCAGAGATGCTCAAGTATTCCACAGGTCGGGGGAAACCAGGCTGGGGGAAAGAAAGCTGCAAGCCTATCTGGTGGCCAGAAGATATCCCATGGGCCAATGTCCGCAGTGATGTCCGCACAGAAGAGCAAAAACAAAGGGTTTCATGGACCCAGGCATTACGGACCATAGTTAAAAATTGCTATAAGCAACATGGGCGGGAGGATCTTTTATATGCTTTTGAAGATCAGCAAACACAAACTCAGGCCACCACCACACACAGTATAGCTCATCTCGTACCATCACAGACCGTAGTACAGACCTTCAGCAACCCTGATGGCACCGTGTCGCTCATCCAGGTTGGTACAGGGGCAACAGTAGCCACATTGGCTGATGCTTCAGAACTGCCAACCACAGTCACTGTTGCCCAAGTGAATTACTCTGCTGTGGCTGATGGAGAGGTGGAACAAAACTGGGCCACGTTACAGGGCGGTGAAATGACCATCCAGACGACGCAAGCATCAGAGGCCACCCAGGCGGTAGCATCACTGGCAGAAGCCGCAGTGGCAGCTTCTCAGGAGATGCAGCAGGGAGCCACTGTCACCATGGCCCTCAACAGTGAAGCTGCCGCCCATGCTGTCGCCACTCTGGCGGAAGCCACCTTACAAGGTGGGGGACAGATAGTCCTGTCTGGGGAAACCGCAGCAGCCGTCGGAGCACTTACTGGAGTCCAAGATGCTAATGGCCTGGTCCAGATCCCTGTGAGCATGTACCAGACTGTGGTAACCAGCCTCGCCCAGGGCAACGGGCCGGTGCAGGTGGCCATGGCCCCAGTGACCACCAGGATATCGGACAGCGCAGTCACCATGGATGGCCAGGCTGTGGAGGTGGTGACCTTGGAACAGTAGCATGGAGCTCTATCATGGCAGCGTTTTCTAGTCTACTGCAGAATTTTTTACATGTTTGCAGAGGTGCAATCAAATGGAATTAAGTCTCTCGACTTGGAAAGAAAGTTTTGGTAACCTTTTTTTAAGAAGGAAGAAAGGCAGCAGATTTTGGAATCACACTTTTTTAAAGCACCACTCTGGGATCTGGTGGAATGAACGCCACCGATTTCACTGTCCCAAAAAGCCAAATTGTGGCCAGACTTCTTTGTGCAGAAATGTGTGTATACTTACGTGTGTGTACGTGTGAGTGTGAATATATGTATATGTGTACATATGGACATACACATTTACATATATGTATAAAGTATATATGTACATACATACATATGTATGAAACCTGCATGGAATTACCTGTATGAAATCAAGGTGAACTGTGGGAACAAGAACCCACCCAGATTCGTGGGTGGTAGGGTACATGACCAAACACAGTCACCTGGTTTTCGTTCATACCAGGGTCATGCATTGAGCTACTGACAGACTCAGGCGGAGGTGACCACGTCCTTCACCAAAGCTGCCTCCCAGTGGCCGCCTAGACCTCTGCTAGATTCACCGAAGGAAGGAAGATCCAGGACACAGCGTGGTCCAGAGAGTGCTTGTGAAGTCCAGGGACAGAGAGTGCGTGCGCACATGTGCGCTTTGCCAGCAGAGACACACGGCAGCTGGCCCAGGTGCTGACCTTGCCACAGGCAGGTAAACGCCCTGCAGGCTCCTGGCAGGGGCAAGAAATCGTTCCTCAGCCTCCATCTTCTCCCTTCCCAGGAACCCTCAGTCTCACGACTATTCAAGAGTTGCTTGGTTGTAAGGTCAGTCCTGTTACAAACTGAAGGTGACAGAAGTGTTAAGGGTCTGAGGAGTGTTCATGGAGCAGGCGGGTGTAAGTGCAGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTATGAGTAATGGAGAAAATGGGAAGATTATAGGAGAGCAAAATAGGAAGGAGGGAGAAAACTCTTCATAAATCAGGGTGCGCCGTGGGAACCGTGTTCTCCAGCTGTCTGCAGCTGTATTTCAGCAGAGGAGACTGCCTCACACAGGACCTCTGCGCAAAGGCTGGCCGTCACAGATGTGTCAGAAGACTCTGTGAGGACTTTTCCCAGGCACATCCTGGCGGCACAGGCCTGGGACAGCTTTCCTGCTCACAGTGTGGCTTGCACTGAGCAGTCATTGTCACTGTGAGCTTCTGTGCTTTCCAGCCACAAGCCCTGAGTCTCCCGTGGCTCATTCATCTGATGTCTTGACAAGCCAAATCTCCACTCCTGGCGTGCAGGGACTCTTCCTCCTTCCTGCCAGCCCTCTCCCGTGCGTGATAGTGTATTTAATGTGGTGTTTTTGGTTTTTTGTTTTTTAATGAGACATTAAAAGATTCTTCATGTCTTGCTCAGCCTTTGAGAAAAGTTTCCAATTCTTATATTTGCTTGTTTTATATAAAACTATTCAATGTTCTTTGTATGTTCTTTTCTGTATGTGATAAGGGAGGGGTGGGAAATTTGCATATCAATGTCCTGGTTCTACAATTGGTTACTTTTTTTTTTTTTTTAAACTGTGAAGCTGTCCAGGGGCTTTAAGGCCCGTGTTCCTTTGTGGTGAAATAAGCCTCCCGATAGTTTGAGAAATTGCCAAGAAGATAAAAGCAAGATCCCAGCAGCAGAGCATGGAATCTGTGTTGTTCTCCATTCTGTCTAAACTGCCTCATTCAATAAATAGTTTAATGTGGCGAC'

Write a function that takes two arguments – a protein sequence and an amino acid residue code – and returns the percentage of the protein that the amino acid makes up. Use the following assertions to test your function.

```assert my_function("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert my_function("MSRSLLLRFLLFLLLLPPLP", "L") == 50
assert my_function("MSRSLLLRFLLFLLLLPPLP", "Y") == 0```