# Lecture07 assignment

#### 1. Print Specific Files from a Zip Archive
Question: Write a Python script to collect only **.fna.gz** files that do not contain the substring cds from a zip archive named **assignment_data.zip**. The script should collect the file names and their uncompressed sizes in bytes in the **output list**. (5 points)

Hints:

- Use ZipFile to iterate over files in the archive and filter them based on the specified conditions.
- Use the ZipInfo.<span style="color:#3d85c6">file_size</span> attribute to get the uncompressed size of each selected file.



In [None]:
#code without Path
# import re
# from zipfile import ZipFile as zf

# output = []
# with zf("assignment_data.zip", "r") as inZF:
#     all_file_list = inZF.infolist()
#     all_file_list_cleanned = []
#     # remove folder from all_file_list
#     for file in all_file_list:
#         if file.is_dir() == False:
#             all_file_list_cleanned.append(file)
#     # print([x for x in all_file_list if x not in all_file_list_cleanned])

#     # filter using re pattern
#     # output is list of tuple [(,), (,)]

#     pattern = r"GCF_\w+.\w+_genomic\.fna\.gz"
#     for file in all_file_list_cleanned:
#         matches = re.search(pattern, file.filename)
#         if matches:
#             # print(matches[0], file.file_size)
#             output.append((matches[0],file.file_size))

In [143]:
#import re
from zipfile import ZipFile as zf
from pathlib import Path

output = []
with zf("assignment_data.zip", "r") as inZF:
    all_file_list = inZF.infolist()
    
    for file in all_file_list:
       if '.fna.gz' in file.filename and not 'cds' in file.filename:
           fname = Path(file.filename)
           output.append((fname.name, file.file_size))
           #print(fname.name, file.file_size)


In [144]:
## Expected output
for fname, fsize in output:
    print(f"{fname}, file size: {fsize} bytes") ## print result


GCF_000091345.1_ASM9134v1_genomic.fna.gz, file size: 475447 bytes
GCF_000013245.1_ASM1324v1_genomic.fna.gz, file size: 483245 bytes
GCF_000011725.1_ASM1172v1_genomic.fna.gz, file size: 478748 bytes
GCF_000093185.1_ASM9318v1_genomic.fna.gz, file size: 480190 bytes
GCF_000021165.1_ASM2116v1_genomic.fna.gz, file size: 500790 bytes
GCF_000008785.1_ASM878v1_genomic.fna.gz, file size: 495396 bytes
GCF_000023805.1_ASM2380v1_genomic.fna.gz, file size: 472639 bytes
GCF_000020245.1_ASM2024v1_genomic.fna.gz, file size: 483848 bytes
GCF_000008525.1_ASM852v1_genomic.fna.gz, file size: 502194 bytes


---

#### 2. Convert Date String to Datetime
Question: Write a Python script that converts a string representing a date and time (**"2024-08-26 14:30:00"**) into a datetime object. Then, print the formatted string **'Mon Aug 26 14:30:00 2024'** using the strftime method. (2 points)

Hints:

- Use the <span style="color:green">**datetime.strptime()**</span> method to parse the date string.
- Use the <span style="color:green">**strftime()**</span> method to format the datetime object.



In [145]:
from datetime import datetime

date_str = "2024-08-26 14:30:00"
## get datetime obj.
date_object = datetime.strptime(date_str, "%Y-%m-%d %H:%M:%S")
#print(date_object)
date_out = date_object.strftime("%a %b %d %H:%M:%S %Y")
## create output


In [146]:
print(date_out)

Mon Aug 26 14:30:00 2024


---

#### 3. Measure Script Execution Time
Question: Write a Python script that measures and compares the execution time of two loops:

- A loop that iterates directly over a <span style="color:green">**range**</span> of numbers from 0 to 100,000,000.
- A loop that iterates over a <span style="color:green">**list**</span> created from the same range of numbers.
  
Your task is to measure and print the execution time of both loops to determine the difference in performance between iterating over a range object and iterating over a list created from that range. (2 points)

Hints:

- Use the <span style="color:green">**time.time()**</span> function to capture the start and end times for both loops.
- Consider the memory and computational differences between using a <span style="color:green">**range**</span> and a <span style="color:green">**list**</span>.

In [156]:

import time

# Measure the time it takes to loop over a range of 100,000,000 numbers
start_time = time.time()
for i in range(100000000):
    pass
end_time = time.time()
print(f"Execution Time (range): {end_time - start_time} seconds")


# Measure the time it takes to create a list from a range of 100,000,000 numbers and loop over it
start_time = time.time()
for i in list(range(100000000)):
    pass
end_time = time.time()
print(f"Execution Time (list): {end_time - start_time} seconds")



Execution Time (range): 1.7435460090637207 seconds
Execution Time (list): 3.7202017307281494 seconds


---

#### 4. Persistent Data Storage with pickle
Question: Write a Python script that serializes a list of dictionaries containing student names and their grades into a file named **students.pkl** using the pickle module. Then, write another script to deserialize the data and print it. (1 point)

Hints:

- Use <span style="color:green">**pickle.dump()**</span> to serialize the data.
- Use <span style="color:green">**pickle.load()**</span> to deserialize the data.
  
**Expected Outcome:** The script should store the list of dictionaries in **students.pkl** and then load and print the data.

In [148]:
import pickle

# List of dictionaries to serialize
students = [
    {'name': 'John', 'grade': 'A'},
    {'name': 'Jane', 'grade': 'B'},
    {'name': 'Doe', 'grade': 'C'}
]

# Serialize the list to a file
with open('students.pkl', 'wb') as outputfile:
    pickle.dump(students, outputfile)

# Deserialize the list from the file
with open('students.pkl', 'rb') as infile:
    loaded_students = pickle.load(infile)


In [149]:
print(loaded_students)

[{'name': 'John', 'grade': 'A'}, {'name': 'Jane', 'grade': 'B'}, {'name': 'Doe', 'grade': 'C'}]


---

#### 5. Extracting and Processing JSONL Files from a Zip Archive to Generate a TSV Report
Question: Please complete a Python code below and reproduce the following Output. You need to upload "data_table.tsv" and the completed Python code below. (10 points)

### Please complete the Python code:

In [28]:
from zipfile import ZipFile
from pathlib import Path
import re, csv
import json
import pandas as pd

## Extract only file with .jsonl from assignment_data.zip
jsonl_list = []
pattern = r'.*\.jsonl'
data_info = {
                'assembly_id' : [],
                'genbank_id' : [],
                'moleculeType' : [],
                'length' : []
            }

with ZipFile('assignment_data.zip') as zipf:
    ## loop over file in ZipFile
    files = zipf.infolist()
    #print(files)
    for f in files:
        #print(f.filename)
        matches = re.search(pattern, f.filename)
        if matches:
            #print(matches[0])
            jsonl_list.append(matches[0])
            zipf.extract(matches[0])
#print(jsonl_list)
paths = Path('./assignment_data/data/')
with open('data_table.tsv','w', newline='') as csvOut:
    writer = csv.writer(csvOut, delimiter="\t")## create writer object
    writer.writerow(['assembly_id', 'genbank_id', 'moleculeType', 'length']) ## write header
    ## read file.jsonl
    for file_path in jsonl_list:
        with open(file_path, 'r') as f:
            for line in f:
                data = json.loads(line)
                data_info['assembly_id'].append(data['assemblyUnit'])
                data_info['genbank_id'].append(data['genbankAccession'])
                data_info['moleculeType'].append(data['assignedMoleculeLocationType'])
                data_info['length'].append(data['length'])

                df = pd.DataFrame(data_info)
    
    for index, row in df.iterrows(): ## write the information from jsonl to tsv file
        writer.writerow([row['assembly_id'], row['genbank_id'], row['moleculeType'], row['length']])

#print(data_info)

In [30]:
#without pandas
from zipfile import ZipFile
from pathlib import Path
import re, csv
import csv

jsonl_list = []

with ZipFile('assignment_data.zip') as zipf:
    ## loop over file in ZipFile
    files = zipf.infolist()
    #print(files)
    for f in files:
        #print(f.filename)
        matches = re.search(pattern, f.filename)
        if matches:
            #print(matches[0])
            jsonl_list.append(matches[0])
            zipf.extract(matches[0])
#print(jsonl_list)
paths = Path('./assignment_data/data/')
with open('data_tabl.tsv','w',
           newline='') as csvOut:
    writer = csv.writer(csvOut, delimiter="\t")## create writer object
    writer.writerow(['assembly_id', 'genbank_id', 'moleculeType', 'length']) ## write header

data_info_list =[]
for i in range(len(data_info['assembly_id'])):
    record = [data_info['assembly_id'][i], data_info['genbank_id'][i], data_info['moleculeType'][i], data_info['length'][i]]
    data_info_list.append(record)

with open('data_tabl.tsv', 'a', newline='') as f: 
    writer = csv.writer(f,delimiter='\t')
    writer.writerows(data_info_list)

print( Path('data_tabl.tsv').read_text() )
    

assembly_id	genbank_id	moleculeType	length
GCF_000091355.1	FM991728.1	Chromosome	1576758
GCF_000013255.1	CP000241.1	Chromosome	1596366
GCF_000013255.1	CP000242.1	Plasmid	9370
GCF_000011735.1	CP000012.1	Chromosome	1589954
GCF_000093195.1	CP001582.1	Chromosome	1588278
GCF_000093195.1	CP001583.1	Plasmid	7326
GCF_000021175.1	CP001173.1	Chromosome	1652982
GCF_000021175.1	CP001174.1	Plasmid	10031
GCF_000008795.1	AE001439.1	Chromosome	1643831
GCF_000023815.1	CP001680.1	Chromosome	1568826
GCF_000020255.1	CP001072.2	Chromosome	1608548
GCF_000008535.1	AE000511.1	Chromosome	1667867



### Output from your python code

In [152]:
for l in paths.rglob("*.*"): #files that have extensions
    print(l)

assignment_data\data\GCF_000008525.1
assignment_data\data\GCF_000008785.1
assignment_data\data\GCF_000011725.1
assignment_data\data\GCF_000013245.1
assignment_data\data\GCF_000020245.1
assignment_data\data\GCF_000021165.1
assignment_data\data\GCF_000023805.1
assignment_data\data\GCF_000091345.1
assignment_data\data\GCF_000093185.1
assignment_data\data\GCF_000008525.1\sequence_report.jsonl
assignment_data\data\GCF_000008785.1\sequence_report.jsonl
assignment_data\data\GCF_000011725.1\sequence_report.jsonl
assignment_data\data\GCF_000013245.1\sequence_report.jsonl
assignment_data\data\GCF_000020245.1\sequence_report.jsonl
assignment_data\data\GCF_000021165.1\sequence_report.jsonl
assignment_data\data\GCF_000023805.1\sequence_report.jsonl
assignment_data\data\GCF_000091345.1\sequence_report.jsonl
assignment_data\data\GCF_000093185.1\sequence_report.jsonl


In [153]:
print( Path('data_table.tsv').read_text() )

assembly_id	genbank_id	moleculeType	length
GCF_000091355.1	FM991728.1	Chromosome	1576758
GCF_000013255.1	CP000241.1	Chromosome	1596366
GCF_000013255.1	CP000242.1	Plasmid	9370
GCF_000011735.1	CP000012.1	Chromosome	1589954
GCF_000093195.1	CP001582.1	Chromosome	1588278
GCF_000093195.1	CP001583.1	Plasmid	7326
GCF_000021175.1	CP001173.1	Chromosome	1652982
GCF_000021175.1	CP001174.1	Plasmid	10031
GCF_000008795.1	AE001439.1	Chromosome	1643831
GCF_000023815.1	CP001680.1	Chromosome	1568826
GCF_000020255.1	CP001072.2	Chromosome	1608548
GCF_000008535.1	AE000511.1	Chromosome	1667867

