# DX 602 Week 4 Homework



## Introduction

In this homework, you will practice working with strings to represent data, and reading and writing files to access and store data.

You may find it helpful to refer to this GitHub repository of Jupyter notebooks for sample code.

* https://github.com/bu-cds-omds/dx602-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Instructions

You should replace every instance of "..." below.
These are where you are expected to write code to answer each problem.

After some of the problems, there are extra code cells that will test functions that you wrote so you can quickly see how they run on an example.
If your code works on these examples, it is more likely to be correct.
However, the autograder will test different examples, so working correctly on these examples does not guarantee full credit for the problem.
You may change the example inputs to further test your functions on your own.
You may also add your own example inputs for problems where we did not provide any.

Be sure to run each code block after you edit it to make sure it runs as expected.
When you are done, we strongly recommend you run all the code from scratch (Runtime menu -> Restart and Run all) to make sure your current code works for all problems.

If your code raises an exception when run from scratch, it will  interfere with the auto-grader process causing you to lose some or all points for this homework.
Please ask for help in YellowDig or schedule an appointment with a learning facilitator if you get stuck.


### Submission

To submit your homework, take the following steps.

1. Save and commit this notebook.
2. Push your changes to GitHub.
3. Confirm that your changes are visible in GitHub.
4. Delete the codespace to avoid wasting your free quota.

The auto-grading process usually completes within a few minutes of pushing to GitHub, but occasionally can take several minutes to an hour.
If you submit your homework early enough, you may review the auto-grading results and fix any mistakes before the deadline.


#### Shared Imports

These common imports will be useful for some problems.
You may add other imports, but you should not try to install new modules not available in our Codespaces environment.

In [2]:
import csv
import json
import re

### Problem 1

Set the variable `p1` to the length of the string `q1`.

In [4]:
# DO NOT CHANGE

q1 = "Hello, I am a robot living in this notebook."

In [5]:
# YOUR CHANGES HERE

p1 = len(q1)

### Problem 2

The variable `q2` below contains a line of text from a comma-separated value file.
Set `p2` to the number of fields in `q2`.

Hint: Is there a string method that can separate the fields in `q2` into a list?

In [6]:
# DO NOT CHANGE

q2 = "35,Hello,red,153.2,n/a,true,true,true,false,154,92,2024-09-01,2020-03-06,confirmed,n/a,F,100,100\n"

In [7]:
# YOUR CHANGES HERE

p2 = len(q2.split(","))

In [8]:
p2

18

### Problem 3

Write a function `p3` that takes in a string as input, and returns `True` if the string contains "silly" and the length of the string is at most 50 characters and `False` otherwise.

In [9]:
# YOUR CHANGES HERE

def p3(joke):
    return True if joke.find("silly") and len(joke) < 50 else False

In [10]:
p3("this is a silly joke")

True

In [11]:
p3("this joke is so boring because it drones on and on and on and on forever as if noone is really reading this amirite?")

False

### Problem 4

In the video "Parsing Numbers from Strings", you saw the `ord` function used to map individual characters to their character codes.
You can reverse this operation with the `chr` function.

In [12]:
ord('🦄')

129412

In [13]:
chr(129412)

'🦄'

The Unicode characters with codes 128200, 128201, and 128202 are all emoji related to data science.
Set `p4` to the concatenation of these three emoji characters together.
That is, make `p4` with those three characters in that order and just those three characters.

In [15]:
# YOUR CHANGES HERE

p4 = chr(128200) + chr(128201) + chr(128202)

In [16]:
p4

'📈📉📊'

### Problem 5

Set `p5` to be a copy of the variable `q5` after replacing "jumped over" with "greeted" and "lazy" with "friendly".

In [17]:
# DO NOT CHANGE

q5 = "The quick brown fox jumped over the lazy brown dog."

In [18]:
# YOUR CHANGES HERE

p5 = q5.replace("jumped over", "greeted").replace("lazy", "friendly")

In [19]:
p5

'The quick brown fox greeted the friendly brown dog.'

### Problem 6

Write a function `p6` that takes in a filename as an argument, reads it as a TSV with a header row, and returns an iterator of dictionaries like in the example code.
Each value should be parsed as an integer.
If a value does not parse successfully, set the value to None.

In [20]:
# YOUR CHANGES HERE

def p6(filename):
    with open(filename) as file:
        reader = csv.DictReader(file, dialect="excel-tab")
        for row in reader:
            for column_name in row:
                if column_name != "mango_id":
                    try:
                        row[column_name] = float(row[column_name])
                    except:
                        row[column_name] = None
 
            yield row

In [21]:
list(p6("data6_a.tsv"))

[{'a': 3.0, 'b': 4.0}]

In [22]:
list(p6("data6_b.tsv"))

[{'a': None, 'b': None, 'c': None}]

### Problem 7

Write a function `p7` that takes in a filename as an argument, reads it as a TSV with a header row, and returns an iterator of dictionaries like in the example code.
Each value should be parsed as an integer, and if the value does not parse successfully, set the value to 3.
In addition, add a new key “finagled” to each dictionary with value True if any value did not parse successfully and False otherwise.

Hints:
1. Use the function `p6` that you previously wrote for the shared work.
2. Use `is None` to check for parsing failures.
3. For the new finagled flag, set it to False initially, and change the value to True if you find a parsing failure.


In [23]:
# YOUR CHANGES HERE

def p7(filename):
    with open(filename) as file:
        reader = csv.DictReader(file, dialect="excel-tab")
        for row in reader:
            finangled = False
            for column_name in row:
                if column_name != "mango_id":
                    try:
                        row[column_name] = float(row[column_name])
                    except:
                        row[column_name] = 3
                        finangled = True
            row["finangled"] = finangled
 
            yield row

In [24]:
list(p7("data7_a.tsv"))

[{'a': 3.0, 'b': 4.0, 'finangled': False}]

In [25]:
list(p7("data7_b.tsv"))

[{'a': 3, 'b': 3, 'c': 3, 'finangled': True}]

### Problem 8

Write a function `p8` that takes in three inputs - an input file name, an output filename, and a list of column names.
The function should read the input file using the TSV format and write the output file using the TSV format with just the specified input column names.
The output file should have the columns in the same order as the input column name list.


In [26]:
# YOUR CHANGES HERE

def p8(input_filename, output_filename, column_names):
    with open(input_filename, newline='') as infile, open(output_filename, 'w', newline='') as outfile:
        reader = csv.DictReader(infile, dialect="excel-tab")
        writer = csv.DictWriter(outfile, fieldnames=column_names, dialect="excel-tab")
        writer.writeheader()
        for row in reader:
            filtered_row = {col: row[col] for col in column_names}
            writer.writerow(filtered_row)

You can use the next two cells to test your function.

In [27]:
# test p8
p8("input-8.tsv", "output-8.tsv", ["height", "width", "color"])

In [28]:
try:
    with open("output-8.tsv") as check_fp:
        for line in check_fp:
            print(line.rstrip("\n"))
except FileNotFoundError:
    print("file not found")

height	width	color
45	23	red
62	15	blue
23	123	green


### Problem 9

Write a function `p9` that takes in a filename as an argument, reads it as a TSV with a header row, and returns the number of rows with data.

Hint:
*  This should be simple, but make sure not to count blank lines.

In [17]:
# YOUR CHANGES HERE

def p9(filename):
    with open(filename) as file:
        reader = csv.DictReader(file, dialect="excel-tab")
        count = 0
        for row in reader:
            if row != None or " ":
                count += 1
            else:
                count += 0
        return count


In [21]:
p9("data9_a.tsv")

1

In [19]:
p9("data9_b.tsv")

0

### Problem 10

Write a function `p10` that takes in a filename as an argument, and returns `True` if the file is formatted as a TSV file and `False` otherwise.


Hint: You can do this just looking at the first line of the file.

In [32]:
# YOUR CHANGES HERE

def p10(filename):
    with open(filename) as file:
        lines = file.readlines()
        if not lines:
            return False
    num_columns = len(lines[0].strip().split("\t"))
    if num_columns < 2:
        return False
    for line in lines:
        if len(line.strip().split("\t")) != num_columns:
                    return False

    return True

In [33]:
p10("data10_a.tsv")

True

In [34]:
p10("data10_b.tsv")

False

### Problem 11

The variable `p11` below is assigned using an f-string without formatting options.
Modify the f-string to display the number of visits with commas, and the average visit revenue with two digits after the decimal point.
You should only modify the f-string for this problem.

Feel free to search for the formatting options to guide you modifying the f-string.
You will learn the more commonly used options with practice.

In [None]:
# DO NOT CHANGE

q11a = 5125
q11b = 3.5123565123

In [None]:
# YOUR CHANGES HERE

p11 = f"Number of visits = {q11a}, average visit revenue = {q11b}"

In [None]:
p11

### Problem 12

The variable `q12` below contains a line of text read from a CSV file.
Set `p12` to the floating point number in the first column of `q12`.

In [None]:
# DO NOT CHANGE

q12 = "6.4,dog,red,0.9\n"

In [None]:
# YOUR CHANGES HERE

p12 = ...

In [None]:
p12

### Problem 13

Write a function `p13` that takes in a filename as an argument, reads it as a CSV with a header row, and returns a list of the column names.
The list of column names should be in the same order as in the file's header.

In [None]:
# YOUR CHANGES HERE

def p13(filename):
    ...

In [None]:
p13("data13_a.csv")

In [None]:
p13("data13_b.csv")

### Problem 14

Write a function `p14` that takes in an input filename and column name, parses the file as a CSV with a header row, and returns a list of the values in the given column.
If the column is missing, your function should return a KeyError.

In [None]:
# YOUR CHANGES HERE

def p14(filename, column_name):
    ...

In [None]:
p14("data14_a.csv", "foo")

In [None]:
p14("data14_a.csv", "bar")

In [None]:
p14("data14_b.csv", "foo")

In [89]:
p14("data14_b.csv", "bar")

### Problem 15

Write a function `p15` that takes in a filename and string key, parses the file as JSON, and returns the value for that key. If the object in the JSON file is not a dictionary or the given key does not exist, then the function should return None.

In [41]:
# YOUR CHANGES HERE

def p15(filename, key):
    ...

In [None]:
p15("data15_a.json", "x")

In [None]:
p15("data15_a.json", "y")

### Problem 16

Write a function that takes in an input filename, parses the file as a TSV with a header row, and returns a dictionary with the average value of each column.

Hint:
* Write a helper function to compute the average of a list, and use list comprehensions to get all the values for each column.


In [None]:
# YOUR CHANGES HERE

def p16(filename):
    ...

In [None]:
p16("data16_a.tsv")

### Problem 17

Write a function `p17` that takes in a filename as an argument and a list of column names, parses the file as a CSV, and returns True if all the given columns are in the file and False otherwise.

In [None]:
# YOUR CHANGES HERE

def p17(filename, column_names):
    ...

In [113]:
p17("data17_a.csv", ["foo"])

In [114]:
p17("data17_a.csv", ["foo", "bar"])

In [115]:
p17("data17_a.csv", ["baz"])

### Problem 18

Write a function `p18` that takes in a filename, column name, and column value as arguments, parses the file as a TSV, and returns the first row where the given column has the given value. The row should be returned as a dictionary. If no such row exists, return None.


In [None]:
# YOUR CHANGES HERE

def p18(filename, column_name, column_value):
    ...

In [None]:
p18("data18_a.tsv", "foo", 3)

In [None]:
p18("data18_a.tsv", "bar", 4)

### Problem 19

The following function `p19` is supposed to check if its input list has at least 10 entries, and return the 10th entry if it exists, and None if it has fewer than 10 entries.

However, there is a bug in the code so it often returns wrong answers and sometimes crashes.
Fix the bug in `p19`.

In [None]:
# YOUR CHANGES HERE

def p19(input):
    if len(input) >= 10:
        return input[10]

    return None

In [None]:
# this should return "a"

p19("aaaaaaaaaaaaaa")

In [None]:
# this should return None
p19("bbbbbbb")

In [None]:
# this should return "j"

p19("abcdefghij")

### Problem 20

Set `p20` to be a list of filenames of the form "data20_X.tsv" that exist in the current directory.
X should be a one digit number from 0 to 9.
For example, the filename could be "data20_0.tsv", "data20_9.tsv", or any other filename using 0 to 9 to set X.



There are many ways to do this.
It can be done using just this week's lessons.
You may also find easier ways to check based on libraries.
You may use libraries to solve this problem, as long as they are installed by default in our Codespaces environment.
(If you try to install other libraries, your answer will likely be rejected by the auto-grader.)

In [None]:
# YOUR CHANGES HERE

p20 = []

...

Ellipsis

In [None]:
p20