# HW1 - Base Python

See Canvas for details on how to complete and submit this assignment.

## Introduction

This assignment bridges foundational Python concepts with the data manipulation skills you'll need throughout the course. You'll work with nested data structures, implement string processing algorithms, clean messy data, and recreate built-in Python functionality from scratch - all essential skills for data science.

### Learning Objectives

- Become familiar with professional code styling guidelines and use them to improve the readability and maintainability of your code
- Compare and contrast different data structures (lists vs. dictionaries) through hands-on implementation
- Practice test-driven development using assertions to verify code correctness
- Transform messy, real-world data into clean, analyzable formats
- Progress from explicit loops to Pythonic idioms that you'll use with pandas and numpy

The problems follow a deliberate progression from simple text processing to complex data transformations. You'll implement solutions using basic constructs first, then explore how Python's built-in tools and methods can simplify your code. This approach mirrors real-world development, where understanding the problem deeply leads to better solution choices.

Each function you write will be tested automatically, introducing the testing practices essential for reliable data analysis. By the end, you'll have practical experience with the exact patterns you'll use when cleaning datasets, aggregating results, and transforming data structures throughout your data science journey.

It should take 3-5 hours to complete, toward the higher side for Graduate Students.

### Generative AI Allowance

You may use GenAI tools for brainstorming, explanations, and code sketches if you disclose it, understand it, and validate it. Your submission must represent your own work and you are solely responsible for its correctness.

### Scoring

- Reading: 30pts, 15 each
- Coding: 60pts, 15 each
- Reflection: 10pts

## Reading

### Markdown Guide

Complete the [Markdown Tutorial](https://www.markdowntutorial.com) and review the [Basic Syntax section of Markdown Guide](https://www.markdownguide.org/basic-syntax/).

Add a text / markdown cell below this one and give some brief insights from that experience. Include a numbered list, some text formatting (e.g. bold and/or italics), and a level 3 header in that, along with any other formatting you would like to include.

### _Guidance Gleamed from a Glancing Gaze at Markdown Cells_

Markdown cells provide an absolute _itany of ways to share and **emphasize** text on the web!  
The ways I saw were as follows ordered by how complicated I think they'll be to master.

1. Reference Images and Links

   The notation for specifying what the reference is and denoting how it's used it a lot

2. Inline Images and Links
   
   At least here everything is on the same line, although I see the usefulness of one reference for many links/images.

3. Lists

   Again, a lot of notation requirements

4. Blockquotes

   >"If people knew how hard I worked to get my mastery, it wouldn't seem so wonderful at all." (_Michelangelo_)

5. Headers

6. Italics and Bolding

### Python Standards

Review [PEP 8, the Style Guide for Python](https://peps.python.org/pep-0008/), focusing on the elements that are familiar to you and most applicable in your current stage development as a Python user.

Add a text / markdown cell below this one to share your main takeaways. You might address some of the following issues and/or entirely different topics.

- Why is code styling important for collaboration and maintainability?
- Which of the PEP 8 recommendations felt the most applicable to you?
- Which do you plan to implement?
- Which were the most surprising?

**Graduate students only:** Also review [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html) and consider it in your response.

### _Main Takeaways from PEP 8_

* The guide starts by realizing it isn't gospel, just a way to remain consistent in a shared way

* Consistency is key so people can all read and understand code easily, code is read more often than created

* Indentation is done with spaces unless older code is already using tabs

* Line length max of 79 characters is intended for working with multiple windows open, hard to understand for me because multiple windows is wildly distracting to me

* There is a set order to import libraries

* The whitespace rules are where I have deviated the most in the past from guidelines, I will try and implement more readable practices in the future

* For docstrings refer to PEP 257. Proper commenting will be a struggle to get use to early on, but will pay off later.

* Name variables and function in lowercase with seperating underscores  
  example_variable  
  yet_another

### _Main Takeways from Google's Python Style Guide_

* Google has max line length at 80 instead of 79, my understanding is the extra character may cause wrapping on the space where the next character would be entered

* Both specify to typically use Python's built in features wherever possible

* URLs go on their own line is long

* 4 space indentation is shared

* Google gives explicit definitions for most rules they mention, makes it both easier to parse but overwhelming as a whole

* Using + to join strings in loops can lead to quadatric runtimes instead of linear, interesting example of how a small syntax leads to readabilty _and_ performance issues

* All variable names should be descriptive

* Consistency is still key here

## Coding

Your code will be evaluated primarily on functionality, but basic PEP 8 compliance will be considered:

- Descriptive function names using `snake_case`
- Clear docstrings explaining function purpose
- Meaningful variable names
- Proper spacing around operators

All solutions will be implemented as functions. This is best practice for many reasons, including testability. As you will see, with functions we can write simple tests to check the correctness of implementation. This theme will be revisited and expanded on throughout the semester.

### Count Letters

This simple problem is designed to reintroduce Python and demonstrate:

- there are many ways to solve problems in Python
- some are better and easier than others
- the "hard way" is a necessary educational tool but Python provides alternatives for a reason

Write three versions of a function that takes a string and returns the number of occurrences of each letter in it:

1. `count_letters_v1` - use a list of lists where each inner list is `[letter, count]`
2. `count_letters_v2` - use a dictionary, checking if keys exist before updating
3. `count_letters_v3` - use dictionary's `.get()` method to simplify the logic

Write your functions in the cell below.

In [94]:
def count_letters_v1(text):
    """Return the count of each letter in text as a list of [letter, count]
    Combine conditionals with a nested loop
    String methods `lower` and `isalpha` may be helpful
    """
    clean_text = text.lower().replace(" ","")
    count = []
    if clean_text.isalpha():
      for char in clean_text:
        sub_count = [char, 0]
        for letter in clean_text:
          if letter in sub_count:
            sub_count[1] += 1
        if sub_count not in count:
          count.append(sub_count)
    return(count)

def count_letters_v2(text):
    """Return the count of each letter in text as a dictionary of letter:count pairs
    Use a single loop and test if each key exists before creating the pair / updating the count
    """
    clean_text = text.lower().replace(" ","")
    count_dict = {}
    if clean_text.isalpha():
      for char in clean_text:
        if char not in count_dict:
          count_dict[char] = 1
        else:
          count_dict[char] += 1
    return(count_dict)

def count_letters_v3(text):
    """Return the count of each letter in text as a dictionary of letter:count pairs
    Use a single loop with the dictionary get method to construct the pairs directly
    """
    clean_text = text.lower().replace(" ","")
    count_dict = {}
    if clean_text.isalpha():
      for char in clean_text:
        count_dict[char] = count_dict.get(char, 0) + 1
    return(count_dict)

#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [104]:
def normalize_result(result):
    """Helper to compare different return types"""
    if isinstance(result, list):
        return {item[0]: item[1] for item in result}
    return result


test_cases = [
    ('Hello World', {'h': 1, 'e': 1, 'l': 3, 'o': 2, 'w': 1, 'r': 1, 'd': 1}),
    ('AAaaa', {'a': 5}),
    ('123!@#', {}),  # No letters
    ('', {}),  # Empty string
]

for text, expected in test_cases:
    assert normalize_result(count_letters_v1(text)) == expected, f"v1 failed on '{text}'"
    assert count_letters_v2(text) == expected, f"v2 failed on '{text}'"
    assert count_letters_v3(text) == expected, f"v3 failed on '{text}'"
    assert count_letters_v4(text) == expected, f"v4 failed on '{text}'"

print('All tests passed!')

All tests passed!


#### Interpretation

Add a text / markdown cell below to describe the progression from v1 to v3. Which method do you prefer and why? Specifically, why are dictionaries better suited for this problem than lists, and what is the advantage of `.get()`?

#### _Progression Explanation_
In version one, to create a list of lists one must iterate through the characters in the text twice while creating a list for each one then checking it against the existing lists for duplicate characters.

In version two, one loop is enough to check if a key exists or create one if it doesn't. This still requires conditional statements nested inside a loop.

In version three, a single loop and the get method are sufficient to check and update the container in two lines.

Dictionaries offer advantages over lists by having unique keys and allowing for key lookups to easily change the values that already exist

The .get() method allows for checking if a key exists and then creating one if it doesn't.

I prefer the second version because I could understand it intuitively. The other two versions took me at least 20 minutes to figure out.

#### Follow-Up (Graduate Students)

This part is for grad students only.

Implement a fourth version of the solution using [`Collections.Counter` from the standard library](https://www.geeksforgeeks.org/python/counters-in-python-set-1/). Test your implementation as you did for v1-3.

In [102]:
from collections import Counter


def count_letters_v4(text):
    """Return the count of each letter in text as a dictionary of letter:count pairs
    Use the Collections.Counter, which was specifically designed for this common task
    """
    clean_text = text.lower().replace(" ","")
    count_dict = {}
    if clean_text.isalpha():
      count_dict = Counter(clean_text)
    return(count_dict)

### Extract Valid Data

Create a function, `extract_valid_data`, that takes a list of lists containing an arbitrary mix of *only* `int`, `float`, and `str` data types, along with a `max_val` number. Return a list of the unique integer values less than `max_val`, sorted in ascending order. The default value of `max_val` is 10. For example, the following function call:

```python
lols = [[1, 'a', 50], [50, 101, -5], [25, 3.14]]
extract_valid_data(lols, max_val=100)
```

should return

```python
[-5, 1, 25, 50]
```

To better understand how default arguments are used when defining and calling Python functions, review the first part of [this Geeks for Geeks article](https://www.geeksforgeeks.org/python/default-arguments-in-python/). The second part, about mutable defaults, is very important; we will revisit this topic later in the course.

You will need to use either `type` or `isinstance` to identify objects of type `int` in your solution. Consult the Python documentation or use the built-in help (e.g. `help(isinstance)`) for more information.

Write your function in the cell below.

In [131]:
def extract_valid_data(lists, max_val=10):
  below_threshold_ints = []
  for groups in lists:
    for x in groups:
      if isinstance(x, int) and (x < max_val):
        below_threshold_ints.append(x)
  below_threshold_ints = list(set(below_threshold_ints))
  below_threshold_ints.sort()
  return(below_threshold_ints)

In [132]:
extract_valid_data([[1, 'a', 50], [50, 101, -5], [25, 3.14]], max_val=100)

[-5, 1, 25, 50]

#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [133]:
# Test 1: Basic example from problem description
lols = [[1, 'a', 50], [50, 101, -5], [25, 3.14]]
assert extract_valid_data(lols, max_val=100) == [-5, 1, 25, 50], 'Basic test failed'

# Test 2: Default max value (10)
data1 = [[1, 5, 15], [8, 12, 3], [5, 9, 10]]
assert extract_valid_data(data1) == [1, 3, 5, 8, 9], 'Default max_val=10 test failed'

# Test 3: No valid integers (all exceed max)
data2 = [[100, 200], [150, 300]]
assert extract_valid_data(data2, max_val=50) == [], 'No valid integers test failed'

# Test 4: Duplicates should be removed
data3 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
assert extract_valid_data(data3, max_val=10) == [1, 2, 3, 4, 5], 'Duplicate removal test failed'

# Test 5: Mixed types - only integers should be included
data4 = [[1, 2.0, '3'], [4.5, 5, 'six'], [7.0, 8, 9.9]]
assert extract_valid_data(data4, max_val=10) == [1, 5, 8], 'Type filtering test failed'

# Test 6: Negative numbers
data5 = [[-5, -3, -1], [0, 1, 2]]
assert extract_valid_data(data5, max_val=3) == [-5, -3, -1, 0, 1, 2], 'Negative numbers test failed'

# Test 7: Single element sublists
data6 = [[1], [2], [3], [2], [1]]
assert extract_valid_data(data6, max_val=5) == [1, 2, 3], 'Single element test failed'

# Test 8: Large max value
data7 = [[1, 100, 1000], [50, 500, 5000]]
assert extract_valid_data(data7, max_val=10000) == [1, 50, 100, 500, 1000, 5000], (
    'Large max test failed'
)

# Test 9: Boundary case - values equal to max should be excluded
data8 = [[8, 9, 10, 11], [10, 10, 10]]
assert extract_valid_data(data8, max_val=10) == [8, 9], 'Boundary test failed (max_val=10)'

print('All tests passed!')

All tests passed!


#### Interpretation

Add a text / markdown cell below to explain how the test code works. In particular, what does `assert` do here? Are you surprised by the number of tests required to fully check the solution?

### _Test Code Explanation_
The assert statement throws an error if the returned list is not the exact same as the test case. I was surprised at how many different test cases there were. I don't think I would have been able to think through every single error easily, if at all.

#### Follow-Up (Graduate Students)

This part is for grad students only.

Rewrite this function as a single list comprehension.

Is the result more or less easy to read than your original implementation? What does this tell you about when comprehensions are best used, in practice?

In [None]:
def extract_valid_data(lists, max_val=10):
    return sorted(
        set(
            item
            for sublist in lists
            for item in sublist
            if isinstance(item, int) and item < max_val
        )
    )

### _List Comprehension_
This rewrite is much harder for me to understand than my original implementation. I think with time I will be better able to read code like this but I really struggle currently. I don't really understand when the best time to implement this would be.

### Data Cleaning

Create a function, `clean_record`, that takes a dictionary and returns a cleaned version of the same. Each `dict` consists of four key:value pairs. All keys are strings and the expected type of each value is specified below:

- 'name': str
- 'age': int
- 'email': str
- 'score': float

To clean each record, your function should:

- convert all keys to lowercase
- convert all age and score values to integer or float values, as specified
- validate that age is positive and less than 100, if not, replace value with `None` and print a warning message
- round score to a single digit of precision using `round(val, 1)`
- convert name to "Last, First" format
  - you can assume that all names come in "First Middle Last" format, but middle is optional
  - you can also assume that the names will not include titles (e.g. "Dr.", suffixes (e.g. "Jr."), multi-word last names (e.g. "Van Buren"), etc.
- return the cleaned version

Note: Python's `round` function uses Banker's Rounding, which can lead to unexpected results. See [this article for additional background / details](https://medium.com/@akhilnathe/understanding-pythons-round-function-from-basics-to-bankers-b64e7dd73477).

You may assume there are no missing keys in the data.

Write your function in the cell below.

In [156]:
def clean_record(record):
    clean_dict = {key.lower(): value for key, value in record.items()}
    split_name = (clean_dict["name"]).split()
    clean_dict["name"] = split_name[-1] + ", " + split_name[0]
    clean_dict["age"] = int(clean_dict["age"])
    if (clean_dict["age"] < 0) or (clean_dict["age"] >= 100):
        clean_dict["age"] = None
        print("Age is outside 0 to 100, please investigate.")
    clean_dict["score"] = round(float(clean_dict["score"]), 1)
    return(clean_dict)

#### Tests

Run the code below to test your implementation. If an error is detected (the output doesn't match the expected value for any of the 8 tests), use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [158]:
### DO NOT CHANGE THE CODE IN THIS CELL

# Test data for clean_records function

test_input = [
    {
        'name': 'John Doe',
        'age': '25',
        'email': 'john@email.com',
        'score': '87.456',
    },
    {
        'NAME': 'Mary Jane Smith',
        'AGE': '30',
        'EMAIL': 'mj@email.com',
        'SCORE': '92.149',
    },
    {
        'Name': 'Bob Wilson',
        'Age': 42,
        'Email': 'bob@test.com',
        'Score': 81.951,
    },
    {
        'name': 'Anna Chen',
        'age': '1',
        'email': 'anna@email.com',
        'score': '95.678',
    },
    {
        'name': 'Senior Citizen',
        'age': '99',
        'email': 'senior@test.com',
        'score': 73.2,
    },
    {
        'NAME': 'Charlie Brown',
        'AGE': 19,
        'EMAIL': 'charlie@test.com',
        'SCORE': '90.5',
    },
    {
        'name': 'Jennifer Anne Marie Thompson',
        'age': '31',
        'email': 'jamt@email.com',
        'score': '88.8',
    },
    {
        'Name': 'Carlos Rodriguez',
        'Age': '28',
        'Email': 'carlos@email.com',
        'Score': 100,
    },
    {
        'name': 'Invalid Age',
        'age': '150',
        'email': 'invalid@test.com',
        'score': '80.0',
    },
]

test_expected = [
    {'name': 'Doe, John', 'age': 25, 'email': 'john@email.com', 'score': 87.5},
    {'name': 'Smith, Mary', 'age': 30, 'email': 'mj@email.com', 'score': 92.1},
    {'name': 'Wilson, Bob', 'age': 42, 'email': 'bob@test.com', 'score': 82.0},
    {'name': 'Chen, Anna', 'age': 1, 'email': 'anna@email.com', 'score': 95.7},
    {'name': 'Citizen, Senior', 'age': 99, 'email': 'senior@test.com', 'score': 73.2},
    {'name': 'Brown, Charlie', 'age': 19, 'email': 'charlie@test.com', 'score': 90.5},
    {'name': 'Thompson, Jennifer', 'age': 31, 'email': 'jamt@email.com', 'score': 88.8},
    {'name': 'Rodriguez, Carlos', 'age': 28, 'email': 'carlos@email.com', 'score': 100.0},
    {'name': 'Age, Invalid', 'age': None, 'email': 'invalid@test.com', 'score': 80.0},
]

# run tests to ensure output matches expected for each given input

for idx, data in enumerate(test_input):
    expected = test_expected[idx]
    actual = clean_record(data)

    # Check if dictionaries match
    if actual != expected:
        # Find which fields don't match
        for key in expected:
            if actual.get(key) != expected[key]:
                assert False, (
                    f"Test {idx + 1} failed on field '{key}': expected {expected[key]}, got {actual.get(key)}"
                )

print('All tests pass!')

Age is outside 0 to 100, please investigate.
All tests pass!


#### Interpretation

Add a text / markdown cell below to explain how the test code works. In particular, look up the `enumerate` function and `dict.get()` method. Consider how the equivalent would be written without them - how do those features simplify this implementation?

Also, what does `assert` do here and how else could it be in other testing situations?

#### Follow-Up (Graduate Students)

This part is for grad students only.

Explain in a text / markdown cell below how your approach would have to change if you could not assume each record was complete (all four keys present).

Sketch out the code change required. You can include (non-running) code blocks in markdown cells as shown below (edit this cell to see the formatting). This does not have to run, it is for communication purposes only.

```python
# code blocks are denoted in markdown with three backticks before and after
print("This is a markdown code block.")
```

### Implement a Simplified `zip()`

Create a function, `simple_zip` that emulates some functionality of the `zip` function included with base Python:

```bash
> help(zip)
Help on class zip in module builtins:

class zip(object)
 |  zip(*iterables, strict=False)
 |
 |  The zip object yields n-length tuples, where n is the number of iterables
 |  passed as positional arguments to zip().  The i-th element in every tuple
 |  comes from the i-th iterable argument to zip().  This continues until the
 |  shortest argument is exhausted.
 |
 |  If strict is true and one of the arguments is exhausted before the others,
 |  raise a ValueError.
 |
 |     >>> list(zip('abcdefg', range(3), range(4)))
 |     [('a', 0, 0), ('b', 1, 1), ('c', 2, 2)]
```

Python's version creates a *generator* object that produces values as needed rather than all at once. Your solution should return a list of tuples instead. For example, the following function call:

```python
it1 = [1, 2, 3]
it2 = ['a', 'b', 'c']
simple_zip(it1, it2)
```

should return

```python
[(1, 'a'), (2, 'b'), (3, 'c')]
```

Do not implement the `strict` argument. Instead, emulate the default behavior of `zip`: if the iterables are of differing lengths, stop when the shortest one is exhausted.

Note that the first argument in `zip` is `*iterables`, allowing it to accept any number of iterables. When you use `*VARIABLE_NAME` in this fashion, Python automatically collects all the positional arguments into a tuple called `VARIABLE_NAME` (e.g. `iterables`). It is your responsibility to extract individual arguments from the resulting tuple. The following code block demonstrates this for clarity.

In [None]:
def example(*vars):
    # return vars as constructed by Python from the user's arguments
    return vars


var1 = 'first argument'
var2 = 'second argument'
result = example(var1, var2)

# inspect results
print(result)  # ('first argument', 'second argument')
print(result[0])  # 'first argument'

Write your function in the cell below.

In [None]:
def simple_zip(*iterables): ...

#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [None]:
# Basic test cases
assert simple_zip([1, 2], ['a', 'b']) == [(1, 'a'), (2, 'b')], 'Basic test failed'
assert simple_zip([1, 2, 3], ['a', 'b']) == [(1, 'a'), (2, 'b')], "Doesn't stop at shortest"

# Multiple iterables
assert simple_zip([1, 2], ['a', 'b'], [10, 20]) == [(1, 'a', 10), (2, 'b', 20)], (
    "Doesn't handle >2 iterables"
)

# Different types of iterables
assert simple_zip('abc', [1, 2, 3]) == [('a', 1), ('b', 2), ('c', 3)], "Doesn't handle mixed types"
assert simple_zip(range(3), 'xyz') == [(0, 'x'), (1, 'y'), (2, 'z')], (
    "Doesn't handle other iterable types"
)

print('All tests passed!')

#### Interpretation

Add a text / markdown cell below to discuss how you might make your code more concise and/or readable by using list comprehensions. If you already used them in your solution, describe why you chose that approach.

Also, what tests have we overlooked? What are we assuming about the input that might cause a crash when this function is called?

#### Follow-Up (Graduate Students)

Read more about [generator objects](https://realpython.com/introduction-to-python-generators/). Then run the following code, noting the included comments.

In [None]:
# Python's built-in zip returns a generator-like object
result1 = zip([1, 2, 3], ['a', 'b', 'c'])
print(result1)  # What do you see?
print(list(result1))  # Convert to list
print(list(result1))  # Try again - what happens?

# Your simple_zip returns a list
result2 = simple_zip([1, 2, 3], ['a', 'b', 'c'])
print(result2)  # What do you see?
print(result2)  # Try again - what happens?

Based on your reading and the code above:

- What's the key difference between a generator and a list?
- Why might Python's zip return a generator instead of a list?
- Name one advantage and one disadvantage of generators vs lists.

## Reflection

Address the following (concise bullets or short paragraphs are fine):

1. Key takeaway
   - What part of this assignment most surprised you or led to the most significant improvement in your Python understanding?
   - Include a concrete before/after to illustrate how this assignment has changed your approach to problem solving, syntax, styling, or other implementational details as a result of this assignment.
2. GenAI use
   - If used, specify the tool / model used, how you used it, how you verified correctness, and how it was most helpful (breadth / depth of understanding, quality of code, time to completion, etc.). Note any limits or problems you observed and how you mitigated them.
   - If not, why and when do you expect to use it in this course, if at all?
3. Feedback
   - Approximately how much time did you spend on this assignment?
   - What was the most difficult part?
   - How would you improve it?
   - Anything else you want to share or ask?