# HW1 - Base Python

See Canvas for details on how to complete and submit this assignment.

## Introduction

This assignment bridges foundational Python concepts with the data manipulation skills you'll need throughout the course. You'll work with nested data structures, implement string processing algorithms, clean messy data, and recreate built-in Python functionality from scratch - all essential skills for data science.

### Learning Objectives

- Become familiar with professional code styling guidelines and use them to improve the readability and maintainability of your code
- Compare and contrast different data structures (lists vs. dictionaries) through hands-on implementation
- Practice test-driven development using assertions to verify code correctness
- Transform messy, real-world data into clean, analyzable formats
- Progress from explicit loops to Pythonic idioms that you'll use with pandas and numpy

The problems follow a deliberate progression from simple text processing to complex data transformations. You'll implement solutions using basic constructs first, then explore how Python's built-in tools and methods can simplify your code. This approach mirrors real-world development, where understanding the problem deeply leads to better solution choices.

Each function you write will be tested automatically, introducing the testing practices essential for reliable data analysis. By the end, you'll have practical experience with the exact patterns you'll use when cleaning datasets, aggregating results, and transforming data structures throughout your data science journey.

It should take 3-5 hours to complete, toward the higher side for Graduate Students.

### Generative AI Allowance

You may use GenAI tools for brainstorming, explanations, and code sketches if you disclose it, understand it, and validate it. Your submission must represent your own work and you are solely responsible for its correctness.

### Scoring

- Reading: 30pts, 15 each
- Coding: 60pts, 15 each
- Reflection: 10pts

## Reading

### Markdown Guide

Complete the [Markdown Tutorial](https://www.markdowntutorial.com) and review the [Basic Syntax section of Markdown Guide](https://www.markdownguide.org/basic-syntax/).

Add a text / markdown cell below this one and give some brief insights from that experience. Include a numbered list, some text formatting (e.g. bold and/or italics), and a level 3 header in that, along with any other formatting you would like to include.

### USING MARKDOWN FOR MY WRITINGS

Using markdown for my writIngs will enable me to **easily share** with;  
1. Computers
   * Desktop computers
   * Personal computers
2. Mobile phones
3. People

Both the [_Markdown Tutorial_](https://www.google.com/url?q=https%3A%2F%2Fwww.markdownguide.org%2Fbasic-syntax%2F)  and  [_Basic Syntax section of Markdown Guide_]( https://www.google.com/url?q=https%3A%2F%2Fwww.markdownguide.org%2Fbasic-syntax%2F) gave great ways through which I can format my writings.

* I got to know how to **Bold** or _Italize_ words considering the position
* I got to understand images could be added easily to my writings unlike other word processing applications




### Python Standards

Review [PEP 8, the Style Guide for Python](https://peps.python.org/pep-0008/), focusing on the elements that are familiar to you and most applicable in your current stage development as a Python user.

Add a text / markdown cell below this one to share your main takeaways. You might address some of the following issues and/or entirely different topics.

- Why is code styling important for collaboration and maintainability?
- Which of the PEP 8 recommendations felt the most applicable to you?
- Which do you plan to implement?
- Which were the most surprising?

**Graduate students only:** Also review [Google's Python Style Guide](https://google.github.io/styleguide/pyguide.html) and consider it in your response.

### CODE STYLING
Consistent code styling improves **collaboration** by making it easier for people working together to be able to read and understand each others work. It also enhances **maintainability**:
* Identifying errors is easier when everthing is well organized
* With proper indentation, spacing, naming and others, fixing bugs is much easier.

#### APPLICABLE RECOMMENDATIONS
**1. Whitespaces:** Avoiding spaces within parentheses, braces and brackets.

**2. Identation:** Using both tabs and spaces in a code should be avoided. Usually 4 spaces of indentation

**3. Name convention:** Constants should all be in capital letters with underscore separating them. This will be helful to me in collabrative coding since it will specifically tell me which aspects of the code that can be changed.

**4. Block and Inline comments:** These will help me better understand what my code is all about and specifically what a line of code represents.

**5. TODO Comment:** I can use this type of comment when my code temporary or not good enough.

**6. Lint:** The pylint tool will be a good tool for me in my coding, detecting bugs and styling problems is simple with it.

_It is quite surprising that **PYTHON** is not capable to distinguish l, O and I from 0 and 1 in some font types_


## Coding

Your code will be evaluated primarily on functionality, but basic PEP 8 compliance will be considered:

- Descriptive function names using `snake_case`
- Clear docstrings explaining function purpose
- Meaningful variable names
- Proper spacing around operators

All solutions will be implemented as functions. This is best practice for many reasons, including testability. As you will see, with functions we can write simple tests to check the correctness of implementation. This theme will be revisited and expanded on throughout the semester.

### Count Letters

This simple problem is designed to reintroduce Python and demonstrate:

- there are many ways to solve problems in Python
- some are better and easier than others
- the "hard way" is a necessary educational tool but Python provides alternatives for a reason

Write three versions of a function that takes a string and returns the number of occurrences of each letter in it:

1. `count_letters_v1` - use a list of lists where each inner list is `[letter, count]`
2. `count_letters_v2` - use a dictionary, checking if keys exist before updating
3. `count_letters_v3` - use dictionary's `.get()` method to simplify the logic

Write your functions in the cell below.

In [None]:
def count_letters_v1(text):
    letter_counts[]
    for char in text:
      if char.isalpha():
        found = False
        for

    """Return the count of each letter in text as a list of [letter, count]
    Combine conditionals with a nested loop
    String methods `lower` and `isalpha` may be helpful
    """
    ...


def count_letters_v2(text):
    """Return the count of each letter in text as a dictionary of letter:count pairs
    Use a single loop and test if each key exists before creating the pair / updating the count
    """
    ...


def count_letters_v3(text):
    """Return the count of each letter in text as a dictionary of letter:count pairs
    Use a single loop with the dictionary get method to construct the pairs directly
    """
    ...

#### Solution

In [3]:
def count_letters_v1(text):
    """Count letters using nested lists [[letter, count], ...]"""
    counts = []
    for char in text.lower():
        if char.isalpha():
            # Search for existing entry
            found = False
            for entry in counts:
                if entry[0] == char:
                    entry[1] += 1
                    found = True
                    break
            if not found:
                counts.append([char, 1])
    return counts


def count_letters_v2(text):
    """Count letters using dictionary with if/else"""
    counts = {}
    for char in text.lower():
        if char.isalpha():
            if char in counts:
                counts[char] += 1
            else:
                counts[char] = 1
    return counts


def count_letters_v3(text):
    """Count letters using dict.get()"""
    counts = {}
    for char in text.lower():
        if char.isalpha():
            counts[char] = counts.get(char, 0) + 1
    return counts

#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [4]:
def normalize_result(result):
    """Helper to compare different return types"""
    if isinstance(result, list):
        return {item[0]: item[1] for item in result}
    return result


test_cases = [
    ('Hello World', {'h': 1, 'e': 1, 'l': 3, 'o': 2, 'w': 1, 'r': 1, 'd': 1}),
    ('AAaaa', {'a': 5}),
    ('123!@#', {}),  # No letters
    ('', {}),  # Empty string
]

for text, expected in test_cases:
    assert normalize_result(count_letters_v1(text)) == expected, f"v1 failed on '{text}'"
    assert count_letters_v2(text) == expected, f"v2 failed on '{text}'"
    assert count_letters_v3(text) == expected, f"v3 failed on '{text}'"

print('All tests passed!')

All tests passed!


#### Interpretation

Add a text / markdown cell below to describe the progression from v1 to v3. Which method do you prefer and why? Specifically, why are dictionaries better suited for this problem than lists, and what is the advantage of `.get()`?

#### Follow-Up (Graduate Students)

This part is for grad students only.

Implement a fourth version of the solution using [`Collections.Counter` from the standard library](https://www.geeksforgeeks.org/python/counters-in-python-set-1/). Test your implementation as you did for v1-3.

In [None]:
from collections import Counter


def count_letters_v4(text):
    """Return the count of each letter in text as a dictionary of letter:count pairs
    Use the Collections.Counter, which was specifically designed for this common task
    """
    ...

#### Solution

In [None]:
from collections import Counter


def count_letters_v4(text):
    return Counter(char for char in text.lower() if char.isalpha())

### Extract Valid Data

Create a function, `extract_valid_data`, that takes a list of lists containing an arbitrary mix of *only* `int`, `float`, and `str` data types, along with a `max_val` number. Return a list of the unique integer values less than `max_val`, sorted in ascending order. The default value of `max_val` is 10. For example, the following function call:

```python
lols = [[1, 'a', 50], [50, 101, -5], [25, 3.14]]
extract_valid_data(lols, max_val=100)
```

should return

```python
[-5, 1, 25, 50]
```

To better understand how default arguments are used when defining and calling Python functions, review the first part of [this Geeks for Geeks article](https://www.geeksforgeeks.org/python/default-arguments-in-python/). The second part, about mutable defaults, is very important; we will revisit this topic later in the course.

You will need to use either `type` or `isinstance` to identify objects of type `int` in your solution. Consult the Python documentation or use the built-in help (e.g. `help(isinstance)`) for more information.

Write your function in the cell below.

In [None]:
def extract_valid_data(lists, max_val=10): ...

#### Solution

In [None]:
def extract_valid_data(lists, max_val=10):
    """Extract unique integers less than max_val from nested lists."""

    valid_numbers = []

    for sublist in lists:
        for item in sublist:
            # Check if item is an integer (not float) and less than max
            if isinstance(item, int):
                if item < max_val:
                    if item not in valid_numbers:
                        valid_numbers.append(item)

    valid_numbers.sort()

    return valid_numbers

#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [None]:
# Test 1: Basic example from problem description
lols = [[1, 'a', 50], [50, 101, -5], [25, 3.14]]
assert extract_valid_data(lols, max_val=100) == [-5, 1, 25, 50], 'Basic test failed'

# Test 2: Default max value (10)
data1 = [[1, 5, 15], [8, 12, 3], [5, 9, 10]]
assert extract_valid_data(data1) == [1, 3, 5, 8, 9], 'Default max_val=10 test failed'

# Test 3: No valid integers (all exceed max)
data2 = [[100, 200], [150, 300]]
assert extract_valid_data(data2, max_val=50) == [], 'No valid integers test failed'

# Test 4: Duplicates should be removed
data3 = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
assert extract_valid_data(data3, max_val=10) == [1, 2, 3, 4, 5], 'Duplicate removal test failed'

# Test 5: Mixed types - only integers should be included
data4 = [[1, 2.0, '3'], [4.5, 5, 'six'], [7.0, 8, 9.9]]
assert extract_valid_data(data4, max_val=10) == [1, 5, 8], 'Type filtering test failed'

# Test 6: Negative numbers
data5 = [[-5, -3, -1], [0, 1, 2]]
assert extract_valid_data(data5, max_val=3) == [-5, -3, -1, 0, 1, 2], 'Negative numbers test failed'

# Test 7: Single element sublists
data6 = [[1], [2], [3], [2], [1]]
assert extract_valid_data(data6, max_val=5) == [1, 2, 3], 'Single element test failed'

# Test 8: Large max value
data7 = [[1, 100, 1000], [50, 500, 5000]]
assert extract_valid_data(data7, max_val=10000) == [1, 50, 100, 500, 1000, 5000], (
    'Large max test failed'
)

# Test 9: Boundary case - values equal to max should be excluded
data8 = [[8, 9, 10, 11], [10, 10, 10]]
assert extract_valid_data(data8, max_val=10) == [8, 9], 'Boundary test failed (max_val=10)'

print('All tests passed!')

#### Interpretation

Add a text / markdown cell below to explain how the test code works. In particular, what does `assert` do here? Are you surprised by the number of tests required to fully check the solution?

#### Follow-Up (Graduate Students)

This part is for grad students only.

Rewrite this function as a single list comprehension.

Is the result more or less easy to read than your original implementation? What does this tell you about when comprehensions are best used, in practice?

In [1]:
def extract_valid_data(lists, max_val=10):
    return sorted(
        set(
            item
            for sublist in lists
            for item in sublist
            if isinstance(item, int) and item < max_val
        )
    )

### Data Cleaning

Create a function, `clean_record`, that takes a dictionary and returns a cleaned version of the same. Each `dict` consists of four key:value pairs. All keys are strings and the expected type of each value is specified below:

- 'name': str
- 'age': int
- 'email': str
- 'score': float

To clean each record, your function should:

- convert all keys to lowercase
- convert all age and score values to integer or float values, as specified
- validate that age is positive and less than 100, if not, replace value with `None` and print a warning message
- round score to a single digit of precision using `round(val, 1)`
- convert name to "Last, First" format
  - you can assume that all names come in "First Middle Last" format, but middle is optional
  - you can also assume that the names will not include titles (e.g. "Dr.", suffixes (e.g. "Jr."), multi-word last names (e.g. "Van Buren"), etc.
- return the cleaned version

Note: Python's `round` function uses Banker's Rounding, which can lead to unexpected results. See [this article for additional background / details](https://medium.com/@akhilnathe/understanding-pythons-round-function-from-basics-to-bankers-b64e7dd73477).

You may assume there are no missing keys in the data.

Write your function in the cell below.

In [None]:
def clean_record(record): ...

#### Solution

In [3]:
def clean_record(record):
    cleaned = {}

    # Process each key-value pair
    for key, value in record.items():
        key_lower = key.lower()

        if key_lower == 'name':
            # Convert "First [Middle] Last" to "Last, First"
            name_parts = str(value).strip().split()
            cleaned['name'] = f'{name_parts[-1]}, {name_parts[0]}'

        elif key_lower == 'age':
            # Convert to int and validate
            age_val = int(value)
            if 0 < age_val < 100:
                cleaned['age'] = age_val
            else:
                print(f'Invalid age: {age_val}. Age must be positive and less than 100.')
                cleaned['age'] = None

        elif key_lower == 'email':
            # Keep email as string
            cleaned['email'] = str(value)

        elif key_lower == 'score':
            # Convert to float and round to 1 decimal place
            score_val = float(value)
            cleaned['score'] = round(score_val, 1)

    return cleaned

#### Tests

Run the code below to test your implementation. If an error is detected (the output doesn't match the expected value for any of the 8 tests), use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [5]:
### DO NOT CHANGE THE CODE IN THIS CELL

# Test data for clean_records function

test_input = [
    {
        'name': 'John Doe',
        'age': '25',
        'email': 'john@email.com',
        'score': '87.456',
    },
    {
        'NAME': 'Mary Jane Smith',
        'AGE': '30',
        'EMAIL': 'mj@email.com',
        'SCORE': '92.149',
    },
    {
        'Name': 'Bob Wilson',
        'Age': 42,
        'Email': 'bob@test.com',
        'Score': 81.951,
    },
    {
        'name': 'Anna Chen',
        'age': '1',
        'email': 'anna@email.com',
        'score': '95.678',
    },
    {
        'name': 'Senior Citizen',
        'age': '99',
        'email': 'senior@test.com',
        'score': 73.2,
    },
    {
        'NAME': 'Charlie Brown',
        'AGE': 19,
        'EMAIL': 'charlie@test.com',
        'SCORE': '90.5',
    },
    {
        'name': 'Jennifer Anne Marie Thompson',
        'age': '31',
        'email': 'jamt@email.com',
        'score': '88.8',
    },
    {
        'Name': 'Carlos Rodriguez',
        'Age': '28',
        'Email': 'carlos@email.com',
        'Score': 100,
    },
    {
        'name': 'Invalid Age',
        'age': '150',
        'email': 'invalid@test.com',
        'score': '80.0',
    },
]

test_expected = [
    {'name': 'Doe, John', 'age': 25, 'email': 'john@email.com', 'score': 87.5},
    {'name': 'Smith, Mary', 'age': 30, 'email': 'mj@email.com', 'score': 92.1},
    {'name': 'Wilson, Bob', 'age': 42, 'email': 'bob@test.com', 'score': 82.0},
    {'name': 'Chen, Anna', 'age': 1, 'email': 'anna@email.com', 'score': 95.7},
    {'name': 'Citizen, Senior', 'age': 99, 'email': 'senior@test.com', 'score': 73.2},
    {'name': 'Brown, Charlie', 'age': 19, 'email': 'charlie@test.com', 'score': 90.5},
    {'name': 'Thompson, Jennifer', 'age': 31, 'email': 'jamt@email.com', 'score': 88.8},
    {'name': 'Rodriguez, Carlos', 'age': 28, 'email': 'carlos@email.com', 'score': 100.0},
    {'name': 'Age, Invalid', 'age': None, 'email': 'invalid@test.com', 'score': 80.0},
]

# run tests to ensure output matches expected for each given input

for idx, data in enumerate(test_input):
    expected = test_expected[idx]
    actual = clean_record(data)

    # Check if dictionaries match
    if actual != expected:
        # Find which fields don't match
        for key in expected:
            if actual.get(key) != expected[key]:
                assert False, (
                    f"Test {idx + 1} failed on field '{key}': expected {expected[key]}, got {actual.get(key)}"
                )

print('All tests pass!')

Invalid age: 150. Age must be positive and less than 100.
All tests pass!


#### Interpretation

Add a text / markdown cell below to explain how the test code works. In particular, look up the `enumerate` function and `dict.get()` method. Consider how the equivalent would be written without them - how do those features simplify this implementation?

Also, what does `assert` do here and how else could it be in other testing situations?

#### Follow-Up (Graduate Students)

This part is for grad students only.

Explain in a text / markdown cell below how your approach would have to change if you could not assume each record was complete (all four keys present).

Sketch out the code change required. You can include (non-running) code blocks in markdown cells as shown below (edit this cell to see the formatting). This does not have to run, it is for communication purposes only.

```python
# code blocks are denoted in markdown with three backticks before and after
print("This is a markdown code block.")
```

### Implement a Simplified `zip()`

Create a function, `simple_zip` that emulates some functionality of the `zip` function included with base Python:

```bash
> help(zip)
Help on class zip in module builtins:

class zip(object)
 |  zip(*iterables, strict=False)
 |
 |  The zip object yields n-length tuples, where n is the number of iterables
 |  passed as positional arguments to zip().  The i-th element in every tuple
 |  comes from the i-th iterable argument to zip().  This continues until the
 |  shortest argument is exhausted.
 |
 |  If strict is true and one of the arguments is exhausted before the others,
 |  raise a ValueError.
 |
 |     >>> list(zip('abcdefg', range(3), range(4)))
 |     [('a', 0, 0), ('b', 1, 1), ('c', 2, 2)]
```

Python's version creates a *generator* object that produces values as needed rather than all at once. Your solution should return a list of tuples instead. For example, the following function call:

```python
it1 = [1, 2, 3]
it2 = ['a', 'b', 'c']
simple_zip(it1, it2)
```

should return

```python
[(1, 'a'), (2, 'b'), (3, 'c')]
```

Do not implement the `strict` argument. Instead, emulate the default behavior of `zip`: if the iterables are of differing lengths, stop when the shortest one is exhausted.

Note that the first argument in `zip` is `*iterables`, allowing it to accept any number of iterables. When you use `*VARIABLE_NAME` in this fashion, Python automatically collects all the positional arguments into a tuple called `VARIABLE_NAME` (e.g. `iterables`). It is your responsibility to extract individual arguments from the resulting tuple. The following code block demonstrates this for clarity.

In [None]:
def example(*vars):
    # return vars as constructed by Python from the user's arguments
    return vars


var1 = 'first argument'
var2 = 'second argument'
result = example(var1, var2)

# inspect results
print(result)  # ('first argument', 'second argument')
print(result[0])  # 'first argument'

Write your function in the cell below.

In [None]:
def simple_zip(*iterables): ...

#### Solution

In [None]:
def simple_zip(*iterables):
    """Recreate a simplified version of Python's zip function (returns a list)."""

    # Convert all iterables to lists to access by index
    lists = []
    for iterable in iterables:
        lists.append(list(iterable))

    min_length = len(lists[0])
    for lst in lists:
        if len(lst) < min_length:
            min_length = len(lst)

    # Build the result
    result = []
    for i in range(min_length):
        # Build tuple for position i
        current_tuple = []
        for lst in lists:
            current_tuple.append(lst[i])
        result.append(tuple(current_tuple))

    return result

#### Tests

Run the code below to test your implementation. If an error is detected, use the information provided to correct your function definition.

**You must run the cell above each time you make changes to it (to create the function definition) before running these tests.**

In [None]:
# Basic test cases
assert simple_zip([1, 2], ['a', 'b']) == [(1, 'a'), (2, 'b')], 'Basic test failed'
assert simple_zip([1, 2, 3], ['a', 'b']) == [(1, 'a'), (2, 'b')], "Doesn't stop at shortest"

# Multiple iterables
assert simple_zip([1, 2], ['a', 'b'], [10, 20]) == [(1, 'a', 10), (2, 'b', 20)], (
    "Doesn't handle >2 iterables"
)

# Different types of iterables
assert simple_zip('abc', [1, 2, 3]) == [('a', 1), ('b', 2), ('c', 3)], "Doesn't handle mixed types"
assert simple_zip(range(3), 'xyz') == [(0, 'x'), (1, 'y'), (2, 'z')], (
    "Doesn't handle other iterable types"
)

print('All tests passed!')

#### Interpretation

Add a text / markdown cell below to discuss how you might make your code more concise and/or readable by using list comprehensions. If you already used them in your solution, describe why you chose that approach.

Also, what tests have we overlooked? What are we assuming about the input that might cause a crash when this function is called?

#### Follow-Up (Graduate Students)

Read more about [generator objects](https://realpython.com/introduction-to-python-generators/). Then run the following code, noting the included comments.

In [None]:
# Python's built-in zip returns a generator-like object
result1 = zip([1, 2, 3], ['a', 'b', 'c'])
print(result1)  # What do you see?
print(list(result1))  # Convert to list
print(list(result1))  # Try again - what happens?

# Your simple_zip returns a list
result2 = simple_zip([1, 2, 3], ['a', 'b', 'c'])
print(result2)  # What do you see?
print(result2)  # Try again - what happens?

Based on your reading and the code above:

- What's the key difference between a generator and a list?
- Why might Python's zip return a generator instead of a list?
- Name one advantage and one disadvantage of generators vs lists.

## Reflection

Address the following (concise bullets or short paragraphs are fine):

1. Key takeaway
   - What part of this assignment most surprised you or led to the most significant improvement in your Python understanding?
   - Include a concrete before/after to illustrate how this assignment has changed your approach to problem solving, syntax, styling, or other implementational details as a result of this assignment.
2. GenAI use
   - If used, specify the tool / model used, how you used it, how you verified correctness, and how it was most helpful (breadth / depth of understanding, quality of code, time to completion, etc.). Note any limits or problems you observed and how you mitigated them.
   - If not, why and when do you expect to use it in this course, if at all?
3. Feedback
   - Approximately how much time did you spend on this assignment?
   - What was the most difficult part?
   - How would you improve it?
   - Anything else you want to share or ask?