<a href="https://colab.research.google.com/github/walkerjian/DailyCode/blob/main/validUtf8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## validUtf8

Write a program that takes in an array of integers representing byte values, and returns whether it is a valid UTF-8 encoding.
UTF-8 is a character encoding that maps each symbol to one, two, three, or four bytes.

For example, the Euro sign, €, corresponds to the three bytes 11100010 10000010 10101100. The rules for mapping characters are as follows:

For a single-byte character, the first bit must be zero.
For an n-byte character, the first byte starts with n ones and a zero. The other n - 1 bytes all start with 10.
Visually, this can be represented as follows.
````

 Bytes   |           Byte format
-----------------------------------------------
   1     | 0xxxxxxx
   2     | 110xxxxx 10xxxxxx
   3     | 1110xxxx 10xxxxxx 10xxxxxx
   4     | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
````

### Solving the problem:

1. Create a function `validUtf8` that takes a list of integers as input.
2. Iterate through each integer in the list.
3. Depending on the byte pattern, check if the byte pattern matches any of the valid UTF-8 encoding patterns.
4. Use a variable to keep track of the number of bytes that need to be checked for the current character.
5. If we encounter an invalid byte or we end up with a non-zero count of bytes to check at the end, return `False`.
6. If all checks pass, return `True`.

Here's the code for the `validUtf8` function based on the above approach:

In [1]:
def validUtf8(data):
    # Number of bytes in the current UTF-8 character
    n_bytes = 0

    # Mask to check if the most significant bit is set or not
    mask1 = 1 << 7
    # Mask to check if the second most significant bit is set or not
    mask2 = 1 << 6

    for num in data:
        # Mask to check the number of bytes the current character uses
        mask = 1 << 7
        if n_bytes == 0:
            while mask & num:
                n_bytes += 1
                mask = mask >> 1

            # 1 byte character
            if n_bytes == 0:
                continue

            # Invalid scenarios according to the problem statement
            if n_bytes == 1 or n_bytes > 4:
                return False
        else:
            # Check if the byte is of the pattern 10xxxxxx
            if not (num & mask1 and not (num & mask2)):
                return False
        n_bytes -= 1

    return n_bytes == 0

# Test the function with the given example
test_data = [0b11100010, 0b10000010, 0b10101100]
validUtf8(test_data)


True

## Testing the Solution.

1. **Setup**: We will create a list of test cases. Each test case will consist of two elements:
    - A list of integers representing the byte values.
    - The expected result (either `True` or `False`).
2. **Execution**: We will run each test case through the `validUtf8` function.
3. **Verification**: After execution, we will compare the result with the expected result.
4. **Output**: We will display a message indicating whether the test passed or failed, along with the input, expected output, and actual output.

Here are the 10 test cases we will use:

1. The given example, which represents the Euro sign (€).
2. A valid 1-byte character.
3. A valid 2-byte character.
4. A valid 4-byte character.
5. An invalid byte sequence that starts with `10xxxxxx`.
6. An invalid 5-byte character.
7. An incomplete 2-byte character.
8. An incomplete 3-byte character.
9. An invalid sequence with a correct starting byte but incorrect following bytes.
10. A mix of valid 1-byte and 2-byte characters.

In [2]:
def test_validUtf8():
    test_cases = [
        # Test case 1: Given example for Euro sign (€)
        ([0b11100010, 0b10000010, 0b10101100], True),

        # Test case 2: Valid 1-byte character (ASCII)
        ([0b01001001], True),

        # Test case 3: Valid 2-byte character
        ([0b11000100, 0b10001001], True),

        # Test case 4: Valid 4-byte character
        ([0b11110000, 0b10010000, 0b10000000, 0b10000000], True),

        # Test case 5: Invalid byte sequence starting with 10xxxxxx
        ([0b10001001], False),

        # Test case 6: Invalid 5-byte character
        ([0b11111000, 0b10000000, 0b10000000, 0b10000000, 0b10000000], False),

        # Test case 7: Incomplete 2-byte character
        ([0b11000100], False),

        # Test case 8: Incomplete 3-byte character
        ([0b11100010, 0b10000010], False),

        # Test case 9: Invalid sequence with correct starting byte but incorrect following bytes
        ([0b11100010, 0b11000010, 0b10101100], False),

        # Test case 10: Mix of valid 1-byte and 2-byte characters
        ([0b01001001, 0b11000100, 0b10001001], True),
    ]

    passed = 0
    for i, (data, expected) in enumerate(test_cases, 1):
        result = validUtf8(data)
        if result == expected:
            print(f"Test case {i}: PASSED")
            passed += 1
        else:
            print(f"Test case {i}: FAILED")
            print(f"  Input: {data}")
            print(f"  Expected: {expected}")
            print(f"  Got: {result}")
        print('-' * 50)

    print(f"Total test cases passed: {passed}/{len(test_cases)}")

test_validUtf8()


Test case 1: PASSED
--------------------------------------------------
Test case 2: PASSED
--------------------------------------------------
Test case 3: PASSED
--------------------------------------------------
Test case 4: PASSED
--------------------------------------------------
Test case 5: PASSED
--------------------------------------------------
Test case 6: PASSED
--------------------------------------------------
Test case 7: PASSED
--------------------------------------------------
Test case 8: PASSED
--------------------------------------------------
Test case 9: PASSED
--------------------------------------------------
Test case 10: PASSED
--------------------------------------------------
Total test cases passed: 10/10


### Breakdown of the test cases:

1. **Euro sign (€)**: This was the given example and represents a valid 3-byte UTF-8 character.
2. **Valid 1-byte character**: This represents an ASCII character, which is always valid in UTF-8.
3. **Valid 2-byte character**: This sequence correctly starts with `110xxxxx` and is followed by `10xxxxxx`.
4. **Valid 4-byte character**: This sequence correctly starts with `11110xxx` and is followed by three `10xxxxxx` bytes.
5. **Invalid byte sequence**: The sequence starts with `10xxxxxx`, which is not a valid starting byte.
6. **Invalid 5-byte character**: UTF-8 does not support characters that use 5 bytes.
7. **Incomplete 2-byte character**: Only the starting byte is given without the following byte.
8. **Incomplete 3-byte character**: The starting byte and one following byte are given, but the third byte is missing.
9. **Invalid sequence with correct starting byte**: The sequence starts correctly but is followed by an incorrect byte pattern.
10. **Mix of valid characters**: A sequence containing both a valid 1-byte and a valid 2-byte character.

The test harness provided a clear output for each test case, detailing whether it passed or failed. The function `validUtf8` was able to handle all these diverse test cases, indicating its robustness.