<a href="https://colab.research.google.com/github/shuhrat9902/my_first_repository/blob/main/tasks_4_3_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
# Import necessary libraries
import unittest
import pandas as pd
import numpy as np
import seaborn as sns
import io
import contextlib # For redirecting stdout
from urllib.error import URLError
import warnings
import sys
import os
# from pathlib import Path # Not used in current run_tests, but maybe intended
# from typing import List, Any, Optional, Tuple, Dict # Not used, but good practice

_DATASET_CACHE = {}
def load_cached_dataset(name):
    """
    Loads a seaborn dataset using caching.
    Returns the DataFrame or raises an error if loading fails.
    """
    if name not in _DATASET_CACHE:
        print(f"\nAttempting to load dataset '{name}'...")
        try:
            with warnings.catch_warnings():
                warnings.simplefilter("ignore")
                # Specify cache location and attempt loading
                df = sns.load_dataset(name, cache=True, data_home='./seaborn-data')

            if df is None:
                # This case might happen if the dataset name is invalid but doesn't raise an Exception directly
                raise ValueError(f"sns.load_dataset('{name}') returned None. Dataset might not exist or is empty.")

            _DATASET_CACHE[name] = df
            print(f"Dataset '{name}' loaded successfully ({df.shape[0]} rows, {df.shape[1]} cols).")

        except (ValueError, URLError, TimeoutError, ConnectionError, FileNotFoundError, ImportError, Exception) as e:
            # Log the error clearly and re-raise it to signal failure to the calling test
            print(f"------------------------------------------------------")
            print(f"ERROR: Data loading failed for '{name}': {type(e).__name__} - {e}")
            print(f"Tests relying on '{name}' will likely fail or be skipped.")
            print(f"------------------------------------------------------")
            # Re-raise the exception so the test framework knows something went wrong during setup/function execution
            # Or alternatively, return None and let tests handle it via skipping (less explicit failure)
            # Returning None for now to keep skipTest logic functional
            _DATASET_CACHE[name] = None # Cache the failure signal
            # raise e # Option 1: Make tests fail immediately

    # Return a copy if loaded successfully, otherwise return None (or raise error if preferred)
    cached_data = _DATASET_CACHE.get(name)
    return cached_data.copy() if cached_data is not None else None

# --- Provided Test Runner Function ---
def run_tests(test_class):
    """Runs tests from a specific unittest.TestCase subclass."""
    print(f"\n--- Running tests from {test_class.__name__} ---")
    # Use defaultTestLoader for compatibility
    suite = unittest.defaultTestLoader.loadTestsFromTestCase(test_class)
    # Ensure output goes to stdout even in environments that might redirect it
    runner = unittest.TextTestRunner(verbosity=2, stream=sys.stdout, buffer=False) # buffer=False might help in some notebook setups

    result = runner.run(suite)
    print(f"--- Finished tests for {test_class.__name__} ---")

    print("-" * 70)

Okay, here are the descriptions for the provided functions (Exercises 1-6), following the style of your examples:

**Head/Tail Data Inspection / Инспекция Начальных/Конечных Данных**

Score: 5

EN: Loads the 'tips' dataset using a helper function, extracts the first 3 rows (head) and the last 3 rows (tail), and returns them as a tuple of two DataFrames.

RU: Загружает набор данных 'tips' с помощью вспомогательной функции, извлекает первые 3 строки (head) и последние 3 строки (tail) и возвращает их в виде кортежа из двух DataFrame.

**DataFrame Shape and Data Types / Форма и Типы Данных DataFrame**

Score: 5

EN: Loads the 'titanic' dataset, retrieves its dimensions (shape) and the data types of each column (dtypes), and returns these as a tuple.

RU: Загружает набор данных 'titanic', получает его размерность (shape) и типы данных каждого столбца (dtypes) и возвращает их в виде кортежа.

**Titanic Data Cleaning / Очистка Данных Титаника**

Score: 20

EN: Loads the 'titanic' dataset and applies a series of cleaning operations: converts 'age' to numeric (handling errors), fills missing 'age' with the mean, fills missing 'embarked' and 'embark\_town' with their respective modes, and drops rows with missing 'deck' values. Returns the cleaned DataFrame.

RU: Загружает набор данных 'titanic' и применяет серию операций очистки: преобразует 'age' в числовой формат (обрабатывая ошибки), заполняет пропущенные значения 'age' средним, заполняет пропущенные значения 'embarked' и 'embark\_town' их соответствующими модами и удаляет строки с пропущенными значениями 'deck'. Возвращает очищенный DataFrame.

**Text Processing and Filtering / Обработка Текста и Фильтрация**

Score: 20

EN: Loads the 'titanic' dataset, creates a 'category' column based on the 'who' column, converts 'category' to uppercase, and then filters the DataFrame twice: first for rows where 'category' is 'WOMAN', and second for rows where 'embarked' is 'S'. Returns the resulting filtered DataFrame.

RU: Загружает набор данных 'titanic', создает столбец 'category' на основе столбца 'who', преобразует 'category' в верхний регистр, а затем дважды фильтрует DataFrame: сначала по строкам, где 'category' равно 'WOMAN', а затем по строкам, где 'embarked' равно 'S'. Возвращает итоговый отфильтрованный DataFrame.

**Grouping and Aggregation / Группировка и Агрегация**

Score: 20

EN: Loads the 'tips' dataset, groups the data by 'day' and 'smoker', and then calculates aggregate statistics for each group: the mean of 'total\_bill' (as 'avg\_bill'), the maximum 'tip' (as 'max\_tip'), and the count of entries (as 'count'). Returns the resulting aggregated DataFrame with a MultiIndex.

RU: Загружает набор данных 'tips', группирует данные по 'day' и 'smoker', а затем вычисляет агрегированную статистику для каждой группы: среднее значение 'total\_bill' (как 'avg\_bill'), максимальное значение 'tip' (как 'max\_tip') и количество записей (как 'count'). Возвращает итоговый агрегированный DataFrame с MultiIndex.

**Car Name Extraction and Cleaning / Извлечение и Очистка Названий Автомобилей**

Score: 30

EN: Loads the 'mpg' dataset, extracts the first word from the 'name' column into 'initial\_brand'. It then cleans common typos in 'initial\_brand' (e.g., 'chevy' to 'chevrolet') to create a final 'brand' column using mapping. Finally, it creates a 'cleaned\_name' column by removing the first word from the original 'name' and joining the remaining parts using string accessor methods. Returns the DataFrame with these new columns.

RU: Загружает набор данных 'mpg', извлекает первое слово из столбца 'name' в 'initial\_brand'. Затем очищает распространенные опечатки в 'initial\_brand' (например, 'chevy' в 'chevrolet') для создания итогового столбца 'brand', используя сопоставление (mapping). Наконец, создает столбец 'cleaned\_name', удаляя первое слово из исходного 'name' и соединяя оставшиеся части, используя методы строкового аксесора. Возвращает DataFrame с этими новыми столбцами.

# Head/Tail Data Inspection / Инспекция Начальных/Конечных Данных

In [26]:
def exercise_1_load_and_inspect_head_tail():
    """
    Loads the 'tips' dataset, returns head(3)/tail(3). Returns (head_df, tail_df) or (None, None).

    Загружает набор данных 'tips', возвращает head(3)/tail(3). Возвращает (head_df, tail_df) или (None, None).

    Source/Источник: https://rdrr.io/cran/reshape2/man/tips.html
    """
    dataset_name = 'tips'
    print(f"\n--- Running Exercise 1 body: Load '{dataset_name}' ---")
    try:
        df = load_cached_dataset(dataset_name)
        if df is None or df.empty:
             print(f"INFO Ex1: Cannot run exercise body: DataFrame '{dataset_name}' is empty or None.")
             # Return tuple of Nones consistent with success/failure paths
             return None, None

        df_processed = df.copy()
        head_df = df_processed.head(3)
        tail_df = df_processed.tail(3)

        return head_df, tail_df
    except Exception as e:
        print(f"ERROR in exercise_1 body: {type(e).__name__} - {e}")
        return None, None

list(map(lambda x: display(x), exercise_1_load_and_inspect_head_tail()));


--- Running Exercise 1 body: Load 'tips' ---


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


In [27]:
class TestExercise1(unittest.TestCase):
    def test_ex1_logic(self):
        """Test Ex1: Checks return types, lengths, and specific values for head/tail."""
        print("\nRunning test_ex1_logic...") # Test-specific print
        # Suppress prints from the exercise function itself during the test run
        with contextlib.redirect_stdout(io.StringIO()) as captured_output:
            head, tail = exercise_1_load_and_inspect_head_tail() # Call without argument

        # Optional: Print captured output for debugging if needed
        # print("Captured output from exercise_1:\n", captured_output.getvalue())

        if head is None and tail is None:
            # Check if the reason was explicit return due to load failure vs. exception
            func_output = captured_output.getvalue()
            if "Cannot run exercise body" in func_output or "ERROR loading dataset" in func_output:
                 self.skipTest("Ex1 Skipped: Function returned (None, None), data load issue detected.")
            else:
                 # If (None, None) returned due to unexpected exception within try block
                 self.fail("Ex1 Failed: Function returned (None, None) unexpectedly. Check logs.")


        # --- Assertions remain the same as they test the logic based on 'tips' ---
        # Check types
        self.assertIsInstance(head, pd.DataFrame, "Ex1 Head is not a DataFrame")
        self.assertIsInstance(tail, pd.DataFrame, "Ex1 Tail is not a DataFrame")

        # Check non-emptiness
        self.assertFalse(head.empty, "Ex1 Head DF is empty")
        self.assertFalse(tail.empty, "Ex1 Tail DF is empty")

        # Check lengths
        self.assertEqual(len(head), 3, "Ex1 Head Length Check Failed")
        self.assertLessEqual(len(tail), 3, "Ex1 Tail Length Check Failed (> 3)") # Tail might be < 3 if dataset is tiny
        self.assertGreaterEqual(len(tail), 1, "Ex1 Tail Length Check Failed (<= 0)") # Should have at least 1 row if not empty

        # Check specific values (assuming 'tips' dataset loaded correctly)
        # Use .iloc for position-based access, less prone to index changes
        self.assertEqual(head.iloc[0]['sex'], 'Female', "Ex1 Head[0] sex mismatch")
        self.assertEqual(head.iloc[2]['day'], 'Sun', "Ex1 Head[2] day mismatch")
        self.assertEqual(tail.iloc[-1]['total_bill'], 18.78, "Ex1 Tail[-1] total_bill mismatch")
        self.assertEqual(tail.iloc[-2]['smoker'], 'No', "Ex1 Tail[-2] smoker mismatch")
        print("Test test_ex1_logic PASSED.")

# --- Run Function and Tests ---
list(map(lambda x: display(x), exercise_1_load_and_inspect_head_tail()));
run_tests(TestExercise1)


--- Running Exercise 1 body: Load 'tips' ---


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2



--- Running tests from TestExercise1 ---
test_ex1_logic (__main__.TestExercise1.test_ex1_logic)
Test Ex1: Checks return types, lengths, and specific values for head/tail. ... 
Running test_ex1_logic...
Test test_ex1_logic PASSED.
ok

----------------------------------------------------------------------
Ran 1 test in 0.008s

OK
--- Finished tests for TestExercise1 ---
----------------------------------------------------------------------


# DataFrame Shape and Data Types / Форма и Типы Данных DataFrame


In [28]:
def exercise_2_check_shape_info_dtypes():
    """
    Loads the 'titanic' dataset, returns shape and dtypes. Returns (shape, dtypes) or (None, None).

    Загружает набор данных 'titanic', возвращает shape и dtypes. Возвращает (shape, dtypes) или (None, None).

    Source/Источник: https://www.kaggle.com/c/titanic/data
    """
    dataset_name = 'titanic' # Hardcoded dataset name
    print(f"\n--- Running Exercise 2 body: Load '{dataset_name}' ---")
    try:
        df = load_cached_dataset(dataset_name)
        if df is None or df.empty:
             print(f"INFO Ex2: Cannot run exercise body: DataFrame '{dataset_name}' is empty or None.")
             # Return tuple of Nones consistent with success/failure paths
             return None, None

        df_processed = df.copy()
        return df_processed.shape, df_processed.dtypes


    except Exception as e:
        print(f"ERROR in exercise_2 body: {type(e).__name__} - {e}")
        return None, None

list(map(lambda x: display(x), exercise_2_check_shape_info_dtypes()));


--- Running Exercise 2 body: Load 'titanic' ---

Attempting to load dataset 'titanic'...
Dataset 'titanic' loaded successfully (891 rows, 15 cols).


(891, 15)

Unnamed: 0,0
survived,int64
pclass,int64
sex,object
age,float64
sibsp,int64
parch,int64
fare,float64
embarked,object
class,category
who,object


In [29]:
# --- Tests for Exercise 2 ---
class TestExercise2(unittest.TestCase):
     def test_ex2_logic(self):
        """Test Ex2: Checks return types, shape values, and specific dtypes for 'titanic'."""
        print("\nRunning test_ex2_logic...")
        # Suppress prints from the exercise function itself during the test run
        with contextlib.redirect_stdout(io.StringIO()) as captured_output:
             shape, dtypes = exercise_2_check_shape_info_dtypes() # Call without argument

        if shape is None and dtypes is None:
            func_output = captured_output.getvalue()
            if "Cannot run exercise body" in func_output or "ERROR loading dataset" in func_output:
                 self.skipTest("Ex2 Skipped: Function returned (None, None), data load issue detected.")
            else:
                 self.fail("Ex2 Failed: Function returned (None, None) unexpectedly. Check logs.")

        # --- Assertions ---
        self.assertIsInstance(shape, tuple, "Ex2 Shape is not a tuple")
        self.assertEqual(len(shape), 2, "Ex2 Shape tuple length invalid")
        self.assertIsInstance(dtypes, pd.Series, "Ex2 Dtypes is not a Series")
        self.assertFalse(dtypes.empty, "Ex2 Dtypes series is empty")

        # Specific checks for the 'titanic' dataset loaded via seaborn
        # These might fail if a different version/source of 'titanic' is loaded
        self.assertEqual(shape[0], 891, f"Ex2 Expected 891 rows, got {shape[0]}")
        self.assertEqual(shape[1], 15, f"Ex2 Expected 15 columns, got {shape[1]}")

        # Check specific dtypes by index (less robust) or name (more robust)
        # Assuming standard seaborn titanic dataset column order/names
        # Index 0: 'survived' (int64)
        # Index 1: 'pclass' (int64)
        # Index 2: 'sex' (object/string)

        # Using index (potentially fragile if columns change)
        # self.assertEqual(dtypes.iloc[0].name, 'int64', f"Ex2 dtype[0] expected int64, got {dtypes.iloc[0]}")
        # self.assertEqual(dtypes.iloc[2].name, 'object', f"Ex2 dtype[2] expected object, got {dtypes.iloc[2]}")

        # Using column names (more robust)
        self.assertTrue('survived' in dtypes.index, "Ex2 'survived' column missing")
        self.assertEqual(dtypes['survived'].name, 'int64', f"Ex2 dtype['survived'] expected int64, got {dtypes['survived']}")

        self.assertTrue('sex' in dtypes.index, "Ex2 'sex' column missing")
        self.assertEqual(dtypes['sex'].name, 'object', f"Ex2 dtype['sex'] expected object, got {dtypes['sex']}")

        # Check for a float type (e.g., 'age' or 'fare')
        self.assertTrue('fare' in dtypes.index, "Ex2 'fare' column missing")
        self.assertEqual(dtypes['fare'].name, 'float64', f"Ex2 dtype['fare'] expected float64, got {dtypes['fare']}")

        print("Test test_ex2_logic PASSED.")


# --- Run Function and Tests ---
list(map(lambda x: display(x), exercise_2_check_shape_info_dtypes()));
run_tests(TestExercise2)


--- Running Exercise 2 body: Load 'titanic' ---


(891, 15)

Unnamed: 0,0
survived,int64
pclass,int64
sex,object
age,float64
sibsp,int64
parch,int64
fare,float64
embarked,object
class,category
who,object



--- Running tests from TestExercise2 ---
test_ex2_logic (__main__.TestExercise2.test_ex2_logic)
Test Ex2: Checks return types, shape values, and specific dtypes for 'titanic'. ... 
Running test_ex2_logic...
Test test_ex2_logic PASSED.
ok

----------------------------------------------------------------------
Ran 1 test in 0.004s

OK
--- Finished tests for TestExercise2 ---
----------------------------------------------------------------------


# Titanic Data Cleaning / Очистка Данных Титаника



In [30]:
# --- Exercise 3: Data Cleaning - Missing Values and Type Conversion (Corrected) ---
def exercise_3_clean_titanic():
    """
    Loads the 'titanic' dataset and performs cleaning steps:
    1. Convert 'age' column to numeric, coercing errors to NaN.
    2. Fill missing 'age' values with the mean age (calculated *after* coercion).
    3. Fill missing 'embarked' values with the mode* of the 'embarked' column.
    4. Fill missing 'embark_town' values with the mode* of the 'embark_town'.
    5. Drop rows where 'deck' has a missing value (NaN).
    Returns: The cleaned DataFrame or None.

    Загружает набор данных 'titanic' и выполняет шаги очистки:
    1. Преобразует столбец 'age' в числовой тип, приводя ошибки к NaN.
    2. Заполняет пропущенные значения 'age' средним возрастом (рассчитанным *после* приведения типов).
    3. Заполняет пропущенные значения 'embarked' модой* столбца 'embarked'.
    4. Заполняет пропущенные значения 'embark_town' модой* столбца 'embark_town'.
    5. Удаляет строки, в которых столбец 'deck' имеет пропущенное значение (NaN).
    Возвращает: Очищенный DataFrame или None.

    *Mode is tricky, the default function returns a tuple, please read on it here
    *Мода - хитрая штука, функция по умолчанию возвращает кортеж, пожалуйста, прочтите об этом здесь:
    https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html

    Source/Источник: https://www.kaggle.com/c/titanic/data
    """
    dataset_name = 'titanic' # Hardcoded dataset name
    print(f"\n--- Running Exercise 3 body: Load and clean '{dataset_name}' ---")
    try:
        df = load_cached_dataset(dataset_name)
        if df is None or df.empty:
             print(f"INFO Ex3: Cannot run exercise body: DataFrame '{dataset_name}' is empty or None.")
             return None

        df_cleaned = df.copy()
        df_cleaned['age'] = pd.to_numeric(df_cleaned['age'], errors='coerce')

        mean_age = df_cleaned['age'].mean()
        df_cleaned['age'] = df_cleaned['age'].fillna(mean_age)

        embarked_mode = df_cleaned['embarked'].mode()[0]
        df_cleaned['embarked'] = df_cleaned['embarked'].fillna(embarked_mode)

        if 'embark_town' in df_cleaned.columns:
            town_mode = df_cleaned['embark_town'].mode()[0]
            df_cleaned['embark_town'] = df_cleaned['embark_town'].fillna(town_mode)

        if 'deck' in df_cleaned.columns:
            df_cleaned = df_cleaned.dropna(subset=['deck'])

        return df_cleaned

    except KeyError as ke:
        print(f"ERROR in exercise_3 body: KeyError - Likely a required column is missing: {ke}")
        return None
    except Exception as e:
        print(f"ERROR in exercise_3 body: {type(e).__name__} - {e}")
        return None

display(exercise_3_clean_titanic())


--- Running Exercise 3 body: Load and clean 'titanic' ---


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [31]:
class TestExercise3(unittest.TestCase):
    original_df = None
    cleaned_df_final = None # Stores result of the *complete* exercise function

    @classmethod
    def setUpClass(cls):
        """Load original data and run the full cleaning process once."""
        print("\nSetting up TestExercise3: Loading original data...")
        cls.original_df = load_cached_dataset('titanic')
        if cls.original_df is None:
            print("Setup FATAL: Failed to load original titanic data for TestExercise3.")
            # Individual tests will skip if original_df is None

        print("\nSetting up TestExercise3: Running full cleaning function...")
        # Run the complete function once
        with contextlib.redirect_stdout(io.StringIO()): # Suppress function prints
            cls.cleaned_df_final = exercise_3_clean_titanic()
        if cls.cleaned_df_final is None:
            print("Setup WARNING: Full cleaning function returned None.")
            # Tests requiring cleaned_df_final will skip

    # --- Tests verifying consequences of individual steps ---

    def test_consequences_step1_age_numeric(self):
        """Test Ex3 Step 1 Consequences: Check final 'age' dtype."""
        print("\nRunning test_consequences_step1_age_numeric...")
        if self.cleaned_df_final is None:
            self.skipTest("Ex3 Skipped (Step 1 - Age Numeric): Final cleaned data not available.")
        if 'age' not in self.cleaned_df_final.columns:
             self.fail("Test Prerequisite Failed: 'age' column missing in final DataFrame.")

        # Assert: Check the final dtype of the 'age' column
        self.assertTrue(pd.api.types.is_numeric_dtype(self.cleaned_df_final['age'].dtype),
                        "Step 1 Consequences Assert Failed: Final 'age' dtype is not numeric.")
        print("Test test_consequences_step1_age_numeric PASSED.")


    def test_consequences_step2_age_filled(self):
        """Test Ex3 Step 2 Consequences: Check final 'age' column has no NaNs."""
        print("\nRunning test_consequences_step2_age_filled...")
        if self.cleaned_df_final is None:
            self.skipTest("Ex3 Skipped (Step 2 - Age Filled): Final cleaned data not available.")
        if 'age' not in self.cleaned_df_final.columns:
             self.fail("Test Prerequisite Failed: 'age' column missing in final DataFrame.")

        # Assert: Check for NaNs in the final 'age' column
        final_age_nan_count = self.cleaned_df_final['age'].isnull().sum()
        self.assertEqual(final_age_nan_count, 0,
                         f"Step 2 Consequences Assert Failed: Final 'age' column still has {final_age_nan_count} NaNs after fillna step should have run.")

        # Optional Assert: Check if a row known to have NaN age originally AND survived dropna
        # now has a value close to the original mean age.
        if self.original_df is not None:
             idx_check = 5 # Passenger originally had NaN age
             # Calculate mean age from original df *after* potential coercion
             original_age_numeric = pd.to_numeric(self.original_df['age'], errors='coerce')
             expected_mean = original_age_numeric.mean()

             # Check if this row survived the dropna(subset=['deck']) step
             if idx_check in self.cleaned_df_final.index:
                 # Check if the original value was actually NaN
                 if pd.isna(self.original_df.loc[idx_check, 'age']):
                     self.assertAlmostEqual(self.cleaned_df_final.loc[idx_check, 'age'], expected_mean, places=4,
                                            msg=f"Step 2 Consequences Assert Failed: Passenger {idx_check} (orig NaN age) doesn't have approx mean age in final data.")
                 else:
                     print(f"INFO Test Step 2: Original age at index {idx_check} was not NaN.")
             else:
                  print(f"INFO Test Step 2: Passenger {idx_check} (orig NaN age) did not survive final dropna, cannot check filled value.")

        print("Test test_consequences_step2_age_filled PASSED.")


    def test_consequences_step3_embarked_filled(self):
        """Test Ex3 Step 3 Consequences: Check final 'embarked' column has no NaNs."""
        print("\nRunning test_consequences_step3_embarked_filled...")
        if self.cleaned_df_final is None:
            self.skipTest("Ex3 Skipped (Step 3 - Embarked Filled): Final cleaned data not available.")
        if 'embarked' not in self.cleaned_df_final.columns:
             self.fail("Test Prerequisite Failed: 'embarked' column missing in final DataFrame.")

        # Assert: Check for NaNs in the final 'embarked' column
        final_embarked_nan_count = self.cleaned_df_final['embarked'].isnull().sum()
        self.assertEqual(final_embarked_nan_count, 0,
                         f"Step 3 Consequences Assert Failed: Final 'embarked' column still has {final_embarked_nan_count} NaNs.")

        # Optional Assert: Check specific rows known to have NaN embarked originally
        # ONLY if they survived the final dropna step.
        if self.original_df is not None:
            original_mode = self.original_df['embarked'].mode()[0] # Get expected fill value
            for idx in [61, 829]: # Indices known to have NaN embarked originally
                if idx in self.cleaned_df_final.index: # Check if row survived
                    if pd.isna(self.original_df.loc[idx, 'embarked']): # Check if it was originally NaN
                        self.assertEqual(self.cleaned_df_final.loc[idx, 'embarked'], original_mode,
                                         f"Step 3 Consequences Assert Failed: Passenger {idx} (orig NaN embarked) doesn't have mode '{original_mode}' in final data.")
                    else:
                        print(f"INFO Test Step 3: Original embarked at index {idx} was not NaN.")
                else:
                    print(f"INFO Test Step 3: Passenger {idx} (orig NaN embarked) did not survive final dropna.")

        print("Test test_consequences_step3_embarked_filled PASSED.")


    def test_consequences_step4_embark_town_filled(self):
        """Test Ex3 Step 4 Consequences: Check final 'embark_town' has no NaNs."""
        print("\nRunning test_consequences_step4_embark_town_filled...")
        if self.cleaned_df_final is None:
            self.skipTest("Ex3 Skipped (Step 4 - Embark Town Filled): Final cleaned data not available.")
        if 'embark_town' not in self.cleaned_df_final.columns:
             self.skipTest("Ex3 Skipped (Step 4): 'embark_town' column not in final data.")

        # Assert: Check for NaNs in the final 'embark_town' column
        final_town_nan_count = self.cleaned_df_final['embark_town'].isnull().sum()
        self.assertEqual(final_town_nan_count, 0,
                         f"Step 4 Consequences Assert Failed: Final 'embark_town' column still has {final_town_nan_count} NaNs.")

        # Optional Assert: Check specific rows known to have NaN embark_town originally
        if self.original_df is not None and 'embark_town' in self.original_df.columns:
             original_mode = self.original_df['embark_town'].mode()[0] # Get expected fill value
             for idx in [61, 829]: # Indices known to have NaN embark_town originally
                 if idx in self.cleaned_df_final.index: # Check if row survived
                     if pd.isna(self.original_df.loc[idx, 'embark_town']): # Check if it was originally NaN
                         self.assertEqual(self.cleaned_df_final.loc[idx, 'embark_town'], original_mode,
                                          f"Step 4 Consequences Assert Failed: Passenger {idx} (orig NaN town) doesn't have mode '{original_mode}' in final data.")
                     else:
                          print(f"INFO Test Step 4: Original embark_town at index {idx} was not NaN.")
                 else:
                     print(f"INFO Test Step 4: Passenger {idx} (orig NaN town) did not survive final dropna.")

        print("Test test_consequences_step4_embark_town_filled PASSED.")


    def test_consequences_step5_deck_dropped(self):
        """Test Ex3 Step 5 Consequences: Check final 'deck' NaN count and row count."""
        print("\nRunning test_consequences_step5_deck_dropped...")
        if self.cleaned_df_final is None:
            self.skipTest("Ex3 Skipped (Step 5 - Deck Dropped): Final cleaned data not available.")
        if self.original_df is None:
            self.skipTest("Ex3 Skipped (Step 5 - Deck Dropped): Original data not available for comparison.")
        if 'deck' not in self.original_df.columns:
             self.skipTest("Ex3 Skipped (Step 5): 'deck' column not in original data.")

        # Assert: Check final 'deck' column NaN count (should be 0)
        if 'deck' in self.cleaned_df_final.columns:
            final_deck_nan_count = self.cleaned_df_final['deck'].isnull().sum()
            self.assertEqual(final_deck_nan_count, 0,
                         f"Step 5 Consequences Assert Failed: Final 'deck' column still has {final_deck_nan_count} NaNs after dropna.")
        else:
             # This case shouldn't happen if dropna was run correctly and deck existed originally
             self.fail("Step 5 Consequences Assert Failed: 'deck' column missing entirely in final data.")


        # Assert: Check final row count against original non-NaN deck count
        expected_rows = self.original_df['deck'].notna().sum()
        self.assertEqual(len(self.cleaned_df_final), expected_rows,
                         f"Step 5 Consequences Assert Failed: Final row count ({len(self.cleaned_df_final)}) doesn't match original non-NaN deck count ({expected_rows}).")

        # Assert: Check a specific row known to have NaN deck originally is GONE
        idx_nan_deck = 0 # Example row
        if pd.isna(self.original_df.loc[idx_nan_deck, 'deck']): # Verify assumption
            self.assertNotIn(idx_nan_deck, self.cleaned_df_final.index,
                            f"Step 5 Consequences Assert Failed: Row {idx_nan_deck} (orig NaN deck) still present in final data.")
        else:
             print(f"INFO Test Step 5: Original deck at index {idx_nan_deck} was not NaN.")

        print("Test test_consequences_step5_deck_dropped PASSED.")


    # --- Optional: Keep a simple integration test as well ---
    def test_integration_final_state(self):
        """Test Ex3 Integration: Simple check on final DataFrame properties."""
        print("\nRunning test_integration_final_state...")
        if self.cleaned_df_final is None:
            self.skipTest("Ex3 Skipped (Integration): Final cleaned data not available.")

        # Basic checks on the final state
        self.assertIsInstance(self.cleaned_df_final, pd.DataFrame)
        self.assertFalse(self.cleaned_df_final.empty)
        # Check overall NaN counts for key columns are zero
        self.assertEqual(self.cleaned_df_final[['age', 'embarked', 'deck']].isnull().sum().sum(), 0,
                         "Integration Assert Failed: Unexpected NaNs found in final age, embarked, or deck columns.")
        if 'embark_town' in self.cleaned_df_final.columns:
             self.assertEqual(self.cleaned_df_final['embark_town'].isnull().sum(), 0,
                              "Integration Assert Failed: Unexpected NaNs found in final embark_town column.")


        print("Test test_integration_final_state PASSED.")

display(exercise_3_clean_titanic())
# --- Run Tests for Exercise 3 ---
run_tests(TestExercise3)


--- Running Exercise 3 body: Load and clean 'titanic' ---


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
872,0,1,male,33.0,0,0,5.0000,S,First,man,True,B,Southampton,no,True
879,1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True



--- Running tests from TestExercise3 ---

Setting up TestExercise3: Loading original data...

Setting up TestExercise3: Running full cleaning function...
test_consequences_step1_age_numeric (__main__.TestExercise3.test_consequences_step1_age_numeric)
Test Ex3 Step 1 Consequences: Check final 'age' dtype. ... 
Running test_consequences_step1_age_numeric...
Test test_consequences_step1_age_numeric PASSED.
ok
test_consequences_step2_age_filled (__main__.TestExercise3.test_consequences_step2_age_filled)
Test Ex3 Step 2 Consequences: Check final 'age' column has no NaNs. ... 
Running test_consequences_step2_age_filled...
INFO Test Step 2: Passenger 5 (orig NaN age) did not survive final dropna, cannot check filled value.
Test test_consequences_step2_age_filled PASSED.
ok
test_consequences_step3_embarked_filled (__main__.TestExercise3.test_consequences_step3_embarked_filled)
Test Ex3 Step 3 Consequences: Check final 'embarked' column has no NaNs. ... 
Running test_consequences_step3_embarke

# Text Processing and Filtering / Обработка Текста и Фильтрация

In [32]:
# --- Exercise 4: Text Manipulation and Filtering ---
# Note: Renamed 'LastName' to 'category' for clarity, as it's derived from 'who'
def exercise_4_text_processing(dataset_name='titanic'):
    """
    Loads the titanic dataset and performs text processing:
    1. Create a 'category' column derived from the 'who' column (man, woman, child).
    2. Convert the 'category' column to uppercase.
    3. Filter the DataFrame to include only entries where 'category' is 'WOMAN'.
    4. Further filter for those women who embarked at 'S' (Southampton).
    Returns: The filtered DataFrame or None.

    Загружает набор данных 'titanic' и выполняет обработку текста:
    1. Создает столбец 'category' на основе столбца 'who' (man, woman, child).
    2. Преобразует столбец 'category' в верхний регистр.
    3. Фильтрует DataFrame, чтобы включить только записи, где 'category' равно 'WOMAN'.
    4. Дополнительно фильтрует по тем женщинам, которые сели на борт в 'S' (Саутгемптон).
    Возвращает: Отфильтрованный DataFrame или None.

    Source: https://www.kaggle.com/c/titanic/data
    """
    print(f"\n--- Running Exercise 4 body: Load, process text, filter '{dataset_name}' ---")
    try:
        df = load_cached_dataset(dataset_name)
        if df is None or df.empty:
             print(f"INFO Ex4: Cannot run exercise body: DataFrame '{dataset_name}' is empty or None.")
             return None
        df_processed = df.copy()
        df_processed['category'] = df_processed['who']
        df_processed['category'] = df_processed['category'].str.upper()

        woman_df = df_processed[df_processed['category'] == 'WOMAN']
        result_df = woman_df[woman_df['embarked'] == 'S']
        return result_df

    except Exception as e:
        print(f"ERROR in exercise_4 body: {type(e).__name__} - {e}")
        return None

display(exercise_4_text_processing())


--- Running Exercise 4 body: Load, process text, filter 'titanic' ---


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,category
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,WOMAN
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,WOMAN
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False,WOMAN
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True,WOMAN
15,1,2,female,55.0,0,0,16.0000,S,Second,woman,False,,Southampton,yes,True,WOMAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False,WOMAN
880,1,2,female,25.0,0,1,26.0000,S,Second,woman,False,,Southampton,yes,False,WOMAN
882,0,3,female,22.0,0,0,10.5167,S,Third,woman,False,,Southampton,no,True,WOMAN
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,WOMAN


In [33]:
class TestExercise4(unittest.TestCase):
    # Class level setup
    df_filtered = None
    original_columns = None

    @classmethod
    def setUpClass(cls):
        """Load and process data once for all tests in this class."""
        print("\nSetting up TestExercise4: Loading and processing data...")
        # Load original first to get column list
        original_df = load_cached_dataset('titanic')
        if original_df is not None:
            cls.original_columns = original_df.columns.tolist()
        else:
             print("Setup WARNING: Failed to load original titanic data for column check.")


        with contextlib.redirect_stdout(io.StringIO()): # Suppress function prints during setup
            cls.df_filtered = exercise_4_text_processing(dataset_name='titanic')

        if cls.df_filtered is None:
            print("Setup WARNING: Failed to process data for TestExercise4.")
        else:
             print(f"Setup INFO: Processed data loaded for TestExercise4 ({len(cls.df_filtered)} rows).")


    def test_ex4_structure_type_columns(self):
        """Test Ex4: Checks DataFrame type, non-emptiness, expected columns, index type."""
        print("\nRunning test_ex4_structure_type_columns...")
        if self.df_filtered is None:
            self.skipTest("Ex4 Skipped (structure): Processed data not available.")

        # Check type
        self.assertIsInstance(self.df_filtered, pd.DataFrame, "Ex4 Result is not a DataFrame")

        # Check non-emptiness (should have results for Titanic)
        self.assertFalse(self.df_filtered.empty, "Ex4 Result DataFrame is unexpectedly empty.")

        # Check columns: should have original columns + new 'category' column
        expected_cols = set(self.original_columns or []) | {'category'} # Combine original (if loaded) and new
        self.assertSetEqual(set(self.df_filtered.columns), expected_cols, "Ex4 Column list mismatch")

        # Check index type (boolean filtering preserves original index labels type)
        self.assertIsInstance(self.df_filtered.index, pd.Index, "Ex4 Index is not a pandas Index object") # General check

        print("Test test_ex4_structure_type_columns PASSED.")


    def test_ex4_column_content(self):
        """Test Ex4: Checks content of key columns ('category', 'who', 'embarked')."""
        print("\nRunning test_ex4_column_content...")
        if self.df_filtered is None:
            self.skipTest("Ex4 Skipped (content): Processed data not available.")
        if self.df_filtered.empty:
            self.skipTest("Ex4 Skipped (content): Filtered DataFrame is empty, cannot check content.")

        # Check 'category' column is all 'WOMAN'
        self.assertTrue(all(self.df_filtered['category'] == 'WOMAN'), "Ex4 Not all 'category' values are 'WOMAN'")

        # Check 'who' column is all 'woman' (original value check)
        if 'who' in self.df_filtered.columns:
            self.assertTrue(all(self.df_filtered['who'] == 'woman'), "Ex4 Not all 'who' values are 'woman'")
        else:
             self.fail("Ex4 'who' column missing in filtered data, cannot verify.")


        # Check 'embarked' column is all 'S'
        self.assertTrue(all(self.df_filtered['embarked'] == 'S'), "Ex4 Not all 'embarked' values are 'S'")

        print("Test test_ex4_column_content PASSED.")


    def test_ex4_row_count(self):
        """Test Ex4: Checks the final number of rows after filtering."""
        print("\nRunning test_ex4_row_count...")
        if self.df_filtered is None:
            self.skipTest("Ex4 Skipped (row count): Processed data not available.")

        # Expected rows: Number of women who embarked at 'S' in the standard Titanic dataset
        expected_rows = 174
        self.assertEqual(len(self.df_filtered), expected_rows, f"Ex4 Final filtered row count mismatch. Expected {expected_rows}.")

        print("Test test_ex4_row_count PASSED.")


    def test_ex4_data_types(self):
        """Test Ex4: Checks data types of key columns after processing."""
        print("\nRunning test_ex4_data_types...")
        if self.df_filtered is None:
            self.skipTest("Ex4 Skipped (dtypes): Processed data not available.")

        # Check dtypes
        self.assertTrue(pd.api.types.is_string_dtype(self.df_filtered['category']), "Ex4 'category' dtype is not string/object")
        if 'who' in self.df_filtered.columns:
             self.assertTrue(pd.api.types.is_string_dtype(self.df_filtered['who']), "Ex4 'who' dtype is not string/object")
        self.assertTrue(pd.api.types.is_string_dtype(self.df_filtered['embarked']), "Ex4 'embarked' dtype is not string/object")
        # Check a numeric column stayed numeric
        if 'fare' in self.df_filtered.columns:
            self.assertTrue(pd.api.types.is_numeric_dtype(self.df_filtered['fare']), "Ex4 'fare' dtype is not numeric")
        if 'age' in self.df_filtered.columns:
            self.assertTrue(pd.api.types.is_numeric_dtype(self.df_filtered['age']), "Ex4 'age' dtype is not numeric")

        print("Test test_ex4_data_types PASSED.")

    def test_ex4_specific_passenger_check(self):
        """Test Ex4: Checks specific attributes of a known passenger meeting the criteria."""
        print("\nRunning test_ex4_specific_passenger_check...")
        if self.df_filtered is None:
            self.skipTest("Ex4 Skipped (specific passenger): Processed data not available.")

        # Find a known passenger: e.g., Passenger ID 3 (original index 2)
        # Miss. Laina Heikkinen, Age 26, Embarked S, Survived 1, Pclass 3, Fare 7.925
        original_index_label = 2 # The original index label for this passenger

        # Check if this passenger is still in the filtered DataFrame (should be)
        self.assertIn(original_index_label, self.df_filtered.index, f"Ex4 Passenger with original index {original_index_label} not found in filtered data.")

        if original_index_label in self.df_filtered.index:
            passenger_data = self.df_filtered.loc[original_index_label]

            # Verify key attributes
            self.assertEqual(passenger_data['who'], 'woman', f"Ex4 Passenger {original_index_label}: 'who' mismatch")
            self.assertEqual(passenger_data['category'], 'WOMAN', f"Ex4 Passenger {original_index_label}: 'category' mismatch")
            self.assertEqual(passenger_data['embarked'], 'S', f"Ex4 Passenger {original_index_label}: 'embarked' mismatch")
            self.assertEqual(passenger_data['pclass'], 3, f"Ex4 Passenger {original_index_label}: 'pclass' mismatch")
            self.assertEqual(passenger_data['survived'], 1, f"Ex4 Passenger {original_index_label}: 'survived' mismatch")
            self.assertAlmostEqual(passenger_data['age'], 26.0, places=1, msg=f"Ex4 Passenger {original_index_label}: 'age' mismatch")
            self.assertAlmostEqual(passenger_data['fare'], 7.925, places=3, msg=f"Ex4 Passenger {original_index_label}: 'fare' mismatch")
        else:
             # This case is covered by assertIn above, but good for clarity
             pass

        print("Test test_ex4_specific_passenger_check PASSED.")


display(exercise_4_text_processing())
# --- Run Tests for Exercise 4 ---
run_tests(TestExercise4)


--- Running Exercise 4 body: Load, process text, filter 'titanic' ---


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,category
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,WOMAN
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,WOMAN
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False,WOMAN
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True,WOMAN
15,1,2,female,55.0,0,0,16.0000,S,Second,woman,False,,Southampton,yes,True,WOMAN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False,WOMAN
880,1,2,female,25.0,0,1,26.0000,S,Second,woman,False,,Southampton,yes,False,WOMAN
882,0,3,female,22.0,0,0,10.5167,S,Third,woman,False,,Southampton,no,True,WOMAN
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,WOMAN



--- Running tests from TestExercise4 ---

Setting up TestExercise4: Loading and processing data...
Setup INFO: Processed data loaded for TestExercise4 (174 rows).
test_ex4_column_content (__main__.TestExercise4.test_ex4_column_content)
Test Ex4: Checks content of key columns ('category', 'who', 'embarked'). ... 
Running test_ex4_column_content...
Test test_ex4_column_content PASSED.
ok
test_ex4_data_types (__main__.TestExercise4.test_ex4_data_types)
Test Ex4: Checks data types of key columns after processing. ... 
Running test_ex4_data_types...
Test test_ex4_data_types PASSED.
ok
test_ex4_row_count (__main__.TestExercise4.test_ex4_row_count)
Test Ex4: Checks the final number of rows after filtering. ... 
Running test_ex4_row_count...
Test test_ex4_row_count PASSED.
ok
test_ex4_specific_passenger_check (__main__.TestExercise4.test_ex4_specific_passenger_check)
Test Ex4: Checks specific attributes of a known passenger meeting the criteria. ... 
Running test_ex4_specific_passenger_check.

# Grouping and Aggregation / Группировка и Агрегация



In [36]:
# --- Exercise 5: Grouping and Aggregation ---
def exercise_5_group_aggregate():
    """
    Loads the tips dataset and performs grouping and aggregation:
    1. Group the data by 'day' and 'smoker'. Use observed=False for consistency.
    2. Calculate the following aggregations for each group:
        - Average 'total_bill' (mean)
        - Maximum 'tip' (max)
        - Total number of entries in the group, tip (size)
    3. The aggregated columns should be named 'avg_bill', 'max_tip', and 'count'.
    Returns: The aggregated DataFrame with renamed columns or None.

    Загружает набор данных tips и выполняет группировку и агрегацию:
    1. Группирует данные по 'day' и 'smoker'. Используйте observed=False для согласованности.
    2. Вычисляет следующие агрегации для каждой группы:
        - Среднее значение 'total_bill' (mean)
        - Максимальное значение 'tip' (max)
        - Общее количество записей в группе (size)
    3. Агрегированные столбцы должны быть названы 'avg_bill', 'max_tip', и 'count'.
    Возвращает: Агрегированный DataFrame с переименованными столбцами или None.

    Source: https://rdrr.io/cran/reshape2/man/tips.html
    """
    dataset_name='tips'
    print(f"\n--- Running Exercise 5 body: Load, group, aggregate '{dataset_name}' ---")
    try:
        df = load_cached_dataset(dataset_name)
        df_processed = df.copy()
        grouped = df_processed.groupby(['day', 'smoker'], observed=False)
        aggregated = grouped.agg(
            avg_bill=('total_bill', 'mean'),
            max_tip=('tip', 'max'),
            count=('tip', 'size')
        )

        return aggregated

    except Exception as e:
        print(f"ERROR in exercise_5 body: {type(e).__name__} - {e}")
        return None

display(exercise_5_group_aggregate())


--- Running Exercise 5 body: Load, group, aggregate 'tips' ---


Unnamed: 0_level_0,Unnamed: 1_level_0,avg_bill,max_tip,count
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Thur,Yes,19.190588,5.0,17
Thur,No,17.113111,6.7,45
Fri,Yes,16.813333,4.73,15
Fri,No,18.42,3.5,4
Sat,Yes,21.276667,10.0,42
Sat,No,19.661778,9.0,45
Sun,Yes,24.12,6.5,19
Sun,No,20.506667,6.0,57


In [37]:
class TestExercise5(unittest.TestCase):
    # Class level setup to load data once for all tests in this class
    df_agg = None

    @classmethod
    def setUpClass(cls):
        """Load the aggregated data once before all tests in this class."""
        print("\nSetting up TestExercise5: Loading aggregated data...")
        with contextlib.redirect_stdout(io.StringIO()): # Suppress function prints during setup
             cls.df_agg = exercise_5_group_aggregate()
        if cls.df_agg is None:
            print("Setup WARNING: Failed to load aggregated data for TestExercise5.")
            # Tests will likely be skipped by the skipIf decorator

    def test_ex5_structure_and_basic_values(self):
        """Test Ex5: Checks type, index, columns, group count, and basic values."""
        print("\nRunning test_ex5_structure_and_basic_values...")
        if self.df_agg is None:
            self.skipTest("Ex5 Skipped (structure/basic): Aggregated data not loaded.")

        # Check type and structure
        self.assertIsInstance(self.df_agg, pd.DataFrame, "Ex5 Result is not a DataFrame")
        self.assertIsInstance(self.df_agg.index, pd.MultiIndex, "Ex5 Result index is not a MultiIndex")
        self.assertEqual(self.df_agg.index.names, ['day', 'smoker'], "Ex5 Index names are incorrect")

        # Check column names
        expected_cols = ['avg_bill', 'max_tip', 'count']
        self.assertListEqual(self.df_agg.columns.tolist(), expected_cols, "Ex5 Column names mismatch")

        # Check number of groups (4 days * 2 smoker statuses = 8)
        self.assertEqual(len(self.df_agg), 8, "Ex5 Number of groups mismatch")

        # Check specific aggregated values (use .loc with tuple for MultiIndex)
        try:
            # Thursday smokers
            thurs_smoker = self.df_agg.loc[('Thur', 'Yes')]
            self.assertAlmostEqual(thurs_smoker['avg_bill'], 19.190588, places=4, msg="Ex5 Thur/Yes avg_bill mismatch")
            self.assertEqual(thurs_smoker['max_tip'], 5.00, msg="Ex5 Thur/Yes max_tip mismatch")
            self.assertEqual(thurs_smoker['count'], 17, msg="Ex5 Thur/Yes count mismatch")

            # Sunday non-smokers
            sun_nosmoker = self.df_agg.loc[('Sun', 'No')]
            self.assertAlmostEqual(sun_nosmoker['avg_bill'], 20.506667, places=4, msg="Ex5 Sun/No avg_bill mismatch")
            self.assertEqual(sun_nosmoker['max_tip'], 6.00, msg="Ex5 Sun/No max_tip mismatch")
            self.assertEqual(sun_nosmoker['count'], 57, msg="Ex5 Sun/No count mismatch")
        except KeyError as e:
            self.fail(f"Ex5 Failed to access group in MultiIndex during basic checks: {e}")

        print("Test test_ex5_structure_and_basic_values PASSED.")

    def test_ex5_data_types(self):
        """Test Ex5: Checks the data types of the aggregated columns."""
        print("\nRunning test_ex5_data_types...")
        if self.df_agg is None:
            self.skipTest("Ex5 Skipped (dtypes): Aggregated data not loaded.")

        # Check dtypes
        self.assertTrue(pd.api.types.is_float_dtype(self.df_agg['avg_bill']), "Ex5 'avg_bill' dtype is not float")
        # max_tip could be float or int depending on original data, check for numeric
        self.assertTrue(pd.api.types.is_numeric_dtype(self.df_agg['max_tip']), "Ex5 'max_tip' dtype is not numeric")
        self.assertTrue(pd.api.types.is_integer_dtype(self.df_agg['count']), "Ex5 'count' dtype is not integer")
        print("Test test_ex5_data_types PASSED.")

    def test_ex5_non_nullness(self):
        """Test Ex5: Checks that aggregated columns do not contain null values."""
        print("\nRunning test_ex5_non_nullness...")
        if self.df_agg is None:
            self.skipTest("Ex5 Skipped (non-null): Aggregated data not loaded.")

        # Check for NaNs - shouldn't occur with mean/max/size on this dataset unless a group was truly empty
        self.assertFalse(self.df_agg['avg_bill'].isnull().any(), "Ex5 'avg_bill' contains NaN values")
        self.assertFalse(self.df_agg['max_tip'].isnull().any(), "Ex5 'max_tip' contains NaN values")
        self.assertFalse(self.df_agg['count'].isnull().any(), "Ex5 'count' contains NaN values")
        print("Test test_ex5_non_nullness PASSED.")

    def test_ex5_index_uniqueness(self):
        """Test Ex5: Checks if the MultiIndex is unique."""
        print("\nRunning test_ex5_index_uniqueness...")
        if self.df_agg is None:
            self.skipTest("Ex5 Skipped (index unique): Aggregated data not loaded.")

        self.assertTrue(self.df_agg.index.is_unique, "Ex5 MultiIndex is not unique")
        print("Test test_ex5_index_uniqueness PASSED.")

    def test_ex5_more_group_values(self):
        """Test Ex5: Checks specific values for additional groups."""
        print("\nRunning test_ex5_more_group_values...")
        if self.df_agg is None:
            self.skipTest("Ex5 Skipped (more groups): Aggregated data not loaded.")

        try:
            # Friday smokers
            fri_smoker = self.df_agg.loc[('Fri', 'Yes')]
            self.assertAlmostEqual(fri_smoker['avg_bill'], 16.813333, places=4, msg="Ex5 Fri/Yes avg_bill mismatch")
            self.assertEqual(fri_smoker['max_tip'], 4.73, msg="Ex5 Fri/Yes max_tip mismatch")
            self.assertEqual(fri_smoker['count'], 15, msg="Ex5 Fri/Yes count mismatch")

            # Saturday non-smokers
            sat_nosmoker = self.df_agg.loc[('Sat', 'No')]
            self.assertAlmostEqual(sat_nosmoker['avg_bill'], 19.661778, places=4, msg="Ex5 Sat/No avg_bill mismatch")
            self.assertEqual(sat_nosmoker['max_tip'], 9.00, msg="Ex5 Sat/No max_tip mismatch")
            self.assertEqual(sat_nosmoker['count'], 45, msg="Ex5 Sat/No count mismatch")
        except KeyError as e:
            self.fail(f"Ex5 Failed to access group in MultiIndex during more group checks: {e}")

        print("Test test_ex5_more_group_values PASSED.")

# --- Run Tests for Exercise 5 ---
display(exercise_5_group_aggregate())
run_tests(TestExercise5)


--- Running Exercise 5 body: Load, group, aggregate 'tips' ---


Unnamed: 0_level_0,Unnamed: 1_level_0,avg_bill,max_tip,count
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Thur,Yes,19.190588,5.0,17
Thur,No,17.113111,6.7,45
Fri,Yes,16.813333,4.73,15
Fri,No,18.42,3.5,4
Sat,Yes,21.276667,10.0,42
Sat,No,19.661778,9.0,45
Sun,Yes,24.12,6.5,19
Sun,No,20.506667,6.0,57



--- Running tests from TestExercise5 ---

Setting up TestExercise5: Loading aggregated data...
test_ex5_data_types (__main__.TestExercise5.test_ex5_data_types)
Test Ex5: Checks the data types of the aggregated columns. ... 
Running test_ex5_data_types...
Test test_ex5_data_types PASSED.
ok
test_ex5_index_uniqueness (__main__.TestExercise5.test_ex5_index_uniqueness)
Test Ex5: Checks if the MultiIndex is unique. ... 
Running test_ex5_index_uniqueness...
Test test_ex5_index_uniqueness PASSED.
ok
test_ex5_more_group_values (__main__.TestExercise5.test_ex5_more_group_values)
Test Ex5: Checks specific values for additional groups. ... 
Running test_ex5_more_group_values...
Test test_ex5_more_group_values PASSED.
ok
test_ex5_non_nullness (__main__.TestExercise5.test_ex5_non_nullness)
Test Ex5: Checks that aggregated columns do not contain null values. ... 
Running test_ex5_non_nullness...
Test test_ex5_non_nullness PASSED.
ok
test_ex5_structure_and_basic_values (__main__.TestExercise5.test_e

# Car Name Extraction and Cleaning / Извлечение и Очистка Названий Автомобилей

In [40]:
# --- Exercise 6: Extracting and Cleaning Car Names ---
def exercise_6_process_car_names():
    """
    Loads the MPG dataset and processes the 'name' column:
    1. Creates a new column 'intial_brand' by extracting the first word from the 'name' column.
       (Assumption: first word is the brand). Handles potential errors during extraction.
    2. Create a column 'brand' from 'intial_brand' using .str.replace() or map and lambda:
        - Replaces 'chevy' and 'chevroelt' with 'chevrolet'.
        - Replaces 'maxda' with 'mazda'.
        - Replaces 'mercedes-benz' with 'mercedes'.
        - Replaces 'vw' and 'vokswagen' with 'volkswagen'.
    3. Creates a new column 'cleaned_name' by removing the brand (the first word)
       and any leading/trailing whitespace from the original 'name'.

    Use:
    1) Accessor .str
    2) Method .str[n] for accessing elements of a massive
    3) .str.join for joining the string back together

    Returns: The DataFrame with the new 'brand' and 'cleaned_name' columns or None.

    Загружает набор данных MPG и обрабатывает столбец 'name':
    1. Создает новый столбец 'intial_brand' путем извлечения первого слова из столбца 'name'.
    (Предположение: первое слово - это бренд). Обрабатывает возможные ошибки при извлечении.
    2. Создайте колонку 'brand' из 'initial_brand' c помощью .str.replace() или map и lambda:
        - Заменяет 'chevy' и 'chevroelt' на 'chevrolet'.
        - Заменяет 'maxda' на 'mazda'.
        - Заменяет 'mercedes-benz' на 'mercedes'.
        - Заменяет 'vw' и 'vokswagen' на 'volkswagen'.
    3. Создает новый столбец 'cleaned_name', удаляя бренд (первое слово) и любые начальные/конечные пробелы из исходного столбца 'name'.

    Успользуй
    1) эксесор .str
    2) .str для получения элемнта в массиве (включая слайсы)
    3) .str.join для соединения массива в строку

    Возвращает: DataFrame с новыми столбцами 'brand' и 'cleaned_name' или None.
    """
    dataset_name='mpg'
    print(f"\n--- Running Exercise 6 body: Load '{dataset_name}', process car names ---")
    try:
        df = load_cached_dataset(dataset_name)
        if df is None or df.empty:
            print(f"INFO Ex6: Cannot run exercise body: DataFrame '{dataset_name}' is empty or None.")
            return None

        df_processed = df.copy()
        df_processed['initial_brand'] = df_processed['name'].str.split().str[0]
        df_processed['brand'] = df_processed['initial_brand'].str.replace('chevy', 'chevrolet')
        df_processed['brand'] = df_processed['brand'].str.replace('chevroelt', 'chevrolet')
        df_processed['brand'] = df_processed['brand'].str.replace('maxda', 'mazda')
        df_processed['brand'] = df_processed['brand'].str.replace('mercedes-benz', 'mercedes')
        df_processed['brand'] = df_processed['brand'].str.replace('vw', 'volkswagen')
        df_processed['brand'] = df_processed['brand'].str.replace('vokswagen', 'volkswagen')

        split_names = df_processed['name'].str.split()
        df_processed['cleaned_name'] = df_processed['name'].apply(lambda x: ' '.join(x.split()[1:]).strip())
        return df_processed


    except Exception as e:
        print(f"ERROR in exercise_6 body: {type(e).__name__} - {e}")
        return None

display(exercise_6_process_car_names())


--- Running Exercise 6 body: Load 'mpg', process car names ---


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,initial_brand,brand,cleaned_name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu,chevrolet,chevrolet,chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320,buick,buick,skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite,plymouth,plymouth,satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst,amc,amc,rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino,ford,ford,torino
...,...,...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,usa,ford mustang gl,ford,ford,mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup,vw,volkswagen,pickup
395,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge rampage,dodge,dodge,rampage
396,28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger,ford,ford,ranger


In [41]:
# --- Tests for Exercise 6 ---
class TestExercise6(unittest.TestCase):
    original_df = None
    result_df = None
    test_indices = {
        # name_key: index # original 'name' value from dataset
        'chevrolet_ok': 0,      # 'chevrolet malibu'
        'chevrolet_typo1': 26,  # 'chevy c20'
        'chevrolet_typo2': 161, # 'chevroelt chevelle malibu'
        'mazda_typo': 294,      # 'maxda glc deluxe'
        'vw_typo1': 394,        # 'vw pickup'
        'vw_typo2': 309,        # 'vokswagen rabbit'
        'mercedes_hyphen': 211, # 'mercedes-benz 280s'
        'toyota_ok': 31,        # 'toyota corona mark ii'
        'ford_ok': 32,          # 'ford pinto'
        'multi_space': 97,     # 'plymouth valiant'
        'single_word': 100,     # 'hi 1200d'
        'nan_name': None        # Placeholder for index with NaN name if one exists
    }

    # Expected values are the *first word* of the original name string
    initial_brand_test_cases = {
            'chevrolet_ok': 'chevrolet',
            'chevrolet_typo1': 'chevy',
            'chevrolet_typo2': 'chevroelt',
            'mazda_typo': 'maxda',
            'vw_typo1': 'vw',
            'vw_typo2': 'vokswagen',
            'mercedes_hyphen': 'mercedes-benz',
            'toyota_ok': 'toyota',
            'ford_ok': 'ford',
            'single_word': 'hi', # First word of 'hi 1200d'
        }

    @classmethod
    def setUpClass(cls):
        """Load original data and run the exercise function once."""
        print("\nSetting up TestExercise6: Loading original mpg data and running function...")
        cls.original_df = load_cached_dataset('mpg')

        # Update test indices based on loaded data if possible
        # This makes tests less brittle if the dataset has slight variations
        if cls.original_df is not None:
            name_col = cls.original_df['name']
            idx_map = {
                'chevrolet_ok': name_col[name_col == 'chevrolet malibu'].index,
                'chevrolet_typo1': name_col[name_col == 'chevy c20'].index,
                'chevrolet_typo2': name_col[name_col == 'chevroelt chevelle malibu'].index,
                'mazda_typo': name_col[name_col == 'maxda glc deluxe'].index,
                'vw_typo1': name_col[name_col == 'vw pickup'].index,
                'vw_typo2': name_col[name_col == 'vokswagen rabbit'].index,
                'mercedes_hyphen': name_col[name_col == 'mercedes-benz 280s'].index,
                'toyota_ok': name_col[name_col == 'toyota corona mark ii'].index,
                'ford_ok': name_col[name_col == 'ford pinto'].index,
                'multi_space': name_col[name_col == 'plymouth valiant'].index, # No trailing space usually
                'single_word': name_col[name_col == 'hi 1200d'].index,
            }
            for name, found_indices in idx_map.items():
                if not found_indices.empty:
                    cls.test_indices[name] = found_indices[0] # Take the first match
                else:
                     print(f"WARN Setup: Could not find index for test case '{name}' in loaded data.")
                     cls.test_indices[name] = None # Mark as not found

            # Find or add NaN
            nan_indices = cls.original_df[cls.original_df['name'].isna()].index
            if not nan_indices.empty:
                cls.test_indices['nan_name'] = nan_indices[0]
                print(f"INFO: Found existing NaN in 'name' at index {cls.test_indices['nan_name']}.")

        else:
             print("Setup WARNING: Original MPG data failed to load. Some tests may be skipped or use fallback indices.")


        # Run the function (which loads its own data)
        with contextlib.redirect_stdout(io.StringIO()):
            cls.result_df = exercise_6_process_car_names() # Function loads 'mpg' internally

        if cls.result_df is None:
            print("Setup WARNING: Exercise 6 function returned None.")
        elif cls.original_df is not None and len(cls.result_df) != len(cls.original_df):
             print(f"Setup WARNING: Row count mismatch between original ({len(cls.original_df)}) and result ({len(cls.result_df)}).")
        elif 'cleaned_name' not in cls.result_df.columns:
             print("Setup WARNING: 'cleaned_name' column is missing from the result (likely Step 3 not implemented).")


    def test_ex6_structure_columns(self):
        """Test Ex6: Checks result type, existence and non-emptiness of new columns."""
        print("\nRunning test_ex6_structure_columns...")
        if self.result_df is None:
            self.skipTest("Ex6 Skipped (structure): Result data not available (function returned None).")

        self.assertIsInstance(self.result_df, pd.DataFrame, "Ex6 Result is not a DataFrame.")

        # Check new columns exist
        self.assertIn('brand', self.result_df.columns, "Ex6 Column 'brand' is missing.")
        # Check 'cleaned_name' exists, even if it's just the placeholder NaN column
        self.assertIn('cleaned_name', self.result_df.columns, "Ex6 Column 'cleaned_name' is missing (placeholder expected if Step 3 not done).")

        # Check columns are not empty (assuming original dataset isn't empty)
        if self.original_df is not None and not self.original_df.empty:
            if 'brand' in self.result_df.columns:
                 self.assertFalse(self.result_df['brand'].isnull().all(), "Ex6 Column 'brand' is entirely null.")

        # Check shape (should be same number of rows as original)
        if self.original_df is not None:
            self.assertEqual(len(self.result_df), len(self.original_df),
                             f"Ex6 Row count changed unexpectedly. Original={len(self.original_df)}, Result={len(self.result_df)}")

        print("Test test_ex6_structure_columns PASSED.")

    def test_ex6_initial_brand_extraction(self):
        """Test Ex6 Step 1: Checks 'brand' column *before* cleaning (contains original first word)."""
        print("\nRunning test_ex6_initial_brand_extraction...")
        if self.result_df is None:
            self.skipTest("Ex6 Skipped (initial brand): Result data not available.")
        if 'initial_brand' not in self.result_df.columns:
             self.fail("Ex6 Prerequisite Fail (initial brand): 'initial_brand' column missing.")

        for name, expected_initial_brand in self.initial_brand_test_cases.items():
             idx = self.test_indices.get(name)
             if idx is None:
                 print(f"WARN (initial brand): Test index for '{name}' not found, skipping check.")
                 continue
             if idx not in self.result_df.index:
                  self.fail(f"Ex6 Test Fail (initial brand): Index {idx} for '{name}' not found in result DataFrame.")

             actual_brand = self.result_df.loc[idx, 'initial_brand']
             self.assertEqual(actual_brand, expected_initial_brand,
                              f"Ex6 Initial Brand incorrect for index {idx} (case: '{name}'). Expected first word '{expected_initial_brand}', got '{actual_brand}'.")

        print("Test test_ex6_initial_brand_extraction PASSED.")


    def test_ex6_brand_extraction_and_cleaning(self):
        """Test Ex6 Step 2: Checks specific 'brand' values *after* extraction and replacement."""
        print("\nRunning test_ex6_brand_extraction_and_cleaning...")
        if self.result_df is None:
            self.skipTest("Ex6 Skipped (brand cleaning): Result data not available.")
        if 'brand' not in self.result_df.columns:
             self.fail("Ex6 Prerequisite Fail (brand cleaning): 'brand' column missing.")

        # This test assumes Step 2 (replacement) has been implemented in the function
        cleaned_brand_test_cases = {
            # name_key: (expected_CLEANED_brand, original_full_name for context)
            'chevrolet_ok': ('chevrolet', 'chevrolet malibu'),
            'chevrolet_typo1': ('chevrolet', 'chevy c20'),
            'chevrolet_typo2': ('chevrolet', 'chevroelt chevelle malibu'),
            'mazda_typo': ('mazda', 'maxda glc deluxe'),
            'vw_typo1': ('volkswagen', 'vw pickup'),
            'vw_typo2': ('volkswagen', 'vokswagen rabbit'),
            'mercedes_hyphen': ('mercedes', 'mercedes-benz 280s'),
            'toyota_ok': ('toyota', 'toyota corona mark ii'), # No cleaning needed
            'ford_ok': ('ford', 'ford pinto'),             # No cleaning needed
            'single_word': ('hi', 'hi 1200d'), # Brand 'hi' requires no cleaning
        }

        failures = []
        for name, (expected_brand, original_name) in cleaned_brand_test_cases.items():
             idx = self.test_indices.get(name)
             if idx is None:
                 print(f"WARN (brand cleaning): Test index for '{name}' not found, skipping check.")
                 continue
             if idx not in self.result_df.index:
                  failures.append(f"Index {idx} for '{name}' not found in result DataFrame.")
                  continue

             actual_brand = self.result_df.loc[idx, 'brand']
             try:
                self.assertEqual(actual_brand, expected_brand)
             except AssertionError:
                 failures.append(f"Brand incorrect for index {idx} (orig: '{original_name}'). Expected CLEANED '{expected_brand}', got '{actual_brand}'.")


        if failures:
             # Check if failures match the initial extraction (meaning Step 2 likely wasn't run)
             initial_values = {idx: self.result_df.loc[idx, 'brand'] for idx in self.test_indices.values() if idx is not None and idx in self.result_df.index}
             step1_only = all(
                  initial_values.get(self.test_indices.get(name)) == actual
                  for name, (expected, orig) in cleaned_brand_test_cases.items()
                  if self.test_indices.get(name) is not None and name in self.initial_brand_test_cases # compare where applicable
                  for actual in [self.result_df.loc[self.test_indices.get(name), 'brand']] # Get actual value once
                  if actual != expected # Check only failing cases
             )

             if step1_only and len(failures) > 0 :
                  self.skipTest(f"Ex6 Skipped (brand cleaning): Failures detected, likely because Step 2 (replacement) is not implemented. Failures:\n" + "\n".join(failures))
             else:
                  self.fail("Ex6 Test Fail (brand cleaning):\n" + "\n".join(failures))
        else:
            print("Test test_ex6_brand_extraction_and_cleaning PASSED.")


    def test_ex6_cleaned_name_creation(self):
        """Test Ex6 Step 3: Checks specific 'cleaned_name' values (removal of *original* first word)."""
        print("\nRunning test_ex6_cleaned_name_creation...")
        if self.result_df is None:
            self.skipTest("Ex6 Skipped (cleaned_name creation): Result data not available.")
        if 'cleaned_name' not in self.result_df.columns:
             self.fail("Ex6 Prerequisite Fail (cleaned_name creation): 'cleaned_name' column missing.")
        # Check if the column is just the placeholder NaN column
        if self.result_df['cleaned_name'].isnull().all():
            self.skipTest("Ex6 Skipped (cleaned_name creation): 'cleaned_name' column contains only NaNs (likely Step 3 not implemented).")


        # This test assumes Step 3 has been implemented in the function
        cleaned_name_test_cases = {
            # name_key: (expected_cleaned_name, original_full_name for context)
            'chevrolet_ok': ('malibu', 'chevrolet malibu'),
            'toyota_ok': ('corona mark ii', 'toyota corona mark ii'),
            'ford_ok': ('pinto', 'ford pinto'),
            'chevrolet_typo2': ('chevelle malibu', 'chevroelt chevelle malibu'), # Should remove 'chevroelt'
            'mazda_typo': ('glc deluxe', 'maxda glc deluxe'),  # Should remove 'maxda'
            'vw_typo1': ('pickup', 'vw pickup'),              # Should remove 'vw'
            'mercedes_hyphen': ('280s', 'mercedes-benz 280s'), # Should remove 'mercedes-benz'
            'multi_space': ('valiant', 'plymouth valiant'), # Assumes stripping occurs
            'single_word': ('1200d', 'hi 1200d'), # Should remove 'hi'
        }

        failures = []
        for name, (expected_cleaned, original_name) in cleaned_name_test_cases.items():
             idx = self.test_indices.get(name)
             if idx is None:
                 print(f"WARN (cleaned_name creation): Test index for '{name}' not found, skipping check.")
                 continue
             if idx not in self.result_df.index:
                  failures.append(f"Index {idx} for '{name}' not found in result DataFrame.")
                  continue

             actual_cleaned = self.result_df.loc[idx, 'cleaned_name']
             try:
                 self.assertEqual(actual_cleaned, expected_cleaned)
             except AssertionError:
                  failures.append(f"cleaned_name incorrect for index {idx} (orig: '{original_name}'). Expected '{expected_cleaned}', got '{actual_cleaned}'.")

        if failures:
             self.fail("Ex6 Test Fail (cleaned_name creation):\n" + "\n".join(failures))
        else:
             print("Test test_ex6_cleaned_name_creation PASSED.")


    def test_ex6_data_types(self):
        """Test Ex6: Checks the data types of the new columns."""
        print("\nRunning test_ex6_data_types...")
        if self.result_df is None:
            self.skipTest("Ex6 Skipped (dtypes): Result data not available.")
        if 'brand' not in self.result_df.columns or 'cleaned_name' not in self.result_df.columns:
             self.fail("Ex6 Prerequisite Fail (dtypes): 'brand' or 'cleaned_name' column missing.")

        # Brand should ideally be string (object or pandas StringDtype)
        self.assertTrue(pd.api.types.is_string_dtype(self.result_df['brand'].dtype) or pd.api.types.is_object_dtype(self.result_df['brand'].dtype) ,
                        f"Ex6 'brand' dtype is not string/object. Found: {self.result_df['brand'].dtype}")

        # Allow cleaned_name to be object or float if it contains NaNs (esp if Step 3 not run or if single words existed)
        is_str_or_obj = pd.api.types.is_string_dtype(self.result_df['cleaned_name'].dtype) or \
                        pd.api.types.is_object_dtype(self.result_df['cleaned_name'].dtype)
        self.assertTrue(is_str_or_obj or self.result_df['cleaned_name'].isnull().all(),
                        f"Ex6 'cleaned_name' dtype is not string/object (and not all null). Found: {self.result_df['cleaned_name'].dtype}")

        print("Test test_ex6_data_types PASSED.")

    def test_ex6_other_columns_unchanged(self):
        """Test Ex6: Checks that columns other than 'brand' and 'cleaned_name' are identical to original."""
        print("\nRunning test_ex6_other_columns_unchanged...")
        if self.result_df is None:
            self.skipTest("Ex6 Skipped (unchanged cols): Result data not available.")
        if self.original_df is None:
            self.skipTest("Ex6 Skipped (unchanged cols): Original data not available for comparison.")
        if len(self.result_df) != len(self.original_df):
             self.skipTest("Ex6 Skipped (unchanged cols): Row count mismatch prevents reliable comparison.")

        original_cols = list(self.original_df.columns)
        new_cols = ['brand', 'cleaned_name']
        other_cols = [col for col in original_cols if col not in new_cols and col != 'name'] # Exclude 'name' too

        # Ensure these columns still exist in the result
        missing_cols = [col for col in other_cols if col not in self.result_df.columns]
        if missing_cols:
             self.fail(f"Ex6 Unchanged Cols Fail: The following original columns are missing from the result: {missing_cols}")

        # Compare the DataFrames subsetted to these columns
        # Use reset_index to compare values even if index was altered (though it shouldn't be)
        # Handle potential dtype changes if NaNs were introduced/removed in compared columns (unlikely here)
        try:
            pd.testing.assert_frame_equal(
                self.original_df[other_cols].reset_index(drop=True),
                self.result_df[other_cols].reset_index(drop=True),
                check_dtype=True, # Be strict about dtypes
                check_like=True # Ensure column order and index type match (after reset)
            )
        except AssertionError as e:
            # Provide more specific feedback about where the difference lies
            diff_summary = []
            df1 = self.original_df[other_cols].reset_index(drop=True)
            df2 = self.result_df[other_cols].reset_index(drop=True)
            for col in other_cols:
                if not df1[col].equals(df2[col]):
                    diff_indices = df1.index[df1[col] != df2[col]]
                    diff_summary.append(f"Column '{col}' differs at indices: {list(diff_indices[:5])}...") # Show first few diffs

            self.fail(f"Ex6 Unchanged Cols Fail: DataFrame comparison failed for columns {other_cols}.\nDetails:\n{e}\nSummary of differences:\n" + "\n".join(diff_summary))
        except Exception as e:
             self.fail(f"Ex6 Unchanged Cols Fail: Error during comparison of columns {other_cols}.\nDetails: {type(e).__name__} - {e}")


        print("Test test_ex6_other_columns_unchanged PASSED.")


# --- Run Tests for Exercise 6 ---
display(exercise_6_process_car_names())
run_tests(TestExercise6)


--- Running Exercise 6 body: Load 'mpg', process car names ---


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,initial_brand,brand,cleaned_name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu,chevrolet,chevrolet,chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320,buick,buick,skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite,plymouth,plymouth,satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst,amc,amc,rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino,ford,ford,torino
...,...,...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790,15.6,82,usa,ford mustang gl,ford,ford,mustang gl
394,44.0,4,97.0,52.0,2130,24.6,82,europe,vw pickup,vw,volkswagen,pickup
395,32.0,4,135.0,84.0,2295,11.6,82,usa,dodge rampage,dodge,dodge,rampage
396,28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger,ford,ford,ranger



--- Running tests from TestExercise6 ---

Setting up TestExercise6: Loading original mpg data and running function...
test_ex6_brand_extraction_and_cleaning (__main__.TestExercise6.test_ex6_brand_extraction_and_cleaning)
Test Ex6 Step 2: Checks specific 'brand' values *after* extraction and replacement. ... 
Running test_ex6_brand_extraction_and_cleaning...
Test test_ex6_brand_extraction_and_cleaning PASSED.
ok
test_ex6_cleaned_name_creation (__main__.TestExercise6.test_ex6_cleaned_name_creation)
Test Ex6 Step 3: Checks specific 'cleaned_name' values (removal of *original* first word). ... 
Running test_ex6_cleaned_name_creation...
Test test_ex6_cleaned_name_creation PASSED.
ok
test_ex6_data_types (__main__.TestExercise6.test_ex6_data_types)
Test Ex6: Checks the data types of the new columns. ... 
Running test_ex6_data_types...
Test test_ex6_data_types PASSED.
ok
test_ex6_initial_brand_extraction (__main__.TestExercise6.test_ex6_initial_brand_extraction)
Test Ex6 Step 1: Checks 'brand