## Check Uniqueness & Validity

**Objective**: Evaluate data quality by checking for uniqueness and validity of data entries.

For this activity, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Uniqueness
    - Unique IDs
    - Unique Email Addresses
    - Unique Combination

2. Check Validity
    - Validate Age Range
    - Validate Grade Scale
    - Validate Name Format

In [5]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check Uniqueness & Validity\n",
    "\n",
    "**Objective**: Evaluate data quality by checking for uniqueness and validity of data entries.\n",
    "\n",
    "For this activity, you will use a sample dataset students.csv that contains the following\n",
    "columns: ID , Name , Age , Grade , Email .\n",
    "\n",
    "**Steps**:\n",
    "1. Check Uniqueness\n",
    "    - Unique IDs\n",
    "    - Unique Email Addresses\n",
    "    - Unique Combination\n",
    "\n",
    "2. Check Validity\n",
    "    - Validate Age Range\n",
    "    - Validate Grade Scale\n",
    "    - Validate Name Format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "test_check_unique (__main__.TestDataQualityFunctions)\n",
      "Test uniqueness of a single column (ID). ... ok\n",
      "test_check_unique_combination (__main__.TestDataQualityFunctions)\n",
      "Test uniqueness of the combination of multiple columns (ID and Email). ... FAIL\n",
      "test_validate_age (__main__.TestDataQualityFunctions)\n",
      "Test validation of the age column. ... ok\n",
      "test_validate_grade (__main__.TestDataQualityFunctions)\n",
      "Test validation of the grade column. ... ok\n",
      "test_validate_name_format (__main__.TestDataQualityFunctions)\n",
      "Test validation of the name format column. ... ok\n",
      "\n",
      "======================================================================\n",
      "FAIL: test_check_unique_combination (__main__.TestDataQualityFunctions)\n",
      "Test uniqueness of the combination of multiple columns (ID and Email).\n",
      "----------------------------------------------------------------------\n",
      "Traceback (most recent call last):\n",
      "  File \"/tmp/ipykernel_22294/1524112830.py\", line 120, in test_check_unique_combination\n",
      "    self.assertFalse(result, \"Combination of ID and Email should be unique.\")\n",
      "AssertionError: True is not false : Combination of ID and Email should be unique.\n",
      "\n",
      "----------------------------------------------------------------------\n",
      "Ran 5 tests in 0.014s\n",
      "\n",
      "FAILED (failures=1)\n"
     ]
    }
   ],
   "source": [
    "# Write your code from here\n",
    "import pandas as pd\n",
    "import re\n",
    "\n",
    "# Function to check for uniqueness of a column\n",
    "def check_unique(df, column):\n",
    "    \"\"\"\n",
    "    Check if all values in a specified column are unique.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    column (str): Column name to check uniqueness.\n",
    "    \n",
    "    Returns:\n",
    "    bool: True if all values are unique, False otherwise.\n",
    "    \"\"\"\n",
    "    if column not in df.columns:\n",
    "        raise KeyError(f\"Column '{column}' not found in DataFrame.\")\n",
    "    \n",
    "    return df[column].is_unique\n",
    "\n",
    "# Function to check for unique combination of multiple columns\n",
    "def check_unique_combination(df, columns):\n",
    "    \"\"\"\n",
    "    Check if the combination of multiple columns has unique rows.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    columns (list): List of column names to check unique combination.\n",
    "    \n",
    "    Returns:\n",
    "    bool: True if all combinations of values are unique, False otherwise.\n",
    "    \"\"\"\n",
    "    if not all(col in df.columns for col in columns):\n",
    "        raise KeyError(\"One or more columns not found in DataFrame.\")\n",
    "    \n",
    "    return df.duplicated(subset=columns).sum() == 0\n",
    "\n",
    "# Function to validate Age (age should be within 0 to 120)\n",
    "def validate_age(df, column, min_age=0, max_age=120):\n",
    "    \"\"\"\n",
    "    Check if the Age values fall within a valid range.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    column (str): Column name to check age validity.\n",
    "    min_age (int): Minimum valid age (default is 0).\n",
    "    max_age (int): Maximum valid age (default is 120).\n",
    "    \n",
    "    Returns:\n",
    "    pandas.DataFrame: Rows with invalid age values.\n",
    "    \"\"\"\n",
    "    if column not in df.columns:\n",
    "        raise KeyError(f\"Column '{column}' not found in DataFrame.\")\n",
    "    \n",
    "    return df[(df[column] < min_age) | (df[column] > max_age)]\n",
    "\n",
    "# Function to validate Grade (grade should be between 0 and 100)\n",
    "def validate_grade(df, column, min_grade=0, max_grade=100):\n",
    "    \"\"\"\n",
    "    Check if the Grade values fall within a valid range.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    column (str): Column name to check grade validity.\n",
    "    min_grade (int): Minimum valid grade (default is 0).\n",
    "    max_grade (int): Maximum valid grade (default is 100).\n",
    "    \n",
    "    Returns:\n",
    "    pandas.DataFrame: Rows with invalid grade values.\n",
    "    \"\"\"\n",
    "    if column not in df.columns:\n",
    "        raise KeyError(f\"Column '{column}' not found in DataFrame.\")\n",
    "    \n",
    "    return df[(df[column] < min_grade) | (df[column] > max_grade)]\n",
    "\n",
    "# Function to validate Name format (name should be capitalized)\n",
    "def validate_name_format(df, column):\n",
    "    \"\"\"\n",
    "    Validate if the Name follows proper format (capitalized).\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    column (str): Column name to check name validity.\n",
    "    \n",
    "    Returns:\n",
    "    pandas.DataFrame: Rows with invalid name format.\n",
    "    \"\"\"\n",
    "    if column not in df.columns:\n",
    "        raise KeyError(f\"Column '{column}' not found in DataFrame.\")\n",
    "    \n",
    "    name_regex = r'^[A-Z][a-z]* [A-Z][a-z]*$'  # First and Last name, both capitalized\n",
    "    return df[~df[column].str.match(name_regex, na=False)]\n",
    "\n",
    "# Unit tests for validation functions\n",
    "import unittest\n",
    "\n",
    "class TestDataQualityFunctions(unittest.TestCase):\n",
    "    \n",
    "    def setUp(self):\n",
    "        \"\"\"Setup for unit tests - creating a sample dataframe.\"\"\"\n",
    "        data = {\n",
    "            'ID': [1, 2, 3, 4, 5, 5],\n",
    "            'Name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'alice white', 'Charlie Brown', 'David Lee'],\n",
    "            'Age': [20, 22, 19, 25, 121, None],\n",
    "            'Grade': [85, 90, 78, 88, 101, None],\n",
    "            'Email': ['johndoe@example.com', 'janesmith@example.com', 'bobjohnson@example.com', \n",
    "                      'alicewhite@example.com', 'charliebrown@example.com', 'invalid-email']\n",
    "        }\n",
    "        self.df = pd.DataFrame(data)\n",
    "    \n",
    "    def test_check_unique(self):\n",
    "        \"\"\"Test uniqueness of a single column (ID).\"\"\"\n",
    "        result = check_unique(self.df, 'ID')\n",
    "        self.assertFalse(result, \"IDs should not be unique due to duplication.\")\n",
    "    \n",
    "    def test_check_unique_combination(self):\n",
    "        \"\"\"Test uniqueness of the combination of multiple columns (ID and Email).\"\"\"\n",
    "        result = check_unique_combination(self.df, ['ID', 'Email'])\n",
    "        self.assertFalse(result, \"Combination of ID and Email should be unique.\")\n",
    "    \n",
    "    def test_validate_age(self):\n",
    "        \"\"\"Test validation of the age column.\"\"\"\n",
    "        result = validate_age(self.df, 'Age', 0, 120)\n",
    "        self.assertEqual(len(result), 1, \"Should identify 1 invalid age.\")\n",
    "\n",
    "    def test_validate_grade(self):\n",
    "        \"\"\"Test validation of the grade column.\"\"\"\n",
    "        result = validate_grade(self.df, 'Grade', 0, 100)\n",
    "        self.assertEqual(len(result), 1, \"Should identify 1 invalid grade.\")\n",
    "    \n",
    "    def test_validate_name_format(self):\n",
    "        \"\"\"Test validation of the name format column.\"\"\"\n",
    "        result = validate_name_format(self.df, 'Name')\n",
    "        self.assertEqual(len(result), 1, \"Should identify 1 invalid name format.\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    unittest.main(argv=[''], verbosity=2, exit=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

{'cells': [{'cell_type': 'markdown',
   'metadata': {},
   'source': ['## Check Uniqueness & Validity\n',
    '\n',
    '**Objective**: Evaluate data quality by checking for uniqueness and validity of data entries.\n',
    '\n',
    'For this activity, you will use a sample dataset students.csv that contains the following\n',
    'columns: ID , Name , Age , Grade , Email .\n',
    '\n',
    '**Steps**:\n',
    '1. Check Uniqueness\n',
    '    - Unique IDs\n',
    '    - Unique Email Addresses\n',
    '    - Unique Combination\n',
    '\n',
    '2. Check Validity\n',
    '    - Validate Age Range\n',
    '    - Validate Grade Scale\n',
    '    - Validate Name Format']},
  {'cell_type': 'code',
   'execution_count': 1,
   'metadata': {},
   'outputs': [{'name': 'stderr',
     'output_type': 'stream',
     'text': ['test_check_unique (__main__.TestDataQualityFunctions)\n',
      'Test uniqueness of a single column (ID). ... ok\n',
      'test_check_unique_combination (__main__.TestDataQ