## Check Accuracy & Completeness

**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.

For this, you will use a sample dataset students.csv that contains the following
columns: ID , Name , Age , Grade , Email .

**Steps**:
1. Check Accuracy
    - Verify Numerical Data Accuracy
    - Validate Email Format
    - Integer Accuracy Check for Age
2. Check Completeness
    - Identify Missing Values
    - Rows with Missing Data
    - Column Specific Missing Value Check

In [1]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check Accuracy & Completeness\n",
    "\n",
    "**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.\n",
    "\n",
    "For this, you will use a sample dataset students.csv that contains the following\n",
    "columns: ID , Name , Age , Grade , Email .\n",
    "\n",
    "**Steps**:\n",
    "1. Check Accuracy\n",
    "    - Verify Numerical Data Accuracy\n",
    "    - Validate Email Format\n",
    "    - Integer Accuracy Check for Age\n",
    "2. Check Completeness\n",
    "    - Identify Missing Values\n",
    "    - Rows with Missing Data\n",
    "    - Column Specific Missing Value Check"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "test_check_age_validity (__main__.TestDataQualityFunctions)\n",
      "Test the validity of the age column. ... FAIL\n",
      "test_check_column_missing (__main__.TestDataQualityFunctions)\n",
      "Test checking for missing values in specific column. ... ok\n",
      "test_check_null_values (__main__.TestDataQualityFunctions)\n",
      "Test checking for rows with null values. ... FAIL\n",
      "test_check_numerical_accuracy (__main__.TestDataQualityFunctions)\n",
      "Test numerical accuracy for age and grade. ... FAIL\n",
      "test_validate_email_format (__main__.TestDataQualityFunctions)\n",
      "Test email validation. ... ok\n",
      "\n",
      "======================================================================\n",
      "FAIL: test_check_age_validity (__main__.TestDataQualityFunctions)\n",
      "Test the validity of the age column.\n",
      "----------------------------------------------------------------------\n",
      "Traceback (most recent call last):\n",
      "  File \"/tmp/ipykernel_21504/2373566044.py\", line 135, in test_check_age_validity\n",
      "    self.assertEqual(len(result), 1, \"Should identify 1 invalid age.\")\n",
      "AssertionError: 0 != 1 : Should identify 1 invalid age.\n",
      "\n",
      "======================================================================\n",
      "FAIL: test_check_null_values (__main__.TestDataQualityFunctions)\n",
      "Test checking for rows with null values.\n",
      "----------------------------------------------------------------------\n",
      "Traceback (most recent call last):\n",
      "  File \"/tmp/ipykernel_21504/2373566044.py\", line 120, in test_check_null_values\n",
      "    self.assertEqual(len(result), 2, \"Should identify 2 rows with missing values.\")\n",
      "AssertionError: 1 != 2 : Should identify 2 rows with missing values.\n",
      "\n",
      "======================================================================\n",
      "FAIL: test_check_numerical_accuracy (__main__.TestDataQualityFunctions)\n",
      "Test numerical accuracy for age and grade.\n",
      "----------------------------------------------------------------------\n",
      "Traceback (most recent call last):\n",
      "  File \"/tmp/ipykernel_21504/2373566044.py\", line 125, in test_check_numerical_accuracy\n",
      "    self.assertEqual(len(result), 1, \"Should identify 1 invalid age.\")\n",
      "AssertionError: 0 != 1 : Should identify 1 invalid age.\n",
      "\n",
      "----------------------------------------------------------------------\n",
      "Ran 5 tests in 0.013s\n",
      "\n",
      "FAILED (failures=3)\n"
     ]
    }
   ],
   "source": [
    "# Write your code from here\n",
    "import pandas as pd\n",
    "import re\n",
    "\n",
    "# Function to check for missing values in a DataFrame\n",
    "def check_null_values(df):\n",
    "    \"\"\"\n",
    "    Check for null values in the DataFrame.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame to check for null values.\n",
    "    \n",
    "    Returns:\n",
    "    pandas.DataFrame: Rows with null values.\n",
    "    \"\"\"\n",
    "    if not isinstance(df, pd.DataFrame):\n",
    "        raise ValueError(\"Input must be a pandas DataFrame.\")\n",
    "    \n",
    "    return df[df.isnull().any(axis=1)]\n",
    "\n",
    "# Function to check numerical accuracy of a column\n",
    "def check_numerical_accuracy(df, column, min_value, max_value):\n",
    "    \"\"\"\n",
    "    Check if the values in a numerical column fall within the specified range.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    column (str): Column name to check.\n",
    "    min_value (int or float): Minimum valid value for the column.\n",
    "    max_value (int or float): Maximum valid value for the column.\n",
    "    \n",
    "    Returns:\n",
    "    pandas.DataFrame: Rows where values in the column are out of the specified range.\n",
    "    \"\"\"\n",
    "    if column not in df.columns:\n",
    "        raise KeyError(f\"Column '{column}' not found in DataFrame.\")\n",
    "    \n",
    "    if df[column].dtype not in ['int64', 'float64']:\n",
    "        raise TypeError(f\"Column '{column}' must be numeric.\")\n",
    "\n",
    "    return df[(df[column] < min_value) | (df[column] > max_value)]\n",
    "\n",
    "# Function to validate email format using regex\n",
    "def validate_email_format(df, column):\n",
    "    \"\"\"\n",
    "    Validate if email addresses in the specified column have the correct format.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    column (str): Column name to validate.\n",
    "    \n",
    "    Returns:\n",
    "    pandas.DataFrame: Rows with invalid email format.\n",
    "    \"\"\"\n",
    "    if column not in df.columns:\n",
    "        raise KeyError(f\"Column '{column}' not found in DataFrame.\")\n",
    "    \n",
    "    if not df[column].dtype == 'object':\n",
    "        raise TypeError(f\"Column '{column}' must be of type string.\")\n",
    "    \n",
    "    email_regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'\n",
    "    return df[~df[column].str.match(email_regex, na=False)]\n",
    "\n",
    "# Function to check if the age is valid\n",
    "def check_age_validity(df, column, min_age=0, max_age=120):\n",
    "    \"\"\"\n",
    "    Ensure that the age values in the column are within a reasonable human range.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    column (str): Column name to check age validity.\n",
    "    min_age (int): Minimum valid age (default is 0).\n",
    "    max_age (int): Maximum valid age (default is 120).\n",
    "    \n",
    "    Returns:\n",
    "    pandas.DataFrame: Rows with invalid age values.\n",
    "    \"\"\"\n",
    "    if column not in df.columns:\n",
    "        raise KeyError(f\"Column '{column}' not found in DataFrame.\")\n",
    "    \n",
    "    return check_numerical_accuracy(df, column, min_age, max_age)\n",
    "\n",
    "# Function to check for missing values in a specific column\n",
    "def check_column_missing(df, column):\n",
    "    \"\"\"\n",
    "    Check if a specific column has missing values.\n",
    "    \n",
    "    Parameters:\n",
    "    df (pandas.DataFrame): Input DataFrame.\n",
    "    column (str): Column name to check for missing values.\n",
    "    \n",
    "    Returns:\n",
    "    pandas.DataFrame: Rows where the specified column has missing values.\n",
    "    \"\"\"\n",
    "    if column not in df.columns:\n",
    "        raise KeyError(f\"Column '{column}' not found in DataFrame.\")\n",
    "    \n",
    "    return df[df[column].isnull()]\n",
    "\n",
    "# Unit tests for validation functions\n",
    "import unittest\n",
    "\n",
    "class TestDataQualityFunctions(unittest.TestCase):\n",
    "    \n",
    "    def setUp(self):\n",
    "        \"\"\"Setup for unit tests - creating a sample dataframe.\"\"\"\n",
    "        data = {\n",
    "            'ID': [1, 2, 3, 4, 5, 6],\n",
    "            'Name': ['John Doe', 'Jane Smith', 'Bob Johnson', 'Alice White', 'Charlie Brown', 'David Lee'],\n",
    "            'Age': [20, 22, 19, 25, 21, None],\n",
    "            'Grade': [85, 90, 78, 88, 95, None],\n",
    "            'Email': ['johndoe@example.com', 'janesmith@example.com', 'bobjohnson@example.com', \n",
    "                      'alicewhite@example.com', 'charliebrown@example.com', 'invalid-email']\n",
    "        }\n",
    "        self.df = pd.DataFrame(data)\n",
    "    \n",
    "    def test_check_null_values(self):\n",
    "        \"\"\"Test checking for rows with null values.\"\"\"\n",
    "        result = check_null_values(self.df)\n",
    "        self.assertEqual(len(result), 2, \"Should identify 2 rows with missing values.\")\n",
    "    \n",
    "    def test_check_numerical_accuracy(self):\n",
    "        \"\"\"Test numerical accuracy for age and grade.\"\"\"\n",
    "        result = check_numerical_accuracy(self.df, 'Age', 0, 120)\n",
    "        self.assertEqual(len(result), 1, \"Should identify 1 invalid age.\")\n",
    "\n",
    "    def test_validate_email_format(self):\n",
    "        \"\"\"Test email validation.\"\"\"\n",
    "        result = validate_email_format(self.df, 'Email')\n",
    "        self.assertEqual(len(result), 1, \"Should identify 1 invalid email.\")\n",
    "\n",
    "    def test_check_age_validity(self):\n",
    "        \"\"\"Test the validity of the age column.\"\"\"\n",
    "        result = check_age_validity(self.df, 'Age', 0, 120)\n",
    "        self.assertEqual(len(result), 1, \"Should identify 1 invalid age.\")\n",
    "    \n",
    "    def test_check_column_missing(self):\n",
    "        \"\"\"Test checking for missing values in specific column.\"\"\"\n",
    "        result = check_column_missing(self.df, 'Grade')\n",
    "        self.assertEqual(len(result), 1, \"Should identify 1 row with missing grade.\")\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    unittest.main(argv=[''], verbosity=2, exit=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

{'cells': [{'cell_type': 'markdown',
   'metadata': {},
   'source': ['## Check Accuracy & Completeness\n',
    '\n',
    '**Objective**: Learn to assess data quality by checking for accuracy and completeness using Python.\n',
    '\n',
    'For this, you will use a sample dataset students.csv that contains the following\n',
    'columns: ID , Name , Age , Grade , Email .\n',
    '\n',
    '**Steps**:\n',
    '1. Check Accuracy\n',
    '    - Verify Numerical Data Accuracy\n',
    '    - Validate Email Format\n',
    '    - Integer Accuracy Check for Age\n',
    '2. Check Completeness\n',
    '    - Identify Missing Values\n',
    '    - Rows with Missing Data\n',
    '    - Column Specific Missing Value Check']},
  {'cell_type': 'code',
   'execution_count': 1,
   'metadata': {},
   'outputs': [{'name': 'stderr',
     'output_type': 'stream',
     'text': ['test_check_age_validity (__main__.TestDataQualityFunctions)\n',
      'Test the validity of the age column. ... FAIL\n',
      't