# General ML Model Pipeline

### Table of contents:
1. Dependencies  
2. Read-in data  
3. Data cleanup  
    3a. Check inconsistencies  
    3b. Fix inconsistencies  
    3c. Determine feature data types  
    3d. Convert and re-code feature data  
    3e. Missing values  
4. Group imbalance

### 1. Dependencies

In [9]:
import pandas as pd
import numpy as np

### 2. Read-in data
Read data and transform into dataframe.  
Test case: local data in .csv format.

### 3. Data cleanup

### 3a. Check inconsistencies
Custom function to check dataframe for inconsistencies in, e.g., data type and delimiters 

In [10]:
def check_inconsistent_values(df: pd.DataFrame):
    # Checking for missing values
    print("Missing Values Check:")
    print(df.isnull().sum())
    
    # Check for inconsistent data types
    print("\nData Types Check:")
    print(df.dtypes)
    
    # Check for numeric columns containing non-numeric data
    print("\nNon-numeric Data in Numeric Columns:")
    for col in df.select_dtypes(include=['int64', 'float64']).columns:
        non_numeric = df[col].apply(lambda x: not pd.api.types.is_number(x))
        if non_numeric.sum() > 0:
            print(f"Non-numeric values found in {col}:")
            print(df[non_numeric][col])
    
    # Checking for delimiter issues by finding any cell containing multiple delimiters
    print("\nDelimiter Issues Check (commas, semicolons):")
    delimiter_issues = df.applymap(lambda x: isinstance(x, str) and (',' in x or ';' in x))
    if delimiter_issues.any().any():
        print("Delimiter issues found in the following columns:")
        print(df.columns[delimiter_issues.any()])

    # Checking for duplicate features
    print("\nDuplicate Columns Check:")
    duplicatecols = df[df.columns.duplicated()]
    if not duplicatecols.empty:
        print(f"{len(duplicatecols)} duplicate columns found:")
        print(duplicatecols)
    else:
        print("No duplicate columns found.")

    # Checking for duplicate rows
    print("\nDuplicate Rows Check:")
    duplicaterows = df[df.duplicated()]
    if not duplicaterows.empty:
        print(f"{len(duplicaterows)} duplicate rows found:")
        print(duplicaterows)
    else:
        print("No duplicate rows found.")
    
    # Checking for inconsistent casing in string columns
    print("\nInconsistent Casing in String Columns:")
    for col in df.select_dtypes(include=['object']).columns:
        inconsistent_case = df[col].apply(lambda x: isinstance(x, str) and (x != x.lower() and x != x.upper()))
        if inconsistent_case.any():
            print(f"Inconsistent casing found in {col}:")
            print(df[inconsistent_case][col])

    # Checking for extra whitespace in string columns
    print("\nExtra Whitespace in String Columns:")
    for col in df.select_dtypes(include=['object']).columns:
        whitespace_issues = df[col].apply(lambda x: isinstance(x, str) and (x != x.strip()))
        if whitespace_issues.any():
            print(f"Whitespace issues found in {col}:")
            print(df[whitespace_issues][col])