# Uncovering Healthcare Inefficiencies - SQL Database and Tableau  Visualizations

This notebook details the process of setting up a SQL database and importing data obtained from CMS. It includes steps for creating additional views and tables within the SQL database. The refined views and tables are then exported to Power BI to develop comprehensive visualization dashboards.

---

## Table of Contents 
1. [Import Libaries](#import-libaries)
2. [Connect to Database](#connect-to-database-fwa_healthcare)
3. [Table Manipulation](#table-manipulation)
4. [View Manipulation](#view-manipulation)
5. [Accessing Visualizations](#accessing-visualizations)
6. [Potential FWA Prediction (Bagging Classifier Results)](#potential-fwa-prediction-bagging-classifier-results)

## Import Libaries

The following libraries are imported to support the SQL database process:

In [2]:
# import required libraries
import pandas as pd
import numpy as np
import os

import pymysql
from sqlalchemy import Table, Column, Integer, Float, String, Text, MetaData, VARCHAR, DECIMAL
from sqlalchemy import create_engine, MetaData, Table
from sqlalchemy.orm import sessionmaker
from sqlalchemy.sql import text

from tqdm import tqdm
import os

from tabulate import tabulate

import powerbiclient

from powerbiclient import QuickVisualize, get_dataset_config, Report
from powerbiclient.authentication import DeviceCodeLoginAuthentication

from geopy.geocoders import Bing
import folium

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning) # supress warning 

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Connect to Database (`fwa_healthcare`) 

The connection parameters are defined, and a connection to the SQL database is established:

In [3]:
# connection parameters
username = "root"
password = "Misty#78182"
hostName = "127.0.0.1"
database = "fwa_healthcare"

# connect to the database
try:
    conn = pymysql.connect(host=hostName, 
                           port=3306,
                           user=username,
                           password=password, 
                           db=database,
                           local_infile=True)  # enable local file loading
    print("Connection to SQL was successful!")
except pymysql.MySQLError as e:
    print(f"Error connecting to MySQL: {e}")
    exit()

Connection to SQL was successful!


A SQLAlchemy engine is also defined:

In [4]:
# define SQLAlchemy engine
engine = create_engine(f'mysql+pymysql://{username}:{password}@{hostName}/{database}')

# define metadata
metadata = MetaData()

### Import Data to Existing Database 

The following steps are followed to import data from a CSV file into the SQL database:

In [5]:
# define the file path
file_path = "data/cms_data_sql.csv"

# function to count the number of lines in the file
def count_lines(file_path):
    with open(file_path, 'r') as file:
        return sum(1 for _ in file)

# count total lines in the file (excluding header)
total_lines = count_lines(file_path) - 1  # subtracting header row

cursor = conn.cursor()

try:
    # check if the table already contains data
    cursor.execute("SELECT COUNT(*) FROM healthcare_market_saturation_fraud;")
    row_count = cursor.fetchone()[0]

    if row_count > 0:
        print(f"Table already contains {row_count} rows. Skipping data import.")
    else:
        # create SQL command for importing data
        sql = f"""
        LOAD DATA LOCAL INFILE '{file_path}'
        INTO TABLE `healthcare_market_saturation_fraud`
        FIELDS TERMINATED BY ','
        ENCLOSED BY '"'
        LINES TERMINATED BY '\n'
        IGNORE 1 LINES
        (reference_period, type_of_service, aggregation_level, state, county, state_fips, county_fips, num_fee_service_benef, 
        num_providers, avg_users_per_provider, pct_users_out_ffs_benef, num_users, avg_providers_per_county, num_dual_eligible_users, 
        pct_dual_eligible_users_out_total_users, percent_dual_elig_ffs, total_payment, moratorium, num_fee_service_benef_dual_color, 
        num_fee_service_benef_desc, num_providers_dual_color, num_providers_desc, avg_users_per_provider_dual_color, avg_users_per_provider_desc, 
        pct_users_out_ffs_benef_dual_color, pct_users_out_ffs_benef_desc, num_users_dual_color, num_users_desc, avg_providers_per_county_dual_color, 
        avg_providers_per_county_desc, num_dual_eligible_users_dual_color, num_dual_eligible_users_desc, pct_dual_eligible_total_users_dual_color, 
        pct_dual_eligible_total_users_desc, pct_dual_eligible_ffs_dual_color, pct_dual_eligible_ffs_desc, total_payment_dual_color, total_payment_desc, 
        num_fee_service_benef_change, num_providers_change, avg_users_per_provider_change, pct_users_out_ffs_benef_change, num_users_change, 
        avg_providers_per_county_change, num_dual_eligible_users_change, pct_dual_eligible_users_out_total_users_change, pct_dual_eligible_ffs_change, 
        total_payment_change);
        """

        # ctart a progress bar
        with tqdm(total=total_lines, desc='Importing Data', unit='line') as pbar:
            cursor.execute(sql)
            conn.commit()
            print("Data imported successfully!")
            # update the progress bar
            pbar.update(total_lines)  

except pymysql.MySQLError as e:
    print(f"Error: {e}")
    conn.rollback()

Table already contains 1044711 rows. Skipping data import.


The data import process confirmed that the table already contained 1,044,711 rows, leading to a decision to skip the import. This ensures that the data remains up-to-date and avoids redundancy.

## Table Manipulation

In this section, various tables are defined and created in the SQL database using SQLAlchemy, followed by inserting data from an existing table into these newly created tables. The following tables are defined:
- **Providers**: This table holds information about providers, including their identifiers and various metrics related to beneficiaries and payments. 
- **Beneficiaries**: This table tracks information about beneficiaries, including their identifiers, reference periods, and metrics related to fees, providers, and payments.
- **Provider Utilization**: This table records data on provider utilization, including metrics like the number of beneficiaries, providers, and payments.
- **Changes**: This table captures changes over time in metrics related to beneficiaries, providers, and payments, including various percentages and counts.
- **Dual Eligible**: This table holds information about dual-eligible users, including their counts and percentages, along with related descriptive data

### Define Tables 
SQLAlchemy is used to define the structure of our tables. Each table is created with specific columns and data types. The extend_existing=True parameter allows updating an existing table definition if it already exists.

In [6]:
# define tables

# Providers table
providers = Table('providers', metadata,
    Column('provider_id', Integer, primary_key=True, autoincrement=True),
    Column('state', VARCHAR(2)),  
    Column('county', VARCHAR(255)), 
    Column('state_fips', Integer),  
    Column('county_fips', Integer),  
    Column('total_beneficiaries', DECIMAL(20, 0)),  
    Column('total_providers', DECIMAL(20, 0)),  
    Column('avg_users_per_provider', DECIMAL(20, 2)), 
    Column('avg_pct_users_out_ffs', DECIMAL(5, 2)), 
    Column('avg_total_payment', DECIMAL(20, 2)),  
    extend_existing=True
)

# Beneficiaries table
beneficiaries = Table('beneficiaries', metadata,
    Column('beneficiary_id', Integer, primary_key=True, autoincrement=True),
    Column('reference_period', VARCHAR(255)),
    Column('num_fee_service_benef', DECIMAL(20, 0)),  
    Column('num_providers', DECIMAL(20, 0)),
    Column('num_users', DECIMAL(20, 0)),
    Column('total_payment', DECIMAL(20, 2)),
    Column('moratorium', VARCHAR(255)),
    Column('state', VARCHAR(2)), 
    Column('county', VARCHAR(255)),            
    extend_existing=True
)

# Provider Utilization table
provider_utilization = Table('provider_utilization', metadata,
    Column('utilization_id', Integer, primary_key=True, autoincrement=True),
    Column('state', VARCHAR(2)),
    Column('county', VARCHAR(255)),
    Column('state_fips', Integer),
    Column('county_fips', Integer),
    Column('total_beneficiaries', DECIMAL(20, 0)),
    Column('total_providers', DECIMAL(20, 0)),
    Column('avg_users_per_provider', DECIMAL(20, 2)),
    Column('avg_pct_users_out_ffs', DECIMAL(5, 2)),
    Column('avg_total_payment', DECIMAL(20, 2)),
    extend_existing=True
)

# Changes table
changes = Table('changes', metadata,
    Column('change_id', Integer, primary_key=True, autoincrement=True),
    Column('reference_period', VARCHAR(255)),
    Column('num_fee_service_benef_change', DECIMAL(20, 2)),
    Column('num_providers_change', DECIMAL(20, 2)),
    Column('avg_users_per_provider_change', DECIMAL(20, 2)),
    Column('pct_users_out_ffs_benef_change', DECIMAL(5, 2)),
    Column('num_users_change', DECIMAL(20, 0)),
    Column('avg_providers_per_county_change', DECIMAL(20, 2)),
    Column('num_dual_eligible_users_change', DECIMAL(20, 0)),
    Column('pct_dual_eligible_users_out_total_users_change', DECIMAL(5, 2)),
    Column('pct_dual_eligible_ffs_change', DECIMAL(5, 2)),
    Column('total_payment_change', DECIMAL(20, 2)),
    extend_existing=True
)

# Dual Eligible table
dual_eligible = Table('dual_eligible', metadata,
    Column('dual_id', Integer, primary_key=True, autoincrement=True),
    Column('reference_period', VARCHAR(255)),
    Column('num_dual_eligible_users', DECIMAL(20, 0)),
    Column('pct_dual_eligible_users_out_of_total_users', DECIMAL(5, 2)),
    Column('num_users_dual_color', VARCHAR(255)),
    Column('pct_dual_eligible_ffs_dual_color', VARCHAR(255)),
    extend_existing=True
)

# create tables in the database
metadata.create_all(engine)

### Insert Data into Tables

The following SQL commands are used to insert data into the newly created tables from an existing source table.


In [7]:
# Define SQL commands
commands = [
    # beneficiaries table 
    """
    INSERT INTO beneficiaries (
        reference_period, 
        num_fee_service_benef, 
        num_providers, 
        num_users, 
        total_payment, 
        moratorium, 
        state, 
        county
    )
    SELECT 
        reference_period, 
        num_fee_service_benef, 
        num_providers, 
        num_users, 
        total_payment, 
        moratorium, 
        state, 
        county
    FROM `healthcare_market_saturation_fraud`;
    """,
    # changes table  
    """
    INSERT INTO changes (
        reference_period, 
        num_fee_service_benef_change, 
        num_providers_change, 
        avg_users_per_provider_change, 
        pct_users_out_ffs_benef_change, 
        num_users_change, 
        avg_providers_per_county_change, 
        num_dual_eligible_users_change, 
        pct_dual_eligible_users_out_total_users_change, 
        pct_dual_eligible_ffs_change, 
        total_payment_change
    )
    SELECT 
        reference_period, 
        num_fee_service_benef_change, 
        num_providers_change, 
        avg_users_per_provider_change, 
        pct_users_out_ffs_benef_change, 
        num_users_change, 
        avg_providers_per_county_change, 
        num_dual_eligible_users_change, 
        pct_dual_eligible_users_out_total_users_change, 
        pct_dual_eligible_ffs_change, 
        total_payment_change
    FROM `healthcare_market_saturation_fraud`;
    """,
    # dual eligible table 
    """
    INSERT INTO dual_eligible (
        reference_period, 
        num_dual_eligible_users, 
        pct_dual_eligible_users_out_of_total_users, 
        num_users_dual_color, 
        pct_dual_eligible_ffs_dual_color
    )
    SELECT 
        reference_period, 
        num_dual_eligible_users, 
        pct_dual_eligible_users_out_total_users, 
        num_users_dual_color, 
        pct_dual_eligible_ffs_dual_color
    FROM `healthcare_market_saturation_fraud`;
    """,
    #provider utilization table 
    """
    INSERT INTO provider_utilization (
        state, 
        county, 
        state_fips, 
        county_fips, 
        total_beneficiaries, 
        total_providers, 
        avg_users_per_provider, 
        avg_pct_users_out_ffs, 
        avg_total_payment
    )
    SELECT 
        state, 
        county, 
        state_fips, 
        county_fips, 
        num_fee_service_benef, 
        num_providers, 
        avg_users_per_provider, 
        pct_users_out_ffs_benef, 
        total_payment
    FROM `healthcare_market_saturation_fraud`;
    """,
    #providers table 
    """
    INSERT INTO providers (
        state, 
        county, 
        state_fips, 
        county_fips, 
        total_beneficiaries, 
        total_providers, 
        avg_users_per_provider, 
        avg_pct_users_out_ffs, 
        avg_total_payment
    )
    SELECT 
        state, 
        county, 
        state_fips, 
        county_fips, 
        num_fee_service_benef, 
        num_providers, 
        avg_users_per_provider, 
        pct_users_out_ffs_benef, 
        total_payment
    FROM `healthcare_market_saturation_fraud`;
    """
]

# execute commands
try:
    with engine.connect() as connection:
        for command in commands:
            connection.execute(text(command))
    print("Data insertion successful")
except Exception as e:
    print(f"An error occurred: {e}")

Data insertion successful


### Load Metadata and Reflect Tables

The metadata is loaded and it reflects the tables to ensure they are correctly set up in the database

In [8]:
# Load the metadata and reflect the tables
metadata.reflect(bind=engine)

# Create a session
Session = sessionmaker(bind=engine)
session = Session()

# Function to display table data
def display_table(table):
    query = table.select()
    result = session.execute(query)
    df = pd.DataFrame(result.fetchall(), columns=result.keys())
    print(f"Table: {table.name}")
    print(df.head())  # Display first few rows
    print("\n")

# Display data from all tables
for table in [providers, beneficiaries, provider_utilization, changes, dual_eligible]:
    display_table(table)

Table: providers
   provider_id state   county  state_fips  county_fips total_beneficiaries  \
0            1    --  --ALL--           0            0            36122263   
1            2    AL  --ALL--           1            0              547486   
2            3    AK  --ALL--           2            0               91480   
3            4    AZ  --ALL--           4            0              740278   
4            5    AR  --ALL--           5            0              437616   

  total_providers avg_users_per_provider avg_pct_users_out_ffs  \
0            8814                 495.69                 12.09   
1             146                 501.47                 13.37   
2              33                 254.97                  9.20   
3             170                 401.34                  9.22   
4              86                 628.07                 12.34   

  avg_total_payment  
0     4037494106.32  
1       74641552.63  
2        6904088.39  
3       49176136.26  
4      

In [9]:
# Query to list all tables in the database
cursor.execute("SHOW TABLES;")

# Fetch and print all table names
tables = cursor.fetchall()
print("Tables in the database:")
for table in tables:
    print(table[0])

Tables in the database:
beneficiaries
beneficiary_provider_changes_over_time
changes
dual_eligible
dual_eligible_users_summary
healthcare_market_saturation_fraud
provider_utilization
provider_utilization_summary
providers
total_payments_by_reference_period_county


## View Manipulation

In this section, several SQL views are defined and created to aggregate and summarize data. Views help encapsulate complex queries for easier analysis.

The following SQL views are created:

- **total_payments_by_reference_period_county**: Aggregates total payments by reference period and county.
- **provider_utilization_summary**: Summarizes provider utilization metrics.
- **beneficiary_provider_changes_over_time**: Shows changes in beneficiary and provider metrics over time.
- **dual_eligible_users_summary**: Summarizes dual-eligible user data by reference period.

### Define the Views
The views are defined using SQL `CREATE VIEW` statements. Each view aggregates data from existing tables to facilitate easier analysis.

In [10]:
# define the views
views = [
    # total_payments_by_reference_period_county view 
    {
        'name': 'total_payments_by_reference_period_county',
        'create': """
            CREATE VIEW total_payments_by_reference_period_county AS
            SELECT 
                reference_period,
                county,
                SUM(total_payment) AS total_payments
            FROM 
                beneficiaries
            GROUP BY 
                reference_period, county;
        """
    },
    #provider_utilization_summary
    {
        'name': 'provider_utilization_summary',
        'create': """
            CREATE VIEW provider_utilization_summary AS
            SELECT 
                state,
                county,
                AVG(avg_users_per_provider) AS avg_users_per_provider,
                AVG(avg_pct_users_out_ffs) AS avg_pct_users_out_ffs,
                AVG(avg_total_payment) AS avg_total_payment
            FROM 
                provider_utilization
            GROUP BY 
                state, county;
        """
    },
    #beneficiary_provider_changes_over_time
    {
        'name': 'beneficiary_provider_changes_over_time',
        'create': """
            CREATE VIEW beneficiary_provider_changes_over_time AS
            SELECT 
                reference_period,
                AVG(num_fee_service_benef_change) AS avg_fee_service_benef_change,
                AVG(num_providers_change) AS avg_providers_change,
                AVG(avg_users_per_provider_change) AS avg_users_per_provider_change,
                AVG(pct_users_out_ffs_benef_change) AS avg_pct_users_out_ffs_benef_change,
                AVG(num_users_change) AS avg_users_change,
                AVG(avg_providers_per_county_change) AS avg_providers_per_county_change,
                AVG(num_dual_eligible_users_change) AS avg_dual_eligible_users_change,
                AVG(pct_dual_eligible_users_out_total_users_change) AS avg_pct_dual_eligible_users_out_total_users_change,
                AVG(pct_dual_eligible_ffs_change) AS avg_pct_dual_eligible_ffs_change,
                AVG(total_payment_change) AS avg_total_payment_change
            FROM 
                changes
            GROUP BY 
                reference_period;
        """
    },
    #dual_eligible_users_summary
    {
        'name': 'dual_eligible_users_summary',
        'create': """
            CREATE VIEW dual_eligible_users_summary AS
            SELECT 
                reference_period,
                AVG(num_dual_eligible_users) AS avg_dual_eligible_users,
                AVG(pct_dual_eligible_users_out_of_total_users) AS avg_pct_dual_eligible_users_out_of_total_users
            FROM 
                dual_eligible
            GROUP BY 
                reference_period;
        """
    }
]

### Execute View Creation Commands

The database connection has been established, and SQL commands have been executed to create the views. Prior to creating each view, duplicate definitions are reviewed to ensure that they do not already exist, thus preventing any redundant definitions.

In [11]:
# execute view creation commands
try:
    with engine.connect() as connection:
        for view in views:
            # check if the view exists
            result = connection.execute(text(f"""
                SELECT COUNT(*) 
                FROM information_schema.VIEWS 
                WHERE TABLE_NAME = '{view['name']}'
                  AND TABLE_SCHEMA = DATABASE();
            """)).fetchone()
            
            # if view does not exist, create it
            if result[0] == 0:
                connection.execute(text(view['create']))
                print(f"View {view['name']} created successfully!")
            else:
                print(f"View {view['name']} already exists.")
except Exception as e:
    print(f"An error occurred: {e}")
    
# close connection 
finally:
    cursor.close()
    conn.close()

View total_payments_by_reference_period_county already exists.
View provider_utilization_summary already exists.
View beneficiary_provider_changes_over_time already exists.
View dual_eligible_users_summary already exists.


## Potential FWA Prediction (Bagging Classifier Results) 

In this section, the results from the Bagging Classifier model for predicting potential Fraud, Waste, and Abuse (FWA) are analyzed. The aim is to evaluate the model's performance and understand its implications for FWA detection.

The Bagging Classifier model was evaluated for its effectiveness in predicting potential FWA cases. Performance metrics such as accuracy, precision, recall, and F1-score were assessed to understand how well the model can identify suspicious patterns.

The results from the model can inform better resource allocation and detection strategies in healthcare, aiding in the prevention of FWA.

### Table Manipulation

The following table was created:

- **prediction_results**: Defines the data types for each column in the table. For example, type_of_service is a string with a maximum length of 255 characters, while total_payment_dual_color is a float.

#### Import data 

In [12]:
# import data form csv file
bagging_results = pd.read_csv('data/final_bagging_results.csv')

Unnamed: 0,type_of_service,state,county,number_of_fee_for_service_beneficiaries_dual_color_reconstructed,number_of_providers_dual_color_reconstructed,average_number_of_users_per_provider_dual_color_reconstructed,percentage_of_users_out_of_ffs_beneficiaries_dual_color_reconstructed,number_of_users_dual_color_reconstructed,average_number_of_providers_per_county_dual_color_reconstructed,number_of_dual_eligible_users_dual_color_reconstructed,percentage_of_dual_eligible_users_out_of_total_users_dual_color_reconstructed,percentage_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_dual_color_reconstructed,total_payment_dual_color_reconstructed,predictions,actual,probabilities
0,Skilled Nursing Facility,ID,SHOSHONE,2,3,1,1,2,3,2,3,1,2,0,0,0.0
1,Preventive Health Services,NJ,ATLANTIC,5,5,4,4,5,5,5,3,3,5,0,1,0.1875
2,Federally Qualified Health Center (FQHC),NC,--ALL--,4,4,3,3,4,2,4,2,2,4,1,1,1.0
3,Podiatry Services,MS,LEAKE,2,1,2,1,1,1,1,3,1,1,0,0,0.0
4,Hospice,TX,BRAZOS,4,4,3,3,4,4,4,2,3,4,0,1,0.025


#### Adjust Column Names

SQL databases do not allow column name sto be over 64 characters, thus the column names are evaluated and shortened to be within required range. 

In [21]:
# check if all column names are under 64 characters
for col in bagging_results.columns:
    if len(col) > 64:
        print(f"Column name '{col}' is too long: {len(col)} characters")
    else:
        print(f"Column name '{col}' is {len(col)} characters")

# check if any column name exceeds 64 characters
exceeds_64 = any(len(col) > 64 for col in bagging_results.columns)
if exceeds_64:
    print("Some column names exceed 64 characters.")
else:
    print("All column names are within 64 characters.")

Column name 'type_of_service' is 15 characters
Column name 'state' is 5 characters
Column name 'county' is 6 characters
Column name 'number_of_fee_for_service_beneficiaries_dual_color_reconstructed' is 64 characters
Column name 'number_of_providers_dual_color_reconstructed' is 44 characters
Column name 'average_number_of_users_per_provider_dual_color_reconstructed' is 61 characters
Column name 'percentage_of_users_out_of_ffs_beneficiaries_dual_color_reconstructed' is too long: 69 characters
Column name 'number_of_users_dual_color_reconstructed' is 40 characters
Column name 'average_number_of_providers_per_county_dual_color_reconstructed' is 63 characters
Column name 'number_of_dual_eligible_users_dual_color_reconstructed' is 54 characters
Column name 'percentage_of_dual_eligible_users_out_of_total_users_dual_color_reconstructed' is too long: 77 characters
Column name 'percentage_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_dual_color_reconstructed' is too long: 97 char

In [30]:
# adjust idnetified columns to be under 64 characters to meet sql requirement 
# create a dictionary to rename columns and remove 'reconstructed'
rename_dict = {
    'num_of_fee_for_service_beneficiaries_dual_color_reconstructed': 'num_fee_for_service_benef_dual_color',
    'num_of_providers_dual_color_reconstructed': 'num_providers_dual_color',
    'avg_num_of_users_per_provider_dual_color_reconstructed': 'avg_users_per_provider_dual_color',
    'pct_of_users_out_of_ffs_beneficiaries_dual_color_reconstructed': 'pct_users_out_ffs_benef',
    'num_of_users_dual_color_reconstructed': 'num_users_dual_color',
    'avg_num_of_providers_per_county_dual_color_reconstructed': 'avg_providers_per_county_dual_color',
    'num_of_dual_eligible_users_dual_color_reconstructed': 'num_dual_eligible_users_dual_color',
    'pct_of_dual_eligible_users_out_of_total_users_dual_color_reconstructed': 'pct_dual_eligible_total_users_dual_color',
    'pct_of_dual_eligible_users_out_of_dual_eligible_ffs_beneficiaries_dual_color_reconstructed': 'pct_dual_eligible_ffs_dual_color',
    'total_payment_dual_color_reconstructed': 'total_payment_dual_color',
}

# rename columns in the df
bagging_results.rename(columns=rename_dict, inplace=True)

# check the new column names and their lengths
for col in bagging_results.columns:
    print(f"Column name '{col}' is {len(col)} characters")

Column name 'type_of_service' is 15 characters
Column name 'state' is 5 characters
Column name 'county' is 6 characters
Column name 'num_fee_for_service_benef_dual_color' is 36 characters
Column name 'num_providers_dual_color' is 24 characters
Column name 'avg_users_per_provider_dual_color' is 33 characters
Column name 'pct_users_out_ffs_benef' is 23 characters
Column name 'num_users_dual_color' is 20 characters
Column name 'avg_providers_per_county_dual_color' is 35 characters
Column name 'num_dual_eligible_users_dual_color' is 34 characters
Column name 'pct_dual_eligible_total_users_dual_color' is 40 characters
Column name 'pct_dual_eligible_ffs_dual_color' is 32 characters
Column name 'total_payment_dual_color' is 24 characters
Column name 'predictions' is 11 characters
Column name 'actual' is 6 characters
Column name 'probabilities' is 13 characters


#### Define Table

In [28]:
# define dataypes of each column 
dtype_mapping = {
    'type_of_service': String(255),
    'state': String(15),
    'county': String(60),
    'num_fee_for_service_benef_dual_color': Integer,
    'num_providers_dual_color': Integer,
    'avg_users_per_provider_dual_color': Float,
    'pct_users_out_ffs_benef': Float,
    'num_users_dual_color': Integer,
    'avg_providers_per_county_dual_color': Float,
    'num_dual_eligible_users_dual_color': Integer,
    'pct_dual_eligible_total_users_dual_color': Float,
    'pct_dual_eligible_ffs_dual_color': Float,
    'total_payment_dual_color': Float,
    'predictions': Integer,
    'actual': Integer,
    'probabilities': Float
}

#### Insert Data into Table

In [29]:
# add df to db as a new table 
#define table name 
table_name = 'prediction_results' 

# write df to the database
try:
    bagging_results.to_sql('prediction_results', engine, if_exists='replace', index=False, dtype=dtype_mapping)
    print("Table created and data inserted successfully.")
except Exception as e:
    print(f"Error occurred: {e}")

Table created and data inserted successfully.


### Views Manipulation

The following Views were created: 
- **summary_statistics**: aggregates key metrics such as the average number of fee-for-service beneficiaries, providers, and users per provider by state and county. This view provides a summary of provider and beneficiary statistics.
- **predictions_actuals_comparison**: displays the predictions, actual values, and probabilities for each record in the prediction_results table, facilitating comparison between predicted and actual values.
- **top_providers** lists the top 10 counties with the highest number of providers, sorted in descending order. This view helps identify areas with the highest provider density.
- **state_wise_payment_analysis**: aggregates total payments by state. This view provides insights into payment distribution across different states.
- **suspicious_patterns**: identifies patterns that may indicate potential fraud, including counties with a high percentage of users out of fee-for-service and substantial total payments. This view highlights areas for further investigation.

#### Define Views

In [34]:
# define views 

views = [
    # Summary Statistics View
    {
        'name': 'summary_statistics',
        'create': """
            CREATE VIEW summary_statistics AS
            SELECT
                state,
                county,
                AVG(num_fee_for_service_benef_dual_color) AS avg_num_fee_for_service_benef_dual_color,
                AVG(num_providers_dual_color) AS avg_num_providers_dual_color,
                AVG(avg_users_per_provider_dual_color) AS avg_avg_users_per_provider_dual_color,
                AVG(pct_users_out_ffs_benef) AS avg_pct_users_out_ffs_benef,
                AVG(num_users_dual_color) AS avg_num_users_dual_color,
                AVG(avg_providers_per_county_dual_color) AS avg_avg_providers_per_county_dual_color,
                AVG(num_dual_eligible_users_dual_color) AS avg_num_dual_eligible_users_dual_color,
                AVG(pct_dual_eligible_total_users_dual_color) AS avg_pct_dual_eligible_total_users_dual_color,
                AVG(pct_dual_eligible_ffs_dual_color) AS avg_pct_dual_eligible_ffs_dual_color,
                AVG(total_payment_dual_color) AS avg_total_payment_dual_color
            FROM
                prediction_results
            GROUP BY
                state, county;
        """
    },
    # Predictions and Actuals Comparison
    {
        'name': 'predictions_actuals_comparison',
        'create': """
            CREATE VIEW predictions_actuals_comparison AS
            SELECT
                predictions,
                actual,
                probabilities
            FROM
                prediction_results;
        """
    },
    # Top Providers
    {
        'name': 'top_providers',
        'create': """
            CREATE VIEW top_providers AS
            SELECT
                state,
                county,
                num_providers_dual_color
            FROM
                prediction_results
            ORDER BY
                num_providers_dual_color DESC
            LIMIT 10;
        """
    },
    # State-wise Payment Analysis
    {
        'name': 'state_wise_payment_analysis',
        'create': """
            CREATE VIEW state_wise_payment_analysis AS
            SELECT
                state,
                SUM(total_payment_dual_color) AS total_payment
            FROM
                prediction_results
            GROUP BY
                state;
        """
    },
    # Suspicious Patterns for Fraud Detection
    {
        'name': 'suspicious_patterns',
        'create': """
            CREATE VIEW suspicious_patterns AS
            SELECT
                state,
                county,
                pct_users_out_ffs_benef,
                total_payment_dual_color
            FROM
                prediction_results
            WHERE
                pct_users_out_ffs_benef > 0.5
                AND total_payment_dual_color > 100000;
        """
    }
]

#### Insert Data into Views

In [36]:
# execute view creation commands
try:
    with engine.connect() as connection:
        for view in views:
            # Check if the view exists
            result = connection.execute(text(f"""
                SELECT COUNT(*) 
                FROM information_schema.VIEWS 
                WHERE TABLE_NAME = '{view['name']}'
                  AND TABLE_SCHEMA = DATABASE();
            """)).fetchone()
            
            # If view does not exist, create it
            if result[0] == 0:
                connection.execute(text(view['create']))
                print(f"View {view['name']} created successfully!")
            else:
                print(f"View {view['name']} already exists.")
except Exception as e:
    print(f"An error occurred: {e}")

# close connection
finally:
    connection.close()

View summary_statistics already exists.
View predictions_actuals_comparison already exists.
View top_providers already exists.
View state_wise_payment_analysis already exists.
View suspicious_patterns already exists.


## Accessing Visualizations

To access visualizations created in Tableau, navigate to the `tableau-visualizations` folder.