# Weaving the data threads of Maji Ndogo's narrative

### Introduction

In the third part of the integrated project, we extract data from various tables and perform statistical analyses to assess the impact of an audit report that cross-references a random sample of records.

# Connecting to our MySQL database

In [1]:
# Load the SQL Extension

%load_ext sql

In [2]:
# Establish a connection to the local database 

%sql mysql+pymysql://root:1234567@localhost:3306/md_water_services

'Connected: root@md_water_services'

# Maji Ndogo Water Services ERD

An audit has been conducted and we aim to incorporate its findings into the database. For us to integrate it successfully, we need to examine our current ERD of the database to fully understand the relationships between the tables.

![ERD](exploration/assets/md_water_services_erd.png)


# ERD Investigative Analysis

Based on the ERD above, we observe that the `visits` table serves as a central connector between other tables.

- `location_id` is the **PRIMARY KEY** in the `location` table and a **FOREIGN KEY** in the `visits` table.
- `source_id` is the **PRIMARY KEY** in the `water_source` table and a **FOREIGN KEY** in the `visits` table.
- `assigned_employee_id` is the **PRIMARY KEY** in the `employee` table and a **FOREIGN KEY** in the `visits` table.

In a nutshell, the visits table logs each instance when a unique location is visited by a specific employee, with a focus on a particular water source, establishing a one-to-many relationship among the location, employee, and water_source tables through the visits table.

However, according to the ERD, the visits table has a one-to-many relationship with the water_quality table. This contradicts our initial understanding, as each visit should correspond to a single unique record in the water_quality table. This discrepancy indicates a potential error in the relationship representation between these two tables, which we need to address.

![Alt text](exploration/assets/updated_md_water_services_erd.png)

# Importing the Auditor's Report

With a clear representation of the relationships in our database, we can now proceed to import the data from the auditor's report which is in a .csv format. To do this, we need to follow the steps below:

1. Create an empty `auditor_report` table in the `md_water_services` database. To do this we run the following in MySQL Workbench:

   <span style="color: green;">DROP TABLE IF EXISTS</span> `auditor_report`; <br><br>
    <span style="color: green;">CREATE TABLE</span> `auditor_report` ( <br>
    &nbsp;&nbsp;&nbsp;&nbsp;`location_id` VARCHAR(32), <br>
    &nbsp;&nbsp;&nbsp;&nbsp;`type_of_water_source` VARCHAR(64), <br>
    &nbsp;&nbsp;&nbsp;&nbsp;`true_water_source_score` INT <span style="color: green;">DEFAULT NULL</span>, <br>
    &nbsp;&nbsp;&nbsp;&nbsp;`statements` VARCHAR(255) <br>
    );


2. Import the data sent by the auditor in `.csv` format on **MySQL Workbench**. Remember to use an existing table since we've already created an empty table from the first step.

# Auditor's Report Integration

Once the auditor's report is imported into the SQL database, we can observe that it contains **1620** records corresponding to all the revisited sites assessed by the auditor. The report includes the following attributes:

- `location_id`: Identifies the revisited locations.
- `type_of_water_source`: Specifies the water sources visited by the auditor.
- `true_water_source_score`: A score assigned by the auditor that indicates the water quality.
- `statements`: Captured comments from the auditor gathered through discussions with locals at each site.

Based on the auditor's report, we can conduct a comparative analysis with the surveyors records.

# Questions to Consider

Now that we have all our data consolidated, let's explore the following questions: 


1. Is there a difference in scores between the auditor's report and the data provided by the surveyors?
2. If a difference exists, is there a pattern we can identify?

# Investigating Differences in Auditor & Surveyor Scores

To determine whether there are discrepancies between the water quality scores from auditors and surveyors, we will need to create some joins. Here's the breakdown:

- The `auditor_report` table in our database includes a `location_id` attribute, while the `water_quality` table contains a `record_id` attribute, which corresponds to a similar attribute in the `visits` table. 
- The `visits` table has both `location_id` and `record_id` attributes, making it the ideal table for joining the `auditor_report` and `water_quality` tables.

To construct the appropriate SQL query for retrieving the necessary information from the various tables, let's start with the following:

- Grab the `location_id` and `true_water_source_score` columns from `auditor_report`.
- **JOIN** the `visits` table with the `auditor_report` table using the shared `location_id` table to access the `record_id`.
- Now that we have the `record_id` for each location, our next step is to retrieve the corresponding scores from the `water_quality` table. We are particularly interested in the `subjective_quality_score`. To do this, we'll **JOIN** the `visits` table and the `water_quality` table, using the `record_id` as the connecting key.
- Clean up the resulting table by removing unnecessary redundant columns and renaming the score columns to `surveyor_score` and `auditor_score`, respectively.
- Use a `WHERE` clause in the query to check for differences between the surveyor_score and auditor_score.

In [3]:
%%sql
# Check the differences between the auditor's scores and surveyor's score
SELECT
    auditor_report.location_id,
    visits.record_id,
    auditor_report.true_water_source_score AS auditor_score,
    water_quality.subjective_quality_score AS surveyor_score
FROM
    auditor_report
JOIN
    visits
    ON visits.location_id = auditor_report.location_id
JOIN
    water_quality
    ON visits.record_id = water_quality.record_id
WHERE auditor_report.true_water_source_score = water_quality.subjective_quality_score
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


location_id,record_id,auditor_score,surveyor_score
SoRu34980,5185,1,1
AkRu08112,59367,3,3
AkLu02044,37379,0,0
AkHa00421,51627,3,3
SoRu35221,28758,0,0
HaAm16170,31048,1,1
AkRu04812,1513,3,3
AkRu08304,1218,3,3
AkRu05107,8322,2,2
HaDe16541,13070,2,2


The result shows a total of **2505** records; however, the auditor's records originally contained only **1620** entries. This discrepancy arises because the `visits` table includes several locations that were visited multiple times, resulting in duplicate records in our dataset. To remove these duplicates, we can modify the query to include a condition in the `WHERE` clause that filters for locations that were visited only once.

In [4]:
%%sql
# Remove the records of locations visited more than once in the visits table
SELECT
    auditor_report.location_id,
    visits.record_id,
    auditor_report.true_water_source_score AS auditor_score,
    water_quality.subjective_quality_score AS surveyor_score
FROM
    auditor_report
JOIN
    visits
    ON visits.location_id = auditor_report.location_id
JOIN
    water_quality
    ON visits.record_id = water_quality.record_id
WHERE 
    visits.visit_count = 1 
    AND auditor_report.true_water_source_score = water_quality.subjective_quality_score
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


location_id,record_id,auditor_score,surveyor_score
SoRu34980,5185,1,1
AkRu08112,59367,3,3
AkLu02044,37379,0,0
AkHa00421,51627,3,3
SoRu35221,28758,0,0
HaAm16170,31048,1,1
AkRu04812,1513,3,3
AkRu08304,1218,3,3
AkRu05107,8322,2,2
HaDe16541,13070,2,2


With the duplicates removed I now get **1518** surveyor records that match those of the auditor. I think that is an excellent result. Calculating 1518 out of 1620 gives us a 94% accuracy rate for the records the auditor reviewed! However, what about the remaining ~6%? Let's take a closer look at those.

In [5]:
%%sql
# Let's check incorrect records
SELECT
    auditor_report.location_id,
    visits.record_id,
    auditor_report.true_water_source_score AS auditor_score,
    water_quality.subjective_quality_score AS surveyor_score
FROM
    auditor_report
JOIN
    visits
    ON visits.location_id = auditor_report.location_id
JOIN
    water_quality
    ON visits.record_id = water_quality.record_id
WHERE 
    visits.visit_count = 1 
    AND auditor_report.true_water_source_score != water_quality.subjective_quality_score
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


location_id,record_id,auditor_score,surveyor_score
AkRu05215,21160,3,10
KiRu29290,7938,3,10
KiHa22748,43140,9,10
SoRu37841,18495,6,10
KiRu27884,33931,1,10
KiZu31170,17950,9,10
KiZu31370,36864,3,10
AkRu06495,45924,2,10
HaRu17528,30524,1,10
SoRu38331,13192,3,10


Since we used some of this data in our previous analyses, we need to make sure those results are still valid, now we know some of them are incorrect.  We didn't use the scores that much, but we relied a lot on the `type_of_water_source`, so let's check if there are any errors there.

In [6]:
%%sql
# Let's check records that were incorrect including water sources
SELECT
    auditor_report.location_id,
    auditor_report.type_of_water_source AS auditor_source,
    water_source.type_of_water_source AS surveyor_source,
    visits.record_id,
    auditor_report.true_water_source_score AS auditor_score,
    water_quality.subjective_quality_score AS surveyor_score
FROM
    auditor_report
JOIN
    visits
    ON visits.location_id = auditor_report.location_id
JOIN
    water_quality
    ON visits.record_id = water_quality.record_id
JOIN
    water_source
    ON visits.source_id = water_source.source_id
WHERE 
    visits.visit_count = 1 
    AND auditor_report.true_water_source_score != water_quality.subjective_quality_score
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


location_id,auditor_source,surveyor_source,record_id,auditor_score,surveyor_score
AkRu05215,well,well,21160,3,10
KiRu29290,shared_tap,shared_tap,7938,3,10
KiHa22748,tap_in_home_broken,tap_in_home_broken,43140,9,10
SoRu37841,shared_tap,shared_tap,18495,6,10
KiRu27884,well,well,33931,1,10
KiZu31170,tap_in_home_broken,tap_in_home_broken,17950,9,10
KiZu31370,shared_tap,shared_tap,36864,3,10
AkRu06495,well,well,45924,2,10
HaRu17528,well,well,30524,1,10
SoRu38331,shared_tap,shared_tap,13192,3,10


The resulting dataset contains a total of **102** records. Additionally, we observe that there are no discrepancies between the types of water sources in the auditor's report and the surveyor's records being compared, allowing us to confidently proceed with our investigation. As we advance, the SQL query we created will become increasingly complex, so let's explore how we can simplify our process by utilizing **Common Table Expressions (CTEs)**.

To do this, we will wrap our existing query within a **CTE** and name the expression as `Incorrect_records`. This approach helps us break down our query into simpler, more manageable steps and allows us to treat the referenced query as though it were a table.

> **NOTE: CTEs** are not actual tables; they do **NOT** store data.

While we're at it, let us add the `employee` table and **JOIN** it with our resulting dataset using the `assigned_employee_id` attribute, giving us a view of the surveyors who recorded inaccurate scores according to the auditor's report.

In [7]:
%%sql
# Create a CTE of employees/surveyors responsible for the incorrect scores
WITH Incorrect_records AS (
    SELECT
        auditor_report.location_id,
        visits.record_id,
        employee.employee_name,
        auditor_report.true_water_source_score AS auditor_score,
        water_quality.subjective_quality_score AS surveyor_score
    FROM
        auditor_report
    JOIN
        visits
        ON visits.location_id = auditor_report.location_id
    JOIN
        water_quality
        ON visits.record_id = water_quality.record_id
    JOIN
        employee
        ON visits.assigned_employee_id = employee.assigned_employee_id
    WHERE 
        visits.visit_count = 1 
        AND auditor_report.true_water_source_score != water_quality.subjective_quality_score
)
SELECT *
FROM Incorrect_records
LIMIT 10;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
10 rows affected.


location_id,record_id,employee_name,auditor_score,surveyor_score
AkRu05215,21160,Rudo Imani,3,10
KiRu29290,7938,Bello Azibo,3,10
KiHa22748,43140,Bello Azibo,9,10
SoRu37841,18495,Rudo Imani,6,10
KiRu27884,33931,Bello Azibo,1,10
KiZu31170,17950,Zuriel Matembo,9,10
KiZu31370,36864,Yewande Ebele,3,10
AkRu06495,45924,Bello Azibo,2,10
HaRu17528,30524,Jengo Tumaini,1,10
SoRu38331,13192,Zuriel Matembo,3,10


The `Incorrect_records` **CTE** is functioning as intended. From the output above, we can see that some surveyor names are repeated across the records, indicating that certain surveyors made multiple errors. First, let’s examine the unique count of surveyors who recorded inaccurate scores.

In [10]:
%%sql
# Get the list of surveyors responsible for the incorrect data
WITH Incorrect_records AS (
    SELECT
        auditor_report.location_id,
        visits.record_id,
        employee.employee_name,
        auditor_report.true_water_source_score AS auditor_score,
        water_quality.subjective_quality_score AS surveyor_score
    FROM
        auditor_report
    JOIN
        visits
        ON visits.location_id = auditor_report.location_id
    JOIN
        water_quality
        ON visits.record_id = water_quality.record_id
    JOIN
        employee
        ON visits.assigned_employee_id = employee.assigned_employee_id
    WHERE 
        visits.visit_count = 1 
        AND auditor_report.true_water_source_score != water_quality.subjective_quality_score
)
SELECT DISTINCT employee_name
FROM Incorrect_records;


 * mysql+pymysql://root:***@localhost:3306/md_water_services
17 rows affected.


employee_name
Rudo Imani
Bello Azibo
Zuriel Matembo
Yewande Ebele
Jengo Tumaini
Farai Nia
Malachi Mavuso
Makena Thabo
Lalitha Kaburi
Gamba Shani


**17** surveyors have incorrect records from the output above. To further our analysis, we will aggregate the frequency of inaccurate scores recorded by each surveyor using the `COUNT` function, grouping the results by the `employee_name` attribute.

In [13]:
%%sql
# Aggregate frequency of inaccurate scores
WITH Incorrect_records AS (
    SELECT
        auditor_report.location_id,
        visits.record_id,
        employee.employee_name,
        auditor_report.true_water_source_score AS auditor_score,
        water_quality.subjective_quality_score AS surveyor_score
    FROM
        auditor_report
    JOIN
        visits
        ON visits.location_id = auditor_report.location_id
    JOIN
        water_quality
        ON visits.record_id = water_quality.record_id
    JOIN
        employee
        ON visits.assigned_employee_id = employee.assigned_employee_id
    WHERE 
        visits.visit_count = 1 
        AND auditor_report.true_water_source_score != water_quality.subjective_quality_score
)
SELECT
    employee_name,
    COUNT(employee_name) AS number_of_mistakes
FROM 
    Incorrect_records
GROUP BY
    employee_name
ORDER BY number_of_mistakes DESC;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
17 rows affected.


employee_name,number_of_mistakes
Bello Azibo,26
Malachi Mavuso,21
Zuriel Matembo,17
Lalitha Kaburi,7
Rudo Imani,5
Farai Nia,4
Enitan Zuri,4
Yewande Ebele,3
Jengo Tumaini,3
Makena Thabo,3


We can observe that **Bello Azibo** recorded the highest number of mistakes at **26**, followed closely by **Malachi Mavuso** with **21**. These discrepancies could stem from either genuine errors or intentional misconduct, but we can't determine the exact cause at this point. Further investigation is necessary to uncover the root cause of the erroneous inputs from our surveyors. To streamline our work, let’s convert the `Incorrect_records` **CTE** into a SQL `VIEW` and also incorporate the `statements` attribute from the `auditor_report` table.

> NOTE: Please run the cell below only once, and then comment out the SQL query. Ideally, the view should be created just once. If you restart the notebook and execute it again, you will encounter an error since the VIEW will already exist in the database by the second run.

In [15]:
'''%%sql
# Change the Incorrect_records CTE to a SQL VIEW
CREATE VIEW Incorrect_records AS (
SELECT
    auditor_report.location_id,
    visits.record_id,
    employee.employee_name,
    auditor_report.true_water_source_score AS auditor_score,
    water_quality.subjective_quality_score AS surveyor_score,
    auditor_report.statements AS statements
FROM
    auditor_report
JOIN
    visits
    ON visits.location_id = auditor_report.location_id
JOIN
    water_quality
    ON visits.record_id = water_quality.record_id
JOIN
    employee
    ON visits.assigned_employee_id = employee.assigned_employee_id
WHERE 
    visits.visit_count = 1 
    AND auditor_report.true_water_source_score != water_quality.subjective_quality_score
);'''

 * mysql+pymysql://root:***@localhost:3306/md_water_services
0 rows affected.


[]

After transforming the **CTE** into a **VIEW**, we can test it out to see if it works as intended.   

In [8]:
%sql SELECT * FROM Incorrect_records LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


location_id,record_id,employee_name,auditor_score,surveyor_score,statements
AkRu05215,21160,Rudo Imani,3,10,"Villagers admired the official's visit for its respectful interactions, hard work, and genuine concern."
KiRu29290,7938,Bello Azibo,3,10,"A young artist sketches the faces in the queue, capturing the weariness of daily hours spent waiting for water."
KiHa22748,43140,Bello Azibo,9,10,"A young girl's hopeful eyes are clouded by mistrust, her innocence tarnished by the corrupt system."
SoRu37841,18495,Rudo Imani,6,10,"The official's respectful and diligent presence was met with heartfelt appreciation, creating a sense of closeness with the villagers."
KiRu27884,33931,Bello Azibo,1,10,"A traditional healer's empathy turns to bitterness, knowing that corrupt practices harm her community."


Our **VIEW** is functioning perfectly. Next, let's transform the query to create a **CTE** named `error_count` that calculates the number of errors made by each employee. We can then use this **CTE** to determine the average number of mistakes made by the **17** surveyors on average using the `AVG` function.

In [17]:
%%sql
# Create a CTE to count the number of errors made by surveyors and calculate the average number of mistakes.
WITH error_count AS (
    SELECT
        employee_name,
        COUNT(employee_name) AS number_of_mistakes
    FROM
        Incorrect_records
    GROUP BY
        employee_name
    ORDER BY number_of_mistakes DESC
)
SELECT
    AVG(number_of_mistakes)
FROM 
    error_count;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
1 rows affected.


AVG(number_of_mistakes)
6.0


Based on the output above, we observe that an average of **6** mistakes were made. To identify employees with a high number of errors, we’ll filter for those who made more than the average number of mistakes. We can achieve this by using a subquery in the `WHERE` clause.

In [18]:
%%sql
# Find employees who made more mistakes than the average.
WITH error_count AS (
    SELECT
        employee_name,
        COUNT(employee_name) AS number_of_mistakes
    FROM
        Incorrect_records
    GROUP BY
        employee_name
    ORDER BY number_of_mistakes DESC
)
SELECT
    employee_name,
    number_of_mistakes
FROM
    error_count
WHERE
    number_of_mistakes > (SELECT AVG(number_of_mistakes) FROM error_count);

 * mysql+pymysql://root:***@localhost:3306/md_water_services
4 rows affected.


employee_name,number_of_mistakes
Bello Azibo,26
Malachi Mavuso,21
Zuriel Matembo,17
Lalitha Kaburi,7


Now, we have only 4 surveyors with an above-average number of mistakes remaining. Since we also have `statements` recorded by the auditor during visits to various sites, we can review what locals had to say about these surveyors. First, let’s create a **CTE** named `suspect_list` to capture employees with above-average mistakes, and then use it in a subquery within the `WHERE` clause, as shown in the code cell below.

In [9]:
%%sql
# Retrieve statements about the above suspect list
WITH error_count AS (
    SELECT
        employee_name,
        COUNT(employee_name) AS number_of_mistakes
    FROM
        Incorrect_records
    GROUP BY
        employee_name
),
suspect_list AS (
    SELECT
        employee_name,
        number_of_mistakes
    FROM
        error_count
    WHERE
        number_of_mistakes > (SELECT AVG(number_of_mistakes) FROM error_count)
)
SELECT 
    employee_name,
    location_id,
    statements
FROM
    Incorrect_records
WHERE
    employee_name IN (SELECT employee_name FROM suspect_list)
LIMIT 5;

 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


employee_name,location_id,statements
Bello Azibo,KiRu29290,"A young artist sketches the faces in the queue, capturing the weariness of daily hours spent waiting for water."
Bello Azibo,KiHa22748,"A young girl's hopeful eyes are clouded by mistrust, her innocence tarnished by the corrupt system."
Bello Azibo,KiRu27884,"A traditional healer's empathy turns to bitterness, knowing that corrupt practices harm her community."
Zuriel Matembo,KiZu31170,"A community leader stood with his people, expressing concern for the water quality and the time lost in queues."","""
Bello Azibo,AkRu06495,"A healthcare worker in the queue expressed fears about water-borne diseases, her face etched with worry."","""


From our resulting dataset, we observe statements suggesting surveyor malpractice, with issues ranging from corruption to arrogance. Let's zoom in on some specific audit locations using the `location_id` in the dataset to examine the types of statements recorded by the auditor from local residents.

In [22]:
%%sql
# Retrieve statements about the above suspect list
WITH error_count AS (
    SELECT
        employee_name,
        COUNT(employee_name) AS number_of_mistakes
    FROM
        md_water_services.Incorrect_records
    GROUP BY
        employee_name
),
suspect_list AS (
    SELECT
        employee_name,
        number_of_mistakes
    FROM
        error_count
    WHERE
        number_of_mistakes > (SELECT AVG(number_of_mistakes) FROM error_count)
)
SELECT 
    employee_name,
    location_id,
    statements
FROM
    Incorrect_records
WHERE
    employee_name IN (SELECT employee_name FROM suspect_list)
    AND location_id IN ("AkRu04508", "AkRu07310", "KiRu29639", "AmAm09607");

 * mysql+pymysql://root:***@localhost:3306/md_water_services
4 rows affected.


employee_name,location_id,statements
Malachi Mavuso,AmAm09607,Villagers spoke of an unsettling encounter with an official who appeared dismissive and detached. The reference to cash transactions added to their growing sense of distrust.
Bello Azibo,AkRu04508,"An unsettling atmosphere surrounded the official, as villagers shared their experiences of arrogance and lack of dedication. The mention of cash exchanges only intensified their doubts."
Lalitha Kaburi,AkRu07310,"Villagers spoke of their unsettling encounters with an official who seemed indifferent and uninterested, hinting at potential improprieties involving cash exchanges."
Bello Azibo,KiRu29639,An unsettling atmosphere prevailed as villagers shared stories of an official's arrogance and perceived corruption. The mention of cash exchanges only intensified their concerns.


We’ve discovered that some surveyors in our `suspect_list` were involved in malpractice, including corruption and bribery. Let’s now identify any other surveyors who engaged in similar activities but were not included in our `suspect_list`.

In [23]:
%%sql
# Identify any surveyor not listed as a suspect who has allegations of bribery.
WITH error_count AS (
    SELECT
        employee_name,
        COUNT(employee_name) AS number_of_mistakes
    FROM
        Incorrect_records
    GROUP BY
        employee_name
),
suspect_list AS (
    SELECT
        employee_name,
        number_of_mistakes
    FROM
        error_count
    WHERE
        number_of_mistakes > (SELECT AVG(number_of_mistakes) FROM error_count)
)
SELECT 
    employee_name,
    location_id,
    statements
FROM
    Incorrect_records
WHERE
    statements LIKE "%cash%"
    AND employee_name NOT IN (SELECT employee_name FROM suspect_list);

 * mysql+pymysql://root:***@localhost:3306/md_water_services
0 rows affected.


employee_name,location_id,statements


From the output above, it is evident that there were no other surveyors involved in bribery and corruption.

# Conclusion

To summarize the findings for Zuriel Matembo, Malachi Mavuso, Bello Azibo, and Lalitha Kaburi:

1. Each made more errors on average than their peers.
2. Each has incriminating statements made against them, and only them.

It’s important to note, however, that this is not definitive proof.