# Dealing with NULL Values
The example data in the tables in the world and sakila databases shown earlier are all accurate and complete. Every row has a value for each attribute. However, real data is usually not so clean and tidy. You will often find **NULL** values in some tables.

Nulls in a database can cause a few headaches. Moreover, the descriptions in the SQL standards on how to handle NULLs seem ambiguous. It is not clear from the standards documents exactly how NULLs should be handled in all circumstances.

Sometimes, we actually can avoid NULLs by setting the NOT NULL constrain when we create a table. However, it is worth bearing in mind that making fields NOT NULL does not always work and could create more headaches than it cures. Not all values of null mean there is a problem with the data.

`NULL` is the term used to represent a missing value. A NULL value in a table is a value in a field that appears to be blank. However, a NULL value should not simply thought as 0 (zero) or an empty string like ' '. It is a value of as either empty or undefined.

This notebook will present:

- How to DROP a table IF EXISTS
- How to CREATE a new table from an existing table
- How to UPDATE a table with a WHERE condition
- How to COUNT NULL values with IS NULL
- How to give NULLs default values with the `COALESCE` function

In [1]:
import pandas as pd
import mysql.connector as sql
import os

In [2]:
connection = sql.connect(
    host = os.environ.get('mysql_host'),
    user = os.environ.get('mysql_user'),
    password = os.environ.get('mysql_password')
)

cursor = connection.cursor()

## 1 Create a table with NULL values from an existing table
Take the table of country_population as an example.

- Firstly, make a backup table

           The SQL CREATE TABLE AS statement is used to create a table from an existing table by copying the existing table's columns.

In [3]:
pd.read_sql_query("""
    SELECT Name, IndepYear
    FROM world.country
    """,
    connection)

Unnamed: 0,Name,IndepYear
0,Aruba,
1,Afghanistan,1919.0
2,Angola,1975.0
3,Anguilla,
4,Albania,1912.0
...,...,...
234,Yemen,1918.0
235,Yugoslavia,1918.0
236,South Africa,1910.0
237,Zambia,1964.0


In [4]:
pd.read_sql_query("""
    SHOW TABLES
    FROM world
    """,
    connection)

Unnamed: 0,Tables_in_world
0,city
1,country
2,country_year
3,countrylanguage


In [3]:
pd.read_sql_query("""
    DROP TABLE IF EXISTS world.country_year;
    CREATE TABLE world.country_year AS
    (SELECT Name, IndepYear
    FROM world.country)
    """,
    connection)

TypeError: 'NoneType' object is not iterable

Have a quick check of backup table

In [8]:
pd.read_sql_query("""
    SELECT *
    FROM world.country_year
    """,
    connection)

Unnamed: 0,Name,IndepYear


- There are some NULL values already, but if there hadn't been any we could have made some values as NULLs with. I will update years <0 as NULLs.
    
            SQL UPDATE Query is used to modify the existing records in a table. You can use WHERE clause with UPDATE query to update selected rows, otherwise all the rows would be updated.

In [7]:
pd.read_sql_query("""
    UPDATE world.country_year
    SET IndepYear = NULL
    WHERE IndepYear < 0
    """,
    connection)

TypeError: 'NoneType' object is not iterable

## 2. Find NULLs
Null values cannot be determined with an =. We need to use the **IS NULL** or **IS NOT NULL** statements to identify null values. So, to get all records with no recorded year, we could run this query.

In [9]:
pd.read_sql_query("""
    SELECT *
    FROM world.country_year
    WHERE IndepYear IS NULL
    """,
    connection)

Unnamed: 0,Name,IndepYear
0,Aruba,
1,Anguilla,
2,Netherlands Antilles,
3,American Samoa,
4,Antarctica,
5,French Southern territories,
6,Bermuda,
7,Bouvet Island,
8,Cocos (Keeling) Islands,
9,China,


The count of years with NULLs

In [10]:
pd.read_sql_query("""
    SELECT COUNT(Name) AS Missings
    FROM world.country_year
    """,
    connection)

Unnamed: 0,Missings
0,239


## 3. Handle NULLs
NULLs can be ambiguous and annoying as there are identified differently depending on data sources. Tale can have NULL values for a number of reasons such as observations that were not recorded and data corruption.

In general, there are two main strategies to handle NULLs during the query session and NOT to change original data in the table.

### 3.1 Do nut use rows with NULL values
This strategy is quite simple as we always can filter the data with a WHERE IS NOT NULL condition. However, in practice, the data would be used at all, if the ratio of NULLs is too high.

In [11]:
pd.read_sql_query("""
    SELECT *
    FROM world.country_year
    WHERE IndepYear IS NOT NULL
    """,
    connection)

Unnamed: 0,Name,IndepYear
0,Afghanistan,1919
1,Angola,1975
2,Albania,1912
3,Andorra,1278
4,United Arab Emirates,1971
...,...,...
184,Yemen,1918
185,Yugoslavia,1918
186,South Africa,1910
187,Zambia,1964


Calculate the counts of NULLs, NOT NULLs and total. Keep in mind that the COUNT function will neglet NULL values.

In [14]:
pd.read_sql_query("""
    SELECT
        SUM(CASE
                WHEN IndepYear IS NULL THEN 1
                ELSE 0
                END) AS NULLs_Count,
        COUNT(IndepYear) AS Not_NULLs_Count,
        COUNT(Name)
    FROM world.country_year
    """,
    connection)

Unnamed: 0,NULLs_Count,Not_NULLs_Count,COUNT(Name)
0,47.0,192,239


We could have done the same count with the following query:

In [15]:
pd.read_sql_query("""
    SELECT
        SUM(IF(IndepYear IS NULL, 1, 0)) AS NULLs_Count,
        COUNT(IndepYear) AS Not_NULLs_Count,
        COUNT(Name)
    FROM world.country_year
    """,
    connection)

Unnamed: 0,NULLs_Count,Not_NULLs_Count,COUNT(Name)
0,47.0,192,239


### 4.2 Replace NULL values with sensible values***
It is recommended that you should firstly check the database document to make sure that nullable columns (columns that are allowed to have null values) have documented what a null value means from a business perspective before replacing NULL values with sensible values.

SQL provides a more elegant way of handling NULL values. That is to use the `COALESCE()` function that accepts two or more arguments and returns the first non-null argument into a specified default value if it is null. If all the arguments are NULL, the COALESCE function returns NULL.

The following illustrates the syntax of the COALESCE function:
COALESCE(parameter1, parameter2, …);

Here we want all NULLs of PREC_mm to be treated as the IndepYear mean of NOT NULLs.

Calculate the meann of NON-NULLs.

Please note: know that filling NULL years with years mean doesn't make sense.

Caluclate the mean nof NON-NULLs

In [16]:
pd.read_sql_query("""
    SELECT
        AVG(IndepYear)
    FROM world.country_year
    """,
    connection)

Unnamed: 0,AVG(IndepYear)
0,1847.2604


Replace NULLs with the above mean nof NON-NULLs

In [17]:
pd.read_sql_query("""
    SELECT
        Name,
        COALESCE(IndepYear, 1847) AS Year
    FROM world.country_year
    """,
    connection)

Unnamed: 0,Name,Year
0,Aruba,1847
1,Afghanistan,1919
2,Angola,1975
3,Anguilla,1847
4,Albania,1912
...,...,...
234,Yemen,1918
235,Yugoslavia,1918
236,South Africa,1910
237,Zambia,1964


## Summary
Dealing with NULL values is a complicated task. It would be better to get assistances from domain experts or you know very clearly what the NULL vlaues were presented for.

# References
- [Chonghua Yin notebook](https://github.com/royalosyin/Practice-SQL-with-SQLite-and-Jupyter-Notebook/blob/master/ex11-Dealing%20with%20NULL%20Values.ipynb)