# Identifying data types

Being able to identify data types before setting up your query can help you plan for any potential issues. There is a group of tables, or a schema, called the `information_schema`, which provides a wide array of information about the database itself, including the structure of tables and columns.

The `columns` table houses useful details about the columns, including the data type.

Note that the `information_schema` is not the default schema SQL looks at when querying, which means you will need to explicitly tell SQL to pull from this schema. To pull a table from a non-default schema, use the syntax `schema_name.table_name`.

```
-- Pull column_name & data_type from the columns table
SELECT 
	column_name,
    data_type
FROM information_schema.columns
-- Filter for the table 'country_stats'
WHERE table_name = 'country_stats';

```

# Interpreting error messages

Inevitably, you will run into errors when running SQL queries. It is important to understand how to interpret these errors to correctly identify what type of error it is.

The console contains two separate queries, each which will output an error when ran. In this exercise, you will run each query, read the error message, and troubleshoot the error.

```
/*
-- Comment out the previous query
SELECT AVG(CAST(pop_in_millions AS float)) AS avg_population
FROM country_stats;
*/
-- Uncomment the following block & run the query
SELECT 
	s.country_id, 
    COUNT(DISTINCT s.athlete_id) AS summer_athletes, 
    COUNT(DISTINCT w.athlete_id) AS winter_athletes
FROM summer_games AS s
JOIN winter_games_str AS w
-- Fix the error by making both columns integers
ON s.country_id = CAST(w.country_id AS INT)
GROUP BY s.country_id;
```

# Using date functions on strings

There are several useful functions that act specifically on date or datetime fields. For example:

- `DATE_TRUNC('month', date)` truncates each date to the first day of the month.
- `DATE_PART('year', date)` outputs the year, as an integer, of each date value.

In general, the arguments for both functions are `('period', field)`, where period is a `date` or time `interval`, such as '`minute`', '`day`', or '`decade`'.

In this exercise, your goal is to test out these date functions on the `country_stats` table, specifically by outputting the decade of each year using two separate approaches. To run these functions, you will need to use `CAST()` function on the `year` `field`.

```
SELECT 
	year,
    -- Pull decade, decade_truncate, and the world's gdp
    DATE_PART('decade', CAST(year AS DATE)) AS decade,
    DATE_TRUNC('decade', CAST(year AS DATE)) AS decade_truncated,
    SUM(gdp) AS world_gdp
FROM country_stats
-- Group and order by year in descending order
GROUP BY year,decade, decade_truncated
ORDER BY year DESC;
```

# String functions

There are a number of string functions that can be used to alter strings. A description of a few of these functions are shown below:

- The `LOWER(fieldName)` function changes the case of all characters in fieldName to lower case.
- The `INITCAP(fieldName)` function changes the case of all characters in fieldName to proper case.
- The `LEFT(fieldName,N)` function returns the left N characters of the string fieldName.
- The `SUBSTRING(fieldName from S for N)` returns N characters starting from position S of the string fieldName. Note that both from S and for N are optional.

```
-- Convert country to proper case
SELECT 
	country, 
    INITCAP(LOWER(country)) AS country_altered
FROM countries
GROUP BY country;
```

```
-- Output the left 3 characters of country
SELECT 
	country, 
    LEFT(country,3) AS country_altered
FROM countries
GROUP BY country;
```

```
-- Output all characters starting with position 7
SELECT 
	country, 
    SUBSTR(country,7,LENGTH(country)) AS country_altered
FROM countries
GROUP BY country;
```

# Replacing and removing substrings

The `REPLACE()` function is a versatile function that allows you to replace or remove characters from a string. The syntax is as follows:

`REPLACE(fieldName, 'searchFor', 'replaceWith')`

Where `fieldName` is the field or string being updated, `searchFor` is the characters to be replaced, and `replaceWith` is the replacement substring.

In this exercise, you will look at one specific value in the countries table and change up the format by using a few `REPLACE()` functions.

```
SELECT 
	region, 
    -- Replace all '&' characters with the string 'and'
    REPLACE(region,'&','and') AS character_swap,
    -- Remove all periods
    REPLACE(region,'.','') AS character_remove,
    -- Combine the functions to run both changes at once
    REPLACE(REPLACE(region,'&','and'),'.','') AS character_swap_and_remove
FROM countries
WHERE region = 'LATIN AMER. & CARIB'
GROUP BY region;
```

# Fixing incorrect groupings

One issues with having strings stored in different formats is that you may incorrectly group data. If the same value is represented in multiple ways, your report will split the values into different rows, which can lead to inaccurate conclusions.

In this exercise, you will query from the `summer_games_messy` table, which is a messy, smaller version of `summer_games`. You'll notice that the same event is stored in multiple ways. Your job is to clean the `event` field to show the correct number of rows.

```
-- Pull event and unique athletes from summer_games_messy 
SELECT 
	event, 
    COUNT(DISTINCT athlete_id) AS athletes
FROM summer_games_messy
-- Group by the non-aggregated field
GROUP BY event;
```

```
-- Pull event and unique athletes from summer_games_messy 
SELECT 
    -- Remove dashes from all event values
    REPLACE(TRIM(event),'-','') AS event_fixed, 
    COUNT(DISTINCT athlete_id) AS athletes
FROM summer_games_messy
-- Update the group by accordingly
GROUP BY event_fixed;
```

# Filtering out nulls

One way to deal with nulls is to simply filter them out. There are two important conditionals related to nulls:

- `IS NULL` is true for any value that is null.
- `IS NOT NULL` is true for any value that is not null. Note that a zero or a blank cell is not the same as a null.

These conditionals can be leveraged by several clauses, such as CASE statements, WHERE statements, and HAVING statements. In this exercise, you will learn how to filter out nulls using two separate techniques.

Feel free to reference the E:R Diagram.

```
-- Show total gold_medals by country
SELECT 
	country, 
    SUM(gold) AS gold_medals
FROM winter_games AS w
JOIN countries AS c
ON w.country_id = c.id
-- Comment out the WHERE statement
-- WHERE gold IS NOT NULL
GROUP BY country
-- Replace WHERE statement with equivalent HAVING statement
HAVING SUM(gold) IS NOT NULL
-- Order by gold_medals in descending order
ORDER BY gold_medals DESC;
```

# Fixing calculations with coalesce

Null values impact aggregations in a number of ways. One issue is related to the `AVG()` function. By default, the `AVG()` function does not take into account any null values. However, there may be times when you want to include these null values in the calculation as zeros.

To replace null values with a string or a number, use the `COALESCE()` function. Syntax is `COALESCE(fieldName,replacement)`, where `replacement` is what should replace all `null` instances of `fieldName`.

This exercise will walk you through why null values can throw off calculations and how to troubleshoot these issues.

```
-- Pull events and golds by athlete_id for summer events
SELECT 
    athlete_id, 
    -- Add a field that averages the existing gold field
    AVG(gold) AS avg_golds,
    COUNT(event) AS total_events, 
    SUM(gold) AS gold_medals
FROM summer_games
GROUP BY athlete_id
-- Order by total_events descending and athlete_id ascending
ORDER BY total_events DESC, athlete_id;
```

If the report was accurate, what should the first three values of avg_golds be?
- `[0, 0, .125]`

```
-- Pull events and golds by athlete_id for summer events
SELECT 
    athlete_id, 
    -- Replace all null gold values with 0
    AVG(COALESCE(gold,0)) AS avg_golds,
    COUNT(event) AS total_events, 
    SUM(gold) AS gold_medals
FROM summer_games
GROUP BY athlete_id
-- Order by total_events descending and athlete_id ascending
ORDER BY total_events DESC, athlete_id;
```