# Data loading and storage

The FROM clause of a query references data from multiple sources.

Which of the following sources both stores and refreshes data?
- Base Table
- Materialized View

# Finding the table type

You may have wondered if these data sources were actually tables or instead views. Knowing this information could help you decide how to structure your queries and understand why a query may be slow (e.g. it uses a view).

```
SELECT DISTINCT table_type 
FROM information_schema.tables 
WHERE table_catalog = 'olympics_aqi'; 
```

```
SELECT *
FROM information_schema.tables 
WHERE table_catalog = 'olympics_aqi' 
AND table_name = 'annual_aqi';
```

# Row-oriented storage properties

Databases store data in two main formats, row-oriented storage, and column-oriented storage. Postgres inherently uses row-oriented storage.

Choose the statements that correctly explain the properties of row-oriented storage in a database.

- Row-oriented storage keeps the association between columns.

# Previewing a row-oriented table

When working with new tables, it is common to SELECT * to preview the data. However, selecting all the rows from a large table using row-oriented storage is resource intensive and slow. A quick optimization trick is to limit the number of rows returned.

You will be working with air quality data from the United States. This data lives in a Postgres database using row-oriented storage. Familiarize yourself with daily_aqi by previewing the data. First, select all records. Then limit the results to improve speed.

Use the EXPLAIN command to quantify the effect of limiting the rows

```
EXPLAIN
SELECT *
FROM daily_aqi;
```

```
EXPLAIN
SELECT *
FROM daily_aqi
LIMIT 10;
```

# Partitioning on location

The air quality index (AQI) is calculated on four air pollutants (ozone, particle pollution, carbon monoxide, and sulfur dioxide) then categorized into six levels.

Air quality in the US is monitored on a state level, with states reporting data to the Environmental Protection Agency (EPA). Because of these different sources, the air quality table is partitioned by state. While it looks like one table, each state has its own child table.

Find the cost estimate impact of the partition. Then find the best days and places to visit in Hawaii based on the AQI.

```
EXPLAIN
SELECT * 
FROM daily_aqi
WHERE state_code = 15; -- Hawaii state code
```

```
EXPLAIN
SELECT * 
FROM daily_aqi_partitioned
WHERE state_code = 15; -- Hawaii state code
```

```
SELECT county_name
  , aqi
  , category
  , aqi_date
FROM daily_aqi_partitioned
WHERE state_code = 15 
ORDER BY aqi;
```

# Finding the database indexes

One aspect of writing well-performing queries is using the database optimization properties. When working in row-oriented databases, you want to limit the number of records returned. If partitions and indexes exist, you should use them in your queries as filters. Ideally, you could reference a database diagram or ask your friendly database administrator (DBA) which tables and columns have indexes.

However, sometimes documentation is missing, and DBAs are busy. Luckily, the pg_tables schema has views that show all the existing indexes.

```
SELECT tablename
 , indexname
FROM pg_indexes;
```

# Creating and using an index

You are working with the US air quality index (AQI) data. AQI data captures four pollutants: ozone, particle pollution, carbon monoxide, and sulfur dioxide. You want to know if one of these pollutants is the main source of poor air quality.

The daily_aqi table contains daily AQI measurements. The defining_parameter field describes which of the four pollutants was the worst for that day.

You will be writing a lot of queries using the pollutant type as a filter. Before you dig into the data, you want to check if defining_parameter has an index on it. If not, you want to add one.

```
SELECT indexname
FROM pg_indexes
WHERE tablename = 'daily_aqi'; -- Filter condition
```

```
CREATE INDEX defining_parameter_index 
 ON daily_aqi (defining_parameter); -- Define the index creation

SELECT indexname -- Check for the index
FROM pg_indexes
WHERE tablename = 'daily_aqi';
```

# Compare runtimes

Hawaii is home to many volcanoes. One volcano, Kīlauea has been erupting nearly continuously since 1983. The spewing volcanic smog (vog) is high in sulfur dioxide (SO2) and results in poor air quality. In fact, based on the count of good or moderate air quality days over 10 years (2008 to 2018), Hawaii has the worst air quality of any state.

Look at the main air pollutant on poor air quality days, focusing on SO2 . Compare the cost estimate between a query with an index on the pollutant column and a query without an index.

```
SELECT category
  , COUNT(*) as record_cnt
  , SUM(no_sites) as aqi_monitoring_site_cnt
FROM daily_aqi
WHERE category <> 'Good'
AND state_code = 15 -- Filter to Hawaii
GROUP BY category;
```

```
EXPLAIN
SELECT category
  , COUNT(*) as record_cnt
  , SUM(no_sites) as aqi_monitoring_site_cnt
FROM daily_aqi
WHERE defining_parameter = 'SO2'
AND category <> 'Good'
AND state_code = 15 -- Filter to Hawaii
GROUP BY  category;
```

```
CREATE INDEX defining_parameter_index ON daily_aqi (defining_parameter); 

EXPLAIN
SELECT category
  , COUNT(*) as record_cnt
  , SUM(no_sites) as aqi_monitoring_site_cnt
FROM daily_aqi
WHERE defining_parameter = 'SO2'
AND category <> 'Good'
AND state_code = 15 -- Hawaii
GROUP BY  category;
```

# Column-oriented storage properties

You are at a small company that is just starting to build its data infrastructure. You are discussing with coworkers the best type of database structure for the company data warehouse. The data warehouse will bring together data from multiple source systems such as human resources data, financial data, and customer data.

You think you should use a database that uses column-oriented storage.

Which of these reasons does NOT support your argument?

- The data needs to be real time so records will continually be appended to tables in the warehouse. Because Column-oriented storage databases are slow to insert records. Constantly appending data will significantly slow performance.

# Using the information schema

The most basic optimization method with column-oriented storage databases is to reduce the number of columns each query returns.

When working with new tables, it is common to select the first 5 or 10 rows. However, a basic select on a wide table may be resource intensive. The information schema provides some column metadata and is a good starting place to learn about your data.

While it does not show as available, views in the information_schema are always available to query. Feel free to explore the columns view in the console to explore what information is available before completing the exercise.

```
-- Examine metadata about daily_aqi
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = 'olympics_aqi'
AND table_name = 'daily_aqi' -- Limit to a specific table
;
```