# 4.3 SQL and Databases: Advanced

1. Using PostgreSQL

    Learn the basics of PostgreSQL and how to use it from Python
    * The basics of PostgreSQL
    * How to create & manipulate tables in PostgreSQL 
    
2. Command Line PostgreSQL
    
    Learn how to work with PostgreSQL from the command line
    * How to work with PostgreSQL from the command line
    * How to create a user&add permissions
    
3. Projects: PostgreSQL Installation

    Learn how to install PostgreSQL and the Psycopg2 library
    * Learn how to install PostgreSQL & Psycopg2 library
    
4. Introduction To Indexing

    Learn about how SQLite accesses data and how to use indexes to speed this up
    * How the EXPLAIN query plan works 
    * How to create a table index in SQLite
    
5. Multi-Column Indexing

    Learn how to take advantage of indexing when querying multiple columns
    * How to use multiple-column indexing to speed up certain queries
    * How to understand multi-column query plan
    * What a covering index is

## 4.3.1 Using PostgreSQL

__[What are pros and cons of PostgreSQL and MySQL? With respect to reliability, speed, scalability, and features](https://www.quora.com/What-are-pros-and-cons-of-PostgreSQL-and-MySQL-With-respect-to-reliability-speed-scalability-and-features)__

__[SQLite vs MySQL vs PostgreSQL: A Comparison Of Relational Database Management Systems](https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems)__

__[Why Uber Engineering Switched from Postgres to MySQL](https://eng.uber.com/mysql-migration/)__

__[PostgreSQL vs MySQL](https://www.2ndquadrant.com/en/postgresql/postgresql-vs-mysql/)__


<span style="color:red">IP address vs Domain Name vs URL ???</span>
<span style="color:red">Can we type in the IP address to access to a web page???</span>

### 4.3.1.1 SQLite vs PostgreSQL

So far, we've been using a database engine called __[SQLite](https://www.sqlite.org/index.html)__. SQLite is one of the most common database engines, and has many advantages:

* The database is stored in a single file, making it portable.
* You can use a SQLite database directly from Python, and don't need a separate program running.
* It implements most SQL commands, enabling you to use most of the statements you're familiar with.

However, particularly when developing larger applications, SQLite has a few __downsides__ that make other database engines more attractive:

* Only one process at a time can write to the database. When you have a complex web application, you may have multiple processes updating information in the database at the same time. For example, on Facebook, one process might handle updating user information, and another might handle generating the news feed.
* You can't take advantage of performance features, such as __[caching](https://en.wikipedia.org/wiki/Cache_(computing))__. Because a SQLite database is a single file, and it doesn't require a special program to run, it can't have performance optimizations like caching. When running a site like Facebook that has a ton of traffic, it's important to be able to lookup data quickly.
* SQLite doesn't have any built-in security. With a production website, it's common to want some people to be able to modify tables in a database (write), and others to only be able to make SELECT queries to tables in the database (read). This is because giving someone write access to the database can be a security risk, in that they can update or overwrite data. SQLite doesn't allow for restricting access to a database in this way.

In general, SQLite is good in cases where having a small and simple database engine is important. SQLite is used extensively in __embedded applications__, such as Android and iOS applications.

In cases where there will be multiple users or performance is important, __[PostgreSQL](https://www.postgresql.org/)__ is the most commonly used database engine. PostgreSQL is open source, and is free to download and use.

In this mission, we'll look at the basics of PostgreSQL, then dive into creating a database, querying data, and some advanced features.

### 4.3.1.2 PostgreSQL overview (server/clients; system port)

__PostgreSQL__, also known as Postgres, is an extremely powerful database engine. At a high level, PostgreSQL consists of two pieces, <span style="color:red">__a server and clients__</span>. The server is a program that manages databases and handles queries. Clients communicate back and forth to the server. <span style="color:red">__Only the server__ ever directly accesses the databases</span> -- the clients can only make __requests__ to the server. If you've gone through the APIs and Web Scraping course, the communication process is very similar to making requests to a remote API.

One of the advantages of this model is that multiple clients can communicate with the server at the same time. This allows __multiple processes__ to write to a database at the same time.

It's possible to run a PostgreSQL server __either remotely or locally__. If it's remote, you connect to it via the internet. If it's local, you connect to it on your own machine. In both cases, you'll be connecting to PostgreSQL via a __[system port](https://en.wikipedia.org/wiki/Port_(computer_networking))__.

One way to think of ports is to think of receiving mail at an apartment building. Let's say 5 people live in an apartment building, but they only have a single address. All incoming mail will come to the address, then have to be sorted out and given to each person:

![img alt](Capture-3.PNG)

All incoming mail is merged into a single pile, because the whole apartment building only has one address. Each apartment occupant then has to sort through the pile to find their mail. Not only is this inefficient, it also results in some apartments getting mail that isn't theirs by accident.

We can make life easier for everyone by giving each apartment its own address:

![img alt](Capture-4.PNG)

Now, nobody has to sort mail, and it's unlikely that someone will accidentally get a message that isn't theirs.

Every computer runs dozens to hundreds of programs. Many of these programs can accept incoming connections from the internet. For instance, web servers, such as the servers that run Dataquest, run on ordinary computers and accept connections from people all over the world. Once the connections are created, data is sent along the connections.

If every program received data in the same stream, you'd have a similar situation to all of the apartments only having one address. Each program would be responsible for figuring out which messages were for it, and many messages would be sent to the wrong program. It would be impossible to know which program you were communicating with when you connected to the computer.

One way to avoid this is for __each program to have its own address__. A __system port__ is similar to an apartment number in that a port on a computer can only be used by one server at a time. For example, __web servers__ run on __port 80__. Any incoming messages on this computer port are automatically sent to the program.

By default, PostgreSQL uses __port 5432__ to communicate with the outside world. __If you start a PostgreSQL server__, it will listen for incoming connections on port 5432. Clients will be able to connect to the server using this port. __If you start a client__, you'll have to specify which server to connect to, along with the port to connect to.

### 4.3.1.3 Psycopg2 (different kinds of clients, including python_client, GUI client)

There are __many clients__ for __PostgreSQL__, including __[graphical clients](https://wiki.postgresql.org/wiki/Community_Guide_to_PostgreSQL_GUI_Tools)__. The most common __Python client__ for PostgreSQL is called __[psycopg2](http://initd.org/psycopg/)__. Connecting to a PostgreSQL database using psycopg2 is similar to connecting to a SQLite database using the sqlite3 libary. psycopg2 also uses __[Connection](http://initd.org/psycopg/docs/connection.html)__ and __[Cursor](http://initd.org/psycopg/docs/cursor.html)__ objects.

We'd connect to a database using __psycopg2__ like this:

```python
import psycopg2
conn = psycopg2.connect("dbname=postgres user=postgres")
cur = conn.cursor()
```

You may notice that we have to specify __both a database name and a user name__. A PostgreSQL server can have multiple databases and multiple users, so we need to specify which user we're connecting as, and which database we're connecting to.

When PostgreSQL is first installed, the __default user account__ is called __postgres__, with an associated database called __postgres__.

You may also notice that we didn't specify a server to connect to. __Psycopg2__ will default to connecting to __port 5432__ on the current computer.

When you're done with a __Connection object__, you should close it to avoid issues where one connection prevents another from executing a query. You can close a connection like this: <span style="color:red">QQQ多线程，多用户也需要关闭？？</span>

```python
conn.close()
```

Closing a connection will __terminate__ the __client's connection__ with the __PostgreSQL server__. It's a good idea to close a connection whenever you're done executing your queries.

We've automatically started a PostgreSQL server, and created a database called dq, with an associated user called dq.

#### Instructions

* Import the __[psycopg2 library](http://initd.org/psycopg/)__. ## Python Client
* Connect to the dq database with the user dq.
* Initialize a Cursor object from the connection.
* Use the print function to display the Cursor object.
* Close the Connection using the __[close method](http://initd.org/psycopg/docs/connection.html#connection.close)__.

#### Answers
```python
import psycopg2
conn = psycopg2.connect("dbname=dq user=dq")   ???where is the url for the database?
cur = conn.cursor()
print(cur)
conn.close()
```

Outputs:
```python
conn connection (<class 'psycopg2._psycopg.connection'>)
    <connection object at 0x7fae123d37b8; dsn: 'dbname=dq user=dq', closed: 1>
cur cursor (<class 'psycopg2._psycopg.cursor'>)
    <cursor object at 0x7fae123cac78; closed: 0>
```

### 4.3.1.4 Creating a table

Once we've connected to a database, we can create a table inside that database. You may recall the __CREATE TABLE__ statement from an earlier mission:

```SQL
CREATE TABLE tableName(
   column1 dataType1 PRIMARY KEY,
   column2 dataType2,
   column3 dataType3,
   ...
);
```

We can use the same syntax to create a table in the __dq database__. In order to execute the query, we can use the __[execute method](http://initd.org/psycopg/docs/cursor.html#cursor.execute)__ of the __[Cursor object](http://initd.org/psycopg/docs/cursor.html)__:

```PYTHON
conn = psycopg2.connect("dbname=dq user=dq")
cur = conn.cursor()
cur.execute("SELECT * FROM notes;")
```

The above code will connect to the database __dq__, then execute a query. The syntax above should look familiar to you from using the __[sqlite3](https://docs.python.org/3.5/library/sqlite3.html)__ library, as all the methods are the same.

#### Instructions

* Connect to the dq database as the user dq
* Write a SQL query that creates a table called notes in the dq database, with the following columns and data types:
    * id -- integer data type, and is a primary key.
    * body -- text data type.
    * title -- text data type.
* Execute the query using the execute method.
* Close the Connection using the close method.

#### Answers
```python
import psycopg2
conn = psycopg2.connect('dbname=dq user=dq')
cur = conn.cursor()

query = '''
    CREATE TABLE notes (
        id integer PRIMARY KEY,
        body text,
        title text
        );
        '''
cur.execute(query)
conn.close()
```


### SQL Transactions

If you checked the database dq now, you would notice that there actually isn't a notes table inside it. This isn't a bug -- it's because of a concept called __SQL transactions__. __With SQLite__, every query we made that modified the data was __immediately executed__, and immediately changed the database.

__With PostgreSQL__, we're dealing with multiple users who could be changing the database at the same time. Let's imagine a simple scenario where we're keeping track of accounts for different customers of a bank. We could write a simple query to create a table for this:

```SQL
CREATE TABLE accounts(
   id integer PRIMARY KEY,
   name text,
   balance float
);
```

Let's say we have the following two rows in the table:

```SQL
id    name    balance
1     Jim     100
2     Sue     200
```

Let's say Sue gives 100 dollars to Jim. We could model this with two queries:

```SQL
UPDATE accounts SET balance=200 WHERE name="Jim";
​
UPDATE accounts SET balance=100 WHERE name="Sue";
```

In the above example, we remove 100 dollars from Sue, and add 100 dollars to Jim. Let's say either the second UPDATE statement has an error in it, the database fails, or another user has a conflicting query. The first query would run properly, but the second would fail. That would result in the following:

```SQL
id    name    balance
1     Jim     200
2     Sue     200
```

Jim would be credited 100 dollars, but 100 dollars would not be removed from Sue. This would cause the bank to lose money.

Transactions prevent this type of behavior by ensuring that all the queries in a __transaction block__ are __executed at the same time__. If any of the transactions fail, the __whole group fails__, and no changes are made to the database at all.

Whenever we open a __[Connection](http://initd.org/psycopg/docs/connection.html)__ in __psycopg2__, a new transaction will automatically be created. All queries run up until the __[commit method](http://initd.org/psycopg/docs/connection.html#connection.commit)__ is called will be placed into the __same transaction block__. When commit is called, the PostgreSQL engine will run all the queries at once.

If we don't want to apply the changes in the transaction block, we can call the __[rollback method](http://initd.org/psycopg/docs/connection.html#connection.rollback)__ to remove the transaction. Not calling either commit or rollback will cause the transaction to stay in a __pending state__, and will result in the changes not being applied to the database.

#### Instructions
* Connect to the dq database as the user dq.
* Write a SQL query that creates a table called notes in the dq database, with the following columns and data types:
    * id -- integer data type, and is a primary key.
    * body -- text data type.
    * title -- text data type.
* Execute the query using the execute method.
* Use the commit method on the Connection object to apply the changes in the transaction to the database.
* Close the Connection.


#### Answers
```python
import psycopg2

conn = psycopg2.connect('dbname=dq user=dq')
cur = conn.cursor()
query = '''
CREATE TABLE notes(
    id integer PRIMARY KEY,
    body text,
    title text
    );
    '''

cur.execute(query)
conn.commit()
conn.close()
```

### 4.3.1.6 Autocommitting

There are cases when you won't want to __manage a transaction__, and you'll instead want changes right away. This is most common when you're making changes to the database that you want to be __guaranteed to happen immediately__.

Some changes also have such widespread effects that they can't be wrapped inside of a transaction. One example of this is creating a database. When creating a database, we'll need to activate autocommit mode first.

To activate autocommit mode, we'll need to set the __[autocommit](http://initd.org/psycopg/docs/connection.html#connection.autocommit)__ property of the __[Connection object](http://initd.org/psycopg/docs/connection.html)__ to __True__.

```Python
conn = psycopg2.connect("dbname=dq user=dq")
conn.autocommit = True
cur = conn.cursor()
cur.execute("CREATE TABLE notes(id integer PRIMARY KEY, body text, title text)")
```

The above command will create a table called notes without having to explicitly commit the transaction. We'll then be able to use the notes table right away.

#### Instructios

* Connect to the dq database as the user dq.
* Set the autocommit property of the Connection object to True.
* Write a SQL query that creates a table called facts in the dq database, with the following columns and data types:
    * id -- integer data type, and is a primary key.
    * country -- text data type.
    * value -- integer data type.
* Execute the query using the execute method.
* Close the Connection.

#### Answers
```python
import psycopg2

conn = psycopg2.connect('dbname=dq user=dq')
conn.autocommit = True
cur = conn.cursor()
query = '''
    CREATE TABLE facts(
        id integer PRIMARY KEY,
        country text,
        value integer
        );
        '''
cur.execute(query)
conn.close()
```

### 4.3.1.7 Execting queries

We can issue SELECT queries against a database using the __execute method__, along with the __fetchall__ and __[fetchone methods](http://initd.org/psycopg/docs/cursor.html#cursor.fetchone)__ to retrieve results:

```python
cur.execute("SELECT * FROM notes;")
rows = cur.fetchall()
print(rows)
```

The above code will select all of the rows in the __notes table__, then print them all out.

Of course, we don't have any rows in our table yet. You may recall how to insert rows from a previous mission:

```SQL
INSERT INTO tableName
VALUES (value1, value2, ...);
```

The below query will insert a row into the __notes table__:

```SQL
INSERT INTO notes
VALUES (1, 'This is my note text.', 'Test note');
```

#### Instructions
* Connect to the dq database as the user dq.
* Execute a SQL query that inserts a row into the notes table with the following values:
    * id -- 1
    * body -- 'Do more missions on Dataquest.'
    * title -- 'Dataquest reminder'.
* Execute a SQL query that selects all of the rows from the notes table.
* Fetch all of the results and print them out.
* Close the Connection.

#### Answers
```python
conn = psycopg2.connect('dbname=dq user=dq')
conn.autocommit = True
cur = conn.cursor()
cur.execute('''INSERT INTO notes VALUES(1, 'Do more missions on Dataquest.', 'Dataquest reminder');''')
query = '''
    SELECT *
    FROM notes;
    '''
cur.execute(query)
print(cur.fetchall())
conn.close()
```

Outputs:
```python
[(1, 'Do more missions on Dataquest.', 'Dataquest reminder')]
```

### 4.3.1.8 Creating a database

One of the __most powerful aspects__ of PostgreSQL is that it enables you to create __multiple databases__. Different databases are generally used to hold information about different applications. For instance, if you have the following three datasets and applications:

* An application that enables you to add and remove friends in your neighborhood.
* A dataset on household income worldwide.
* An application that allows you to store and share notes.

You could in theory make different tables for each of these in an existing database. But eventually, you'll reach a point where each application has multiple tables, due to foreign keys and joins. It will __get messy to manage__ all the tables for each application separately. By storing data for a single application in a single database, we __encapsulate that application__, and make it easier to manage and alter the data for it.

We can create a database using the __CREATE DATABASE SQL statement__:

```SQL
CREATE DATABASE dbName;
```

Here's a concrete example:

```SQL
CREATE DATABASE notes;
```

The above SQL command will create a database called notes. We can specify the user who will own the database when we create it as well, using the __OWNER statement__:

```SQL
CREATE DATABASE notes OWNER postgres;
```

The above statement will create a database called notes with the __default postgres user as the owner__. The owner of a database is the only one that can access and modify a database, unless they give __permission__ to other users. An exception is __superusers__, who we'll cover in a later mission, who can perform any action on any database without being given permission.

#### Instructions
* Connect to the dq database with the user dq.
* Set the connection to autocommit mode.
* Create a database called income where the owner is the user dq.
* Close the __[Connection](http://initd.org/psycopg/docs/connection.html)__.

#### Answers
```PYTHON
conn = psycopg2.connect('dbname=dq user=dq')
conn.autocommit = True
cur = conn.cursor()
query_create = '''
CREATE DATABASE income OWNER dq;
'''
cur.execute(query_create)
conn.close()
```

### 4.3.1.9 Deleting a database

We can delete a database using the __DROP DATABASE statement__. The DROP DATABASE statement will immediately remove a database, provided the user executing the query has the __right permissions__. It should be __used with caution__ when working with real data.

```SQL
DROP DATABASE dbName;
```

Here's a more concrete example:

```SQL
DROP DATABASE income;
```

The above statement will remove the database called income, along with any tables it contains.

#### Instructions
* Connect to the dq database with the user dq.
* Set the connection to autocommit mode.
* Drop the income database.
* Close the Connection.

#### Answers
```PYTHON
conn = psycopg2.connect('dbname=dq user=dq')
conn.autocommit = True
cur = conn.cursor()
query = "DROP DATABASE income;"
cur.execute(query)
conn.close()
```


### 4.3.1.10 Next Steps

In this mission, we covered the __basics of PostgreSQL__, along with __transactions__ and working with databases. In the next mission, we'll look at managing databases, users, and permissions in PostgreSQL.

## 4.3.2 Command line PostgreSQL



### 4.3.2.1 The psql tool

In the last mission, we worked with __[PostgreSQL](https://www.postgresql.org/)__, or Postgres, databases and tables. In this mission, we'll learn how to work with the PostgreSQL command line tool, called __[psql](https://www.postgresql.org/docs/9.4/static/app-psql.html)__.

psql is __similar to__ the sqlite3 command line tool __in that__ it allows you to connect to and manage databases. psql connects to a running __PostgreSQL server process__, then enables you to:

* Run queries.
* Manage users and permissions.
* Manage databases.
* See PostgreSQL system information.

By default, psql will connect to a __PostgreSQL server__ running on the __current computer__, using __port 5432__. If you don't specify a user and database to connect to, it will use the defaults. __By default__, the name of the currently logged in system user will be used as __both the PostgreSQL user name and database name__.

If you're logged in to a computer as the system user dq, then type __psql__, you will connect to the dq database as a PostgreSQL user called dq. We'll learn later on how to connect to different databases using different PostgreSQL users.

After you're finished working with psql, you can exit using the __\q command__.

#### Instructions
* Start the PostgreSQL command line tool by typing psql.
* Exit psql by typing \q.

#### Answers


### 4.3.2.2 Running SQL queries

After starting the psql command line tool, we can run SQL queries. Any valid SQL query will be executed. Because the psql shell is about giving __instant feedback__, __transactions don't apply__, and each command we type is immediately executed. This allows us to quickly test out queries and get results.

Since creating a database is one SQL query, we can do it via psql. You may recall that the syntax to create a database is like the following:

```SQL
CREATE DATABASE dbName;
```

Queries in psql must end with a __semicolon (;)__, or they won't be performed.

#### Instructions
* Start the psql command line tool.
* Create a database called bank_accounts
* Exit the psql command line tool.

#### Answers
```SQL
psql <<EOF
CREATE DATABASE bank_accounts;
EOF
```

### 4.3.2.3 Special PostgreSQL commands

We can run several special commands using psql. These commands start with a __backslash (\)__, and can perform a variety of functions, including:

* Listing databases
* Listing tables
* Managing users

You can see a full list of all of the special functions by running __\?__ after starting psql. You'll need to type __q to exit__ the resulting help interface. You can also find the full list here.

Two common functions to run are:

* \l -- list all available databases.
* \dt -- list all tables in the current database.
* \du -- list users.

#### Instructions
* Start psql.
* List all available databases.
* Exit psql.

### 4.3.2.4 Switching databases

When we're connected to a specific SQL database, we can only create tables within that database, and run queries on tables in that database. In the past few screens, we've been connected to the dq database. This prevents us from manipulating tables in the bank_accounts database.

You can connect to a different database using the __-d option of psql__. If you wanted to connect to a database called dataquest, you could use the following command:

```SQL
psql -d dataquest
```
psql will start connected to the specified database, and you'll be able to create tables in the database.

#### Instructions
* Start psql and connect to the bank_accounts database.
* Create a table called deposits in bank_accounts with the following columns:
    * id, integer, primary key
    * name, text
    * amount, float
* Use the \dt command to list all of the tables in bank_accounts.
* Exit psql.

#### Answers
```SQL
psql -d bank_accounts <<EOF
CREATE TABLE deposits (
    id integer PRIMARY KEY,
    name text,
    amount float
);
EOF
```

### 4.3.2.5 Creating users

In order to __manage access__ to different databases, you can also __create users__. Users will be able to log into a PostgreSQL database and run queries. You can create a user with the __[CREATE ROLE statement](https://www.postgresql.org/docs/9.4/static/sql-createrole.html)__. Here's how the statement looks:

```SQL
CREATE ROLE userName;
```

By default, the user isn't allowed to login to PostgreSQL and run queries. You can fix this by adding the WITH and LOGIN statements:

```SQL
CREATE ROLE userName WITH LOGIN;
```

If you run the pseudo-code above with a real username, you may be unable to login as that user. Depending on the configuration of your PostgreSQL instance, you may either be unable to login entirely, or will only be able to login when your system user name is the same as the PostgreSQL user name you want to login as. You can __get around this by creating a password__ -- you'll then be able to login using the password. We'll cover PostgreSQL __authentication__ and __login methods__ in more depth in a later mission.

You can create a password using the __WITH PASSWORD statement__ like this:

```SQL
CREATE ROLE userName WITH LOGIN PASSWORD `password`;
```

If the user needs to be able to create databases, you can add that ability in with the __CREATEDB statement__:

```SQL
CREATE ROLE userName WITH CREATEDB LOGIN PASSWORD 'password';
```

As you may be able to tell from above, we can __keep modifying__ how the user is created by __adding statements after the WITH__ statement. Some other statements we can add are:

* CREATEROLE -- allows the user to create other users.
* SUPERUSER -- makes the user a superuser. We'll cover what a superuser is later on.

For a full list of statements that can be added, see __[here](https://www.postgresql.org/docs/9.4/static/sql-createrole.html)__.

#### Instructions
* Start psql.
* Create a user called sec with the following modifying statements:
    * LOGIN
    * PASSWORD 'test'
    * CREATEDB
* List all the users using \du.
* Exit psql.


#### Answers
```SQL
psql <<EOF
CREATE ROLE sec WITH LOGIN CREATEDB PASSWORD 'test';
EOF
```

### 4.3.2.6 Adding permissions

When users are created, they don't have any ability, or permissions, to access tables in existing databases. This is done for security reasons, so that __all permissions are issued explicitly__ instead of being unexpected. You can issue permissions to a user using the __[GRANT statement](https://www.postgresql.org/docs/9.4/static/sql-grant.html)__. The __GRANT statement__ will __issue permissions to access certain tables in a database to a certain user__. You can allow a user to perform SELECT queries on a given table like this:

```SQL
GRANT SELECT ON tableName TO userName;
```

If you want to grant different types of permissions, you can separate them with commas. The below query will allow a given user to query data from a table, update rows in the table, insert rows into the table, and delete rows from the table:

```SQL
GRANT SELECT, INSERT, UPDATE, DELETE ON tableName TO userName;
```

A shortcut for this is to use the __ALL PRIVILEGES statement__:

```SQL
GRANT ALL PRIVILEGES ON tableName TO userName;
```

You can use the psql __\dp__ command to find out what privileges have been granted to users for a specific table:

```SQL
\dp tableName
```

#### Instructions

* Start psql and connect to the bank_accounts database.
* Grant all privileges on the table deposits to the user sec.
* List all the privileges for deposits using \dp.
* Exit psql.

#### Answers
```SQL
psql -d bank_accounts <<EOF
GRANT ALL PRIVILEGES ON deposits TO sec;
EOF
```

### 4.3.2.7 Removing permissions

There are times when you'll want to __remove permissions__ that you granted to a user previously. Permissions can be removed using the REVOKE statement. The __[REVOKE statement](https://www.postgresql.org/docs/9.4/static/sql-revoke.html)__ enables you to take back any permissions given via the __GRANT statement__. You can revoke the ability for a user to run queries:

```SQL
REVOKE SELECT ON tableName FROM userName;
```

If you want to revoke different types of permissions, you can separate them with commas. The below query will revoke permissions for a given user to query data from a table, update rows in the table, insert rows into the table, and delete rows from the table:

```SQL
REVOKE SELECT, INSERT, UPDATE, DELETE ON tableName FROM userName;
```

A shortcut for this is to use the __ALL PRIVILEGES statement__:

```SQL
REVOKE ALL PRIVILEGES ON tableName FROM userName;
```

The above syntax likely looks very __similar to the GRANT syntax__ from the last screen. This is __by design__, and both are as similar as possible to make adding and removing permissions straightforward.

#### Instructions

* Start psql and connect to the bank_accounts database.
* Revoke all privileges on the table deposits from the user sec.
* List all the privileges for deposits using \dp.
* Exit psql.

#### Answers
```SQL
psql -d bank_accounts <<EOF
REVOKE ALL PRIVILEGES ON deposits FROM sec;
EOF
```

### 4.3.2.8 Superusers

A superuser is a special type of user that __overrides all access restrictions__. Superusers can perform any function in a database, and a user should only be made a superuser in special cases. Adding the __SUPERUSER statement__ to a CREATE ROLE statement will make a user a superuser:

```SQL
CREATE ROLE userName WITH SUPERUSER;
```

You can also setup login and a password for the superuser:

```SQL
CREATE ROLE userName WITH LOGIN PASSWORD 'password' SUPERUSER;
```

#### Instructions

* Start psql.
* Create a user called aig with the following modifying statements:
* LOGIN
* PASSWORD 'test'
* SUPERUSER
* List all the users using \du.
* Exit psql.

#### Answers
```SQL
psql <<EOF
CREATE ROLE aig WITH LOGIN PASSWORD 'test' SUPERUSER;
EOF
```

## 4.3.3 Projects: PostgreSQL Installation

* sqlite3 is a standard module in Python 3.xx, which is why we don't need to install sqlite.
* PostgreSQL is not a standard module in Python, which means we need to install this library first before using it in Python.

__[Python Module Index]__(https://docs.python.org/3/py-modindex.html)

* local installation: 
    * Add the directory of PostgreSQL into System PATH (under folder bin/)
    * To start it from command line: psql -U username
    * To change the password in PostgreSQL shell: \password
    * username: postgres
    * password: 12345


### 4.3.3.1 Introduction

So far, we've explored many database concepts using SQLite and PostgreSQL. In this project, we'll walk through how to install the PostgreSQL database system and the psycopg2 Python library for Windows, Mac, and Linux. We'll be focusing on installing and running PostgreSQL __locally__ on your own machine instead of on a remote server.

### 4.3.3.2 Installing PostgreSQL

First things first, let's install PostgreSQL. During the setup process, you'll be asked to __specify a default username and password__. Select a username and password combination you'll remember since you're only installing PostgreSQL locally and don't need a highly secure combination.

Also during installation, you may be asked to __specify a port number__. <span style="color:red">Even though PostgreSQL will be running on the same machine, __other applications__ communciate with it through the port __as if it were on a remote machine__</span>. Port number __5432__ is the default for PostgreSQL.

Here are the installation instructions for each operating system:

#### Mac:

Download Postgres.app __[here](https://postgresapp.com/)__, move to the Applications folder, and double click to launch. This applications runs in the background and you'll need it to be running to connect to it from Python. By default, PostgreSQL will run on port 5432.
Add the following line to the end of ~/.bash_profile:
export PATH=\$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin

#### Windows:

* Download the latest Windows installer __[here](https://www.postgresql.org/download/windows/)__, double click the installer, and go through the installation wizard.
* When asked for a port number, use 5432.

#### Linux:

* Read the installation directions for your specific flavor of Linux from the official documentation.
* If asked for a port number, use 5432.

To test your installation, open your command line application, type __psql__, and you should be in the PostgreSQL shell.

To login with the username, simply type __psql -U (your username)__. For example, if you're trying to access the database under user "myuser", you would use the following psql login command: __psql -U myuser__. After you execute the command, enter the password, when prompted, to successfully connect to the desired database.

### 4.3.3.3 Psycopg2

You can install this library using Anaconda:

```SHELL
conda install psycopg2
```

* somehow it is not working with conda, to fix it, try pip

```SHELL
pip install psycopg2
```

### 4.3.3.4 Connecting to PostgreSQL from psycopg2

Launch your Python shell and import the psycopg2 library. Then, run the following code to connect to PostgreSQL and test that everything works as expected:

```SQL
import psycopg2
conn = psycopg2.connect(dbname="postgres", user="postgres")
conn.autocommit = True
cursor = conn.cursor()
cursor.execute("CREATE TABLE notes(id integer PRIMARY KEY, body text, title text)")
conn.close()
```

If no errors were returned, then you're setup is good to go! If you run into any issues, use Google, StackOverflow, and the Dataquest member-only Slack community to get help.

## 4.3.4 Introduction to Indexing

To return the Schema of a table:
    * With SQLite shell spacial command
    .SCHEMA table
    * With Query:
    PRAGMA table_info(table)

### 4.3.4.1 Introduction

In previous missions, we've explored on how to __query, modify, and create tables in a database__. In this mission, we'll explore __how queries are executed__ in SQLite. After exploring this at a high level, we explore how to __create and use indexes for better performance__. As our data gets larger and our queries more complex, it's important to be able to tweak the queries we write and optimize a database's schema to ensure that we're getting results back quickly.

To explore __database performance__, we'll work with __factbook.db__, a SQLite database that contains information about each country in the world. We'll be working with the __facts table__ in the database. Each row in facts represents a single country, and contains several columns, including:

* name -- the name of the country.
* area -- the total land and sea area of the country.
* population -- the population of the country.
* birth_rate -- the birth rate of the country.
* created_at -- the date the record was created.
* updated_at -- the date the record was updated.

Here are the first few rows of facts:

<table class="table table-bordered">
<tbody><tr>
<th>id</th>
<th>code</th>
<th>name</th>
<th>area</th>
<th>area_land</th>
<th>area_water</th>
<th>population</th>
<th>population_growth</th>
<th>birth_rate</th>
<th>death_rate</th>
<th>migration_rate</th>
<th>created_at</th>
<th>updated_at</th>
</tr>
<tr>
<td>1</td>
<td>af</td>
<td>Afghanistan</td>
<td>652230</td>
<td>652230</td>
<td>0</td>
<td>32564342</td>
<td>2.32</td>
<td>38.57</td>
<td>13.89</td>
<td>1.51</td>
<td>2015-11-01 13:19:49.461734</td>
<td>2015-11-01 13:19:49.461734</td>
</tr>
<tr>
<td>2</td>
<td>al</td>
<td>Albania</td>
<td>28748</td>
<td>27398</td>
<td>1350</td>
<td>3029278</td>
<td>0.3</td>
<td>12.92</td>
<td>6.58</td>
<td>3.3</td>
<td>2015-11-01 13:19:54.431082</td>
<td>2015-11-01 13:19:54.431082</td>
</tr>
<tr>
<td>3</td>
<td>ag</td>
<td>Algeria</td>
<td>2381741</td>
<td>2381741</td>
<td>0</td>
<td>39542166</td>
<td>1.84</td>
<td>23.67</td>
<td>4.31</td>
<td>0.92</td>
<td>2015-11-01 13:19:59.961286</td>
<td>2015-11-01 13:19:59.961286</td>
</tr>
</tbody></table>

Before we dive in further, let's set up our environment and explore the data.

#### Instructions

* Write a query that returns the schema of the facts table and assign the resulting list of tuples to schema.
* Use a for loop and a print statement to display each tuple in schema on a separate line.

#### Answers
```PYTHON
import sqlite3
conn = sqlite3.connect("factbook.db")
schema = conn.execute("pragma table_info(facts);").fetchall() ## a shortcut mechanism in SQLite to skip Cursor explicitly
for s in schema:
    print(s)
```
* With SQLite shell spacial command
.SCHEMA table
* With Query:
PRAGMA table_info(table)

Outputs:
```PYTHON
(0, 'id', 'INTEGER', 1, None, 1)
(1, 'code', 'varchar(255)', 1, None, 0)
(2, 'name', 'varchar(255)', 1, None, 0)
(3, 'area', 'integer', 0, None, 0)
(4, 'area_land', 'integer', 0, None, 0)
(5, 'area_water', 'integer', 0, None, 0)
(6, 'population', 'integer', 0, None, 0)
(7, 'population_growth', 'float', 0, None, 0)
(8, 'birth_rate', 'float', 0, None, 0)
(9, 'death_rate', 'float', 0, None, 0)
(10, 'migration_rate', 'float', 0, None, 0)
(11, 'created_at', 'datetime', 0, None, 0)
(12, 'updated_at', 'datetime', 0, None, 0)
```

### 4.3.4.2 Query planner

When you execute a SQL query, SQLite performs many steps before returning the results to you. First, it tokenizes and parses your query to look for any __syntax errors__. If there are any syntax errors, the query execution process halts and the error message is returned to you. If the parser was able to successfully parse the query, then SQLite moves on to the query __planning and optimization phase__.

There are many different ways for SQLite to access the underlying data in a database. When working with a database that's stored on disk as a file, it's crucial to minimize the amount of disk reads necessary to avoid long running times. The __query optimizer__ generates cost estimates for the various ways to access the underlying data, factoring in the schema of the tables and the operations the query requires. The heuristics and algorithms that are involved in query optimization is complex and out of this mission's scope.

The optimizer quickly assesses the __various ways__ to access the data and generates __a best guess__ for the fastest __query plan__. This high level query plan is then converted into highly efficient, lower-level __C code__ to interact with the database file on disk. Thankfully, we can observe the query plan to understand what SQLite is doing to return our results.

### 4.3.4.3 Explain query plan

We can use the __EXPLAIN QUERY PLAN statement__ before any query we're running to get a high level query plan that would be performed. If you write a __SELECT statement__ and place the __EXPLAIN QUERY PLAN statement__ before it:

```SQL
EXPLAIN QUERY PLAN SELECT * FROM facts;
```

the results of the SELECT query won't be returned and instead the high level query plan will be:

```SQL
[(0, 0, 0, 'SCAN TABLE facts')]
```

In this mission, we'll focus on __'SCAN TABLE facts'__, the last value from the returned tuple. __SCAN TABLE__ means that every row in entire table (facts) had to be accessed to evaluate the query. Since the SELECT query we wrote returns all of the columns and rows in the facts table, the entire table had to be accessed to get the results we requested.

When running the query using the sqlite3 library, you'll still need to use the __fetchall() method__.

```python
query_plan = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts;").fetchall()
```

The query plan is represented as a tuple, which is the sqlite3 library's preferred way of representing results.

#### Instructions

* Return the query plan for the query that returns all columns and rows where area exceeds 40000. Assign the results to query_plan_one.

* Return the query plan for the query that returns only the area column for all rows where area exceeds 40000. Assign the results to query_plan_two.

* Return the query plan for the query that returns the row for the country Czech Republic. Assign the results to query_plan_three.

* Use the print function to display each query plan.

#### Answers
```PYTHON
query_plan_one = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts WHERE area>40000;").fetchall()
query_plan_two = conn.execute("EXPLAIN QUERY PLAN SELECT area FROM facts WHERE area>40000;").fetchall()
query_plan_three = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts WHERE name='Czech Republic';").fetchall()
print(query_plan_one)
print(query_plan_two)
print(query_plan_three)
```

Outputs:
```python
[(0, 0, 0, 'SCAN TABLE facts')]
[(0, 0, 0, 'SCAN TABLE facts')]
[(0, 0, 0, 'SCAN TABLE facts')]

query_plan_one list (<class 'list'>)
    [(0, 0, 0, 'SCAN TABLE facts')]
query_plan_two list (<class 'list'>)
    [(0, 0, 0, 'SCAN TABLE facts')]
query_plan_three list (<class 'list'>)
    [(0, 0, 0, 'SCAN TABLE facts')]
conn Connection (<class 'sqlite3.Connection'>)
    <sqlite3.Connection at 0x7f64f85f6c70>
```

### 4.3.4.4 Data representation

You'll notice that all 3 query plans are exactly the same. The entire facts table had to be accessed to return the data we needed for all 3 queries. Even though all the queries asked for a __subset__ of the facts table, SQLite still __ends up scanning the entire table__. Why is this? This is because of the way SQLite represents data.

For the facts table, we set the __id column as the primary key__ and SQLite uses this column to order the records in the database file. Since the rows are __ordered by id__, SQLite can search for a specific row based on it's id value using __binary search__. Unless we provide specific id values in the WHERE statement in the query, SQLite can't take advantage of binary search and has to instead scan the __entire table, row by row__. To return the results for the first 2 queries, SQLite has to:

* access the first row in the table (lowest id value),
    * check if that row's value for area exceeds 40000 and store the row separately in a temporary collection if it is,
* move onto the next row,
    * check if that row's value for area exceeds 40000 and store the row separately in a temporary collection if it is,
* repeat moving and checking each row for the rest of the table,
* return the final collection of rows that meet the criteria.

Here's a diagram of what that looks like:

![img alt](Capture-diagram-queryplan.png)

If we were instead interested in a row with __a specific id value__, like in the following query:

```SQL
SELECT * FROM facts WHERE id=15;
```

SQLite can use __binary search__ to quickly find the corresponding row at that id value. Instead of performing a full table scan, SQLite would:

* use binary search to find the __first__ row where the id value matches 15 in __O(log N)__ time complexity and store this row in a temporary collection,
* advance to the __next__ row to look for any more rows with the __same id values__ and add those rows to the temporary collection,
* return the final collection of rows that matched.

If we set the __id column__ to be a __UNIQUE PRIMARY KEY__ when we created the schema, SQLite would __stop searching__ when it found the instances that matched the id value. It would avoid advancing to the next row(s) since no 2 rows could have the same id value. While we didn't enforce the UNIQUE constraint on the id column, all of the values currently in the column are in fact unique and SQLite will only have to advance one row to realize this since they're ordered.

If you need a refresher on algorithmic complexity head to our mission on __Algorithms__. If you want to dive into binary search, check out our mission on __Binary Search__.

### 4.3.4.5 Time Complexity

Binary search on a table using the __primary key__ would be __O(log N)__ time complexity where N is the number of total rows in the table. On the other hand, a full table scan would would be __O(N)__ time complexity since each row would have to be accessed. If we're working with a database containing millions of rows, binary search would be over a million times faster! While you may not notice major performance differences when working with a small, on-disk database, they __become profound as you scale up the amount of data__ you work with. Many organizations work with databases contains billions or trillions of rows and understanding the time complexity of queries is important to avoid writing queries that take a long time to complete.

Let's now observe the query plan that SQLite takes to access a row at a specific id value.

#### Instructions

* Return the query plan for the query that selects the row at id value 20 from the facts table.
* Assign the query plan to query_plan_four and display the query plan using the print function.

#### Answers
```python
query_plan_four = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts WHERE id=20").fetchall()
print(query_plan_four)
```

Outputs:
```PYTHON
[(0, 0, 0, 'SEARCH TABLE facts USING INTEGER PRIMARY KEY (rowid=?)')]
```

### 4.3.4.6 Search and rowid

Instead of using a full table scan:

```PYTHON
[(0, 0, 0, 'SCAN TABLE facts')]
```

SQLite performed __binary search__ on the facts table using the integer primary key:

```python
[(0, 0, 0, 'SEARCH TABLE facts USING INTEGER PRIMARY KEY (rowid=?)')]
```

SQLite uses __rowid__ to refer to the primary key of a table. The __alias rowid__ will be displayed in the query plan, no matter what you name the primary key column for that table. Either __SCAN__ or __SEARCH__ will always appear at the start of the query explanation for SELECT queries.


### 4.3.4.7 Indexing (Perform two binary search)

SQLite can take advantage of speedy lookups when searching for a specific primary key. Unfortunately, we don't always have the primary keys for the rows we're interested in beforehand. When we're expressing our intent as a SQL query, we're often thinking in terms of row and column values. We need to find a way that allows us to benefit from the speed of primary key lookups without actually knowing the primary key in advance.

__To that end__, we could create a __separate table__ that's optimized for lookups by a different column from the facts table instead of by the id. We can make the column we want to query by the primary key, so we get the speed benefits, and embed the id value from the facts table corresponding to that row. We call this table an __index__ and each row in the index contains:

* the value we want to be able to search by, as the primary key,
* an id value for the corresponding row in facts.

Let's walk through a concrete example. If we wrote a SELECT query to look up the population of India from the facts table:

```SQL
SELECT population FROM facts WHERE name = 'India';
```

SQLite would need to perform a full table scan on facts to find the specific row where the value for name was India. We can instead __create an index__ that's ordered by name values (primary key) and where each row contains the corresponding row's id from the facts table. Here's what that index would look like:

![img alt](Capture-diagram-queryplan-1.png)

We can write a query that uses the primary key, the country name, of the index table, which we'll call __name_idx__, to look up the row we're interested in and then extract the id value for that row in facts. Then, we can write a separate query that uses the id value returned from the previous query to look up the specific row in the facts table that contains information on India and then return just the population value.

Instead of performing a single full table scan of facts, SQLite would perform a binary search on the index then another binary search on facts using the id value. Both queries are taking advantage of the primary key for the index and the facts table to quickly return the results we want. Here's a diagram of these concepts:

![img alt](Capture-diagram-queryplan-2.png)


### 4.3.4.8 Creating an index (Tradeoffs: Speed vs Space)

Instead of creating a separate table and updating it ourselves, we can specify a column we want an index table for and SQLite will take care of the rest. SQLite, and most databases, make it easy for you to __create indexes for tables on columns__ we plan to query often. To create an index we use the __[CREATE INDEX statement](https://www.sqlite.org/lang_createindex.html)__. Here's the pseudo-code for that statement:

```SQL
CREATE INDEX index_name ON table_name(column_name);
```

As you can see from the pseudo-code above, each index we create needs a name (to replace __index_name__). Similar to when you add a table to a database, using the __IF NOT EXISTS clause__ helps you avoid attempting to create an index that already exists. Doing so will cause SQLite to throw an error. To create an index for the __area__ column called __area_idx__, we write the following query:

```SQL
CREATE INDEX IF NOT EXISTS area_idx ON facts(area);
```

An empty array will be returned when you run the query. The main benefit of having SQLite handle the maintenance of indexes we create is that the indexes are used automatically when we execute a query whenever there will be any speed advantages. As our queries become more complex, letting SQLite decide how and when to use the indexes we create helps us be much more productive.

If we create an index for the area column in the facts table, SQLite will use the index whenever we search for rows in facts using that column. This index would be similar to the one we worked with in the past step and each row would only contain the area value and the corresponding row's id value. The index would __be ordered__ by the area values for quick lookups.

All three of the following queries would take advantage of the area_idx index:

```SQL
SELECT * FROM facts WHERE area = 10000;
SELECT * FROM facts WHERE area > 10000;
SELECT * FROM facts WHERE area < 10000;
```

Since the __area_idx__ index would be ordered by the area values, SQLite would:

* search for the first instance in the index where area equaled 10000 and store the id value in a temporary collection.
* it would then advance to the next row in the index to check if the WHERE condition was still met.
    * if not, then the temporary collection would be returned and the process completes.
    * if so, then SQLite would add that id value to the collection and check the next row.
* when SQLite finds a value for area that doesn't match the WHERE condition,
    * it will look up and return the rows in facts using the id values stored in the temporary collection.
    * each of these lookups will be O(log N) time complexity and while this could add up, it will still be faster than a full table scan.
    
This process allows us to just write one query instead of 2 and have SQLite maintain and interact with the index. A table can have many indexes, and most tables in production environments usually do have many indexes. Every time you add or delete a row to the table, all of the indexes will be updated. If you edit the values in a row, SQLite will figure out which indexes are affected by the changes and update those indexes.

While creating indexes gives us tremendous __speed benefits__, they come at the __cost of space__. Each index needs to be stored in the database file. In addition, adding, editing, and deleting rows takes longer since each of the affected indexes need to be updated. Since indexes can be created after a table is created, it's recommended to __only create an index when you find yourself querying on a specific column frequently__. Throughout the rest of the course, we'll explore how to understand the __tradeoffs__ and you'll develop a better sense of how to create indexes in an optimal way.

Now it's your turn to practice creating an index.

### 4.3.4.9 All together now

Instead of performing a full table scan on facts, SQLite used the name_idx index to return the id values first (in this case just one id value). Then, SQLite used binary search to extract just the rows from the facts table that corresponded to that id. SQLite __utilized 2 binary searches instead of a full table scan__ to find the row corresponding to India.

Let's now synthesize the concepts we learned in this mission to practice understanding the query plan and creating an index.

#### Instructions

* Return the query plan for the query that returns all values in the rows in facts where the population exceeds 10000. Assign the resulting query plan to query_plan_six and display using the print function.
* Create an index for the population column in the facts table named pop_idx.
* Return the query plan for the query that returns all values in the rows in facts where the population exceeds 10000. Assign the resulting query plan to query_plan_seven and display using the print function.

#### Answers
```PYTHON
query_plan_six = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts WHERE population>10000;").fetchall()
print(query_plan_six)

conn.execute("CREATE INDEX IF NOT EXISTS pop_idx ON facts(population);")

query_plan_seven = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts WHERE population>10000;").fetchall()
print(query_plan_seven)
```

Outputs:
```Python
[(0, 0, 0, 'SCAN TABLE facts')]
[(0, 0, 0, 'SEARCH TABLE facts USING INDEX pop_idx (population>?)')]
```

### 4.3.4.10 Next Steps

Instead of ending in __USING INDEX pop_idx (population)__, the query plan ended in __USING INDEX pop_idx (population>?)__. This is to indicate the granularity of the lookup that SQLite had to do for that index.

In this mission, we explored how SQLite accessed data and how to create and take advantages of indexes. In the next mission, we'll learn how to create more complex indexes and dive deeper into database performance and learn about __multi-column indices__.

## 4.3.5 Multi-column indexing

### 4.3.5.1 Introduction

In the last mission, we explored how to __speed up SELECT queries__ that only __filter on one column__ by creating an index for that column. In this mission, we'll explore how to create indexes for speeding up queries that filter on multiple columns.

We'll continue to work with factbook.db, a SQLite database that contains information about each country in the world. Recall that this database contains just the facts table and each row represents a single country. While we created indexes for the facts table in this database in the previous mission, this version of factbook.db contains no indexes.

Here are some of the columns:

* name -- the name of the country.
* area -- the total land and sea area of the country.
* population -- the population of the country.
* birth_rate -- the birth rate of the country.
* created_at -- the date the record was created.
* updated_at -- the date the record was updated.

and the first few rows of facts:

<table class="table table-bordered">
<tbody><tr>
<th>id</th>
<th>code</th>
<th>name</th>
<th>area</th>
<th>area_land</th>
<th>area_water</th>
<th>population</th>
<th>population_growth</th>
<th>birth_rate</th>
<th>death_rate</th>
<th>migration_rate</th>
<th>created_at</th>
<th>updated_at</th>
</tr>
<tr>
<td>1</td>
<td>af</td>
<td>Afghanistan</td>
<td>652230</td>
<td>652230</td>
<td>0</td>
<td>32564342</td>
<td>2.32</td>
<td>38.57</td>
<td>13.89</td>
<td>1.51</td>
<td>2015-11-01 13:19:49.461734</td>
<td>2015-11-01 13:19:49.461734</td>
</tr>
<tr>
<td>2</td>
<td>al</td>
<td>Albania</td>
<td>28748</td>
<td>27398</td>
<td>1350</td>
<td>3029278</td>
<td>0.3</td>
<td>12.92</td>
<td>6.58</td>
<td>3.3</td>
<td>2015-11-01 13:19:54.431082</td>
<td>2015-11-01 13:19:54.431082</td>
</tr>
<tr>
<td>3</td>
<td>ag</td>
<td>Algeria</td>
<td>2381741</td>
<td>2381741</td>
<td>0</td>
<td>39542166</td>
<td>1.84</td>
<td>23.67</td>
<td>4.31</td>
<td>0.92</td>
<td>2015-11-01 13:19:59.961286</td>
<td>2015-11-01 13:19:59.961286</td>
</tr>
</tbody></table>


We limited ourselves to working with queries that only filtered on one column like:

```SQL
SELECT * FROM facts WHERE name = 'India';
```

In this mission, we'll explore how to create indexes for speeding up queries that __filter on multiple columns__, like:

```SQL
SELECT * FROM facts WHERE population > 1000000 AND population_growth < 2.0;
```

We'll also explore how to modify the queries we write to better take advantage of indexes. For example, if we create an index for the name column, we'll explore why the following query:

```SQL
SELECT name from facts WHERE name = 'India'`;
```

will be faster than:

```SQL
SELECT * from facts WHERE name = 'India'`;
```

To start, let's write and run a query that involves filtering on more than 1 column and use the __EXPLAIN QUERY PLAN statement__ to understand what SQLite is doing to return the results. Our intuition suggests that SQLite will have to perform a full table scan. It will have to check if each row in the table meets the WHERE constraints since there are no indexes in the table to take advantage of.

We've already imported the sqlite3 library and initialized a connection to factbook.db in the coding cell.


### 4.3.5.2 Introduction

#### Instructions
* Return the query plan for a query that returns all rows where population is greater than 1000000 and where population_growth is less than 0.05.
    * We're interested in all of the columns in the rows.
* Assign the query plan to query_plan_one and use print function to display the query plan.

#### Answers
```python
import sqlite3
conn = sqlite3.connect("factbook.db")

query_plan_one = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts WHERE population>1000000 AND population_growth<0.05;").fetchall()
print(query_plan_one)
```

Outputs:
```python
[(0, 0, 0, 'SCAN TABLE facts')]
```


### 4.3.5.3 Query plan for multi-column queries

As expected, SQLite had to perform a full table scan to access the data we asked for. Let's add indexes for both the __population and population_growth columns__ to see how SQLite uses these indexes for returning the same query.

#### Instructions

* Create an index called pop_idx for the population column in the facts table.
* Create an index called pop_growth_idx for the population_growth column in the facts table.
* Return the query plan for a query that returns all rows where population is greater than 1000000 and where population_growth is less than 0.05. We're interested in all of the columns in the rows.
* Assign the query plan to query_plan_two and display it using the print function.

#### Answers
```python
conn = sqlite3.connect("factbook.db")

conn.execute("CREATE INDEX IF NOT EXISTS pop_idx ON facts(population);")
conn.execute("CREATE INDEX IF NOT EXISTS pop_growth_idx ON facts(population_growth);")

query_plan_two = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts WHERE population>1000000 AND population_growth<0.05;").fetchall()

print(query_plan_two)
```

Outpus:
```PYTHON
[(0, 0, 0, 'SEARCH TABLE facts USING INDEX pop_growth_idx (population_growth<?)')]
```


### 4.3.5.4 Explanation (can't work both)

If you recall, SQLite returns only a high-level query plan when you use the __EXPLAIN QUERY PLAN statement__ in front of a query. This means that you'll often have to augment the returned query plan with your own understanding of the available indexes. In this case, the facts table has 2 indexes:

* one ordered by population called pop_idx,
* one ordered by population_growth, called pop_growth_idx.

SQLite struggles to take advantage of both indexes since each index is optimized for lookups on just that column. SQLite can use the indexes to quickly find the row id values where either population is greater than 1000000 or where population_growth is less than 0.05. If SQLite uses the index of population values to return all of the row id values where population is less than 1000000, it can't use those id values to search the pop_growth_idx index quickly to find the rows where population_growth is less than 0.05.

If you look at the query plan, you can infer that SQLite first decided to use the pop_growth_idx index to return the id values for the rows where population_growth was less than 0.05. Then, SQLite used a binary search on the facts table to access the row at each id value, add that row to a temporary collection if the value for population was greater than 1000000, and return the collection of rows.

You may be wondering __why SQLite chose the pop_growth_idx instead of the pop_idx__. This is because when there are 2 possible indexes available, SQLite tries to estimate which index will result in better performance. Unfortunately, to keep SQLite lightweight, limited ability was added to estimate and plan accurately and SQLite often __ends up picking an index at random__.

### 4.3.5.5 Multi-column index

In cases like this, we need to create a __multi-column index__ that contains values from both of the columns we're filtering on. This way, both criteria in the WHERE statement can be evaluted in the index itself and the facts table will only be queried at the end when we have the specific row id values.

Here's how a multi-column index for __population and population_growth__ would look like:

![img alt](Capture-diagram-queryplan-3.PNG)

<span style="color:red">While the single column indexes we've created in the past contain just the primary key column (population) and the row id (id) columns, this multi-column index contains the population_growth column as well</span>. SQLite can:

* use binary search to find the first row in this index where population is greater than 1000000,
* add the row to a temporary collection if population_growth is less than 0.05,
* advance to the next row (the index is ordered by population) and check if it's greater than 1000000,
* add the row to a temporary collection if population_growth is less than 0.05,
* when the end of the index is reached, look up each row in facts using the id values from the temporary collection.

This way the facts table is only accessed at the end and the index is used to process the __WHERE criteria__.

When creating a multi-column index, we need to __specify which of the columns we want as the primary key__. In the example above, this means that SQLite can use binary search to quickly jump to the first row that matches a specific population value but not for jumping to the first row that matches a specific population_growth value.

### 4.3.5.6 Creating a multi-column index

To create a multi-column index, we use the same __CREATE INDEX__ syntax as before but instead __specify 2 columns__ in the __ON statement__:

```sql
CREATE INDEX index_name ON table_name(column_name_1, column_name_2);
```

The important thing to know here is that the __first column__ in the parentheses becomes the __primary key for the index__. Let's create a multi-column index for the population and population_growth columns and return the query plan for the query we've been working with.

#### Instructions

* Create a multi-column index for population and population_growth named pop_pop_growth_idx with population as the primary key.
* Return the query plan for a query that returns all rows where population is greater than 1000000 and where population_growth is less than 0.05. We're interested in all of the columns in the rows.
* Assign the returned query plan to query_plan_three and use the print function to display it.

#### Answers
```python
conn = sqlite3.connect("factbook.db")

conn.execute("CREATE INDEX IF NOT EXISTS pop_pop_growth_idx ON facts(population, population_growth);")

query_plan_three = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts WHERE population>1000000 AND population_growth<0.05;").fetchall()

print(query_plan_three)
```

Outputs:
```PYTHON
[(0, 0, 0, 'SEARCH TABLE facts USING INDEX pop_pop_growth_idx (population>?)')]
```

### 4.3.5.7 Covering index

This time, SQLite used the multi-column index pop_pop_growth_idx that we created instead of either pop_growth_idx or pop_idx. SQLite only needed to access the facts table to return the rest of the column values for the rows that met the WHERE criteria. This is only because the pop_pop_growth_idx doesn't contain the other values (besides population and population_growth already).

What if we restricted the columns in the SELECT that we want returned to just population and population_growth? In this case, SQLite will not need to interact with the facts table since the pop_pop_growth_idx can service the query. When an __index contains all of the information necessary to answer a query__, it's called a __covering index__. Since the index covers for the actual table and can return the requested results to the query, SQLite doesn't need to query the actual table. For many queries, especially as your data gets larger, this can be much more efficient.

Let's write a query that uses the index we created as a covering index and return its query plan.

#### Instructions

* Return the query plan for a query that returns all rows where population is greater than 1000000 and where population_growth is less than 0.05. Select only the population and population_growth columns.
* Assign the returned query plan to query_plan_four and use the print function to display it.

#### Answers
```python
conn = sqlite3.connect("factbook.db")
conn.execute("create index if not exists pop_pop_growth_idx on facts(population, population_growth);")

query_plan_four = conn.execute("EXPLAIN QUERY PLAN SELECT population, population_growth FROM facts WHERE population>1000000 AND population_growth<0.05;").fetchall()
print(query_plan_four)
```

Outputs
```python
[(0, 0, 0, 'SEARCH TABLE facts USING COVERING INDEX pop_pop_growth_idx (population>?)')]
```



### 4.3.5.8 Covering index for single column

There's two things that stand out from the __query plan__ from the previous screen:

* instead of __USING INDEX__ the query plan says __USING COVERING INDEX__,
* the query plan still contains SEARCH TABLE facts as before.

Even though the query plan indicates that a binary search on facts was performed, this is __misleading__ and it was instead able to use the covering index. You can read more about that __[on the documentation](https://www.sqlite.org/queryplanner.html#covidx)__.

Covering indexes don't apply just to multi-column indexes. If a query we write only touches a column in the database that we have a single-column index for, SQLite will use only the index to service the query. Let's test this by writing a query that can take advantage of just the index, pop_idx, for the population column.

#### Instructions

* Return the query plan for a query that returns all rows where population is greater than 1000000. We're only interested in the population column.
* Assign the returned query plan to query_plan_five and use the print function to display it.

#### Answers
```python
conn = sqlite3.connect("factbook.db")
conn.execute("create index if not exists pop_pop_growth_idx on facts(population, population_growth);")

query_plan_five = conn.execute("EXPLAIN QUERY PLAN SELECT population FROM facts WHERE population>1000000;").fetchall()

print(query_plan_five)
```

Outputs:
```python
[(0, 0, 0, 'SEARCH TABLE facts USING COVERING INDEX pop_idx (population>?)')]
```

### 4.3.5.9 Conclusion

Since only the population values were necessary to service the query, SQLite used the pop_idx index as a covering index and didn't have to access the facts table.

In this mission, we explored how to create multi-column indexes and how to restrict our query to utilize an index if we don't always need information on column values only available in the table.