# The SELECT Statement

The SELECT statement provides power to SQL users wishing to perform data analysis. The SELECT statement describes a **task** to be completed by the RDBMS. The result of this task is always some data in rows in columns. You can analogize the SELECT statement to the methods in a pandas DataFrame. Nearly all actions capable with a SELECT statement are also capable with a DataFrame and vice versa. By the end of this chapter, you should be able to answer many of the exercises in previous chapters using the SELECT statement.

This chapter covers the most useful functionality of the SELECT statement that is common to most RDBMS's. As the various RDBMS's have different syntax, it will not be possible to provide exhaustive coverage of all the functionality in each RDBMS and you will need to consult your specific RDBMS's documentation to find its syntax. As our data is stored in SQLite, we will use syntax that works specifically for it, though for many of the commands, the syntax will be similar (and often identical) to other RDBMS's.

## Writing and executing SQL statements

Before we learn how to write SQL statements, we need to understand where we can write out SQL statements and execute them so that they run on our database. This section covers the different methods for writing and executing SQL statements. 

### DbSchema

With DbSchema open, click on the **Query Tools** menu bar and select **New SQL Editor**. A tab with a blank editor will open in the bottom half of the screen. You may write your SQL statements in that editor and either click the **Run Query** button or press **ctrl + enter** to execute. Many other graphical user interfaces exist (some specifically designed for a particular RDBMS) that allow you to execute SQL commands.

### `sqlite3` command

After installing Python with Miniconda, the `sqlite3` command will be available to you from the command line. Open up your terminal/command prompt and "cd" into to the `data/databases` directory of this book. Run the command `sqlite3 healthcare.db` and a new **sqlite3>** prompt should appear. Within this prompt, SQL statements may be written and executed. Other RDBMS's will have their own command line prompts. Enter `.quit` to exit the prompt.

## Connecting to a database programmatically

And of course, it is possible to write and execute SQL statements with another programming language. This is the method we will use for the remainder of the book. This allows us to remain in the notebook instead of switching between other software. In order to programmatically connect to a database, you'll need to know the following items:

* username
* password
* host
* port
* database name

Connecting to a database isn't too different than logging into your email, which also requires a username and password. The host is simply the URL where the database is located (often localhost). The port is usually a four-digit number of where the database is running on the server. For instance, Jupyter Notebooks run on port 8888 on your localhost.

### Installing the database driver

A database **driver** is software that facilitates the transfer of commands and data between the specific RDBMS and the other piece of software (Python in this case). You'll often have to download a separate Python library for each RDBMS. In the case of MySQL, one popular driver is [PyMySQL][0], which you can install with `pip install pymysql`.

With the driver library installed, you can use it directly to issue commands to the RDBMS. Consult each driver's documentation on how to create a connection object, which will allow you to run an `execute` method.

### Connection strings with SQLAlchemy 

Although it isn't too complicated to work with the driver library directly, pandas requires that we use SQLAlchemy, an additional third-party package that streamlines the process for working with any driver. With SQLAlchemy, we provide all of the information to connect to the database as a **connection string**. The general format for the connection string for any RDMBS is:

`dialect+driver://username:password@host:port/database`

Each component above will be replaced with its specific value for an actual connection. [Check here][1] for a list of all connection strings in SQLAlchemy. You will still need to install the driver, but won't need to import it. For example, here is what a connection string to a MySQL database might look like.

`mysql+pymysql://ted:nikopenny1234@dunderdata.com:3306/students`

### Connecting to SQLite

With SQLite being an exception, most RDBMS's allow access by creating users with passwords. With SQLite, there are no users and no passwords. Also, SQLite databases are just a single file - there is no server running from a port. It's not possible to connect to a SQLite database remotely. The connection string simplifies to become:

`sqlite:///path/to/database`


### Reading in tables as pandas DataFrames

pandas further simplifies the process of executing SQL commands with its `read_sql` function. We only need to supply the SQL statement and connection string. pandas will execute our statement and return the result as a DataFrame. We don't even have to import SQLAlchemy, though it must be installed with `conda install sqlalchemy` first. Here, we read in the patient table from the healthcare database. The first argument to `read_sql` can be either a SQL statement or the name of a table. Since we haven't learned any SQL yet, the entire table is retrieved.

[0]: https://github.com/PyMySQL/PyMySQL/
[1]: https://docs.sqlalchemy.org/en/latest/core/engines.html

In [2]:
import pandas as pd
CS = 'sqlite:///../data/databases/healthcare.db'
patient = pd.read_sql('patient', CS)
patient.head(3)

Unnamed: 0,patient_id,first_name,last_name,sex,address,date_of_birth
0,1,Ezra,Gonzalez,Male,"270 Elm St., Houston, TX 77005",1954-06-11
1,2,Molly,Clark,Female,"325 Main St., Houston, TX 77005",1976-08-11
2,3,Ivy,Jackson,Female,"136 Blueberry Hill, Houston, TX 77084",1948-03-25


All results with SQL statements will be duplicated with pandas commands. We read in the other four tables as DataFrames so that we can use them when needed below.

In [3]:
doctor = pd.read_sql('doctor', CS)
clinic = pd.read_sql('clinic', CS)
procedure = pd.read_sql('procedure', CS)
appointment = pd.read_sql('appointment', CS)

## SELECT statement clauses

The SELECT statement syntax is composed of various **clauses** that each provide a description for the task to be completed. Before covering the specific clauses, it might be helpful to see how natural language statements can be broken down into a collection of clauses. Take a look at the following statement.

```
BRING 4 apples and 3 oranges
FROM my_favorite_market
TO my_house
WHERE each apple is greater than 300 grams
BY 3 p.m. today.
```

Let's call this a **BRING** statement which begins with the word "BRING" and ends in a period. All of the words in all capital letters at the start of each line can be referred to as a clause (technically, natural language clauses have verbs, but we will ignore this fact in order to make our analogy work). Each clause provides more detail as to what task needs to be completed. Notice that all clauses are **optional**. Omitting any of them still provides a valid description of a task.

A brief description of the primary SELECT statement clauses are listed below. Note that SELECT itself is a clause and the name for the entire statement. Each clause has its own syntax with other keywords available to use. The words **JOIN**, **HAVING**, and **OFFSET** are **subclauses** and can only appear after the main clause. The clauses must appear in the order listed below without exception.

* **SELECT** - choose column, evaluate expressions, aggregate functions
* **FROM** - table of data
    * **JOIN** - joins two tables together
* **WHERE** - filter rows based on boolean condition
* **GROUP BY** - create independent groups based on values in given columns
    * **HAVING** - filter rows after grouping
* **ORDER BY** - sort the rows based on given columns
* **LIMIT** - limit the number of returned rows
    * **OFFSET** - skip a given number of rows

## The FROM clause

In this section, we cover both the SELECT and FROM clauses as it is rare for a SELECT clause to appear by itself. The most basic use case for the SELECT clause is to select specific columns from a table. Place the names of each column selected separated by a comma after SELECT. The FROM clause identifies the table. Here, we select three columns from the patient table. 

By default, all rows are selected. The `read_sql` function returns the result as a DataFrame, which we then use the `head` method to display just the first five rows. The `read_sql` function is used merely to display the results of the SQL command in the notebook as a pandas DataFrame.

In [4]:
sql = """
SELECT patient_id, first_name, sex
FROM patient
"""
pd.read_sql(sql, CS).head()

Unnamed: 0,patient_id,first_name,sex
0,1,Ezra,Male
1,2,Molly,Female
2,3,Ivy,Female
3,4,Ivy,Female
4,5,Amara,Female


Translating to pandas, this becomes:

In [5]:
cols = ['patient_id', 'first_name', 'sex']
patient[cols].head()

Unnamed: 0,patient_id,first_name,sex
0,1,Ezra,Male
1,2,Molly,Female
2,3,Ivy,Female
3,4,Ivy,Female
4,5,Amara,Female


### Renaming columns

Each column can be renamed using the AS keyword. This is also known as providing an **alias**. The same terminology is used when importing Python modules and subsequently renaming them with the `as` keyword. Here, we rename two of the columns.

In [6]:
sql = """
SELECT doctor_id, first_name AS first, last_name AS last
FROM doctor
"""
pd.read_sql(sql, CS).head()

Unnamed: 0,doctor_id,first,last
0,1,Leo,Davis
1,2,Liam,Perez
2,3,Molly,Johnson
3,4,Oliver,Jackson
4,5,Zoey,Thomas


Replicating in pandas, this becomes:

In [7]:
cols = ['doctor_id', 'first_name', 'last_name']
doctor[cols].rename(columns={'first_name': 'first', 
                             'last_name': 'last'}).head()

Unnamed: 0,doctor_id,first,last
0,1,Leo,Davis
1,2,Liam,Perez
2,3,Molly,Johnson
3,4,Oliver,Jackson
4,5,Zoey,Thomas


### Keywords, table names, and column names are case insensitive

Unlike most programming languages, keywords are case insensitive in SQL as well as table and column names. Here, we run the same query as before but a mix of lower and upper case letters. By convention, the clause keywords are capitalized. Additionally, whitespace is irrelevant in SQL.

In [8]:
sql = """
select doctor_id, FIRST_name as first, last_name AS LAST
fROm DOCtor
"""
pd.read_sql(sql, CS).head(3)

Unnamed: 0,doctor_id,first,LAST
0,1,Leo,Davis
1,2,Liam,Perez
2,3,Molly,Johnson


### All returned tables are temporary

Every time you run a SELECT statement, a new temporary table is created. This table is not saved in the database and is not available after the query. It is only available to view. This is exactly what happens when we call methods from a pandas DataFrame that return a new DataFrame. Unless we use an assignment statement to save the results, the new returned DataFrame is temporary and we cannot access it again without re-running the same code.

### Selecting all columns with `*`

Instead of writing each column name, use an asterisk to select every column from a table. Here, we select all columns (and all rows) from the doctor table.

In [9]:
sql = """
SELECT *
FROM doctor
"""
pd.read_sql(sql, CS).head(3)

Unnamed: 0,doctor_id,first_name,last_name,specialty
0,1,Leo,Davis,Dermatology
1,2,Liam,Perez,Radiology
2,3,Molly,Johnson,Anesthesiology


### Selecting unique values

The unique values of a column may be returned by placing the keyword DISTINCT immediately after the SELECT clause. Here, we return the unique clinic_id values, in the order that they appear.

In [10]:
sql = """
SELECT DISTINCT major_category
FROM procedure
"""
pd.read_sql(sql, CS)

Unnamed: 0,major_category
0,Anesthesia
1,Surgery
2,Radiology
3,Pathology and Laboratory
4,Medicine
5,Evaluation and Management


The `drop_duplicates` method replicates this functionality. Note that the index preserves the original location where the first unique value appeared.

In [11]:
procedure['major_category'].drop_duplicates()

0                      Anesthesia
156                       Surgery
2919                    Radiology
3497     Pathology and Laboratory
4572                     Medicine
5214    Evaluation and Management
Name: major_category, dtype: object

Any combination of unique values across any number of columns may be found with DISTINCT.

In [12]:
sql = """
SELECT DISTINCT major_category, minor_category
FROM procedure
"""
pd.read_sql(sql, CS).head()

Unnamed: 0,major_category,minor_category
0,Anesthesia,head
1,Anesthesia,neck
2,Anesthesia,thorax
3,Anesthesia,intrathoracic
4,Anesthesia,spineandspinal cord


The same `drop_duplicates` method is used again.

In [13]:
procedure[['major_category', 'minor_category']].drop_duplicates().head()

Unnamed: 0,major_category,minor_category
0,Anesthesia,head
21,Anesthesia,neck
25,Anesthesia,thorax
29,Anesthesia,intrathoracic
48,Anesthesia,spineandspinal cord


## The LIMIT clause

The LIMIT clause limits the number of rows returned. Use it by placing an integer after it. Here, we select the first four rows along with a few columns from the procedure table.

In [14]:
sql = """
SELECT procedure_id, major_category, cost
FROM procedure
LIMIT 4
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,major_category,cost
0,100,Anesthesia,248
1,103,Anesthesia,84
2,104,Anesthesia,111
3,120,Anesthesia,511


The `head` method replicates the LIMIT clause.

In [15]:
cols = ['procedure_id', 'major_category', 'cost']
procedure[cols].head(4)

Unnamed: 0,procedure_id,major_category,cost
0,100,Anesthesia,248.0
1,103,Anesthesia,84.0
2,104,Anesthesia,111.0
3,120,Anesthesia,511.0


### The OFFSET subclause

The OFFSET subclause must appear after the LIMIT clause. The integer provided to it references the number of rows to skip before starting the selection. Here, we skip 1,000 rows before selecting the next four.

In [16]:
sql = """
SELECT procedure_id, major_category, cost
FROM procedure
LIMIT 4 OFFSET 1000
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,major_category,cost
0,28010,Surgery,179
1,28011,Surgery,156
2,28022,Surgery,140
3,28024,Surgery,924


The `iloc` indexer provides this functionality in pandas. We begin at integer location 1,000.

In [17]:
cols = ['procedure_id', 'major_category', 'cost']
procedure[cols].iloc[1000:1004]

Unnamed: 0,procedure_id,major_category,cost
1000,28010,Surgery,179.0
1001,28011,Surgery,156.0
1002,28022,Surgery,140.0
1003,28024,Surgery,924.0


## Database terminology - rows/records and columns/fields

The terms rows and columns have been used throughout this book to refer to the horizontal and vertical pieces of data in a table. In the database world, the term **record** is often used for a row and **field** for a column. You will encounter these terms in the chapters on SQL.

## The WHERE clause

The WHERE clause filters the data based on a boolean condition and is analogous to the DataFrame `query` method. An expression that evaluates as a boolean must be placed to the right of the WHERE clause. Here, we retrieve the first three records where cost is less than 100.

In [18]:
sql = """
SELECT *
FROM procedure
WHERE cost < 100
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,103,Anesthesia for procedure on eyelid,Anesthesia,head,84
1,144,Anesthesia for procedure on eye for corneal tr...,Anesthesia,head,64
2,352,Anesthesia for tying procedure on major blood ...,Anesthesia,neck,60


Replicating with pandas:

In [19]:
procedure.query('cost < 100').head(3)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
1,103,Anesthesia for procedure on eyelid,Anesthesia,head,84.0
7,144,Anesthesia for procedure on eye for corneal tr...,Anesthesia,head,64.0
24,352,Anesthesia for tying procedure on major blood ...,Anesthesia,neck,60.0


One minor difference is that the equality comparison operator is one (and not two) equal signs. The keywords AND, OR, and NOT are all available to perform conditional logic.

In [20]:
sql = """
SELECT *
FROM procedure
WHERE cost < 200 AND minor_category = "pulmonary"
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,94016,Physician interpretation and report of measure...,Medicine,pulmonary,184
1,94450,Lung function response to low oxygen,Medicine,pulmonary,113
2,94680,Collection and analysis of exhaled air for eva...,Medicine,pulmonary,140


The expression following the WHERE clause can often be placed directly into the `query` method as it is here.

In [21]:
procedure.query('cost < 200 and minor_category == "pulmonary"').head(3)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
4944,94016,Physician interpretation and report of measure...,Medicine,pulmonary,184.0
4951,94450,Lung function response to low oxygen,Medicine,pulmonary,113.0
4967,94680,Collection and analysis of exhaled air for eva...,Medicine,pulmonary,140.0


The BETWEEN keyword exists to test whether a value is between the given minimum and maximum, which must be separated by AND. Here, we select all procedures with a cost between 800 and 900 (inclusive) that have "pulmonary" as the minor_category.

In [22]:
sql = """
SELECT *
FROM procedure
WHERE cost BETWEEN 800 AND 900 AND minor_category = "pulmonary"
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,94060,Measurement and graphic recording of the amoun...,Medicine,pulmonary,852
1,94728,Measurement of airway resistance by impulse os...,Medicine,pulmonary,844
2,94762,Overnight measurement of oxygen saturation in ...,Medicine,pulmonary,813


Within the `query` method, a chained comparison replicates the behavior of BETWEEN.

In [23]:
procedure.query('800 <= cost <= 900 and minor_category == "pulmonary"').head()

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
4945,94060,Measurement and graphic recording of the amoun...,Medicine,pulmonary,852.0
4972,94728,Measurement of airway resistance by impulse os...,Medicine,pulmonary,844.0
4977,94762,Overnight measurement of oxygen saturation in ...,Medicine,pulmonary,813.0


### Test multiple equalities with IN

Use the keyword IN followed by a comma-separated list of values surrounded by parentheses to test a column for multiple equalities.

In [24]:
sql = """
SELECT *
FROM doctor
WHERE doctor_id IN (5, 10, 20)
"""
pd.read_sql(sql, CS)

Unnamed: 0,doctor_id,first_name,last_name,specialty
0,5,Zoey,Thomas,Neurology
1,10,Oliver,Johnson,Anesthesiology
2,20,Zoey,Moore,Radiology


The syntax is identical within the `query` method. The parentheses technically create a Python tuple here. A set or a list would work as well.

In [25]:
doctor.query('doctor_id in (5, 10, 20)')

Unnamed: 0,doctor_id,first_name,last_name,specialty
4,5,Zoey,Thomas,Neurology
9,10,Oliver,Johnson,Anesthesiology
19,20,Zoey,Moore,Radiology


### Partial string matching with LIKE

The LIKE keyword performs partial string matching when the string following it contains a percentage sign at the beginning or end. The following matches all descriptions that begin with the word "muscle". The percentage sign represents any number of any other characters. Notice that it is not case sensitive.

In [26]:
sql = """
SELECT *
FROM procedure
WHERE description LIKE "muscle%"
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,15732,Muscle flap wound repair at head and neck,Surgery,integumentary system,82
1,15734,Muscle flap wound repair at trunk,Surgery,integumentary system,821
2,15736,Muscle flap wound repair of arm,Surgery,integumentary system,367
3,15738,Muscle flap wound repair of leg,Surgery,integumentary system,507


The `query` method doesn't provide partial string matching. You'll need to create a boolean Series using one of the string-only methods available with the `str` accessor. Here, we use the `startswith` method (which is case sensitive).

In [27]:
filt = procedure['description'].str.startswith('Muscle')
procedure[filt]

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
345,15732,Muscle flap wound repair at head and neck,Surgery,integumentary system,82.0
346,15734,Muscle flap wound repair at trunk,Surgery,integumentary system,821.0
347,15736,Muscle flap wound repair of arm,Surgery,integumentary system,367.0
348,15738,Muscle flap wound repair of leg,Surgery,integumentary system,507.0


In order to ignore case, we need to use the `contains` method with a regular expression.

In [28]:
filt = procedure['description'].str.contains('^muscle', case=False)
procedure[filt]

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
345,15732,Muscle flap wound repair at head and neck,Surgery,integumentary system,82.0
346,15734,Muscle flap wound repair at trunk,Surgery,integumentary system,821.0
347,15736,Muscle flap wound repair of arm,Surgery,integumentary system,367.0
348,15738,Muscle flap wound repair of leg,Surgery,integumentary system,507.0


Here, we match strings that end in "muscle".

In [29]:
sql = """
SELECT *
FROM procedure
WHERE description LIKE "%muscle"
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,1474,Anesthesia for procedure to repair calf muscle,Anesthesia,lower leg (below knee),482
1,11046,Removal of skin and/or muscle,Surgery,integumentary system,191
2,20200,Biopsy of muscle,Surgery,musculoskeletal system,632


In pandas, the `endswith` string-only method is used. Like above, use `contains` to make case-insensitive.

In [30]:
filt = procedure['description'].str.endswith('muscle')
procedure[filt].head(3)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
118,1474,Anesthesia for procedure to repair calf muscle,Anesthesia,lower leg (below knee),482.0
181,11046,Removal of skin and/or muscle,Surgery,integumentary system,191.0
473,20200,Biopsy of muscle,Surgery,musculoskeletal system,632.0


Use parentheses on either side to match strings containing the sub-string anywhere inside of it.

In [31]:
sql = """
SELECT *
FROM procedure
WHERE description LIKE "%muscle%"
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,1250,"Anesthesia for procedure on nerves, muscles, t...",Anesthesia,upper leg (except knee),627
1,1320,"Anesthesia for procedure on nerves, muscles, t...",Anesthesia,knee and popliteal area,940
2,1470,"Anesthesia for procedure on nerves, muscles, t...",Anesthesia,lower leg (below knee),475


In this instance, we must use the `contains` method as it matches substrings in any part of the string.

In [32]:
filt = procedure['description'].str.contains('muscle')
procedure[filt].head(3)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
104,1250,"Anesthesia for procedure on nerves, muscles, t...",Anesthesia,upper leg (except knee),627.0
108,1320,"Anesthesia for procedure on nerves, muscles, t...",Anesthesia,knee and popliteal area,940.0
117,1470,"Anesthesia for procedure on nerves, muscles, t...",Anesthesia,lower leg (below knee),475.0


## The ORDER BY clause

The ORDER BY clause sorts the data by one or more columns. By default, it uses ascending order. Here, we sort the procedure by cost from least to greatest.

In [33]:
sql = """
SELECT *
FROM procedure
ORDER BY cost
LIMIT 7
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,38520,"Biopsy or removal of lymph nodes of neck, open...",Surgery,hemicandlymphatic systems,
1,54056,Freezing destruction of penile growths,Surgery,male genital system,
2,69644,"Repair of eardrum, ear canal and bones with re...",Surgery,auditory system,
3,76998,Ultrasonic guidance during surgery,Radiology,diagnostic ultrasound,
4,80177,Levetiracetam level,Pathology and Laboratory,therapeutic drug assays,
5,21627,Debridement of chest bone,Surgery,musculoskeletal system,49.0
6,24400,Incision to repair upper arm bone,Surgery,musculoskeletal system,49.0


Missing values in SQL are represented as NULL and are placed at the top when sorting ascending. Pandas DataFrames do not have a NULL object, so are read in as `np.nan` or `None`. Filter the missing values with the WHERE clause using `IS NOT NULL`.

In [34]:
sql = """
SELECT *
FROM procedure
WHERE cost IS NOT NULL
ORDER BY cost
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,21627,Debridement of chest bone,Surgery,musculoskeletal system,49
1,24400,Incision to repair upper arm bone,Surgery,musculoskeletal system,49
2,29550,Strapping of toes,Surgery,musculoskeletal system,49


Use the `sort_values` method to replicates this in pandas. Setting parameter `na_position` to "first" will match the SQL behavior.

In [35]:
procedure.sort_values('cost', na_position="first").head(7)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
1687,38520,"Biopsy or removal of lymph nodes of neck, open...",Surgery,hemicandlymphatic systems,
2265,54056,Freezing destruction of penile growths,Surgery,male genital system,
2910,69644,"Repair of eardrum, ear canal and bones with re...",Surgery,auditory system,
3300,76998,Ultrasonic guidance during surgery,Radiology,diagnostic ultrasound,
3526,80177,Levetiracetam level,Pathology and Laboratory,therapeutic drug assays,
2088,49651,Repositioning of recurrent groin hernia using ...,Surgery,digestive system,49.0
712,24400,Incision to repair upper arm bone,Surgery,musculoskeletal system,49.0


By default, pandas places the rows with missing values at the end when sorting.

In [36]:
procedure.sort_values('cost').head(3)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
2088,49651,Repositioning of recurrent groin hernia using ...,Surgery,digestive system,49.0
712,24400,Incision to repair upper arm bone,Surgery,musculoskeletal system,49.0
3859,83593,"Ketosteroids, 17 (hormone) measurement",Pathology and Laboratory,chemistry,49.0


By default, pandas uses the quicksort algorithm which does not preserve the original order when there are ties, which is why the results are not identical. Change to mergesort to perform a stable sort.

In [37]:
procedure.sort_values('cost', kind='mergesort').head(3)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
578,21627,Debridement of chest bone,Surgery,musculoskeletal system,49.0
712,24400,Incision to repair upper arm bone,Surgery,musculoskeletal system,49.0
1120,29550,Strapping of toes,Surgery,musculoskeletal system,49.0


Sort by values in multiple columns by separating each column name with a comma. Place the keywords ASC or DESC after the column name to sort ascending or descending. Because the default ordering is ASC, it's usually not written. Here, we sort the text column `major_category` from greatest to least, and, within it, by cost from least to greatest.

In [38]:
sql = """
SELECT *
FROM procedure
ORDER BY minor_category DESC, cost ASC
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,90732,Vaccine for pneumococcal polysaccharide for in...,Medicine,"vaccines, toxoids",51
1,90691,Vaccine for typhoid for injection into muscle,Medicine,"vaccines, toxoids",75
2,90785,Interactive complexity,Medicine,"vaccines, toxoids",87


Use lists for both the first argument and the `ascending` parameter to sort multiple columns in specific directions in pandas.

In [39]:
procedure.sort_values(['minor_category', 'cost'], 
                      ascending=[False, True],
                      kind='mergesort').head(3)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
4596,90732,Vaccine for pneumococcal polysaccharide for in...,Medicine,"vaccines, toxoids",51.0
4592,90691,Vaccine for typhoid for injection into muscle,Medicine,"vaccines, toxoids",75.0
4602,90785,Interactive complexity,Medicine,"vaccines, toxoids",87.0


## Functions

Thus far, we've merely selected existing data from one of the tables in the database. We have not performed any operations or called any functions to change any of the values. All RDBMS's have functions that can be called with parentheses and passing in arguments to produce a new result. 

Like we covered in pandas, there are both aggregating and non-aggregating functions. The aggregating functions return a single value for the entire table, while the non-aggregating return a single value for each row. The available functions and their names are where the greatest degree of difference resides in RDBMS's. All of the [SQLite functions are found in the documentation][0].

The `count` function returns the number of non-missing values of a particular column when passed the column name. When passed the `*`, it returns the total number of rows. Here, we find the total number of rows, number of non-missing cost values, and the minimum, maximum, and average cost.

[0]: https://www.sqlite.org/lang_corefunc.html

In [40]:
sql = """
SELECT count(*), count(cost), min(cost), max(cost), avg(cost)
FROM procedure
"""
pd.read_sql(sql, CS)

Unnamed: 0,count(*),count(cost),min(cost),max(cost),avg(cost)
0,5286,5281,49,998,527.69627


Passing the `agg` method a list of strings replicates the result.

In [41]:
procedure['cost'].agg(['size', 'count', 'min', 'max', 'mean'])

size     5286.00000
count    5281.00000
min        49.00000
max       998.00000
mean      527.69627
Name: cost, dtype: float64

It's common to change column names resulting from a function call and is done using AS just like we did above.

In [42]:
sql = """
SELECT count(*) AS size, count(cost) AS count, 
       min(cost) AS min_cost, max(cost) AS max_cost, avg(cost) AS avg_cost
FROM procedure
"""
pd.read_sql(sql, CS)

Unnamed: 0,size,count,min_cost,max_cost,avg_cost
0,5286,5281,49,998,527.69627


In [43]:
procedure['cost'].agg(size='size', count='count', min_cost='min',
                      max_cost='max', avg_cost='mean')

size        5286.00000
count       5281.00000
min_cost      49.00000
max_cost     998.00000
avg_cost     527.69627
Name: cost, dtype: float64

You can actually count the number of unique values in any column by inserting the DISTINCT keyword in the count function.

In [44]:
sql = """
SELECT count(distinct major_category) AS num_unique_major_category, 
       count(distinct minor_category) AS num_unique_minor_category
FROM procedure
"""
pd.read_sql(sql, CS)

Unnamed: 0,num_unique_major_category,num_unique_minor_category
0,6,96


The `nunique` aggregating method returns the number of unique values in each column in a DataFrame.

In [45]:
procedure.nunique()

procedure_id      5286
description       4731
major_category       6
minor_category      96
cost               947
dtype: int64

Non-aggregating functions work one value at a time and the same number of records are returned as there are in the original table. Here, we use the `length` and `upper` functions to find the number of characters in the description and uppercase all characters in `major_category`.

In [46]:
sql = """
SELECT description, 
       length(description) as desc_len, 
       upper(major_category) as MAJOR_CATEGORY
FROM procedure
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,description,desc_len,MAJOR_CATEGORY
0,Anesthesia for procedure on salivary gland wit...,54,ANESTHESIA
1,Anesthesia for procedure on eyelid,34,ANESTHESIA
2,Anesthesia for electric shock treatment,39,ANESTHESIA


One way to accomplish this in pandas is with the DataFrame constructor, mapping the three column names to Series.

In [47]:
pd.DataFrame({
    'description':    procedure['description'],
    'desc_len':       procedure['description'].str.len(),
    'MAJOR_CATEGORY': procedure['major_category'].str.upper()
}).head(3)

Unnamed: 0,description,desc_len,MAJOR_CATEGORY
0,Anesthesia for procedure on salivary gland wit...,54,ANESTHESIA
1,Anesthesia for procedure on eyelid,34,ANESTHESIA
2,Anesthesia for electric shock treatment,39,ANESTHESIA


## Arithmetic and comparison operations

The same arithmetic and comparison operators in Python work with SQL with the following exceptions:

* Exponentiation - use the `pow` function to raise to a power instead of `**`
* Equality test - use `=` and not `==` to test equality

### Using SELECT by itself as a calculator

Although rare, it's possible to use the SELECT clause by itself without any other clauses. We can turn the clause into a calculator, separating each calculation by a comma. Each calculation is a separate column in the result and is renamed with AS. By default, integer division is performed when operating with two integers. The cast function is used to convert from one data type to another, which results in true division. Notice that scalar values (such as 10 and "Python") are allowed as well.

In [48]:
sql = """
SELECT 10 AS scalar_number, 
       7 * 9 - 12 AS calculation,
       95 / 23 AS integer_division,
       95 / cast(23 AS float) AS true_division,
       pow(3, 5) AS raise_power,
       "Python" AS string 
"""
pd.read_sql(sql, CS)

Unnamed: 0,scalar_number,calculation,integer_division,true_division,raise_power,string
0,10,51,4,4.130435,243.0,Python


An easier way to perform true division is to write the integer as a decimal, such as `23.0` in the above query, though this won't work when changing entire column of integers to floats.

Operating with two columns from a table is far more common. Below, we divide the cost by the procedure_id both using integer division and true division by casting one of the columns to a float. Since both columns in the division operation were integers, an integer is returned. We also create a boolean column, testing whether the cost is greater than 200. Many RDBMS's have true boolean data types, but SQLite does not, and returns either 0 or 1. Finally, a column of constant values is created.

In [49]:
sql = """
SELECT cost, procedure_id,
       cost / procedure_id AS cost_per_procedure_id, 
       cost / cast(procedure_id as float) AS cost_per_procedure_id_float, 
       cost > 200 AS is_cost_greater_than_200,
       5 AS constant
FROM procedure
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,cost,procedure_id,cost_per_procedure_id,cost_per_procedure_id_float,is_cost_greater_than_200,constant
0,248,100,2,2.48,1,5
1,84,103,0,0.815534,0,5
2,111,104,1,1.067308,0,5


We can use the constructor like we did before, or we can use the `assign` method like we do below.

In [50]:
df = procedure[['cost', 'procedure_id']]
df.assign(cost_per_procedure_id=lambda x: x['cost'] // x['procedure_id'],
          is_cost_greater_than_200=lambda x: x['cost'] > 200,
          constant=5).head(3)

Unnamed: 0,cost,procedure_id,cost_per_procedure_id,is_cost_greater_than_200,constant
0,248.0,100,2.0,True,5
1,84.0,103,0.0,False,5
2,111.0,104,1.0,False,5


The result of a calculation between two or more columns can appear in a WHERE clause. Here, we filter for the records with cost at least three times the procedure_id.

In [51]:
sql = """
SELECT *
FROM procedure
WHERE cost / procedure_id >= 3
LIMIT 3
"""
pd.read_sql(sql, CS)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
0,120,Anesthesia for biopsy of external middle and i...,Anesthesia,head,511
1,126,Anesthesia for incision of ear drum,Anesthesia,head,852
2,142,Anesthesia for lens surgery,Anesthesia,head,952


An identical condition in the `query` method produced the same result in pandas.

In [52]:
procedure.query('cost / procedure_id >= 3').head(3)

Unnamed: 0,procedure_id,description,major_category,minor_category,cost
3,120,Anesthesia for biopsy of external middle and i...,Anesthesia,head,511.0
4,126,Anesthesia for incision of ear drum,Anesthesia,head,852.0
6,142,Anesthesia for lens surgery,Anesthesia,head,952.0


## Exercises

All of these exercises use the Chinook database, which contains data from a music store. It has tables on music tracks, artists, genres, playlists, invoicess, customers, employees, and more. Open it with DbSchema so that you can see the database diagram. Use SQL SELECT statements to answer each of the following exercises.

### Exercise 1

<span style="color:green; font-size:16px">Create a variable called `CS_CHINOOK` and assign it the value of the connection string. Use it to read in all of the columns of the tracks table.</span>

### Exercise 2

<span style="color:green; font-size:16px">Select the name and composer columns from the tracks table, returning the first five records.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find the number of unique composers and unit prices in the tracks table.</span>

### Exercise 4

<span style="color:green; font-size:16px">Find the unique unit prices in the tracks table.</span>

### Exercise 5

<span style="color:green; font-size:16px">Count the total number of records and the number of non-missing values of composer in the tracks table.</span>

### Exercise 6

<span style="color:green; font-size:16px">Return the first five records in the tracks table where composer is missing.</span>

### Exercise 7

<span style="color:green; font-size:16px">Filter the tracks table where unit price is 1.99. Return the first five after the 100th.</span>

### Exercise 8

<span style="color:green; font-size:16px">Compute the minutes and seconds of each song in the tracks table as separate columns. Return the song name and milliseconds along with the other two columns naming them appropriately.</span>

### Exercise 9

<span style="color:green; font-size:16px">Select all the records between three and four minutes in length from the tracks table.</span>

### Exercise 10

<span style="color:green; font-size:16px">How many records from the tracks table are under 30 seconds in length?</span>

### Exercise 11

<span style="color:green; font-size:16px">How many records from the tracks table are under 30 seconds or more than 10 minutes in length?</span>

### Exercise 12

<span style="color:green; font-size:16px">Calculate the average unit price for songs greater than 10 minutes in length from the tracks table.</span>

### Exercise 13

<span style="color:green; font-size:16px">Select tracks with TrackId of 10, 100, or 1000.</span>

### Exercise 14

<span style="color:green; font-size:16px">Select all customers from France and Portugal.</span>

### Exercise 15

<span style="color:green; font-size:16px">Find the top 10 invoices by total.</span>

### Exercise 16

<span style="color:green; font-size:16px">Sort the invoices table by BillingCountry and within that by Total from greatest to least.</span>

### Exercise 17

<span style="color:green; font-size:16px">Find all tracks that have a name beginning or ending in 'X'.</span>

### Exercise 18

<span style="color:green; font-size:16px">Find all tracks that have the word 'smith' anywhere in the composer.</span>

### Exercise 19

<span style="color:green; font-size:16px">Calculate the average bytes per millisecond for all tracks. Make sure to use true division. Round to one decimal place.</span>

### Exercise 20

<span style="color:green; font-size:16px">Return the five longest names in the track table.</span>

### Exercise 21

<span style="color:green; font-size:16px">Use the [SQLite math functions page][0] to calculate the area of a circle with radius of 17.</span>


[0]: https://www.sqlite.org/lang_mathfunc.html

### Exercise 22

<span style="color:green; font-size:16px">Count the number of customers that do not have a company name.</span>