# Step 4.1: SQL Fundamentals


1. Intriduction To SQL

    Learn the basics of SQL to explore a dataset
    * How to preview a SQLite database table
    * How to filter the rows in a table

2. Summary Statistics 
    
    Learn how to calculate summary statistics in SQL
    * How to use aggregate functions in SQL
    * How to compute min, max, and average values in a column
    * How to perform arithmetic in SQL

3. Group Summary Statistics
    
    Learn how to compute statistics across groups
    * How to compute group level summary statistics in a database table
    * How to query virtual columns within a group

4. Subqueries
    
    Learn how to write complex, nested queries using subqueries
    * How to use subqueries to nest queries in the SELECT clause
    * How to use subqueries to nest queries in the WHERE clause

5. Querying SQLite From Python
    
    Learn how to query a SQLite database from Python
    * How to run SQL queries using sqlite3 in Python
    * How to work with cursors and tuples

6. Guided Project: Analyzing CIA Factbook Data USing SQLite And Python
    
    Practice the Python SQLite workflow using CIA Factbook data.
    * Working with SQLite data in Python
    * Generating visualizations from SQLite results


## 4.1.1 Introduction to SQL

### 4.1.1.1 Introduction to Databases

In previous missions, we primarily worked with data represented in a CSV file. The workflow looked something like this:
![text alt](https://s3.amazonaws.com/dq-content/252/pandas_workflow.svg)

The pandas workflow works well when:

* the data fits in __memory__ (a few gigabytes but not terabytes)
* the data is relatively __static__ (doesn't need to be loaded into memory every minute because the data has changed)
* only a __single__ person is __accessing__ the data (shared access to memory is difficult)
* __security__ isn't important (security is critical for company scale production situations)

When the data __changes frequently__, requires __shared access__, doesn't fit in __memory__, and __security__ is critical, a __database__ is a much better solution. A database is a data representation that lives on disk that can be queried, accessed, and updated without using much memory. We primarily interact with a database using a __[database management system](https://en.wikipedia.org/wiki/Database)__ or __DBMS__ for short.

In the pandas workflow, we spend most of our time thinking about what functions and methods to use, where to store intermediate results in variables, and juggling all of these. To work with data stored in a database, we instead use a language called __SQL__ (or structured query language). In SQL, we express each unique request (whether it be fetching a subset of or editing values in the data) as a single query and then ask the DBMS to run the query and display any results.

For example, to fetch a specific subset of the data from a database, we would:

* write the SQL query: __SELECT * FROM salaries__
* ask the DBMS to run the query and display the results to us

Here's what the database workflow looks like:

![text alt](https://s3.amazonaws.com/dq-content/252/database_workflow.svg)


Because the data lives on __disk__, we can work with datasets that consume multiple terabytes of disk space. Many data science teams in industry have servers and setups in __cloud environments__ like Microsoft Azure or Amazon Web Services that let team members work with this scale of data. Robust and popular DBMS tools like __[Postgres](https://www.postgresql.org/)__ and __[MySQL](https://www.mysql.com/)__ include powerful features for managing user credentials, security, and high data throughput (quickly changing data). In this course and the next, we'll learn the fundamentals of SQL using a small, portable DBMS called __[SQLite](https://www.sqlite.org/index.html)__. SQLite is the most popular database in the world and is __lightweight__ enough that the SQLite DBMS is included as a __[module in Python](https://docs.python.org/3.6/library/sqlite3.html)__. In later courses, we'll dive into production systems like Postgres.

In this course, we'll explore data from the American Community Survey on job outcome statistics based on college majors. While the original CSV version can be found on __[FiveThirtyEight's Github](https://github.com/fivethirtyeight/data/tree/master/college-majors)__, we'll be using a slightly modified version of the data that's stored as a database. We'll be working with the bit of data that contains the 2010-2012 data for recent college grads only. In this mission, we'll learn how to write SQL queries to explore and start to understand the dataset.

### 4.1.1.2 Previewing A Table Using SELECT

Whenever we encountered a new dataset in the past, we displayed the first few rows to get familiar with the different columns, types of values, and some sample data.

We've loaded the dataset on job outcome statistics into a database. A database usually consists of multiple, related tables of data. Each table contains rows and columns, just like a CSV file. We'll be working with the database file __jobs.db__, which contains a single table named __recent_grads__. In later courses, we'll learn how to work with a database containing multiple tables.

![text alt](https://s3.amazonaws.com/dq-content/252/sql_table.svg)

To display the first 5 rows from the __recent_grads__ table, we need to:

* write SQL code that expresses this request
* ask the SQLite DBMS software to run the code and display the results.

Like other programming languages, code in SQL has to adhere to a defined structure and vocabulary. To specify that we want to return the first 5 rows from __recent_grads__, we need to run the following SQL query:

```SQL
SELECT * FROM recent_grads LIMIT 5
```

Here's what's returned when the query is run:

<div>
<table><tbody><tr><th>index</th><th>Rank</th><th>Major_code</th><th>Major</th><th>Major_category</th><th>Total</th><th>Sample_size</th><th>Men</th><th>Women</th><th>ShareWomen</th><th>Employed</th><th>Full_time</th><th>Part_time</th><th>Full_time_year_round</th><th>Unemployed</th><th>Unemployment_rate</th><th>Median</th><th>P25th</th><th>P75th</th><th>College_jobs</th><th>Non_college_jobs</th><th>Low_wage_jobs</th></tr><tr><td>0</td><td>1</td><td>2419</td><td>PETROLEUM ENGINEERING</td><td>Engineering</td><td>2339</td><td>36</td><td>2057</td><td>282</td><td>0.120564344</td><td>1976</td><td>1849</td><td>270</td><td>1207</td><td>37</td><td>0.018380527</td><td>110000</td><td>95000</td><td>125000</td><td>1534</td><td>364</td><td>193</td></tr><tr><td>1</td><td>2</td><td>2416</td><td>MINING AND MINERAL ENGINEERING</td><td>Engineering</td><td>756</td><td>7</td><td>679</td><td>77</td><td>0.10185185199999999</td><td>640</td><td>556</td><td>170</td><td>388</td><td>85</td><td>0.117241379</td><td>75000</td><td>55000</td><td>90000</td><td>350</td><td>257</td><td>50</td></tr><tr><td>2</td><td>3</td><td>2415</td><td>METALLURGICAL ENGINEERING</td><td>Engineering</td><td>856</td><td>3</td><td>725</td><td>131</td><td>0.153037383</td><td>648</td><td>558</td><td>133</td><td>340</td><td>16</td><td>0.024096386</td><td>73000</td><td>50000</td><td>105000</td><td>456</td><td>176</td><td>0</td></tr><tr><td>3</td><td>4</td><td>2417</td><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td><td>1258</td><td>16</td><td>1123</td><td>135</td><td>0.107313196</td><td>758</td><td>1069</td><td>150</td><td>692</td><td>40</td><td>0.050125313</td><td>70000</td><td>43000</td><td>80000</td><td>529</td><td>102</td><td>0</td></tr><tr><td>4</td><td>5</td><td>2405</td><td>CHEMICAL ENGINEERING</td><td>Engineering</td><td>32260</td><td>289</td><td>21239</td><td>11021</td><td>0.341630502</td><td>25694</td><td>23170</td><td>5180</td><td>16697</td><td>1672</td><td>0.061097712</td><td>65000</td><td>50000</td><td>75000</td><td>18314</td><td>4440</td><td>972</td></tr></tbody></table>
</div>

In this query, we specified:

* the columns we wanted using SELECT *
* the table we wanted to query using FROM recent_grads
* the number of rows we wanted using LIMIT 5

<p>Here's a visual breakdown of the different components of the query:</p>

![img alt](https://s3.amazonaws.com/dq-content/252/select_breakdown_2.svg)

Writing and running SQL queries in our interface is similar to writing and running Python code. Type the query in the code cell and click Run to execute the query against the database. If you write multiple queries in a code cell, SQLite will __only display the last query's results__.


#### Instructions

* Write a SQL query that returns the first 10 rows from recent_grads.

#### Answers
```SQL
SELECT * FROM recent_grads LIMIT 10;
```


### 4.1.1.3 Filtering Rows Using WHERE

SQLite ran our query and returned the first 10 rows and all columns from the recent_grads table.

<table><tbody><tr><th>index</th><th>Rank</th><th>Major_code</th><th>Major</th><th>Major_category</th><th>Total</th><th>Sample_size</th><th>Men</th><th>Women</th><th>ShareWomen</th><th>Employed</th><th>Full_time</th><th>Part_time</th><th>Full_time_year_round</th><th>Unemployed</th><th>Unemployment_rate</th><th>Median</th><th>P25th</th><th>P75th</th><th>College_jobs</th><th>Non_college_jobs</th><th>Low_wage_jobs</th></tr><tr><td>0</td><td>1</td><td>2419</td><td>PETROLEUM ENGINEERING</td><td>Engineering</td><td>2339</td><td>36</td><td>2057</td><td>282</td><td>0.120564344</td><td>1976</td><td>1849</td><td>270</td><td>1207</td><td>37</td><td>0.018380527</td><td>110000</td><td>95000</td><td>125000</td><td>1534</td><td>364</td><td>193</td></tr><tr><td>1</td><td>2</td><td>2416</td><td>MINING AND MINERAL ENGINEERING</td><td>Engineering</td><td>756</td><td>7</td><td>679</td><td>77</td><td>0.10185185199999999</td><td>640</td><td>556</td><td>170</td><td>388</td><td>85</td><td>0.117241379</td><td>75000</td><td>55000</td><td>90000</td><td>350</td><td>257</td><td>50</td></tr><tr><td>2</td><td>3</td><td>2415</td><td>METALLURGICAL ENGINEERING</td><td>Engineering</td><td>856</td><td>3</td><td>725</td><td>131</td><td>0.153037383</td><td>648</td><td>558</td><td>133</td><td>340</td><td>16</td><td>0.024096386</td><td>73000</td><td>50000</td><td>105000</td><td>456</td><td>176</td><td>0</td></tr><tr><td>3</td><td>4</td><td>2417</td><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td><td>1258</td><td>16</td><td>1123</td><td>135</td><td>0.107313196</td><td>758</td><td>1069</td><td>150</td><td>692</td><td>40</td><td>0.050125313</td><td>70000</td><td>43000</td><td>80000</td><td>529</td><td>102</td><td>0</td></tr><tr><td>4</td><td>5</td><td>2405</td><td>CHEMICAL ENGINEERING</td><td>Engineering</td><td>32260</td><td>289</td><td>21239</td><td>11021</td><td>0.341630502</td><td>25694</td><td>23170</td><td>5180</td><td>16697</td><td>1672</td><td>0.061097712</td><td>65000</td><td>50000</td><td>75000</td><td>18314</td><td>4440</td><td>972</td></tr><tr><td>5</td><td>6</td><td>2418</td><td>NUCLEAR ENGINEERING</td><td>Engineering</td><td>2573</td><td>17</td><td>2200</td><td>373</td><td>0.144966965</td><td>1857</td><td>2038</td><td>264</td><td>1449</td><td>400</td><td>0.177226407</td><td>65000</td><td>50000</td><td>102000</td><td>1142</td><td>657</td><td>244</td></tr><tr><td>6</td><td>7</td><td>6202</td><td>ACTUARIAL SCIENCE</td><td>Business</td><td>3777</td><td>51</td><td>832</td><td>960</td><td>0.535714286</td><td>2912</td><td>2924</td><td>296</td><td>2482</td><td>308</td><td>0.095652174</td><td>62000</td><td>53000</td><td>72000</td><td>1768</td><td>314</td><td>259</td></tr><tr><td>7</td><td>8</td><td>5001</td><td>ASTRONOMY AND ASTROPHYSICS</td><td>Physical Sciences</td><td>1792</td><td>10</td><td>2110</td><td>1667</td><td>0.44135557299999995</td><td>1526</td><td>1085</td><td>553</td><td>827</td><td>33</td><td>0.021167415</td><td>62000</td><td>31500</td><td>109000</td><td>972</td><td>500</td><td>220</td></tr><tr><td>8</td><td>9</td><td>2414</td><td>MECHANICAL ENGINEERING</td><td>Engineering</td><td>91227</td><td>1029</td><td>12953</td><td>2105</td><td>0.139792801</td><td>76442</td><td>71298</td><td>13101</td><td>54639</td><td>4650</td><td>0.057342277999999997</td><td>60000</td><td>48000</td><td>70000</td><td>52844</td><td>16384</td><td>3253</td></tr><tr><td>9</td><td>10</td><td>2408</td><td>ELECTRICAL ENGINEERING</td><td>Engineering</td><td>81527</td><td>631</td><td>8407</td><td>6548</td><td>0.437846874</td><td>61928</td><td>55450</td><td>12695</td><td>41413</td><td>3895</td><td>0.059173845</td><td>60000</td><td>45000</td><td>72000</td><td>45829</td><td>10874</td><td>3170</td></tr></tbody></table>

Head to the __[dataset page](https://github.com/fivethirtyeight/data/tree/master/college-majors)__ and spend some time getting familiar with what each column represents.

Based on this dataset preview and an understanding of what each column represents, here are some questions we may have:

* Which majors had mostly female students? Which ones had mostly male students?
* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?
* Which engineering majors had the highest full time employment rates?

Let's start by focusing on the first question. The SQL workflow revolves around translating the question we want to answer to the subset of data we want from the database. To determine which majors had mostly female students, we want the following subset:

* only the Major column
* only the rows where ShareWomen is greater than 0.5 (corresponding to 50%)

To return only the __Major__ column, we need to add the specific column name in the __SELECT__ statement part of the query (instead of using the * operator to return all columns):

```SQL
SELECT Major FROM recent_grads
```

This will return all of the values in the Major column. We can specify multiple columns this way as well and the results table will preserve the order of the columns:
```SQL
SELECT Major, Major_category FROM recent_grads
```

To return only the values where ShareWomen is greater than or equal to 0.5, we need to add a WHERE clause:
```SQL
SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
```
Finally, we can limit the number of rows returned using LIMIT:
```SQL
SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
LIMIT 5
```

Running this query will return the following results table:

<table>
<tbody><tr>
<th>Major</th>
</tr>
<tr>
<td>ACTUARIAL SCIENCE</td>
</tr>
<tr>
<td>COMPUTER SCIENCE</td>
</tr>
<tr>
<td>ENVIRONMENTAL ENGINEERING</td>
</tr>
<tr>
<td>NURSING</td>
</tr>
<tr>
<td>INDUSTRIAL PRODUCTION TECHNOLOGIES</td>
</tr>
</tbody></table>


Here's a breakdown of the different components:

![img alt](https://s3.amazonaws.com/dq-content/252/where_breakdown_1.svg)

While in the __SELECT__ part of the query, we express the specific column we want, in the __WHERE__ part we express the specific rows we want. The beauty of SQL is that these can be independent.

#### Instructions
Write a SQL query that returns the majors where females were a minority.

* Only return the Major and ShareWomen columns (in that order) and don't limit the number of rows returned.

#### Answers
```SQL
SELECT Major, ShareWomen
FROM recent_grads
WHERE ShareWomen < 0.5;
```



### 4.1.1.4 Expressing Multiple Filter Criteria Using AND

To filter rows by specific criteria, we need to use the WHERE statement. A simple WHERE statement requires three things:

* The column we want the database to filter on: ShareWomen
* A comparison operator that specifies how we want to compare a value in a column: >
* The value we want the database to compare each value to: 0.5

Here are the comparison operators we can use:

* Less than: <
* Less than or equal to: <=
* Greater than: >
* Greater than or equal to: >=
* Equal to: =
* Not equal to: !=

The comparison value after the operator must be either text or a number, depending on the field. Because ShareWomen is a numeric column, we don't need to enclose the number 0.5 in quotes. __Finally, most database systems require that the SELECT and FROM statements come first, before WHERE or any other statements__.

We can use the AND operator to combine multiple filter criteria. For example, to determine which engineering majors had majority female, we'd need to specify 2 filtering criteria:
```SQL
SELECT Major FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
```

<table><tbody><tr><th>Major</th></tr><tr><td>ENVIRONMENTAL ENGINEERING</td></tr><tr><td>INDUSTRIAL PRODUCTION TECHNOLOGIES</td></tr></tbody></table>

It looks like only 2 majors met this criteria. If we wanted to "zoom" back out to look at all of the columns for both of these majors to see if they shared some other common attributes, we can modify the SELECT statement and use the symbol * to represent all columns:

```SQL
SELECT * FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
```
Now, all of the columns for the same 2 rows will be returned:

<table><tbody><tr><th>index</th><th>Rank</th><th>Major_code</th><th>Major</th><th>Major_category</th><th>Total</th><th>Sample_size</th><th>Men</th><th>Women</th><th>ShareWomen</th><th>Employed</th><th>Full_time</th><th>Part_time</th><th>Full_time_year_round</th><th>Unemployed</th><th>Unemployment_rate</th><th>Median</th><th>P25th</th><th>P75th</th><th>College_jobs</th><th>Non_college_jobs</th><th>Low_wage_jobs</th></tr><tr><td>30</td><td>31</td><td>2410</td><td>ENVIRONMENTAL ENGINEERING</td><td>Engineering</td><td>4047</td><td>26</td><td>2639</td><td>3339</td><td>0.558548009</td><td>2983</td><td>2384</td><td>930</td><td>1951</td><td>308</td><td>0.093588575</td><td>50000</td><td>42000</td><td>56000</td><td>2028</td><td>830</td><td>260</td></tr><tr><td>38</td><td>39</td><td>2503</td><td>INDUSTRIAL PRODUCTION TECHNOLOGIES</td><td>Engineering</td><td>4631</td><td>73</td><td>528</td><td>1588</td><td>0.75047259</td><td>4428</td><td>3988</td><td>597</td><td>3242</td><td>129</td><td>0.028308097</td><td>46000</td><td>35000</td><td>65000</td><td>1394</td><td>2454</td><td>480</td></tr></tbody></table>

The ability to quickly iterate on queries as you think of new questions is the appeal of SQL. The SQL workflow lets data professionals focus on asking and answering questions, instead of lower level programming concepts. There's a clear separation of concerns between the engine that stores, organizes, and retrieves the data and the language that let's people interface with the data easily without having to worry about the underlying mechanics.

As the scale of data has increased, engineers have maintained the interface of SQL while swapping out the database engine underneath. This allows people who need to ask and answer questions easily transfer their SQL experience, even as database technologies change. For example, the __[Presto project](https://en.wikipedia.org/wiki/Presto_%28SQL_query_engine%29)__ lets you query using SQL but use data from database systems like MySQL, from a __distributed file system__ like __HDFS__, and more.


#### Instructions
Write a SQL query that returns:

* all majors with majority female and
* all majors had a median salary greater than 50000.

Only include the following columns in the results and in this order:

* Major
* Major_category
* Median
* ShareWomen

#### Answers
```SQL
SELECT Major, Major_category, Median, ShareWomen
FROM recent_grads
WHERE ShareWomen > 0.5 AND Median > 50000;
```


### 4.1.1.5 Returning One of Several Conditions with OR

We used the AND operator to specify that our filter needs to pass two Boolean conditions. Both of the conditions had to evaluate to True for the record to appear in the result set. If we wanted to specify a filter that meets __either__ of the conditions instead, we would use the OR operator.

```SQL
SELECT [column1, column2,...] FROM [table1]
WHERE [condition1] OR [condition2]
```
We'll dive straight into a practice problem because we use the OR and AND operators in similar ways.

#### Instructions
Write a SQL query that returns the first 20 majors that either:

* have a Median salary greater than or equal to 10,000, or
* have less than or equal to 1,000 Unemployed people

Only include the following columns in the results and in this order:

* Major
* Median
* Unemployed

#### Answer
```SQL
SELECT Major, Median, Unemployed
FROM recent_grads
WHERE Median>10000 OR Unemployed<=1000
LIMIT 20;
```



### 4.1.1.6 Grouping Operators With Parentheses

There's a certain class of questions that we can't answer using only the techniques we've learned so far. For example, if we wanted to write a query that returned all __Engineering__ majors that __either__ had mostly female graduates __or__ an unemployment rate below 5.1%, we would need to use parentheses to express this more complex logic.

The three raw conditions we'll need are:

```SQL
Major_category = 'Engineering'
ShareWomen >= 0.5
Unemployment_rate < 0.051
```

What the SQL query looks like using parantheses:

```SQL
SELECT Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051);
```

The first thing you may notice is that we didn't capitalize any of the operators or statements in the query. SQL's built-in keywords are __case-insensitive__, which means we don't have to capitalize operators like AND or statements like SELECT. This also goes for the column names (you can use either major_category or Major_category). We'll stick to using capitalized SQL and the original column names to stay consistent in these missions..

The second thing you may notice is how we enclosed the logic we wanted to be evaluated together in __parentheses__. This is very similar to how we group mathematical calculations together in a particular order. The parentheses makes it explictly clear to the database that we want all of the rows where both of the expressions in the statements evaluate to True:

```SQL
(Major_category = 'Engineering') -> True or False
(ShareWomen > 0.5 OR Unemployment_rate < 0.051) -> True or False
```

If we had written the where statement without any parentheses, the database would guess what our intentions are, and actually execute the following query instead:

```sql
WHERE (Major_category = 'Engineering' AND ShareWomen > 0.5) OR (Unemployment_rate < 0.051)
```

Leaving the parentheses out implies that we want the calculation to happen from left to right in the order in which the logic is written, and wouldn't return us the data we want. Now let's run our intended query and see the results!

#### Instructions

Run the query we explored above, which returns all majors that:

* fell under the category of Engineering and
* either
    * had mostly women graduates
    * or had an unemployment rate below 5.1%, which was the rate in August 2015

Only include the following columns in the results and in this order:

* Major
* Major_category
* ShareWomen
* Unemployment_rate

#### Answers:
```SQL
SELECT Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE Major_category = "Engineering" AND (ShareWomen>0.5 OR Unemployment_rate<0.051);
```



### 4.1.1.7 Ordering Results Using ORDER BY

The results of every query we've written so far have been ordered by the Rank column. Recall a query from early in the mission, where we wrote a query that returned all of the columns and didn't filter rows on any specific criteria (SELECT * FROM recent_grads LIMIT 5):

<table><tbody><tr><th>index</th><th>Rank</th><th>Major_code</th><th>Major</th><th>Major_category</th><th>Total</th><th>Sample_size</th><th>Men</th><th>Women</th><th>ShareWomen</th><th>Employed</th><th>Full_time</th><th>Part_time</th><th>Full_time_year_round</th><th>Unemployed</th><th>Unemployment_rate</th><th>Median</th><th>P25th</th><th>P75th</th><th>College_jobs</th><th>Non_college_jobs</th><th>Low_wage_jobs</th></tr><tr><td>0</td><td>1</td><td>2419</td><td>PETROLEUM ENGINEERING</td><td>Engineering</td><td>2339</td><td>36</td><td>2057</td><td>282</td><td>0.120564344</td><td>1976</td><td>1849</td><td>270</td><td>1207</td><td>37</td><td>0.018380527</td><td>110000</td><td>95000</td><td>125000</td><td>1534</td><td>364</td><td>193</td></tr><tr><td>1</td><td>2</td><td>2416</td><td>MINING AND MINERAL ENGINEERING</td><td>Engineering</td><td>756</td><td>7</td><td>679</td><td>77</td><td>0.10185185199999999</td><td>640</td><td>556</td><td>170</td><td>388</td><td>85</td><td>0.117241379</td><td>75000</td><td>55000</td><td>90000</td><td>350</td><td>257</td><td>50</td></tr><tr><td>2</td><td>3</td><td>2415</td><td>METALLURGICAL ENGINEERING</td><td>Engineering</td><td>856</td><td>3</td><td>725</td><td>131</td><td>0.153037383</td><td>648</td><td>558</td><td>133</td><td>340</td><td>16</td><td>0.024096386</td><td>73000</td><td>50000</td><td>105000</td><td>456</td><td>176</td><td>0</td></tr><tr><td>3</td><td>4</td><td>2417</td><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td><td>1258</td><td>16</td><td>1123</td><td>135</td><td>0.107313196</td><td>758</td><td>1069</td><td>150</td><td>692</td><td>40</td><td>0.050125313</td><td>70000</td><td>43000</td><td>80000</td><td>529</td><td>102</td><td>0</td></tr><tr><td>4</td><td>5</td><td>2405</td><td>CHEMICAL ENGINEERING</td><td>Engineering</td><td>32260</td><td>289</td><td>21239</td><td>11021</td><td>0.341630502</td><td>25694</td><td>23170</td><td>5180</td><td>16697</td><td>1672</td><td>0.061097712</td><td>65000</td><td>50000</td><td>75000</td><td>18314</td><td>4440</td><td>972</td></tr></tbody></table>

If we modify the query from the last screen to include the Rank column, you'll notice that the results are ordered by the Rank column as well:

<table><tbody><tr><th>Rank</th><th>Major</th><th>Major_category</th><th>ShareWomen</th><th>Unemployment_rate</th></tr><tr><td>1</td><td>PETROLEUM ENGINEERING</td><td>Engineering</td><td>0.120564344</td><td>0.018380527</td></tr><tr><td>3</td><td>METALLURGICAL ENGINEERING</td><td>Engineering</td><td>0.153037383</td><td>0.024096386</td></tr><tr><td>4</td><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td><td>0.107313196</td><td>0.050125313</td></tr><tr><td>14</td><td>MATERIALS SCIENCE</td><td>Engineering</td><td>0.310820285</td><td>0.023042836</td></tr><tr><td>15</td><td>ENGINEERING MECHANICS PHYSICS AND SCIENCE</td><td>Engineering</td><td>0.183985189</td><td>0.006334343</td></tr><tr><td>17</td><td>INDUSTRIAL AND MANUFACTURING ENGINEERING</td><td>Engineering</td><td>0.34347321799999997</td><td>0.042875544</td></tr><tr><td>24</td><td>MATERIALS ENGINEERING AND MATERIALS SCIENCE</td><td>Engineering</td><td>0.292607004</td><td>0.027788805</td></tr><tr><td>31</td><td>ENVIRONMENTAL ENGINEERING</td><td>Engineering</td><td>0.558548009</td><td>0.093588575</td></tr><tr><td>39</td><td>INDUSTRIAL PRODUCTION TECHNOLOGIES</td><td>Engineering</td><td>0.75047259</td><td>0.028308097</td></tr><tr><td>51</td><td>ENGINEERING AND INDUSTRIAL MANAGEMENT</td><td>Engineering</td><td>0.174122505</td><td>0.03365166</td></tr></tbody></table>

As the questions we want to answer get more complex, we want more control over how the results are ordered. We can specify the order using the ORDER BY clause. For example, we may want to understand which majors that met the criteria in the WHERE statement had the lowest unemployment rate:

```SQL
SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate
```

This will return the results in ascending order (increasing) by the Unemployment_rate column:

<table><tbody><tr><th>Rank</th><th>Major</th><th>Major_category</th><th>ShareWomen</th><th>Unemployment_rate</th></tr><tr><td>15</td><td>ENGINEERING MECHANICS PHYSICS AND SCIENCE</td><td>Engineering</td><td>0.183985189</td><td>0.006334343</td></tr><tr><td>1</td><td>PETROLEUM ENGINEERING</td><td>Engineering</td><td>0.120564344</td><td>0.018380527</td></tr><tr><td>14</td><td>MATERIALS SCIENCE</td><td>Engineering</td><td>0.310820285</td><td>0.023042836</td></tr><tr><td>3</td><td>METALLURGICAL ENGINEERING</td><td>Engineering</td><td>0.153037383</td><td>0.024096386</td></tr><tr><td>24</td><td>MATERIALS ENGINEERING AND MATERIALS SCIENCE</td><td>Engineering</td><td>0.292607004</td><td>0.027788805</td></tr><tr><td>39</td><td>INDUSTRIAL PRODUCTION TECHNOLOGIES</td><td>Engineering</td><td>0.75047259</td><td>0.028308097</td></tr><tr><td>51</td><td>ENGINEERING AND INDUSTRIAL MANAGEMENT</td><td>Engineering</td><td>0.174122505</td><td>0.03365166</td></tr><tr><td>17</td><td>INDUSTRIAL AND MANUFACTURING ENGINEERING</td><td>Engineering</td><td>0.34347321799999997</td><td>0.042875544</td></tr><tr><td>4</td><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td><td>0.107313196</td><td>0.050125313</td></tr><tr><td>31</td><td>ENVIRONMENTAL ENGINEERING</td><td>Engineering</td><td>0.558548009</td><td>0.093588575</td></tr></tbody></table>

If we instead want the results ordered by the same column but in descending order, we can add the DESC keyword:

```SQL
SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate DESC
```

Here's the result of that query:
<table><tbody><tr><th>Rank</th><th>Major</th><th>Major_category</th><th>ShareWomen</th><th>Unemployment_rate</th></tr><tr><td>31</td><td>ENVIRONMENTAL ENGINEERING</td><td>Engineering</td><td>0.558548009</td><td>0.093588575</td></tr><tr><td>4</td><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td><td>0.107313196</td><td>0.050125313</td></tr><tr><td>17</td><td>INDUSTRIAL AND MANUFACTURING ENGINEERING</td><td>Engineering</td><td>0.34347321799999997</td><td>0.042875544</td></tr><tr><td>51</td><td>ENGINEERING AND INDUSTRIAL MANAGEMENT</td><td>Engineering</td><td>0.174122505</td><td>0.03365166</td></tr><tr><td>39</td><td>INDUSTRIAL PRODUCTION TECHNOLOGIES</td><td>Engineering</td><td>0.75047259</td><td>0.028308097</td></tr><tr><td>24</td><td>MATERIALS ENGINEERING AND MATERIALS SCIENCE</td><td>Engineering</td><td>0.292607004</td><td>0.027788805</td></tr><tr><td>3</td><td>METALLURGICAL ENGINEERING</td><td>Engineering</td><td>0.153037383</td><td>0.024096386</td></tr><tr><td>14</td><td>MATERIALS SCIENCE</td><td>Engineering</td><td>0.310820285</td><td>0.023042836</td></tr><tr><td>1</td><td>PETROLEUM ENGINEERING</td><td>Engineering</td><td>0.120564344</td><td>0.018380527</td></tr><tr><td>15</td><td>ENGINEERING MECHANICS PHYSICS AND SCIENCE</td><td>Engineering</td><td>0.183985189</td><td>0.006334343</td></tr></tbody></table>


#### Instructions
Write a query that returns all majors where:

* ShareWomen is greater than 0.3
* and Unemployment_rate is less than .1

Only include the following columns in the results and in this order:

* Major,
* ShareWomen,
* Unemployment_rate

Order the results in descending order by the ShareWomen column.

#### Answers:
```SQL
SELECT Major, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE ShareWomen>0.3 AND Unemployment_rate<0.1
ORDER BY ShareWomen DESC;
```

### 4.1.1.8 Practice Writing A Query

In this step, you'll practice going from question to answer using the SQL workflow. You'll focus on one of the questions we posed early in this mission:

* Which engineering majors had the highest full time employment rates?

#### Instructions

* Write a query that returns the Engineering or Physical Sciences majors in asecending order of unemployment rates.
    * The results should only contain the Major_category, Major, and Unemployment_rate columns.
    
#### Answers:
```SQL
SELECT Major_category, Major, Unemployment_rate
FROM recent_grads
WHERE Major_category IN ('Engineering', 'Physical Sciences')
ORDER BY Unemployment_rate;
```


### 4.1.1.9 Next Steps

In this mission, we became familiar with a dataset stored in a SQLite table by learning how to craft basic SQL queries.

Here are a few things to note:

We rarely linked to SQLite documentation, because it's a bit challenging to understand while you're just starting out. Sites like W3 Schhols and SQL ZOO are more friendly for looking up SQL commands.
We learned about clauses, statements, keywords, and operators in SQL. Here's a diagram describing the difference between each term:

![img alt](https://s3.amazonaws.com/dq-content/252/sql_components.svg)

In the next mission, we'll learn how to compute summary statistics and perform reductions on the same data in SQL.


## 4.1.2 Summary Statistics

<span style="color:red">Question?? Is it correct down below where selecting column and aggregation data at the same time??</span>
```SQL
SELECT Major, Major_category, MIN(Median)
FROM recent_grads
WHERE Major_category = 'Engineering'
```
Output:
```SQL
Major | Major_category | MIN(Median)
ARCHITECTURE | Engineering | 40000
```
<span style="color:red">Question?? What if replace MIN with SUM, which does ** not make sense ** for Major and Major category columns</span>




### 4.1.2.1 Introduction

In the last mission, we wrote queries that filtered rows and columns in a database table. Each of the queries we ran returned a collection of rows of values. What if we wanted to calculate the sum, average, min, or max of the results from these queries?

In this mission, we'll learn how to calculate __[summary statistics](https://en.wikipedia.org/wiki/Summary_statistics)__ on subsets of a database table. We'll continue working with data on job outcomes, compiled by FiveThirtyEight. Here's what the first 5 rows of the data look like:

<table><tbody><tr><th>index</th><th>Rank</th><th>Major_code</th><th>Major</th><th>Major_category</th><th>Total</th><th>Sample_size</th><th>Men</th><th>Women</th><th>ShareWomen</th><th>Employed</th><th>Full_time</th><th>Part_time</th><th>Full_time_year_round</th><th>Unemployed</th><th>Unemployment_rate</th><th>Median</th><th>P25th</th><th>P75th</th><th>College_jobs</th><th>Non_college_jobs</th><th>Low_wage_jobs</th></tr><tr><td>0</td><td>1</td><td>2419</td><td>PETROLEUM ENGINEERING</td><td>Engineering</td><td>2339</td><td>36</td><td>2057</td><td>282</td><td>0.120564344</td><td>1976</td><td>1849</td><td>270</td><td>1207</td><td>37</td><td>0.018380527</td><td>110000</td><td>95000</td><td>125000</td><td>1534</td><td>364</td><td>193</td></tr><tr><td>1</td><td>2</td><td>2416</td><td>MINING AND MINERAL ENGINEERING</td><td>Engineering</td><td>756</td><td>7</td><td>679</td><td>77</td><td>0.10185185199999999</td><td>640</td><td>556</td><td>170</td><td>388</td><td>85</td><td>0.117241379</td><td>75000</td><td>55000</td><td>90000</td><td>350</td><td>257</td><td>50</td></tr><tr><td>2</td><td>3</td><td>2415</td><td>METALLURGICAL ENGINEERING</td><td>Engineering</td><td>856</td><td>3</td><td>725</td><td>131</td><td>0.153037383</td><td>648</td><td>558</td><td>133</td><td>340</td><td>16</td><td>0.024096386</td><td>73000</td><td>50000</td><td>105000</td><td>456</td><td>176</td><td>0</td></tr><tr><td>3</td><td>4</td><td>2417</td><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td><td>1258</td><td>16</td><td>1123</td><td>135</td><td>0.107313196</td><td>758</td><td>1069</td><td>150</td><td>692</td><td>40</td><td>0.050125313</td><td>70000</td><td>43000</td><td>80000</td><td>529</td><td>102</td><td>0</td></tr><tr><td>4</td><td>5</td><td>2405</td><td>CHEMICAL ENGINEERING</td><td>Engineering</td><td>32260</td><td>289</td><td>21239</td><td>11021</td><td>0.341630502</td><td>25694</td><td>23170</td><td>5180</td><td>16697</td><td>1672</td><td>0.061097712</td><td>65000</td><td>50000</td><td>75000</td><td>18314</td><td>4440</td><td>972</td></tr></tbody></table>

Let's start with some motivating questions we want to answer:

* How many majors had mostly female students? How many had mostly male students? What proportion of majors had mostly female students?
* Which category of majors had the lowest unemployment rates? Which category of majors had the highest female representation?
* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

Let's focus on the first set of questions around gender representation in this screen. In the last mission, we learned how to return all majors that contained majority female students:

```SQL
SELECT Major FROM recent_grads WHERE ShareWomen > 0.5
```

Instead of returning all of the rows, we want SQLite to count the number of rows and return just that value. While we don't need to change the subset of data we're working with, we do need to change how it's presented to us. To return just the count, we need to use the SQL function __[COUNT()](https://sqlite.org/lang_aggfunc.html#count)__:

```SLQ
SELECT COUNT(Major) FROM recent_grads WHERE ShareWomen > 0.5
```

This will return:

<table><tbody><tr><th>COUNT(Major)</th></tr><tr><td>97</td></tr></tbody></table>

Instead of just returning a single value, SQLite returned a table with a column (__COUNT(Major)__) and the count as a row in that column (97).

A key idea in SQL is that __everything is a table__. One advantage of this simplification is that it's a common, visual representation that makes SQL approachable for a much wider audience. The disadvantage is that datasets and calculations that aren't well suited for this representation must be converted to be used in a SQL database environment.


#### Instructions
* Write a query that returns the number of majors with mostly male students.
    * Use all caps in the SELECT clause so our answer checking will match - COUNT(Major).
    
#### Answers:
```SQL
SELECT COUNT(Major)
FROM recent_grads
WHERE ShareWomen<0.5;
```

### 4.1.2.2 Finding a Column's Minimum and Maximum Values in SQL

Functions like __COUNT()__ are known as __[aggregate functions](https://sqlite.org/lang_aggfunc.html)__. Aggregate functions are applied over columns of values and return a single value. __MIN()__ and __MAX()__, for example, calculate and return the minimum and maximum values in a column.

We can use these functions to compute the lowest value in the ShareWomen column:

```SQL
SELECT MIN(ShareWomen) 
FROM recent_grads;
```
<table><tbody><tr><th>MIN(ShareWomen)</th></tr><tr><td>0.0</td></tr></tbody></table>


It's interesting that there's a major with 0 women in the dataset. <span style="color:red">__What if we wanted to know which major that was or access other columns for that row__</span>? We just need to add the additional columns we want returned in the SELECT clause:

```SQL
SELECT Major, MIN(ShareWomen) FROM recent_grads
```
<table><tbody><tr><th>Major</th><th>MIN(ShareWomen)</th></tr><tr><td>MISCELLANEOUS ENGINEERING TECHNOLOGIES</td><td>0.0</td></tr></tbody></table>

If you think about it, <span style="color:red">__MIN(ShareWomen) acts a row filter in some way__</span>. While the query SELECT Major FROM recent_grads returns all of the values in the Major column, the query SELECT Major, MIN(ShareWomen) FROM recent_grads only returned the Major column value corresponding for the row with the minimum value in the ShareWomen column.

One thing to note is that while COUNT() can be used on any column (because it's just counting the number of values), the other aggregate functions (MIN(), MAX(), etc) can only be used on numeric columns (since these arithmetic calculations only work with numbers).

#### Instructions

Write a query that returns the Engineering major with the lowest median salary.

* We only want the Major, Major_category, and MIN(Median) columns in the result.

#### Answers:
```SQL
SELECT Major, Major_category, MIN(Median)
FROM recent_grads
WHERE Major_category = 'Engineering';
```


### 4.1.2.3 Calculating Sums and Averages in SQL

The final two aggregation functions we'll look at are __SUM()__ and __AVG()__. Applying the SUM() function will add all of the values in a column while AVG() will compute the average. Lastly, the __TOTAL()__ function also returns the sum as a __floating__ point value (even if the column contains integers). The TOTAL() function should be used when working with a column containing floating point values. You can read more __[here](https://sqlite.org/lang_aggfunc.html)__.

This time around, we're going to skip showing sample code since these functions are used the same way as COUNT(), MIN(), and MAX(). This is good practice working with new functions, as SQL contains many functions that you'll end up using down the road that you haven't been taught explicitly.

#### Instructions

Write a query that computes the sum of the Total column. - Return only the total number of students integer value.

#### Answers

```SQL
Select SUM(Total)
FROM recent_grads
```


### 4.1.2.4 Combining Multiple Aggregation Functions

Instead of writing an individual query for each specific question we want to answer, we can actually write queries that answer multiple questions at once. Let's take the following questions:

What's the lowest median salary?
What's the highest median salary?
What's the total number of students?
Recall that we can select multiple columns by including their names with commas, like so:

```SQL
SELECT Major, Major_category FROM recent_grads
```

We can apply the same principle to combine multiple aggregation functions into a single query:

```SQL
SELECT MIN(Median), MAX(Median), SUM(Total)
FROM recent_grads
```
<table><tbody><tr><th>min(Median)</th><th>max(Median)</th><th>sum(Total)</th></tr><tr><td>22000</td><td>110000</td><td>6776015</td></tr></tbody></table>

#### Instructions
Write a query that computes the average of the Total column, the minimum of the Men column, and the maximum of the Women column, in that specific order.

* Make sure that all of the aggregate functions are capitalized (SUM() not sum(), etc), so our results match yours.

#### Answers
```SQL
SELECT AVG(Total), MIN(Men), MAX(Women)
FROM recent_grads
```



### 4.1.2.5 Customizing The Results

All of the queries we've written so far have had somewhat unpleasant column names in the results, like AVG(SUM) and MIN(Men). Many companies use SQL environments and tools that can run your query, turn the results into a plot of your choosing, and then create a PDF report containing multiple plots (and some additional explanation from the user). Given that others may interpret and understand the results of your SQL queries, it's helpful to be able to __specify custom names__ for the columns in our results.

We can do just that using the __AS operator__:

```SLQ
SELECT COUNT(*) as num_students FROM recent_grads
```

This is known as an __alias__ and the alias is restricted to just our results table (the table in the database won't be renamed). We can specify an arbitrary phrase as a string using quotation marks:

```SQL
SELECT COUNT(*) as "Total Students" FROM recent_grads
```

Even better, we can __drop AS__ entirely and just add the name next to the original column:

```SQL
SElECT COUNT(*) "Total Students" FROM recent_grads
```

Lastly, we can reference renamed columns when writing longer queries to make our code more compact:

```SQL
SELECT Major m, Major_category mc, Unemployment_rate ur
FROM recent_grads
WHERE (mc = 'Engineering') AND (ur > 0.04 and ur < 0.08)
ORDER BY ur DESC
```

#### Instructions
Write a query that returns, in the following order:

* the number of rows as Number of Students
* the maximum value of Unemployment_rate as Highest Unemployment Rate

#### Answers
```SQL
SELECT COUNT(*) AS "Number of Students", MAX(Unemployment_rate) AS "Highest Unemployment Rate"
FROM recent_grads
```

### 4.1.2.6 Counting Unique Values

<span style="color:red"> Notes: difference between **DISTINCT STATEMENT** and **DISTINCT FUNCTION**</span>

We've been working with the Major_category column a decent amount in our queries and it's a column with only few unique values. What if we want to return just the __unique values__ in this column? Or the number of unique values in this column?

We can return all of the unique values in a column using the __DISTINCT statement__.

```SQL
SELECT DISTINCT Major_category FROM recent_grads
```
<table><tbody><tr><th>Major_category</th></tr><tr><td>Engineering</td></tr><tr><td>Business</td></tr><tr><td>Physical Sciences</td></tr><tr><td>Law &amp; Public Policy</td></tr><tr><td>Computers &amp; Mathematics</td></tr><tr><td>Agriculture &amp; Natural Resources</td></tr><tr><td>Industrial Arts &amp; Consumer Services</td></tr><tr><td>Arts</td></tr><tr><td>Health</td></tr><tr><td>Social Science</td></tr><tr><td>Biology &amp; Life Science</td></tr><tr><td>Education</td></tr><tr><td>Humanities &amp; Liberal Arts</td></tr><tr><td>Psychology &amp; Social Work</td></tr><tr><td>Communications &amp; Journalism</td></tr><tr><td>Interdisciplinary</td></tr></tbody></table>

As with the other SQL clauses we've learned, we can use the DISTINCT statement with multiple columns to return __unique pairings__ of those columns:

```SQL
SELECT DISTINCT Major, Major_category FROM recent_grads limit 5
```
<table><tbody><tr><th>Major</th><th>Major_category</th></tr><tr><td>PETROLEUM ENGINEERING</td><td>Engineering</td></tr><tr><td>MINING AND MINERAL ENGINEERING</td><td>Engineering</td></tr><tr><td>METALLURGICAL ENGINEERING</td><td>Engineering</td></tr><tr><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td></tr><tr><td>CHEMICAL ENGINEERING</td><td>Engineering</td></tr></tbody></table>

In this case, the Major_category column is much more unique (only 16 unique values for Major_category compared to 173 for Major), so the corresponding value is __repeated__ for every unique value in Major.

Lastly, we can count the number of unique values in a column by nesting the __COUNT()__ function with the __DISTINCT() function__ (note the nesting of parentheses as well):

```SQL
SELECT COUNT(DISTINCT(Major_category)) unique_major_categories FROM recent_grads
```
<table><tbody><tr><th>unique_major_categories</th></tr><tr><td>16</td></tr></tbody></table>

#### Instructions

Write a query that returns the number of unique values in the Major, Major_category, and Major_code columns. Use the following aliases in the following order:

* For the unique value count of the Major column, use the alias unique_majors.
* For the unique value count of the Major_category column, use the alias unique_major_categories.
* For the unique value count of the Major_code column, use the alias unique_major_codes.

#### Answers:
```SQL
SELECT COUNT(DISTINCT Major) AS unique_majors, COUNT(DISTINCT Major_category) AS unique_major_categories, COUNT(DISTINCT Major_code) AS unique_major_codes
FROM recent_grads
```


### 4.1.2.7 Performing Arithmetic in SQL

Let's revisit one of the questions from the beginning of the mission:

* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

To answer this question, we need to be able to perform arithmetic on the columns in a table to compute the difference. SQL supports the standard arithmetic operators: __*, +, -, and /__, and we can use them like any other operator:

```SQL
SELECT P75th - P25th quartile_spread FROM recent_grads LIMIT 10
```

<table><tbody><tr><th>quartile_spread</th></tr><tr><td>30000</td></tr><tr><td>35000</td></tr><tr><td>55000</td></tr><tr><td>37000</td></tr><tr><td>25000</td></tr><tr><td>52000</td></tr><tr><td>19000</td></tr><tr><td>77500</td></tr><tr><td>22000</td></tr><tr><td>27000</td></tr></tbody></table>

You can also add, subtract, multiple, or divide columns by individual values:

```SQL
SELECT ShareWomen * 100 percent_female FROM recent_grads LIMIT 10
```
<table><tbody><tr><th>percent_female</th></tr><tr><td>12.0564344</td></tr><tr><td>10.1851852</td></tr><tr><td>15.3037383</td></tr><tr><td>10.731319599999999</td></tr><tr><td>34.1630502</td></tr><tr><td>14.4966965</td></tr><tr><td>53.571428600000004</td></tr><tr><td>44.135557299999995</td></tr><tr><td>13.979280099999999</td></tr><tr><td>43.7846874</td></tr></tbody></table>

One thing to note is that multiplying or dividing columns with a floating point value (or a column with floating point values) will result in floating point values:

* Two floats - Returns a float.
    * SELECT 100.0 / 100.0 returns 1.0.
* A float and an integer - Returns a float
    * SELECT 100 / 1.0 returns 100.0.
* Two integers - Returns an integer
    * SELECT 100 / 10 returns 10
    
    
#### Instructions
Write a query that computes the difference between the 25th and 75th percentile of salaries for all majors.

* Return the Major column first, using the default column name.
* Return the Major_category column second, using the default column name.
* Return the compute difference between the 25th and 75th percentile third, using the alias quartile_spread.
* Order the results from lowest to highest and only return the first 20 results.

#### Answers
```SQL
SELECT Major, Major_category, P75th - P25th AS quartile_spread
FROM recent_grads
ORDER BY quartile_spread 
LIMIT 20
```


### 4.1.2.8 Next Steps

In this mission, we explored how to calculate summary statistics in SQL. It's often advantageous to do these computations in the SQL database instead of a Python environment because it's faster to code and execute. In the next mission, we'll learn how to calculate __statistics__ within specific __subgroups__ using the __GROUP BY statement__.

## 4.1.3 Group Summary Statistics



### 4.1.3.1 Introduction

In the last mission, we computed summary statistics across columns with SQL. In many cases, though, we want to drill down even more and compute __summary statistics per group__. In this mission, we'll explore how to calculate more granular summary statistics using groups.

We'll be working with a data set on jobs we stored in the __recent_grads__ table of __jobs.db__. Each row represents a single college major, and contains information about post-graduation employment of students who studied the major. You can find out more about the data set in __[FiveThirtyEight's GitHub repository](https://github.com/fivethirtyeight/data/tree/master/college-majors)__ for the project. Here are some descriptions for just a few of the 21 total columns:

* Rank - The major's rank by median earnings
* Major_code - The major's code or ID
* Major - The name of the major
* Major_category - The broader category the major belongs to
* Total - The total number of people who studied the major
* Men - The number of male graduates
* Women - The number of female graduates
* ShareWomen - Women as a proportion of the total number of graduates (a number ranging from 0 to 1)
* Employed - The number of employed graduates

Here are the first few rows and columns in the data set:

<table class="table table-bordered">
<thead><tr>
<th>Rank</th>
<th>Major_code</th>
<th>Major</th>
<th>Major_category</th>
<th>Total</th>
<th>Sample_size</th>
<th>Men</th>
<th>Women</th>
<th>ShareWomen</th>
<th>Employed</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2419</td>
<td>PETROLEUM ENGINEERING</td>
<td>Engineering</td>
<td>2339</td>
<td>36</td>
<td>2057</td>
<td>282</td>
<td>0.120564</td>
<td>1976</td>
</tr>
<tr>
<td>2</td>
<td>2416</td>
<td>MINING AND MINERAL ENGINEERING</td>
<td>Engineering</td>
<td>756</td>
<td>7</td>
<td>679</td>
<td>77</td>
<td>0.101852</td>
<td>640</td>
</tr>
<tr>
<td>3</td>
<td>2415</td>
<td>METALLURGICAL ENGINEERING</td>
<td>Engineering</td>
<td>856</td>
<td>3</td>
<td>725</td>
<td>131</td>
<td>0.153037</td>
<td>648</td>
</tr>
<tr>
<td>4</td>
<td>2417</td>
<td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td>
<td>Engineering</td>
<td>1258</td>
<td>16</td>
<td>1123</td>
<td>135</td>
<td>0.107313</td>
<td>758</td>
</tr>
<tr>
<td>5</td>
<td>2405</td>
<td>CHEMICAL ENGINEERING</td>
<td>Engineering</td>
<td>32260</td>
<td>289</td>
<td>21239</td>
<td>11021</td>
<td>0.341631</td>
<td>25694 </td>
</tr>
</tbody>
</table>


As we progress through this mission, we'll drill down and compute summary statistics by group to answer questions like:

* What's the share of women in each major category?
* Which major categories have the greatest numbers of employed graduates?
* What percentage of people in each major category end up in low-wage jobs?

#### Instructions
* Write a SQL query that displays all of the columns and the first five rows of the recent_grads table.

#### Answers
```SQL
SELECT *
FROM recent_grads
LIMIT 5
```


### 4.1.3.2 Calculating Group-Level Summary Statistics

The __GROUP BY__ SQL statement allows us to compute summary statistics by "group," or unique value. When we use this statement, SQL creates a group for each unique value in a column or set of columns (the same values we get when we use the DISTINCT statement), and then does the calculations for them. To illustrate, we can find the total number of people employed in each major category with the following query:

```SQL
SELECT SUM(Employed) 
FROM recent_grads 
GROUP BY Major_category;
```

This will give us the total number of employed graduates for each major category. Here's a truncated view of the output:

<table class="table table-bordered">
<tbody><tr>
<th>SUM(Employed)</th>
</tr>
<tr>
<td>66943</td>
</tr>
<tr>
<td>288114</td>
</tr>
<tr>
<td>302797</td>
</tr>
</tbody></table>

The output shows aggregate counts of the Employed column for each Major_category. Unfortunately, it doesn't indicate which major category each row refers to. We can fix this by including the Major_category column in our query:

```SQL
SELECT Major_category, SUM(Employed) 
FROM recent_grads 
GROUP BY Major_category;
```
This makes the output much easier to understand:

<table class="table table-bordered">
<tbody><tr>
<th>Major_category</th>
<th>SUM(Employed)</th>
</tr>
<tr>
<td>Agriculture &amp; Natural Resources</td>
<td>66943</td>
</tr>
<tr>
<td>Arts</td>
<td>288114</td>
</tr>
<tr>
<td>Biology &amp; Life Science</td>
<td>302797</td>
</tr>
</tbody></table>

Here's how the query works. The __GROUP BY__ statement __splits__ the Major_category column into groups (with one group for each unique major category), then calculates the sum for each group. The following diagram shows how GROUP BY splits the data. (The diagram uses a small sample from the recent_grads table.):



For each group, the GROUP BY statement queries each column, and runs all of the aggregation functions we include in the query after the SELECT statement:



If a column is selected, the SQL engine will use the last value for that column in the group. If an aggregation function is selected, the SQL engine will compute the value for that aggregation function across the group.

The query in the diagram will give us the following result:

<table class="table table-bordered">
<tbody><tr>
<th>Employed</th>
<th>Major_category</th>
<th>SUM(Employed)</th>
</tr>
<tr>
<td>1290</td>
<td>Agriculture</td>
<td>4439</td>
</tr>
<tr>
<td>36165</td>
<td>Arts</td>
<td>39075</td>
</tr>
</tbody></table>

#### Instructions
* Use the SELECT statement to select the following columns and aggregates in a query:
    * Major_category
* AVG(ShareWomen)
    * Use the GROUP BY statement to group the query by the Major_category column.
    
#### Answers

```SQL
SELECT Major_category, AVG(ShareWomen)
FROM recent_grads
GROUP BY Major_category
```


### 4.1.3.3 Practice: Using GROUP BY

Now that we have a better understanding of the GROUP BY statement, let's practice using it by computing summary statistics by group for the recent_grads table.

#### Instructions
* For each major category, find the percentage of graduates who are employed.
    * Use the SELECT statement to select the following columns and aggregates in your query:
        * Major_category
        * AVG(Employed) / AVG(Total) as share_employed
* Use the GROUP BY statement to group the query by the Major_category column.

#### Answers
```SQL
SELECT Major_category, AVG(Employed)/AVG(Total) AS share_employed
FROM recent_grads
GROUP BY Major_category
```



### 4.1.3.4 Querying Virtual Columns With the HAVING Statement

Sometimes we want to select a subset of rows after performing a __GROUP BY__ query. On the last screen, for instance, we may have wanted to select only those rows where share_employed is greater than .8. We can't use the __WHERE clause__ to do this because share_employed isn't a column in recent_grads; it's actually a __virtual column__ generated by the GROUP BY statement.

When we want to filter on a column generated by a GROUP BY query, we can use the __HAVING statement__. Here's an example:

```SQL
SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
```

Note that we used the same column name in the HAVING statement that we originally specified with the AS statement. SQL allows us to use custom column names in subsequent statements, including HAVING and WHERE. The statement above will result in the following output:

<table class="table table-bordered">
<tbody><tr>
<th>Major_category</th>
<th>share_employed</th>
</tr>
<tr>
<td>Agriculture &amp; Natural Resources</td>
<td>0.8369862842425075</td>
</tr>
<tr>
<td>Arts</td>
<td>0.8067482429367457</td>
</tr>
<tr>
<td>Business</td>
<td>0.8359659576036412</td>
</tr>
<tr>
<td>Communications &amp; Journalism</td>
<td>0.8422291333949735</td>
</tr>
</tbody></table>

Note that the results only include categories where share_employed is greater than .8. That's because the HAVING statement filters out the other rows.

#### Instructions
* Find all of the major categories where the share of graduates with low-wage jobs is greater than .1.
    * Use the SELECT statement to select the following columns and aggregates in a query:
        * Major_category
        * AVG(Low_wage_jobs) / AVG(Total) as share_low_wage
* Use the GROUP BY statement to group the query by the Major_category column.
* Use the HAVING statement to restrict the selection to rows where share_low_wage is greater than .1.

#### Answers
```SQL
SELECT Major_category, AVG(Low_wage_jobs)/AVG(Total) AS share_low_wage
FROM recent_grads
GROUP BY Major_category
HAVING share_low_wage > 0.1
```



### 4.1.3.5 Rounding Results With the ROUND() Functions

On the last screen, the percentages in our results were very long and hard to read (e.g., 0.16833085991095678). We can use the SQL ROUND function in our query to round them. Here's an example of what this looks like:

```SQL
SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women 
FROM recent_grads;
```

The query will round the ShareWomen column to two decimal places. Here's a truncated view of the results:

<table class="table table-bordered">
<tbody><tr>
<th>Major_category</th>
<th>rounded_share_women</th>
</tr>
<tr>
<td>Engineering</td>
<td>0.12</td>
</tr>
<tr>
<td>Engineering</td>
<td>0.1</td>
</tr>
</tbody></table>

By passing different values in to the ROUND function, such as __ROUND(ShareWomen, 3)__, we can round to different decimal places.

#### Instructions
* Write a SQL query that returns the following columns of recent_grads (in the same order):
    * ShareWomen rounded to 4 decimal places
    * Major_category
* Limit the results to 10 rows.

#### Answers
```SQL
SELECT ROUND(ShareWomen, 4), Major_category
FROM recent_grads
LIMIT 10
```


### 4.1.3.6 Nesting functions

On a previous screen, we ran the following query:

```SQL
SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
```

This query returned very long fractional values for share_employed. We can update our query with the ROUND function to round the results to three decimal places:

```SQL
SELECT Major_category, ROUND(AVG(Employed) / AVG(Total), 3) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
```

This will return the following result:

<table class="table table-bordered">
<tbody><tr>
<th>Major_category</th>
<th>share_employed</th>
</tr>
<tr>
<td>Agriculture &amp; Natural Resources</td>
<td>0.837</td>
</tr>
<tr>
<td>Arts</td>
<td>0.807</td>
</tr>
</tbody></table>

#### Instructions

* Use the SELECT statement to select the following columns and aggregates in a query:
    * Major_category
    * AVG(College_jobs) / AVG(Total) as share_degree_jobs
        * Use the ROUND function to round share_degree_jobs to 3 decimal places.
* Group the query by the Major_category column.
* Only select rows where share_degree_jobs is less than .3.

#### Answers
```SQL
SELECT Major_category, ROUND(AVG(College_jobs)/AVG(Total), 3) AS share_degree_jobs
FROM recent_grads
GROUP BY Major_category
HAVING share_degree_jobs<0.3
```


### 4.1.3.7 Casting

In the last few screens, we used SQL arithmetic to divide float columns. This resulted in float values that we could round using the ROUND() function. We can use the __[PRAGMA TABLE_INFO()](https://sqlite.org/pragma.html#pragma_table_info)__ statement by itself to return the type, along with some other information, for each column:

```SQL
PRAGMA TABLE_INFO(recent_grads)
```

This query returns:

<table><tbody><tr><th>cid</th><th>name</th><th>type</th><th>notnull</th><th>dflt_value</th><th>pk</th></tr><tr><td>0</td><td>index</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>1</td><td>Rank</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>2</td><td>Major_code</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>3</td><td>Major</td><td>TEXT</td><td>0</td><td>None</td><td>0</td></tr><tr><td>4</td><td>Major_category</td><td>TEXT</td><td>0</td><td>None</td><td>0</td></tr><tr><td>5</td><td>Total</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>6</td><td>Sample_size</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>7</td><td>Men</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>8</td><td>Women</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>9</td><td>ShareWomen</td><td>REAL</td><td>0</td><td>None</td><td>0</td></tr><tr><td>10</td><td>Employed</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>11</td><td>Full_time</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>12</td><td>Part_time</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>13</td><td>Full_time_year_round</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>14</td><td>Unemployed</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>15</td><td>Unemployment_rate</td><td>REAL</td><td>0</td><td>None</td><td>0</td></tr><tr><td>16</td><td>Median</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>17</td><td>P25th</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>18</td><td>P75th</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>19</td><td>College_jobs</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>20</td><td>Non_college_jobs</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr><tr><td>21</td><td>Low_wage_jobs</td><td>INTEGER</td><td>0</td><td>None</td><td>0</td></tr></tbody></table>


If we try to divide 2 integer columns (Women and Total), SQL will round down and return integer values:

```SQL
SELECT Women/Total SW from recent_grads limit 20
```

This query returns:

<table><tbody><tr><th>Women</th><th>Total</th><th>SW</th></tr><tr><td>282</td><td>2339</td><td>0</td></tr><tr><td>77</td><td>756</td><td>0</td></tr><tr><td>131</td><td>856</td><td>0</td></tr><tr><td>135</td><td>1258</td><td>0</td></tr><tr><td>11021</td><td>32260</td><td>0</td></tr><tr><td>373</td><td>2573</td><td>0</td></tr><tr><td>960</td><td>3777</td><td>0</td></tr><tr><td>1667</td><td>1792</td><td>0</td></tr><tr><td>2105</td><td>91227</td><td>0</td></tr><tr><td>6548</td><td>81527</td><td>0</td></tr><tr><td>8284</td><td>41542</td><td>0</td></tr><tr><td>16016</td><td>15058</td><td>1</td></tr></tbody></table>

We need to instead use the __CAST()__ function to the Float type:

```SQL
SELECT CAST(Women as Float) / CAST(Total as Float) FROM recent_grads limit 5
```

This returns the results as float values:

<table><tbody><tr><th>women_ratio</th></tr><tr><td>0.12056434373663959</td></tr><tr><td>0.10185185185185185</td></tr><tr><td>0.1530373831775701</td></tr><tr><td>0.10731319554848967</td></tr><tr><td>0.3416305021698698</td></tr></tbody></table>


#### Instructions
* Write a query that divides the sum of the Women column by the sum of the Total column, aliased as SW.
* Group the results by Major_category and order by SW.
* The results should only contain the Major_category and SW columns, in that order.


#### Answers
```SQL
SELECT Major_category, CAST(SUM(Women) AS Float)/CAST(SUM(Total) AS Float) AS SW
FROM recent_grads
GROUP BY Major_category
ORDER BY SW
```


### 4.1.3.8 Next Steps

In this mission, we covered the __GROUP BY__ and __HAVING__ statements. We can use these statements to quickly calculate powerful summary statistics in SQL. In the next few missions, we'll learn more about working with SQL tables, including how to insert and modify data.


## 4.1.4 Subqueries

<span style="color:red">**ONLY IN SELECT, FROM, WHERE**</span>


### 4.1.4.1 Writing More Complex Queries

The SQL operations we've learned so far enable us to answer questions with only one source of uncertainty. Many times, we want to answer questions that have 2 or more levels of unknowns. For example:

* Which rows are above the average for the ShareWomen column?

Using the SQL techniques we've learned so far, there's no way to write a query that answers this question. As of right now, we only know aggregate functions such as AVG() is valid in the SELECT clause; however, they can be used in other clauses such as the GROUP BY and HAVING clauses. For example, The following query:

```SQL
SELECT * FROM recent_grads
WHERE ShareWomen > AVG(ShareWomen)
```

will return an error:

```SQL
(sqlite3.OperationalError) misuse of aggregate function AVG() [SQL: 'SELECT * FROM recent_grads WHERE ShareWomen > AVG(ShareWomen)']
```

We need to instead learn how to break up a question we want to answer into a series of queries that can be combined.


#### Instructions
Try to write a query that answers the following question using the SQL you've learned so far:

* Which rows are above the average for the ShareWomen column?

#### Answers
```SQL
SELECT *
FROM recent_grads
WHERE ShareWomen > (SELECT AVG(ShareWomen)
                        FROM recent_grads)
```



### 4.1.4.2 Subqueries

<span style="color:red">**Dynamic Computing; Declarative Programing Paradigm (explicit); Parallel Computing**</span>

To determine which majors are above the average for the ShareWomen column, we need to:

* first determine the average value for the ShareWomen column
* then select and filter the rows that are greater than the average value

If we had to do this using Python and pandas, we would compute and store the average value in ShareWomen as a variable and then use the variable in a table filter. While variables dominate how we express logic in object-oriented programming languages like Python and Java, SQL __doesn't have support for variables__. The designers of SQL, a __[declarative programming language](https://en.wikipedia.org/wiki/Declarative_programming)__, want it's users to focus on expressing computations over explicitly defining, setting, and juggling variables.

What would the query look like if we already knew the average value for the ShareWomen column?

```SQL
SELECT Major, ShareWomen FROM recent_grads
WHERE ShareWomen > 0.5225502029537575
```

Now, how do we make the computed average value, 0.5225502029537575, __dynamic__?

Let's introduce the SQL way to solve this problem -- __subqueries__. A subquery is a query nested within another query. Here's a template for a SQL statement where the subquery resides in the WHERE clause:

```SQL
SELECT Major, ShareWomen FROM recent_grads
WHERE ShareWomen > (subquery that returns the average value for ShareWomen)
```

The subquery is run first and returns the average value for the ShareWomen column (which happens to be 0.5225502029537575). Based on the result of the subquery, SQL will replace the subquery with this value __dynamically__. Note that SQL will ignore the column name (AVG(ShareWomen)) and is smart enough to just use the actual row value. Here's a diagram that makes the flow clearer:

![img alt](https://s3.amazonaws.com/dq-content/255/subquery_one.svg)


The query that replaces the __placeholder subquery__ needs to be a full query (contain SELECT and FROM clauses, etc), that works even if it's run separately. In addition, the inner query should only return a table with a __single__ row and column because of where it __fits in the outer query__ (... WHERE > ?). If you instead try to return a table with multiple columns, for example, the following __error__ will be returned:

```SQL
(sqlite3.OperationalError) only a single result allowed for a SELECT that is part of an expression [SQL: 'SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen > (SELECT Major, AVG(ShareWomen) FROM recent_grads)']
```

Lastly, a subquery must always be contained within parentheses __()__, or the following error will be returned:

```SQL
(sqlite3.OperationalError) near "select": syntax error [SQL: 'select Major, Unemployment_rate from recent_grads where Unemployment_rate <  select AVG(Unemployment_rate) from recent_grads order by Unemployment_rate']
```

#### Instructions
Write a query that returns the majors that are below the average for Unemployment_rate. The results should:

* only contain the Major and Unemployment_rate columns
* be sorted in ascending order by Unemployment_rate

#### Answers
```SQL
SELECT Major, Unemployment_rate
FROM recent_grads
WHERE Unemployment_rate < (SELECT AVG(Unemployment_rate)
                            FROM recent_grads)
ORDER BY Unemployment_rate
```

### 4.1.4.3 Subquery In Select

In the last screen, we wrote SQL statements that used a subquery to express __dynamic filter criteria__ in the __WHERE__ clause. Specifically, we were interested in rows that were above or below the average value in a specific column. What if we wanted to understand the __proportion__ of majors are above the average for a given column? We'd need to divide the number of rows that met the filter criteria with the total number of rows in the table.

Let's focus on the query from the last screen:

```SQL
SELECT Major, ShareWomen 
FROM recent_grads
WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads)
```

Using the __COUNT()__ aggregate function, we can return the number of rows the results set contains:

```SQL
SELECT COUNT(*) 
FROM recent_grads
WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads)
```
<table><tbody><tr><th>COUNT(*)</th></tr><tr><td>91</td></tr></tbody></table>

To return the proportion, we need to divide this value with the total number of rows in recent_grads. The challenge, however, is that the we don't know the total number of rows (or want to be reliant on an out of date calculation anyway that we could potentially hard code).

To __dynamically__ calculate the number of total rows in recent_grads and be able to use it in another SQL statement, we can use a subquery in the SELECT clause:

```SQL
SELECT COUNT(*), (SELECT COUNT(*) FROM recent_grads) 
FROM recent_grads
WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads)
```

<table><tbody><tr><th>COUNT(*)</th><th>(SELECT COUNT(*) FROM recent_grads)</th></tr><tr><td>91</td><td>173</td></tr></tbody></table>

We'll leave it to you to extend the SQL statement to compute the actual proportion.

#### Instructions

* Write a SQL statement that computes the proportion (as a __float value__) of rows that contain above average values for the ShareWomen.
* The results should only return the proportion, __aliased__ as proportion_abv_avg, like so (with a different value):

#### Answers
```SQL
SELECT CAST(COUNT(*) AS float)/(SELECT CAST(COUNT(*) AS float) FROM recent_grads) AS proportion_abv_avg
FROM recent_grads
WHERE ShareWomen>
    (SELECT AVG(ShareWomen)
    FROM recent_grads)
```

<table><tbody><tr><th>proportion_abv_avg</th></tr><tr><td>0.000</td></tr></tbody></table>


### 4.1.4.4 Returning Multiple Results In Subqueries

So far, the subqueries we've used have computed an aggregate value of some kind and returned that value to the outer query to use for filtering. This is because we only worked with the < and > operators, which, by definition, __expect a single value__ to compare against in a __filter__. As we learned earlier in this course __[from the documentation]()__, SQLite contains all of the following operators:

![img alt](https://s3.amazonaws.com/dq-content/255/sqlite_operators.png)

Using the __IN operator__, we can specify a list of values that we want to match against in the __WHERE__ clause. All rows that match exactly will be returned. The following query returns the rows where Major_category equals either Business or Engineering:

```SQL
SELECT Major, Major_category FROM recent_grads
WHERE Major_category IN ('Business', 'Engineering')
LIMIT 7
```

We've limited the results to just 7 rows so they don't take up too much space, but we encourage you to remove the limit and run the query in our interface:

<table><tbody><tr><th>Major</th><th>Major_category</th></tr><tr><td>PETROLEUM ENGINEERING</td><td>Engineering</td></tr><tr><td>MINING AND MINERAL ENGINEERING</td><td>Engineering</td></tr><tr><td>METALLURGICAL ENGINEERING</td><td>Engineering</td></tr><tr><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td></tr><tr><td>CHEMICAL ENGINEERING</td><td>Engineering</td></tr><tr><td>NUCLEAR ENGINEERING</td><td>Engineering</td></tr><tr><td>ACTUARIAL SCIENCE</td><td>Business</td></tr></tbody></table>

Opportunities like this, where we've hard coded values, are usually good candidates for converting to a subquery. Instead of returning the rows where Major_category equals one of 2 specific values, we can write a subquery that returns the Major_category with the 5 highest group level sums for the Total column:

```SQL
SELECT Major_category FROM recent_grads
GROUP BY Major_category
ORDER BY SUM(Total) DESC
LIMIT 5
```

The subquery returns the following table:

<table><tbody><tr><th>Major_category</th></tr><tr><td>Business</td></tr><tr><td>Humanities &amp; Liberal Arts</td></tr><tr><td>Education</td></tr><tr><td>Engineering</td></tr><tr><td>Social Science</td></tr></tbody></table>

We'll leave it to you to finish integrating this subquery into the outer query.

#### Instructions
Write a query that returns the Major and Major_category columns for the rows where:

* Major_category is one of the 5 highest group level sums for the Total column

Here's what the first 3 rows of the final table should look like:
<table><tbody><tr><th>Major</th><th>Major_category</th></tr><tr><td>PETROLEUM ENGINEERING</td><td>Engineering</td></tr><tr><td>MINING AND MINERAL ENGINEERING</td><td>Engineering</td></tr><tr><td>METALLURGICAL ENGINEERING</td><td>Engineering</td></tr></tbody></table>

#### Answers
```SQL
SELECT Major, Major_category
FROM recent_grads
WHERE Major_category IN 
    (SELECT Major_category 
    FROM recent_grads
    GROUP BY Major_category 
    ORDER BY SUM(Total) DESC
    LIMIT 5)
```



### 4.1.4.5 Building Complex Subqueries

In the last few screens, we nested subqueries in the __WHERE__ and the __SELECT__ clauses that were evaluated before the outer query was. We can actually nest subqueries within subqueries many times, but this makes our SQL code more complex and harder to debug. In the next course, we'll explore other techniques of composing SQL statements that __make nested logic easier__.

When you have a SQL statement you want to write that will end up using many subqueries, it can be overwhelming at first to know how to start. In general, you want to __start with the inner queries first__ and work your way outwards. Let's say we're interested in understanding the ratio of the Sample_size column to the Total column. You can read the __[dataset documentation](https://github.com/fivethirtyeight/data/tree/master/college-majors)__ if you need a reminder for what these columns represent.

Specifically, let's say we're interested in:

* computing this ratio for every major
* understanding which majors are above the average for this ratio
* understanding how many majors are above the average for this ratio

We'll start by writing a query that computes the ratio for every major and then the average of all of these ratios.

#### Instructions
Write a query that returns the average ratio (Sample_size/Total)) for all of the majors.

* You'll need to cast both columns to the float type.
* Use the alias avg_ratio for the average ratio.

Here's the general format for what your query should return:

<table><tbody><tr><th>avg_ratio</th></tr><tr><td>0.00</td></tr></tbody></table>

#### Answers
```SQL
SELECT AVG(CAST(Sample_size AS float)/CAST(Total AS float)) AS avg_ratio
FROM recent_grads

```

### 4.1.4.6 Practice Integrating A Subquery With The Outer Query

Now that we have a subquery that calculates the average ratio (of Sample_size to Total), we can return the rows that exceed this average.

#### Instructions
Write a query that:

* selects the Major, Major_category, and the computed ratio columns
* filters to just the rows where ratio is greater than avg_ratio:
    * recall that this value is the result of the subquery from the last screen: select AVG(cast(Sample_size as float)/cast(Total as float)) avg_ratio from recent_grads

Here's what the first 3 rows should look like:

<table><tbody><tr><th>Major</th><th>Major_category</th><th>ratio</th></tr><tr><td>PETROLEUM ENGINEERING</td><td>Engineering</td><td>0.015391192817443352</td></tr><tr><td>MINING AND MINERAL ENGINEERING</td><td>Engineering</td><td>0.009259259259259259</td></tr><tr><td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td><td>Engineering</td><td>0.012718600953895072</td></tr></tbody></table>

#### Answers
```SQL
SELECT Major, Major_category, CAST(Sample_size AS float)/CAST(Total AS float) AS ratio
FROM recent_grads
WHERE ratio > 
    (SELECT AVG(CAST(Sample_size AS float)/CAST(Total AS float)) AS avg_ratio
    FROM recent_grads)
```


### 4.1.4.7 Next Steps

In this mission, we explored how subqueries enabled us to write more complex and dynamic queries. In the next mission, we'll learn how to work with __SQLite in Python__.

## 4.1.5 Querying SQLite from Python

### 4.1.5.1 Overview

In past missions, we focused on exploring the __SQL syntax__ for retrieving data from a database. In this mission, we'll explore how to __interact with a SQLite database in Python__ so you can start to incorporate databases into your data science workflow.

SQLite is a database that doesn't require a standalone server; it stores the entire database as a file on __disk__. This makes it __ideal__ for working with larger data sets that can __fit on disk but not in memory__.

The __pandas__ library loads the __entire data set__ we're working with __into memory__, making SQLite a compelling alternative for working with data sets larger than 8 gigabytes (which is roughly the amount of memory modern computers contain). The fact that we can contain an __entire database in a single file__ makes them easy to share; some data sets are available online as SQLite database files (using the extension .db).

We can interact with a SQLite database in two main ways:

* Through the __[sqlite3 Python module](https://docs.python.org/3/library/sqlite3.html)__
* Through the __[SQLite shell](https://sqlite.org/cli.html)__

In this mission, we'll focus on learning how to use the sqlite3 module to interact with the database.


### 4.1.5.2 Introduction to the Data

We'll continue to work with the American Community Survey data on college majors and job outcomes, which looks like this:

<table class="table table-bordered">
<thead><tr>
<th>Rank</th>
<th>Major_code</th>
<th>Major</th>
<th>Major_category</th>
<th>Total</th>
<th>Sample_size</th>
<th>Men</th>
<th>Women</th>
<th>ShareWomen</th>
<th>Employed</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2419</td>
<td>PETROLEUM ENGINEERING</td>
<td>Engineering</td>
<td>2339</td>
<td>36</td>
<td>2057</td>
<td>282</td>
<td>0.120564</td>
<td>1976</td>
</tr>
<tr>
<td>2</td>
<td>2416</td>
<td>MINING AND MINERAL ENGINEERING</td>
<td>Engineering</td>
<td>756</td>
<td>7</td>
<td>679</td>
<td>77</td>
<td>0.101852</td>
<td>640</td>
</tr>
<tr>
<td>3</td>
<td>2415</td>
<td>METALLURGICAL ENGINEERING</td>
<td>Engineering</td>
<td>856</td>
<td>3</td>
<td>725</td>
<td>131</td>
<td>0.153037</td>
<td>648</td>
</tr>
<tr>
<td>4</td>
<td>2417</td>
<td>NAVAL ARCHITECTURE AND MARINE ENGINEERING</td>
<td>Engineering</td>
<td>1258</td>
<td>16</td>
<td>1123</td>
<td>135</td>
<td>0.107313</td>
<td>758</td>
</tr>
<tr>
<td>5</td>
<td>2405</td>
<td>CHEMICAL ENGINEERING</td>
<td>Engineering</td>
<td>32260</td>
<td>289</td>
<td>21239</td>
<td>11021</td>
<td>0.341631</td>
<td>25694 </td>
</tr>
</tbody>
</table>


The full table has many more columns than the ones we've displayed above (21 to be specific). You can learn about all of them in __[FiveThirtyEight's GitHub repository](https://github.com/fivethirtyeight/data/tree/master/college-majors)__.

Here are the descriptions for the columns in the preview:

* Rank - The major's rank by median earnings
* Major_code - The major's code or ID
* Major - The name of the major
* Major_category - The broader category the major belongs to
* Total - The total number of people who studied the major
* Sample_size - The sample size (unweighted) of graduates with full time jobs
* Men - The number of male graduates
* Women - The number of female graduates
* ShareWomen - Women as a proportion of the total number of graduates (a number ranging from 0 to 1)
* Employed - The number of employed graduates

We've loaded a subset of the data into a table named recent_grads in a database. The subset contains the 2010-2012 data for recent college grads only. The database file we'll be working with is called jobs.db.


### 4.1.5.3 Connecting to the Database

Python 2.5 and up come with the sqlite module, which means we don't need to install any separate libraries to get started. Specifically, we'll be working with the sqlite3 Python module, which was developed to work with __[SQLite version 3](https://www.sqlite.org/version3.html)__.

We can import it into our environment using this command:

```python
import sqlite3
```
Once we've imported the module, we connect to the database we want to query using the __[connect()](https://docs.python.org/3/library/sqlite3.html#sqlite3.connect)__ function. This function requires a single parameter, which is the database we want to connect to. Because the database we're working with exists as a file on disk, we need to pass in the file name.

The connect() function returns a __[Connection instance](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection)__, which maintains the connection to the database we want to work with. When we're connected to a database, SQLite locks the database file and __prevents__ any other __processes__ from connecting to the database simultaneously. The SQLite team made this design decision to keep the database lightweight, and avoid the complexity that arises when __multiple processes__ interact with the same database.

#### Instructions
* Import the sqlite3 library into the environment.
* Then, use the sqlite3.connect() function to connect to jobs.db, and assign the Connection instance it returns to conn.

#### Answers
```python
import sqlite3
conn = sqlite3.connect('jobs.db')
```


In [1]:
import sqlite3
conn = sqlite3.connect('jobs.db')

### 4.1.5.4 Introduction to Cursor Objects and Tuples

Before we can execute a query, we need to express our SQL query as a string. While we use the Connection class to represent the database we're working with, we use the __[Cursor class](https://docs.python.org/3/library/sqlite3.html#cursor-objects)__ to:

* Run a query against the database
* Parse the results from the database
* Convert the results to native Python objects ## <span style="color:red">**In Memory or In Disk???**</span>
* Store the results within the Cursor instance as a local variable

After running a query and converting the results to a list of __tuples__, the Cursor instance stores the list as a local variable. Before diving into the syntax of querying the database, let's learn more about tuples.


### 4.1.5.5 Working With Sequence of Values as Tuples

A __[tuple](https://docs.python.org/3/tutorial/datastructures.html#tuples-and-sequences)__ is a core data structure that Python uses to represent a sequence of values, similar to a list. Unlike lists, tuples are __immutable__, which means we can't modify existing ones. Python represents each row in the results set as a tuple.

To create an empty tuple, assign a pair of empty parentheses to a variable:

```Python
t = ()
```
Python indexes Tuples from 0 to n-1, just like it does with lists. We access the values in a tuple using bracket notation.

```Python
t = ('Apple', 'Banana')
apple = t[0] 
banana = t[1]
```
Tuples are <span style="color:red">**faster**</span> than lists, so they're helpful with larger databases and larger results sets.

Next, let's dive into how to use the __Cursor instance__ to query the database.


### 4.1.5.6 Creating a Cursor and Running a Query

We need to use the Connection instance method cursor() to return a Cursor instance corresponding to the database we want to query.

```python
cursor = conn.cursor()
```

In the following code block, we:

Write a basic select query that will return all of the values from the recent_grads table, and store this query as a string named query
Use the Cursor method execute() to run the query against our database
Return the full results set and store it as results
Print the first three tuples in the list results

```python
# SQL Query as a string
query = "select * from recent_grads;"
# Execute the query, convert the results to tuples, and store as a local variable
cursor.execute(query)
# Fetch the full results set as a list of tuples
results = cursor.fetchall()
# Display the first three results
print(results[0:3])
```

Now it's your turn!


#### Instructions
* Write a query that returns all of the values in the Major column from the recent_grads table.
* Store the full results set (a list of tuples) in majors.
* Then, print the first three tuples in majors.

#### Answers
```python
import sqlite3
conn = sqlite3.connect("jobs.db")
cursor = conn.cursor()

query = "select * from recent_grads;"
cursor.execute(query)
results = cursor.fetchall()
print(results[0:2])

query = "select major from recent_grads;"
majors = cursor.execute(query).fetchall()
print(majors[0:3])
```

Outputs:
```Python
conn Connection (<class 'sqlite3.Connection'>) ## Object
    <sqlite3.Connection at 0x7fa2d88aed50>
    
cursor Cursor (<class 'sqlite3.Cursor'>) ## Object
    <sqlite3.Cursor at 0x7fa26da0dd50>
```


In [2]:
import sqlite3
conn = sqlite3.connect("jobs.db")
cursor = conn.cursor()

query = "select Major from recent_grads;"
cursor.execute(query)
majors = cursor.fetchall()
print(majors[0:2])

[('PETROLEUM ENGINEERING',), ('MINING AND MINERAL ENGINEERING',)]


### 4.1.5.7 Execute as a Shortcut for Running a Query

<span style="color:red">**conn.execute()    VS    conn.cursor().execute()**</span>

So far, we've been running queries by creating a Cursor instance, and then calling the execute method on the instance. The SQLite library actually allows us to __skip creating a Cursor__ altogether by using the __[execute method within the Connection object itself](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.execute)__. SQLite will create a Cursor instance for us under the hood and our query run against the database, but this shortcut allows us to skip a step. Here's what the code looks like:

```Python
conn = sqlite3.connect("jobs.db")
query = "select * from recent_grads;"
conn.execute(query).fetchall()
```

Notice that we __didn't explicitly create a separate Cursor instance ourselves__ in this code example.

Now let's learn how to fetch a specific number of results after we run a query.


### 4.1.5.8 Fetching a Specific Number of Results

To make it easier to work with large results sets, the Cursor class allows us to control the number of results we want to retrieve at any given time. To return a single result (as a tuple), we use the __[Cursor method fetchone()](https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.fetchone)__. To return n results, we use the __[Cursor method fetchmany()](https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.fetchmany)__.

Each Cursor instance contains an internal counter that updates every time we retrieve results. When we call the __fetchone()__ method, the Cursor instance will return a single result, and then increment its internal counter by 1. This means that if we call fetchone() again, the Cursor instance will actually return the second tuple in the results set (and increment by 1 again).

The __fetchmany()__ method takes in an integer (n) and returns the corresponding results, starting from the current position. It then increments the Cursor instance's counter by n. In the following code, we return the first two results using the fetchone() method, then the next five results using the fetchmany() method.

```Python
first_result = cursor.fetchone()
second_result = cursor.fetchone()
next_five_results = cursor.fetchmany(5)
```

#### Instructions

* Write and run a query that returns the Major and Major_category columns from recent_grads.
* Then, fetch the first five results and store them as five_results.

#### Answers
```python
import sqlite3
conn = sqlite3.connect("jobs.db")
cursor = conn.cursor()
query = "SELECT Major, Major_category FROM recent_grads"
five_results = cursor.execute(query).fetchmany(5)
```

Outputs:
```Python
five_resultslist (<class 'list'>)
[('PETROLEUM ENGINEERING', 'Engineering'),
 ('MINING AND MINERAL ENGINEERING', 'Engineering'),
 ('METALLURGICAL ENGINEERING', 'Engineering'),
 ('NAVAL ARCHITECTURE AND MARINE ENGINEERING', 'Engineering'),
 ('CHEMICAL ENGINEERING', 'Engineering')]

connConnection (<class 'sqlite3.Connection'>)
    <sqlite3.Connection at 0x7ffba12fbc70>
    
querystr (<class 'str'>)
    'SELECT Major, Major_category FROM recent_grads'
    
cursorCursor (<class 'sqlite3.Cursor'>)
    <sqlite3.Cursor at 0x7ffb36463f80>
```


In [4]:
import sqlite3
import sys
conn = sqlite3.connect("jobs.db")
cursor = conn.cursor()
query = "SELECT Major, Major_category FROM recent_grads"
five_results = cursor.execute(query).fetchmany(5)

sys.getsizeof(five_results) ## check the size of the variables in memory ----> Does this mean the SQLite is interacting data with the memory rather than the Disk.


128

### 4.1.5.9 Closing the Database Connection

Because SQLite restricts access to the database file when we're connected to a database, we need to __close__ the connection when we're done working with it. Closing the connection __allows other processes to access the database__, which is important when you're in a production environment and working with other team members.

To close a connection to a database, use the __Connection instance method close()__. When we're working with multiple databases and multiple Connection instances, we want to make sure we call the close() method on the correct instance. After closing the connection, attempting to query the database using any linked Cursor instances will return the following error:

```Python
ProgrammingError: Cannot operate on a closed database.
```

#### Instructions

* Close the connection to the database using the Connection instance method close().

#### Answers
```Python
conn = sqlite3.connect("jobs.db")
conn.close()
```


In [5]:
conn = sqlite3.connect('jobs.db')
conn.close()

### 4.1.5.10 Practice 

Now let's practice the entire workflow we've learned so far, from start to finish.

#### Instructions
* Connect to the database jobs2.db, which contains the same data as jobs.db.
* Write and execute a query that returns all of the majors (Major) in reverse alphabetical order (Z to A).
* Assign the full result set to reverse_alphabetical.
* Finally, close the connection to the database.

#### Answers
```python
conn = sqlite3.connect("jobs2.db")
query = "select Major from recent_grads order by Major desc;"
reverse_alphabetical = conn.cursor().execute(query).fetchall() ## object evolution: Connection -> Cursor -> perform query -> get results as list of tuples
conn.close()
```

Outputs
```Python
connConnection (<class 'sqlite3.Connection'>)
    <sqlite3.Connection at 0x7ffb361fdf10>
querystr (<class 'str'>)
    'SELECT Major FROM recent_grads ORDER BY Major DESC'
cursorCursor (<class 'sqlite3.Cursor'>)
    <sqlite3.Cursor at 0x7ffb362035e0>
```


In [7]:
conn = sqlite3.connect('jobs2.db')
cursor = conn.cursor()
query = 'SELECT Major FROM recent_grads ORDER BY Major DESC'
reverse_alphabetical = cursor.execute(query).fetchall()
print(reverse_alphabetical[:5])
conn.close()

[('ZOOLOGY',), ('VISUAL AND PERFORMING ARTS',), ('UNITED STATES HISTORY',), ('TREATMENT THERAPY PROFESSIONS',), ('TRANSPORTATION SCIENCES AND TECHNOLOGIES',)]


### 4.1.5.11 Next Steps

In this mission, we learned how to query a SQLite database from the Python module. Next up is a guided project, where you'll practice analyzing data using SQLite.
