# 3. Advanced SQLAlchemy Queries
**In this chapter, you will learn to perform advanced—and incredibly useful—queries that enable you to interact with your data in powerful ways.**

## Calculating values in a query
### Math operators
Now that you know how to customize your SQL queries using filtering and aggregation functions, it's time to dive deeper by performing typical and useful math operations such as addition(`+`), subtraction(`-`), multiplication(`*`), division(`/`), and modulus(`%`) on columns in our query. It is import to remember that these operations perform differently with non-numeric data types.

### Calculating difference
If we wanted to find the top five age groups by growth between 2000 and 2008, we would start by passing select the `age` column and then the calculated difference between `pop2008` and `pop2000` columns. Notice that we wrapped the difference in parenthesis so we can apply the label `pop_change` to it. Next we're going to group by age and order by the `pop_change` in descending manner and finally we apply a limit statement to only return the top 5 results. Now we can execute that statement and print the results. 

In [43]:
from sqlalchemy import create_engine, Table, MetaData, select, desc
engine = create_engine('sqlite:///census.sqlite')
connection = engine.connect()
metadata = MetaData()
census = Table('census', metadata, autoload=True, autoload_with=engine)

In [44]:
stmt = select([census.columns.age,
              (census.columns.pop2008 - 
               census.columns.pop2000).label('pop_change')
              ])
stmt = stmt.group_by(census.columns.age)
stmt = stmt.order_by(desc('pop_change'))
stmt = stmt.limit(5)
results = connection.execute(stmt).fetchall()
print(results)

[(61, 25201), (54, 23503), (55, 21716), (60, 19677), (58, 19526)]


That let's us see that the number of 61 and 85 year olds grew quite a bit between those years.

### Case statement
Often when we are performing calculations, we want to selectively include data in a calculation based on a set of conditions. The case statement allows us to do just that. The case statement has a list of conditions and a column to return if the condition is meet, and it ends with an else that tells it how to handle those rows without a match.

### Case example
Let's take a look at how a case statement works. Before we begin to look at this example, this is just to demonstrate how case works, we could get the same result where a much simpler select statement and a where clause. However, we'll be building on the case statement in the next example to build queries that a where clause cannot perform. We start by importing the case statement from sqlalchemy. Then we build a select statement that include a `sum` function for a case statement. The case statement begins with a conditional that checks to see if the `state` is `New York`, and if that is the case it returns the value of the `pop2008` column. Next we have an else clause that returns 0 for any record that does not have the state of New York. Then we execute the statement and print the results.


In [3]:
from sqlalchemy import func, case
stmt = select([
        func.sum(
            case([
                (census.columns.state == 'New York',
                census.columns.pop2008)
            ], else_=0))])
results = connection.execute(stmt).fetchall()
print(results)

[(19465159,)]


### Cast statement
The cast statement is also useful when we are performing operations and you need to convert a column from one type to another. This is useful for converting integers to floats so we get the expected result when we use it in division. It can also be used to convert strings to dates. The case statement accepts a column or expression and the type to which you want to convert it. Let's combine the case and cast statements in an example.

### Percentage example
If we wanted to find what percentage of the total population live in New York. We start by importing case, cast and float. Then we build a select statement where we are selecting a very complex clause. To calculate a percentage, we need sum the 2008 population for all the rows where the state is New York and dividing it by the sum of the total 2008 population and multiple by 100. We do that by calculating the sum of a case statement that returns the `pop2008` column if the state is New York and 0 for any other record just like our last example. Then we divide that by the sum of the pop2008 column for all the records. However, we are casting that to a Float so we will get a fractional result when we perform the division. This is important because if we don't covert it, it will perform floor or integer division and we'll get 0 back. Next we multiply that by 100 to get the percentage and label the entire calculation as `ny_percent`. Next, we can execute that statement and print the results. 

In [4]:
from sqlalchemy import case, cast, Float
stmt = select([
        (func.sum(
            case([
                (census.columns.state == 'New York',
                census.columns.pop2008)
            ], else_=0)) /
         cast(func.sum(census.columns.pop2008),
             Float) * 100).label('ny_percent')])
results = connection.execute(stmt).fetchall()
print(results)

[(6.426761976501632,)]


Notice that we have used a sophisticated SQL query to extract the solution to a very intuitive question from our database: What percentage of the total population lived in New York in 2008?

```
6.43%
```

## Connecting to a MySQL database
Before you jump into the calculation exercises, let's begin by connecting to our database. Recall that in the last chapter you connected to a PostgreSQL database. Now, you'll connect to a MySQL database, for which many prefer to use the `pymysql` database driver, which, like `psycopg2` for PostgreSQL, you have to install prior to use.

This connection string is going to start with `'mysql+pymysql://'`, indicating which dialect and driver you're using to establish the connection. The dialect block is followed by the `'username:password'` combo. Next, you specify the host and port with the following `'@host:port/'`. Finally, you wrap up the connection string with the `'database_name'`.

Now you'll practice connecting to a MySQL database: it will be the same `census` database that you have already been working with. One of the great things about SQLAlchemy is that, after connecting, it abstracts over the type of database it has connected to and you can write the same SQLAlchemy code, regardless.

- Import the `create_engine` function from the `sqlalchemy` library.
- Create an engine to the `census` database by concatenating the following strings and passing them to `create_engine()`:
    - `'mysql+pymysql://'` (the dialect and driver).
    - `'student:datacamp'` (the username and password).
    - `'@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:3306/'` (the host and port).
    - `'census'` (the database name).
- Use the `.table_names()` method on `engine` to print the table names.

In [5]:
!pip install PyMySQL



In [6]:
# Import create_engine function
from sqlalchemy import create_engine

# Create an engine to the census database
engine = create_engine('mysql+pymysql://student:datacamp@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:3306/census')

# Print the table names
print(engine.table_names())


['census', 'state_fact']


## Calculating a difference between two columns
Often, you'll need to perform math operations as part of a query, such as if you wanted to calculate the change in population from 2000 to 2008. For math operations on numbers, the operators in SQLAlchemy work the same way as they do in Python.

You can use these operators to perform addition (`+`), subtraction (`-`), multiplication (`*`), division (`/`), and modulus (`%`) operations. Note: They behave differently when used with non-numeric column types.

Let's now find the top 5 states by population growth between 2000 and 2008.

- Define a select statement called `stmt` to return:
     1. The state column of the `census` table (`census.columns.state`).
    2. The difference in population count between 2008 (`census.columns.pop2008`) and 2000 (`census.columns.pop2000`) labeled as `'pop_change'`.
- Group the statement by `census.columns.state`.
- Order the statement by population change (`'pop_change'`) in descending order. Do so by passing it desc(`'pop_change'`).
- Use the `.limit()` method on the previous statement to return only 5 records.
- Execute the statement and `fetchall()` the records.
- Print the `state` and `population` change for each record

In [7]:
# Build query to return state names by population difference from 2008 to 2000: stmt
stmt = select([census.columns.state, 
               (census.columns.pop2008 - 
                census.columns.pop2000).label('pop_change')])

# Append group by for the state: stmt_grouped
stmt_grouped = stmt.group_by(census.columns.state)

# Append order by for pop_change descendingly: stmt_ordered'
stmt_ordered = stmt_grouped.order_by(desc('pop_change'))

# Return only 5 results: stmt_top5
stmt_top5 = stmt_ordered.limit(5)

# Use connection to execute stmt_top5 and fetch all results
results = connection.execute(stmt_top5).fetchall()

# Print the state and population change for each record
for result in results:
    print('{}:{}'.format(result.state, result.pop_change))

Texas:40137
California:35406
Florida:21954
Arizona:14377
Georgia:13357


## Determining the overall percentage of women
It's possible to combine functions and operators in a single select statement as well. These combinations can be exceptionally handy when we want to calculate percentages or averages, and we can also use the `case()` expression to operate on data that meets specific criteria while not affecting the query as a whole. The `case()` expression accepts a list of conditions to match and the column to return if the condition matches, followed by an `else_` if none of the conditions match. We can wrap this entire expression in any function or math operation we like.

Often when performing integer division, we want to get a float back. While some databases will do this automatically, you can use the `cast()` function to convert an expression to a particular type.

- Import `case`, `cast`, and `Float` from `sqlalchemy`.
- Build an expression `female_pop2000` to calculate female population in 2000. To achieve this:
    - Use `case()` inside `func.sum()`.
    - The first argument of `case()` is a list containing a tuple of
        1. A boolean checking that `census.columns.sex` is equal to `'F'`.
        2. The column `census.columns.pop2000`.
    - The second argument is the `else_` condition, which should be set to 0.
- Calculate the total population in 2000 and use `cast()` to convert it to `Float`.
- Build a query to calculate the percentage of women in 2000. To do this, divide `female_pop2000` by `total_pop2000` and multiply by `100`.
- Execute the query and print `percent_female`.

In [8]:
# import case, cast and Float from sqlalchemy
from sqlalchemy import case, cast, Float

# Build an expression to calculate female population in 2000
female_pop2000 = func.sum(
                    case([
                        (census.columns.sex == 'F', census.columns.pop2000)
                    ], else_=0))

# Cast an expression to calculate total population in 2000 to Float
total_pop2000 = cast(func.sum(census.columns.pop2000), Float)

# Build a query to calculate the percentage of women in 2000: stmt
stmt = select([female_pop2000 / total_pop2000 * 100])

# Execute the query and store the scalar result: percent_female
percent_female = connection.execute(stmt).scalar()

# Print the percentage
print(percent_female)


51.09467432293413


*It looks like there were slightly more women than men in the US population in 2000.*

---
## SQL relationships
Tables can be related to one another via columns that act as a bridge between the tables.

### Relationships
We use relationships to avoid duplicating data. For example, an employee table might be related to a location table so that we can know which location an employee works at without the need to copy the same location data in every employee's record. Relationships allow us to change the data in one place. Back to our employee to location example, if that location moves to a new building, we'd be able to update the address once in the location table, and every employee related to that location would show the new address. Another way that we might use relationships is to store additional details that we don't need to use as often. These relationships might be predefined in the table.

### Relationships
- `Census`

state | sex | age | pop2000 | pop2008
:---|:---|:---|:---|:---
New York | F | 0 | 120335 | 122194
New York | F | 1 | 118219 | 119661
New York | F | 2 | 119577 | 116413

- `State_Fact`
name | abbreviation | type
:---|:---|:---
New York | NY | state
Washington DC | DC | capitol
Washington | WA | state

In our census data, we have a `census` table and a `state_fact` table that are related by the state name which is found in the state column of the census table and name column of the `state_fact` table. Let's use this predefined relationship to get the state abbreviation from the `state_fact` table instead of the name and the population in 2008 for that same record from the census table. This is called a join.

### Automatic joins
We build our statement with the column from each table that we desire. Then we can execute the query and print the results. Those results now show each record with the state abbreviation instead of the state name. 

```python
stmt = select([census.columns.pop2008,
              state_fact.columns.abbreviation])
results = connection.execute(stmt).fetchall()
print(results)
```

```

[(95012, u'IL'), (95012, u'NJ'), (95012, u'ND'), (95012, u'OR'), ...
```



SQLAlchemy automatically adds the right join clause because it is predefined in the database.

### Join
We can use a join clause to add a relationship that isn't necessarily predefined in a query. The join clause takes a related table and an expression that details the relationship. If the relation is predefined in the table, we don't need that expression. The join clause should be placed right after the select statement and prior to any where, `order_by` or `group_by` clauses. When we want to build queries that do not select a column from each table but use both tables in other clauses, we have to tell SQLAlchemy what to tables to use in the query.

### select_from()
The `select_from` method of the select statement allows us to do that, and a join clause is passed as the argument to `select_from`.

### select_from() example
In this example, we want to determine the total population in 2000 that was within the 10th Circuit Court jurisdiction. We use our select statement to sum the `pop2000` column from the census table, then we append the `select_from` method to include the census table joined with the `state_fact` table. Next, we use a where clause to find only the records where the `circuit_court` column from the `state_fact` table is `10`. After executing the statement, we can print our result.

```python
stmt = select([func.sum(census.columns.pop2000)])
stmt = stmt.select_from(census.join(state_fact))
stmt = stmt.where(state_fact.columns.circuit_court == '10')
result = connection.execute(stmt).scalar()
print(result)
```
```

14945252
```

### Joining tables without predefined relationship
So far, we have been using the join statement with a relationship already existing in the database. However, often as a data scientist, we will get tables that have related data, but are not setup with a relationship. To join tables we can give the join clause a Boolean expression that explains how the tables are related. This is the same type of Boolean expression we would use in a `where` clause. This will only join rows from each table that can be related between the two columns. It also doesn't work if the columns are different types.

### select_from() example
Imagine that we want to determine the total population in 2008 that belongs to the East South Central division of the census; the population and location live in different tables; however, this time I have removed the defined relationship between the `census` and `state_fact` tables so we can practice working with tables in that manner. 

We begin by selecting the sum of the `pop2000` column from the census table. Next we append a `select_from` to include the join clause. This time in the join clause we specify the table and a condition that matches rows based on the state column of the census table and the name column of the `state_fact` table. Finally, we add a where clause to find the records where the `census_division_name` is `'East South Central'` in the `state_fact` table. Then we execute and print the results.


In [13]:
stmt = select([func.sum(census.columns.pop2000)])
stmt = stmt.select_from(
                    census.join(state_fact, census.columns.state ==
                                state_fact.columns.name))
stmt = stmt.where(
                state_fact.columns.census_division_name ==
                'East South Central')
result = connection.execute(stmt).scalar()
print(result)

16982311


## Automatic joins with an established relationship
If you have two tables that already have an established relationship, you can automatically use that relationship by just adding the columns we want from each table to the select statement. Recall the following query:
```python
stmt = select([census.columns.pop2008, 
               state_fact.columns.abbreviation])
```
In order to join the `census` and `state_fact` tables and select the `pop2008` column from the first and the `abbreviation` column from the second. In this case, the `census` and `state_fact` tables had a pre-defined relationship: the `state` column of the former corresponded to the `name` column of the latter.

In this exercise, you'll use the same predefined relationship to select the `pop2000` and `abbreviation` columns.

- Build a statement to join the `census` and `state_fact` tables and select the **`pop2000`** column from the first and the `abbreviation` column from the second.
- Execute the statement to get the first result and save it as `result`.
- Print the key and value for each.

In [10]:
# Build a statement to join census and state_fact tables: stmt
stmt = select([census.columns.pop2000, state_fact.columns.abbreviation])

# Execute the statement and get the first result: result
result = connection.execute(stmt).first()

# Loop over the keys in the result object and print the key and value
for key in result.keys():
    print(key, getattr(result, key))

pop2000 89600
abbreviation IL


## Joins
If you aren't selecting columns from both tables or the two tables don't have a defined relationship, you can still use the `.join()` method on a table to join it with another table and get extra data related to our query. The `.join()` takes the table object you want to join in as the first argument and a condition that indicates how the tables are related to the second argument. Finally, you use the `.select_from()` method on the select statement to wrap the join clause. For example, in the previous exercise the following code was executed to join the `census` table to the `state_fact` table such that the `state` column of the `census` table corresponded to the `name` column of the `state_fact` table.

```python
stmt = stmt.select_from(
    census.join(
        state_fact, census.columns.state == 
        state_fact.columns.name)
```

- Build a statement to select ALL the columns from the `census` and `state_fact` tables. To select ALL the columns from two tables `employees` and `sales`, for example, you would use `stmt = select([employees, sales])`.
- Append a `select_from` to `stmt` to join the `census` table to the `state_fact` table by the `state` column in `census` and the `name` column in the `state_fact` table.
- Execute the statement to get the first result and save it as `result`. 
- Print the key and value for each.

In [15]:
# Build a statement to select the census and state_fact tables: stmt
stmt = select([census, state_fact])

# Add a select_from clause that wraps a join for the census and state_fact
# tables where the census state column and state_fact name column match
stmt_join = stmt.select_from(
        census.join(state_fact, 
                    census.columns.state ==
                    state_fact.columns.name))

# Execute the statement and get the first result: result
result = connection.execute(stmt_join).first()

# Loop over the keys in the result object and print the key and value
for key in result.keys():
    print(key, getattr(result, key))


state Illinois
sex M
age 0
pop2000 89600
pop2008 95012
id 13
name Illinois
abbreviation IL
country USA
type state
sort 10
status current
occupied occupied
notes 
fips_state 17
assoc_press Ill.
standard_federal_region V
census_region 2
census_region_name Midwest
census_division 3
census_division_name East North Central
circuit_court 7


## More practice with joins
You can use the same select statement you built in the last exercise, however, let's add a twist and only return a few columns and use the other table in a `group_by()` clause.

- Build a statement to select:
    - The `state` column from the `census` table.
    - The sum of the `pop2008` column from the `census` table.
    - The `census_division_name` column from the `state_fact` table.
- Append a `.select_from()` to `stmt` in order to join the `census` and `state_fact tables` by the `state` and `name` columns.
- Group the statement by the `name` column of the `state_fact table`.
- Execute the statement `stmt_grouped` to get all the records and save it as `results`.
- Loop over the results object and print each record.

In [16]:
# Build a statement to select the state, sum of 2008 population and census
# division name: stmt
stmt = select([
    census.columns.state,
    func.sum(census.columns.pop2008),
    state_fact.columns.census_division_name
])

# Append select_from to join the census and state_fact tables by the census state and state_fact name columns
stmt_joined = stmt.select_from(
    census.join(state_fact, census.columns.state == state_fact.columns.name)
)

# Append a group by for the state_fact name column
stmt_grouped = stmt_joined.group_by(state_fact.columns.name)

# Execute the statement and get the results: results
results = connection.execute(stmt_grouped).fetchall()

# Loop over the results object and print each record.
for record in results:
    print(record)


('Alabama', 4649367, 'East South Central')
('Alaska', 664546, 'Pacific')
('Arizona', 6480767, 'Mountain')
('Arkansas', 2848432, 'West South Central')
('California', 36609002, 'Pacific')
('Colorado', 4912947, 'Mountain')
('Connecticut', 3493783, 'New England')
('Delaware', 869221, 'South Atlantic')
('Florida', 18257662, 'South Atlantic')
('Georgia', 9622508, 'South Atlantic')
('Hawaii', 1250676, 'Pacific')
('Idaho', 1518914, 'Mountain')
('Illinois', 12867077, 'East North Central')
('Indiana', 6373299, 'East North Central')
('Iowa', 3000490, 'West North Central')
('Kansas', 2782245, 'West North Central')
('Kentucky', 4254964, 'East South Central')
('Louisiana', 4395797, 'West South Central')
('Maine', 1312972, 'New England')
('Maryland', 5604174, 'South Atlantic')
('Massachusetts', 6492024, 'New England')
('Michigan', 9998854, 'East North Central')
('Minnesota', 5215815, 'West North Central')
('Mississippi', 2922355, 'East South Central')
('Missouri', 5891974, 'West North Central')
('Mon

## Working with hierarchical tables
In addition to tables that join with other tables, there are also tables that join with themselves.

### Hierarchical tables
We call these tables self-referential or hierarchical tables. These are commonly used to store organizational charts, geographic data, networks and relationship graphs.

### Hierarchical tables - example
- `Employees`

id | name | job | manager
:---|:---|:---|:---|:---
1 | Johnson | Admin | 6
2 | Harding | Manager | 9
3 | Taft | Sales I | 2
4 | Hoover | Sales I | 2

Here we have an `employees` table, which, in addition to having the employee's name and position, also contains an column for their manager. That manager is also an employee and has a record in that table. The table has an undefined relationship between the id column and the manager column.

### Hierarchical tables - alias()
In order to use this relationship in a query, we need a way to refer to this table by another name. The `alias` method allows us to do just that by creating a way to refer to the same table with two unique names.

### Querying hierarchical data
Let's get a list of managers and the employees that report to them. To join the employees table using the relationship, we start by using the `alias` method on the `employees` table and storing the alias as managers. Now we can use both the name managers and employees to refer to the table. Now we are ready to build our query. We start by selecting the `name` column from the `managers` alias and labeling that column as manager. Next we select the `name` column from the `employees` table and label it `employee`. Now we use the select_from method to wrap an explicit join from the `employees` table to the managers alias. We use the `id` column from the managers alias with the `manager` column of the employees table to form the join condition. Next, we order by the managers name. Finally, we execute the statement and review the results.

In [22]:
from sqlalchemy import create_engine, Table, MetaData, select
engine_e = create_engine('sqlite:///employees.sqlite')
connection = engine_e.connect()
metadata = MetaData()
employees = Table('employees', metadata, autoload=True, autoload_with=engine_e)

In [28]:
print(repr(employees))

Table('employees', MetaData(bind=None), Column('id', INTEGER(), table=<employees>, primary_key=True, nullable=False), Column('name', VARCHAR(length=20), table=<employees>), Column('job', VARCHAR(length=20), table=<employees>), Column('mgr', INTEGER(), table=<employees>), Column('hiredate', DATETIME(), table=<employees>), Column('sal', NUMERIC(precision=7, scale=2), table=<employees>), Column('comm', NUMERIC(precision=7, scale=2), table=<employees>), Column('dept', INTEGER(), table=<employees>), schema=None)


In [29]:
employees.columns.keys()

['id', 'name', 'job', 'mgr', 'hiredate', 'sal', 'comm', 'dept']

In [32]:
managers = employees.alias()
stmt = select(
        [managers.columns.name.label('manager'),
         employees.columns.name.label('employee')])
stmt = stmt.select_from(employees.join(
                managers, managers.columns.id ==
                employees.columns.mgr))
stmt = stmt.order_by(managers.columns.name)
result_e = connection.execute(stmt).fetchall()

for result in result_e:
    print(result)

('FILLMORE', 'GRANT')
('FILLMORE', 'ADAMS')
('FILLMORE', 'MONROE')
('GARFIELD', 'JOHNSON')
('GARFIELD', 'LINCOLN')
('GARFIELD', 'POLK')
('GARFIELD', 'WASHINGTON')
('HARDING', 'TAFT')
('HARDING', 'HOOVER')
('JACKSON', 'HARDING')
('JACKSON', 'GARFIELD')
('JACKSON', 'FILLMORE')
('JACKSON', 'ROOSEVELT')


For example, Taft's supervisor here is Harding.

### group_by and func
Hierarchical tables can get tricky when performing `group_by`s or using functions. It's important to think of it as if it were two different tables. You should focus on having the table in the `group_by` and the alias in the function or vice versa. It's super important to make sure you are using both the alias and the table in the query when using the join otherwise you could cause the query to error or use a lot of resources.

### Querying hierarchical data
To practice this let's pretend that we are making next years budgets and we need to know how much salary to allocate for each managers employees. We start by making the `managers` alias of the `employees` table. Then we begin building the select statement, we select the managers name and then sum all the employees salaries. Next, we use the same explicit join from the previous example in the `select_from`. Next, we group by the managers name and finally we execute the query. 

In [33]:
managers = employees.alias()
stmt = select([managers.columns.name,
              func.sum(employees.columns.sal)])
stmt = stmt.select_from(employees.join(
                managers, managers.columns.id ==
                employees.columns.mgr))
stmt = stmt.group_by(managers.columns.name)
result_e = connection.execute(stmt).fetchall()

for result in result_e:
    print(result)

('FILLMORE', Decimal('96000.00'))
('GARFIELD', Decimal('83500.00'))
('HARDING', Decimal('52000.00'))
('JACKSON', Decimal('197000.00'))


  util.warn(


Notice that we applied the function to the `employees` table and grouped by the `managers` alias.

## Using alias to handle same table joined queries
Often, you'll have tables that contain hierarchical data, such as employees and managers who are also employees. For this reason, you may wish to join a table to itself on different columns. The `.alias()` method, which creates a copy of a table, helps accomplish this task. Because it's the same table, you only need a where clause to specify the join condition.

Here, you'll use the `.alias()` method to build a query to join the `employees` table against itself to determine to whom everyone reports.

## Using alias to handle same table joined queries
Often, you'll have tables that contain hierarchical data, such as employees and managers who are also employees. For this reason, you may wish to join a table to itself on different columns. The `.alias()` method, which creates a copy of a table, helps accomplish this task. Because it's the same table, you only need a where clause to specify the join condition.

Here, you'll use the `.alias()` method to build a query to join the `employees` table against itself to determine to whom everyone reports.

- Save an alias of the `employees` table as `managers`. To do so, apply the method `.alias()` to `employees`.
- Build a query to select the employee's `name` and their manager's `name`. Use label to `label` the `name` column of `employees` as `'employee'`.
- Append a where clause to `stmt` to match where the `id` column of the `managers` table corresponds to the `mgr` column of the `employees` table.
- Order the statement by the `name` column of the `managers` table.
- Execute the statement and store all the results. Pprint the names of the managers and all their employees.

In [34]:
# Make an alias of the employees table: managers
managers = employees.alias()

# Build a query to select names of managers and their employees: stmt
stmt = select(
    [managers.columns.name.label('manager'),
     employees.columns.name.label('employee')]
)

# Match managers id with employees mgr: stmt_matched
stmt_matched = stmt.where(managers.columns.id == employees.columns.mgr)

# Order the statement by the managers name: stmt_ordered
stmt_ordered = stmt_matched.order_by(managers.columns.name)

# Execute statement: results
results = connection.execute(stmt_ordered).fetchall()

# Print records
for record in results:
    print(record)

('FILLMORE', 'GRANT')
('FILLMORE', 'ADAMS')
('FILLMORE', 'MONROE')
('GARFIELD', 'JOHNSON')
('GARFIELD', 'LINCOLN')
('GARFIELD', 'POLK')
('GARFIELD', 'WASHINGTON')
('HARDING', 'TAFT')
('HARDING', 'HOOVER')
('JACKSON', 'HARDING')
('JACKSON', 'GARFIELD')
('JACKSON', 'FILLMORE')
('JACKSON', 'ROOSEVELT')


## Leveraging functions and group_bys with hierarchical data
It's also common to want to roll up data which is in a hierarchical table. Rolling up data requires making sure you're careful which alias you use to perform the group_bys and which table you use for the function.

Here, your job is to get a count of employees for each manager.

- Save an alias of the `employees` table as `managers`.
- Build a query to select the `name` column of the `managers` table and the count of the number of their employees. Use `func.count()` to count the `id `column of the `employees` table.
- Using a `.where()` clause, filter the records where the `id` column of the `managers` table and `mgr` column of the `employees` table are equal.
- Group the query by the `name` column of the `managers` table.
- Execute the statement and store all the results. Print the names of the managers and their employees.

In [35]:
# Make an alias of the employees table: managers
managers = employees.alias()

# Build a query to select names of managers and counts of their employees: stmt
stmt = select([managers.columns.name, func.count(employees.columns.id)])

# Append a where clause that ensures the manager id and employee mgr are equal
stmt_matched = stmt.where(managers.columns.id == employees.columns.mgr)

# Group by Managers Name
stmt_grouped = stmt_matched.group_by(managers.columns.name)

# Execute statement: results
results = connection.execute(stmt_grouped).fetchall()

# print manager
for record in results:
    print(record)

('FILLMORE', 3)
('GARFIELD', 4)
('HARDING', 2)
('JACKSON', 4)


---
## Handling large ResultSets
So what do we do when we have really complex queries with large result sets?

### Dealing with large ResultSets
Dealing with large result sets can be problematic, as we might run out of memory or disk space to store the results. Thankfully, SQLAlchemy has a `fetchmany()` method that allows us to retrieve results so many at a time. It works by passing the number of records we want at once to the `fetchmany()` method and using the method in a loop. When there are no more records, fetchmany will return an empty list. Because the `ResultProxy` does not know when we are done calling fetchmany, we must call the close method on the result proxy when we are done. Let's look at an example.

### Fetching many rows
I want to count how many results we have for each state; however, we have a HUGE table so I need to work in smaller groups of records with `fetchmany`. We're going to do this in a while loop. Recall that while loops will check to see if a variable or expression is true and if so, it will continue running a loop. When the condition is false, the loop stops executing. In this example, we already have set `more_results` to be True, we also started a `state_count` dictionary to hold the count for each state, and we have already executed the query and stored the results proxy as `results_proxy`. We start the while loop by checking to see if `more_results` is True. Then inside the loop, we fetch 50 records from the results proxy and store that as `partial_results`. We immediately follow that up by checking to see if `partial_results` is an empty list. Remember that is how we know there are no more records to fetch. If it is an empty list, we update `more_results` to be False so we will exit the loop. Next we loop over the `partial_results` and increment the `state_count` for that records state by one. So that will keep running until we get an empty list back from `fetchmany` and exit the while loop. Once we exit the while loop, we close the `results_proxy` so the database and SQLAlchemy know we are done with the large result set.

```python
while more_results:
    partial_results = results_proxy.fetchmanay(50)
    if partial_results == []:
        more_results = False
    for row in partial_results:
        state_count[row.state] += 1
results_proxy.close()
```

## Working on blocks of records
Sometimes you may have the need to work on a large ResultProxy, and you may not have the memory to load all the results at once. To work around that issue, you can get blocks of rows from the ResultProxy by using the `.fetchmany()` method inside a loop. With `.fetchmany()`, give it an argument of the number of records you want. When you reach an empty list, there are no more rows left to fetch, and you have processed all the results of the query. Then you need to use the `.close()` method to close out the connection to the database.

You'll now have the chance to practice this on a large ResultProxy called `results_proxy`.

- Use a `while` loop that checks if there are `more_results`.
- Inside the loop, apply the method `.fetchmany()` to `results_proxy` to get `50` records at a time and store those records as `partial_results.
- After fetching the records, if `partial_results` is an empty list (that is, if it is equal to `[]`), set `more_results` to `False`.
- Loop over the `partial_results` and, if `row.state` is a key in the `state_count` dictionary, increment `state_count[row.state]` by 1; otherwise set `state_count[row.state]` to 1.
- After the while loop, close the ResultProxy `results_proxy` using `.close()`.
- Print `state_count`.

In [49]:
from sqlalchemy import create_engine, Table, MetaData
engine = create_engine('sqlite:///census.sqlite')
census = Table('census', metadata, autoload=True, autoload_with=engine)
metadata = MetaData()
stmt = select([census])
results_proxy = connection.execute(stmt)

In [50]:
more_results = True
state_count = {}

# Start a while loop checking for more results
while more_results:
    # Fetch the first 50 results from the ResultProxy: partial_results
    partial_results = results_proxy.fetchmany(50)

    # if empty list, set more_results to False
    if partial_results == []:
        more_results = False

    # Loop over the fetched records and increment the count for the state
    for row in partial_results:
        if row.state in state_count:
            state_count[row.state] += 1
        else:
            state_count[row.state] = 1

# Close the ResultProxy, and thus the connection
results_proxy.close()

# Print the count by state
print(state_count)

{'Illinois': 172, 'New Jersey': 172, 'District of Columbia': 172, 'North Dakota': 172, 'Florida': 172, 'Maryland': 172, 'Idaho': 172, 'Massachusetts': 172, 'Oregon': 172, 'Nevada': 172, 'Michigan': 172, 'Wisconsin': 172, 'Missouri': 172, 'Washington': 172, 'North Carolina': 172, 'Arizona': 172, 'Arkansas': 172, 'Colorado': 172, 'Indiana': 172, 'Pennsylvania': 172, 'Hawaii': 172, 'Kansas': 172, 'Louisiana': 172, 'Alabama': 172, 'Minnesota': 172, 'South Dakota': 172, 'New York': 172, 'California': 172, 'Connecticut': 172, 'Ohio': 172, 'Rhode Island': 172, 'Georgia': 172, 'South Carolina': 172, 'Alaska': 172, 'Delaware': 172, 'Tennessee': 172, 'Vermont': 172, 'Montana': 172, 'Kentucky': 172, 'Utah': 172, 'Nebraska': 172, 'West Virginia': 172, 'Iowa': 172, 'Wyoming': 172, 'Maine': 172, 'New Hampshire': 172, 'Mississippi': 172, 'Oklahoma': 172, 'New Mexico': 172, 'Virginia': 172, 'Texas': 172}


*As a data scientist, you'll inevitably come across huge databases, and being able to work on them in blocks is a vital skill.*