# 5. Putting it all together
**Bring together all of the skills you acquired in the previous chapters to work on a real-life project. From connecting to a database and populating it, to reading and querying it.**

It's time to put all your effort so far to good use on a census case study.

### Census case study
The case study is broken down into three parts. 
1. we are going to prepare SQLAlchemy and the database. 
2. we will load the data into the database. 
3. we solve a few data science type problems with our query knowledge.

### Part 1: preparing SQLAlchemy and the database
For part 1 we are going to focus on preparing SQLAlchemy and the database. You might remember this example from Chapter 1. We import `create_engine` and `Metadata`, then create the engine and initialize the metadata.
```python
from sqlalchemy import create_engine, MetaData
engine = create_engine('sqlite:///census_nyc.sqlite')
metadata = MetaData()
```

### Part 1: preparing SQLAlchemy and the database
Then we will build the census table to hold our data. You might remember the employees table we built in Chapter 4. We begin by importing the `Table` and `Column` objects along with all the types we are going to use in our table. Next we define our Table using the Table object by giving it a name, the metadata object, and then each of the columns we want in our table. Finally we create the table in the database by using the create all method on the metadata with the engine.
```python
from sqlalchemy import Table, Column, String, Integer, Numeric, Boolean

engine = create_engine('sqlite:///')
metadata = MetaData()

employees = Table('employees', metadata,
                  Column('id', Integer()),
                  Column('name', String(255)),
                  Column('salary', Numeric()),
                  Column('active', Boolean()))
metadata.create_all(engine)
```

## Setup the engine and metadata
In this exercise, your job is to create an engine to the database that will be used in this chapter. Then, you need to initialize its metadata.

Recall how you did this in Chapter 1 by leveraging `create_engine()` and `MetaData()`.

- Import `create_engine` and `MetaData` from `sqlalchemy`.
- Create an `engine` to the chapter 5 database by using `'sqlite:///chapter5.sqlite'` as the connection string.
- Create a MetaData object as `metadata`.

In [1]:
# Import create_engine, MetaData
from sqlalchemy import create_engine, MetaData

# Define an engine to connect to chapter5.sqlite: engine
engine = create_engine('sqlite:///chapter5.sqlite')

# Initialize MetaData: metadata
metadata = MetaData()

## Create the table to the database
Having setup the engine and initialized the metadata, you will now define the `census` table object and then create it in the database using the `metadata` and `engine` from the previous exercise. To create it in the database, you will have to use the `.create_all()` method on the `metadata` with `engine` as the argument.

- Import `Table`, `Column`, `String`, and `Integer` from `sqlalchemy`.
- Define a `census` table with the following columns:
    - `'state'` - String - length of 30
    - `'sex'` - String - length of 1
    - `'age'` - Integer
    - `'pop2000'` - Integer
    - `'pop2008'` - Integer
- Create the table in the database using the `metadata` and `engine`.

In [2]:
# Import Table, Column, String, and Integer
from sqlalchemy import Table, Column, String, Integer

# Build a census table: census
census = Table('census', metadata,
               Column('state', String(30)),
               Column('sex', String(1)),
               Column('age', Integer),
               Column('pop2000', Integer),
               Column('pop2008', Integer))

# Create the table in the database
metadata.create_all(engine)

---
## Populating the database
With our table in place, we can now load the data into it. The US Census Agency gave us a CSV file full of data that we need to load into the table.

### Part 2: populating the database
We'll start that by building a `values_list` like we did in chapter 4 with this exercise. 
```python
values_list = []
for row in csv_reader:
    data = {'state': row[0], 'sex': row[1], 'age': row[2],
           'pop2000': row[3], 'pop2008': row[4]}
    values_list.append(data)
```
We begin by defining an empty list then looping over the rows of the CSV. Then we build a dictionary for each CSV row that has the data for that row matched up with the column we want to store it in. Then we append the dictionary to the values list.

### Part 2: Populating the Database
Now we can insert that `values_list` as we did in Chapter 4 like this example. We we start by importing the `insert` statement. Then we build an insert statement for our table, finally we use the execute method on our connection with the statement and values list to insert the data into the table.
```python
from sqlalchemy import insert
stmt = insert(employees)
result_proxy = connection.execute(stmt, values_list)
print(result_proxy.rowcount)
```
```

2
```
To review how many rows were inserted, we use the `rowcount` method of the `ResultProxy`.

## Reading the data from the CSV
Leverage the Python CSV module from the standard library and load the data into a list of dictionaries.

- Create an empty list called `values_list`.
- Iterate over the rows of `csv_reader` with a for loop, creating a dictionary called `data` for each row and append it to `values_list`.
    - Within the for loop, `row` will be a list whose entries are `'state'`, `'sex'`, `'age'`, `'pop2000'` and `'pop2008'` (in that order).

In [15]:
import csv

csv_reader = csv.reader(open('census.csv'))

# Create an empty list: values_list
values_list = []

# Iterate over the rows
for row in csv_reader:
    # Create a dictionary with the values
    data = {'state': row[0], 'sex': row[1], 'age': row[2], 
            'pop2000': row[3], 'pop2008': row[4]}
    # Append the dictionary to the values list
    values_list.append(data)

## Load data from a list into the Table
Using the multiple insert pattern, in this exercise, you will load the data from `values_list` into the table.

- Import `insert` from `sqlalchemy`.
- Build an insert statement for the `census` table.
- Execute the statement `stmt` along with `values_list`. You will need to pass them both as arguments to `connection.execute()`.
- Print the `rowcount` attribute of `results`.

In [16]:
# Import insert
from sqlalchemy import insert

# Build insert statement: stmt
stmt = insert(census)

# Use values_list to insert data: results
results = connection.execute(stmt, values_list)

# Print rowcount
print(results.rowcount)

8772


---
## Querying the database
### Part 3: answering data science questions with queries
Here is an example of how we calculated an average in an exercise from Chapter 3. We began by importing the select statement. Next we built a select statement that creates a weighted average. We do this by summing the result of multiplying the age with the population and dividing that by the sum of the total population and labeling that average age. Next we grouped by the sex column to determine the average `age` for each `sex`. Finally, we executed the query and fetched all the results.
```python
from sqlalchemy import select
stmt = select([census.columns.sex,
               (func.sum(census.columns.pop2008 *
                         census.columns.age) /
                func.sum(census.columns.pop2008)
               ). label('avarage_age')])
stmt = stmt.group_by(census.columns.sex)
resutls = connection.execute(stmt).fetchall()
```

### Part 3: answering data science questions with queries
We learned how to calculate a percentage by using the case and cast clauses in Chapter 3. We begin by importing `case`, `cast`, and `Float`. Then we build a select statement that calculates the sum of the `pop2008` column in cases where the state is New York. Then we divided that by the sum of the total population which is cast to a Float so we would get Decimal values. Finally, we multiplied by 100 to get a percentage and labeled it `ny_percent`.
```python
from sqlalchemy import case, cast, Float
stmt = select([
        (func.sum(
            case([
                (census.columns.state == 'New York',
                census.columns.pop2008)
            ], else_=0)) /
        cast(func.sum(census.columns.pop2008),
            Float) * 100). label('ny_percent')])
```

Also from Chapter 3, we learned how calculate the difference between two columns grouped by another column. We start by building a `select` statement, that selects the column we want to determine the change by, which in this case is `age`. Then we calculate the difference between the population in 2008 and in 2000, and we label that `pop_change`. Remember to wrap the difference calculation in parentheses so you can label it. Next, we order by `pop_change` and finally we limit it to just 5 results.
```python
stmt = select([census.columns.age,
              (census.columns.pop2008 -
               census.columns.pop2000).label('pop_chage')
              ])
stmt = stmt.order_by('pop_change')
stmt = stmt.limit(5)
```

## Determine the average age by population
To calculate a weighted average, we first find the total sum of weights multiplied by the values we're averaging, then divide by the sum of all the weights.

For example, if we wanted to find a weighted average of `data = [10, 30, 50]` weighted by `weights = [2,4,6]`, we would compute *(2*10 + 4*30 + 6*50) / (2+4+6)*, or `sum(weights * data) / sum(weights)`.

In this exercise, however, you will make use of **`func.sum()`** together with select to `select` the weighted average of a column from a table. You will still work with the `census` data, and you will compute the average of age weighted by state population in the year 2000, and then group this weighted average by sex.

- Import `select` and `func` from `sqlalchemy`.
- Write a statement to `select` the average of age (`age`) weighted by population in **2000** (`pop2000`) from `census`.

In [17]:
# Import select and func
from sqlalchemy import select, func

# Select the average of age weighted by pop2000
stmt = select([func.sum(census.columns.pop2000 *
                        census.columns.age) /
               func.sum(census.columns.pop2000)])

- Modify the select statement to alias the new column with weighted average as `'average_age'` using `.label()`.

In [18]:

# Import select and func
from sqlalchemy import select, func

# Relabel the new column as average_age
stmt = select([(func.sum(census.columns.pop2000 * 
                         census.columns.age) / 
                func.sum(census.columns.pop2000)).label('average_age')
			  ])

- Modify the select statement to select the `sex` column of `census` in addition to the weighted average, with the `sex` column coming first.
- Group by the `sex` column of `census`.

In [19]:
# Import select and func
from sqlalchemy import select, func

# Add the sex column to the select statement
stmt = select([census.columns.sex,
                (func.sum(census.columns.pop2000 * 
                          census.columns.age) / 
                 func.sum(census.columns.pop2000)).label('average_age'),               
			  ])

# Group by sex
stmt = stmt.group_by(census.columns.sex)

- Execute the statement on the `connection` and fetch all the results.
- Loop over the results and print the values in the `sex` and `average_age` columns for each record in the results.

In [20]:

# Import select and func
from sqlalchemy import select, func

# Select sex and average age weighted by 2000 population
stmt = select([census.columns.sex,
               (func.sum(census.columns.pop2000 * 
                         census.columns.age) / 
                func.sum(census.columns.pop2000)).label('average_age')
              ])

# Group by sex
stmt = stmt.group_by(census.columns.sex)

# Execute the query and fetch all the results
connection = engine.connect()
results = connection.execute(stmt).fetchall()

# Print the sex and average age column for each result
for result in results:
    print(result.sex, result.average_age)

F 37
M 34


## Determine the percentage of population by gender and state
In this exercise, you will write a query to determine the percentage of the population in 2000 that comprised of women. You will group this query by state.

- Import `case`, `cast` and `Float` from `sqlalchemy`.
- Define a statement to select `state` and the percentage of women in 2000.
    - Inside `func.sum()`, use `case()` to select women (using the `sex` column) from `pop2000`. Remember to specify `else_=0` if the `sex` is not `'F'`.
    - To get the percentage, divide the number of women in the year 2000 by the overall population in 2000. Cast the divisor - `census.columns.pop2000` - to `Float` before multiplying by 100.
- Group the query by `state`.
- Execute the query and store it as `results`.
- Print `state` and `percent_female` for each record.

In [25]:

# import case, cast and Float from sqlalchemy
from sqlalchemy import case, cast, Float, desc

# Build a query to calculate the percentage of women in 2000: stmt
stmt = select([census.columns.state, 
               (func.sum(
                   case([
                       (census.columns.sex == 'F', 
                        census.columns.pop2000)
                   ], else_=0)) /
                cast(func.sum(census.columns.pop2000), 
                     Float) * 100).label('percent_female')
])

# Group By state
stmt = stmt.group_by(census.columns.state)

stmt = stmt.order_by(desc('percent_female'))

# Execute the query and store the results: results
results = connection.execute(stmt).fetchall()

# Print the percentage
for result in results:
    print(result.state, result.percent_female)

District of Columbia 53.129626141738385
Rhode Island 52.07343391902215
Maryland 51.93575549972231
Mississippi 51.92229481794672
Massachusetts 51.843023571316785
New York 51.83453865150073
Alabama 51.832407770179465
Louisiana 51.75351596554121
Pennsylvania 51.74043473051053
South Carolina 51.73072129765755
Connecticut 51.66816507130644
Virginia 51.657252447241795
Delaware 51.61109733558627
New Jersey 51.51713956125773
Maine 51.50570813418951
North Carolina 51.482262322084594
Missouri 51.46888602639692
Ohio 51.46550350015544
Tennessee 51.430689699449275
West Virginia 51.40042318092286
Florida 51.36488001165242
Kentucky 51.32687036927168
Arkansas 51.26992846221834
Hawaii 51.118011836915514
Georgia 51.11408350339436
Oklahoma 51.11362457075227
Illinois 51.11224234802867
New Mexico 51.0471720798335
Vermont 51.018573209949466
Michigan 50.97246518318712
Indiana 50.95480313297678
Iowa 50.950398342534264
Nebraska 50.8584549336086
New Hampshire 50.858019844961746
Kansas 50.821864107754735
Wiscons

*Interestingly, the District of Columbia had the highest percentage of women in 2000, while Alaska had the highest percentage of males.*

## Determine the difference by state from the 2000 and 2008 censuses
In this final exercise, you will write a query to calculate the states that changed the most in population. You will limit your query to display only the top 10 states.

- Build a statement to:
    - Select `state`.
    - Calculate the difference in population between 2008 (`pop2008`) and 2000 (`pop2000`).
- Group the query by `census.columns.state` using the `.group_by()` method on `stmt`.
- Order by `'pop_change'` in descending order using the `.order_by()` method with the `desc()` function on `'pop_change'`.
- ~Limit the query to the top `10` states using the `.limit()` method.~
- Execute the query and store it as `results`.
- Print the state and the population change for each result. 

In [27]:
# Build query to return state name and population difference from 2008 to 2000
stmt = select([census.columns.state, 
               (census.columns.pop2008-
                census.columns.pop2000).label('pop_change')
])

# Group by State
stmt = stmt.group_by(census.columns.state)

# Order by Population Change
stmt = stmt.order_by(desc('pop_change'))

# Limit to top 10
##stmt = stmt.limit(10)

# Use connection to execute the statement and fetch all results
results = connection.execute(stmt).fetchall()

# Print the state and population change for each record
for result in results:
    print('{}:{}'.format(result.state, result.pop_change))

Texas:40137
California:35406
Florida:21954
Arizona:14377
Georgia:13357
North Carolina:11574
Virginia:6639
Colorado:6425
Utah:5934
Illinois:5412
Nevada:5367
Washington:4666
Tennessee:4621
Missouri:4547
Minnesota:3763
Oklahoma:3677
Pennsylvania:3384
South Carolina:3360
Wisconsin:2945
Oregon:2817
Maryland:2551
Arkansas:2549
Idaho:2500
Indiana:2336
New Mexico:2095
Kentucky:2021
Nebraska:1924
Iowa:1915
Mississippi:1864
New York:1851
New Jersey:1773
Kansas:1772
Ohio:1585
Alabama:1576
Hawaii:1454
South Dakota:990
Montana:960
Delaware:858
Wyoming:830
Alaska:740
District of Columbia:659
North Dakota:585
West Virginia:537
Maine:358
Rhode Island:197
New Hampshire:189
Vermont:7
Massachusetts:-242
Louisiana:-300
Connecticut:-392
Michigan:-2592
