# SQL Aggregates

### Introduction

So far we have used SQL to retrieve information about individual rows.  For example, we gathered *each* name of an *individual* employee.  However, what if we want to ask a question that must examine multiple rows.  For example, we may want to know the count of a row, the maximum or minimum value of a row, or the average of rows.  We'll explore questions like these in this lesson.

### Working with CSV and SQL

Let's start working with some real sql data.  For this lesson, we'll work with a list of restaurants in New York City, that were listed on Yelp.  We can find the data [here]('https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/mapping/tilemill/yelp-lunch-nyc.csv).

And we can transer that data using the `pandas` library.  Let's see how.

In [2]:
import pandas as pd

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/mapping/tilemill/yelp-lunch-nyc.csv')

So we first import the pandas library, and read the csv which we then store as something called a dataframe.  We can see the first few rows of a dataframe by calling the `head` method.

In [5]:
df.head(3)

Unnamed: 0,Name,Address,City,Category,Rating,URL
0,Rambling House,4292 Katonah Ave,Bronx,Pubs,4.0,http://www.yelp.com/biz/rambling-house-bronx
1,Curry Spot,4268 Katonah Ave,Bronx,Indian,4.0,http://www.yelp.com/biz/curry-spot-bronx
2,Eileens Country Kitchen,964 McLean Ave,Yonkers,American (Traditional),3.5,http://www.yelp.com/biz/eileens-country-kitche...


There's our data.  Now we'll learn about pandas in future lessons.  For now, let's stick with SQL.

We can convert our dataframe to SQL with the `to_sql` method in pandas.

In [6]:
df.to_sql()

TypeError: to_sql() missing 2 required positional arguments: 'name' and 'con'

This `to_sql` method requires two arguments: the name of the table we wish to create and a connection to the database.  We don't yet have a connection to the database, or even a database at all, so let's use SQLite to create one, and then we can pass through that connection to the `to_sql` method.

In [7]:
import sqlite3
yelp_db = sqlite3.connect('yelp.db')

In [8]:
df.to_sql('restaurants', yelp_db)

* Great so now our data from the dataframe should be loaded into our SQL table.  We can confirm this with a SELECT query to our database.

In [9]:
cursor = yelp_db.cursor()
cursor.execute('SELECT * FROM restaurants LIMIT 1')
cursor.fetchall()

# [(0,
#   'Rambling House',
#   '4292 Katonah Ave',
#   'Bronx',
#   'Pubs',
#   4.0,
#   'http://www.yelp.com/biz/rambling-house-bronx')]

[(0,
  'Rambling House',
  '4292 Katonah Ave',
  'Bronx',
  'Pubs',
  4.0,
  'http://www.yelp.com/biz/rambling-house-bronx')]

### Working with Aggregates

Now that we have our data in our `yelp.db` database, it's time to work with our aggregates.  Remember, that with aggregates, we ask questions of multiple rows.  Let's start by trying to find the highest available rating in our database.  We can do so with the following.

In [10]:
cursor.execute('SELECT MAX(rating) FROM restaurants')
cursor.fetchall()

# [(5.0,)]

[(5.0,)]

Now let's find the lowest.

In [11]:
cursor.execute('SELECT MIN(rating) FROM restaurants')
cursor.fetchall()

# [(1.0,)]

[(1.0,)]

So we can see that only one row is returned, as there is only one maximum rating.  And the format for returning an aggregate is 

```sql
SELECT aggregate(column) FROM table_name
```

Let's see the AVG rating now.

In [12]:
cursor.execute('SELECT AVG(rating) FROM restaurants')
cursor.fetchall()

# [(3.892015143692996,)]

[(3.892015143692996,)]

That wasn't so bad.  We quickly found the average rating just using sql.

And if we want to find the number of entries in our table we use the COUNT keyword.

Count is interesting because we could count any individual column and if all of our data is filled, it would give us the same answer.  For example, if we counted the number of restaurant names that would return the same as the number of ratings.  Really what we want to do is count the rows.  To specify that we want to count not a specific column, but each of an entire row we use `*` as our argument. 

In [17]:
cursor.execute('SELECT COUNT(*) FROM restaurants')
cursor.fetchall()

# [(5811,)]

[(5811,)]

### Aggregates and Where Clauses

So far we have queried the entire table of restaurants.  But now let's say that we want to find the average rating not for  all of the restaurants, but just for restaurants in the Bronx.  Is the rating higher or lower than the average rating of 3.89 for all of our restaurants.

In [18]:
cursor.execute('SELECT AVG(rating) FROM restaurants WHERE City = "Bronx"')
cursor.fetchall()

# [(3.821297429620563,)]

[(3.821297429620563,)]

So it is slightly lower.  So we can see that first SQL limited the restaurants to those from the Bronx, and then took the average.

### Summary

In this lesson, we saw how to perform aggregate methods with SQL.  Aggregate methods are those that return a value based on a calculation from multiple rows in the database instead of just one.  We saw that return an aggregate value by using the syntax of `SELECT aggregate(column_name) FROM table_name`.  And we saw that we can return the aggregate for a subset of our rows by combining our aggregate method with a where clause.