# NYC High Schools Aggregates

### Introduction
In this lab we will practice using aggregate SQL functions. These functions, such as AVG, MIN, and MAX, allow us to perform mathematical operations on a set of numbers, and return one value. We will also use the GROUP BY function. GROUP BY allows us to group rows that have identical values in a column (or columns), often with the intention of performing an aggregate function on these groups. In the database we are using in this lab, each row represents a school, with each column representing some metric or information about that school. We could use an aggregate function to find the MAX total students of all the schools listed. But what if we wanted to know the MAX number of students by Boro? Previously we might have used a WHERE clause, but that would require a separate statement for each boro. Thats where GROUP BY clauses come in. In this example we could use GROUP BY boro, and the query would return the results of our aggregate function for each boro.

Lets begin by using the `sqlite3` library to connect to the database

In [3]:
import sqlite3
conn = sqlite3.connect('nyc_schools.db')
cursor = conn.cursor()

high_school_df = pd.read_csv('./highschools.csv')
high_school_df.to_sql('high_schools', conn, index = False)

In [12]:
cursor.execute('select * from high_schools;')
# cursor.fetchall()

<sqlite3.Cursor at 0x123013340>

### Aggregates

For each of the questions below, use a SQL aggregate function to find the solution.

* What's the average number of students in Manhattan?

In [5]:
cursor.execute("""
SELECT AVG(total_students) 
  FROM high_schools
 WHERE boro = "M";

""")
cursor.fetchall()

[(601.9666666666667,)]

* What's the average attendance in Manhattan?

In [6]:
cursor.execute("""
SELECT AVG(attendance_rate) 
  FROM high_schools
 WHERE boro = "M";
""")
cursor.fetchall()

[(0.8782222222222222,)]

* What's the largest difference between graduation_rate and college_career_rate?

In [7]:
cursor.execute("""
SELECT MAX(graduation_rate - college_career_rate) 
  FROM high_schools;

""")
cursor.fetchall()

[(0.55,)]

* What is the highest math_avg in queens

In [8]:
cursor.execute("""
SELECT MAX(math_avg) 
  FROM high_schools
 WHERE boro = "Q";
""")
cursor.fetchall()

[(660.0,)]

* What is the highest math_avg in manhattan?

In [9]:
cursor.execute("""
SELECT MAX(math_avg) 
  FROM high_schools
 WHERE boro = "M";
""")
cursor.fetchall()

[(735.0,)]

* What is the highest combined score in manhattan?

In [10]:
cursor.execute("""
SELECT MAX(math_avg + reading_avg) 
  FROM high_schools
 WHERE boro = "M";
""")
cursor.fetchall()

[(1414.0,)]

### Group By

* What's the average number of students in each borough

In [11]:
cursor.execute("""
SELECT boro,
       AVG(total_students) 
  FROM high_schools
 GROUP BY boro;
""")
cursor.fetchall()

[('K', 740.2884615384615),
 ('M', 601.9666666666667),
 ('Q', 1135.4615384615386),
 ('R', 1863.2),
 ('X', 523.4827586206897)]

* What's the average difference between graduation_rate and college_career_rate by borough

In [12]:
cursor.execute("""
SELECT boro,
       AVG(graduation_rate - college_career_rate) 
  FROM high_schools
 GROUP BY boro;

""")
cursor.fetchall()

[('K', 0.22480392156862752),
 ('M', 0.17298850574712643),
 ('Q', 0.1706153846153846),
 ('R', 0.23200000000000004),
 ('X', 0.21264367816091953)]

* What's the avg college career rate grouped by math_avg scores (Hint: https://stackoverflow.com/questions/30929526/sqlite-group-by-range-of-1000s)

In [13]:
cursor.execute("""
SELECT ROUND(math_avg / 100, 1) AS math_score,
       AVG(college_career_rate) 
  FROM high_schools
 GROUP BY math_score;
""")
cursor.fetchall()

[(None, 0.6124999999999999),
 (3.1, 0.42),
 (3.2, 0.446),
 (3.3, 0.47),
 (3.4, 0.51),
 (3.5, 0.4930769230769231),
 (3.6, 0.4627777777777778),
 (3.7, 0.46095238095238084),
 (3.8, 0.49268292682926834),
 (3.9, 0.5093939393939395),
 (4.0, 0.5037037037037039),
 (4.1, 0.57875),
 (4.2, 0.6064),
 (4.3, 0.5788888888888889),
 (4.4, 0.5892307692307693),
 (4.5, 0.72),
 (4.6, 0.7385714285714285),
 (4.7, 0.7218181818181819),
 (4.8, 0.7257142857142858),
 (4.9, 0.724),
 (5.0, 0.79),
 (5.1, 0.8400000000000001),
 (5.2, 0.8566666666666666),
 (5.3, 0.79),
 (5.4, 0.865),
 (5.5, 0.8400000000000001),
 (5.6, 0.8366666666666666),
 (5.7, 0.9133333333333334),
 (5.8, 0.86),
 (5.9, 0.885),
 (6.0, 0.98),
 (6.5, 0.9966666666666667),
 (6.6, 0.9299999999999999),
 (6.8, 0.88),
 (6.9, 0.98),
 (7.3, 0.98)]

### HAVING
One important thing to note is that once we use the GROUP BY clause, we can no longer use the WHERE clause for aggregate functions. For example lets say we wanted to know the average number of students in each boro, but we only wanted the results for boros with an average of more than 1000. Here we would use the HAVING clause. See the example below and then use the HAVING clause to find the solution for the the next question.

In [14]:
cursor.execute("""
SELECT boro,
       AVG(total_students) 
  FROM high_schools
 GROUP BY boro
HAVING AVG(total_students) > 1000;
""")
cursor.fetchall()

[('Q', 1135.4615384615386), ('R', 1863.2)]

What is the average college career rate for each boro, selecting only boros with an average college career rate less than .6?

In [15]:
cursor.execute("""
SELECT boro,
       AVG(college_career_rate) 
  FROM high_schools
 GROUP BY boro
HAVING AVG(college_career_rate) < 0.6;

""")
cursor.fetchall()

[('K', 0.5471568627450981), ('X', 0.5295402298850576)]

### Conclusion
In this lab, we performed aggregate functions on our data. This allows us to perform mathematical operations on a set of values in our database. We also used the GROUP BY clause, which gave us the ability to perform the aggregate functions on different subsets of the data at once.