# Module 3: SQL in Pandas

Last week, we learned how to connect to a SQLite database and perform queries on it. In this module, we will be utilizing the SQL queries we learned last week on pandas dataframes.

In [17]:
import sqlite3
import pandas as pd

connection = sqlite3.connect("cdeschools.sqlite")
cursor = connection.cursor()

First, let's simplify what we were doing last week where we ran a query, and transformed it into a pandas dataFrame. Using panda's function `read_sql_query`, we can pass in a query as a string and a connection variable to give back the table right away, no cursor required! 

In [13]:
pd.read_sql_query('''
    SELECT s.School, s.City, s.Latitude, s.Longitude, sat.NumTstTakr,
        sat.AvgScrRead, sat.AvgScrMath, sat.AvgScrWrite
    FROM schools as s
    INNER JOIN satscores as sat
    ON s.CDSCode = sat.cds
    WHERE s.StatusType == 'Closed'
''', connection)

Unnamed: 0,School,City,Latitude,Longitude,NumTstTakr,AvgScrRead,AvgScrMath,AvgScrWrite
0,FAME Public Charter,Newark,37.521436,-121.99391,17,503.0,546.0,505.0
1,Aspire California College Preparatory Academy,Berkeley,37.868991,-122.27844,0,,,
2,Encinal High,Alameda,37.773616,-122.29027,132,483.0,504.0,476.0
3,,Livermore,37.691041,-121.77055,75,516.0,523.0,515.0
4,North Campus Continuation,San Pablo,37.993898,-122.32079,22,324.0,307.0,328.0
5,Fresno Academy for Civic and Entrepreneurial L...,Fresno,36.731908,-119.79306,10,,,
6,National University Academy - Orange Center,Fresno,36.682671,-119.78174,0,,,
7,Northcoast Preparatory and Performing Arts Aca...,Arcata,40.863604,-124.07508,35,620.0,534.0,579.0
8,YouthBuild Charter School of California Central,Los Angeles,34.031953,-118.26627,0,,,
9,"National University Academy, Armona",Vista,33.170564,-117.22039,0,,,


This is a really quick way to get the column names for a table if they aren't written down anywhere too, as it can be quite convoluted to get these when only using `sqlite3`:

In [15]:
pd.read_sql_query("SELECT * FROM schools", connection).columns

Index(['CDSCode', 'NCESDist', 'NCESSchool', 'StatusType', 'County', 'District',
       'School', 'Street', 'StreetAbr', 'City', 'Zip', 'State', 'MailStreet',
       'MailStrAbr', 'MailCity', 'MailZip', 'MailState', 'Phone', 'Ext',
       'Website', 'OpenDate', 'ClosedDate', 'Charter', 'CharterNum',
       'FundingType', 'DOC', 'DOCType', 'SOC', 'SOCType', 'EdOpsCode',
       'EdOpsName', 'EILCode', 'EILName', 'GSoffered', 'GSserved', 'Virtual',
       'Magnet', 'Latitude', 'Longitude', 'AdmFName1', 'AdmLName1',
       'AdmEmail1', 'AdmFName2', 'AdmLName2', 'AdmEmail2', 'AdmFName3',
       'AdmLName3', 'AdmEmail3', 'LastUpdate'],
      dtype='object')

## Using Queries on DataFrames using `sqldf`

We have seen that SQL is a powerful language for taking existing data from various sources and easily filtering and combining until we get the exact table that we need for our tasks. We only needed a few keywords too! 

So, what if we could use SQL queries from non-SQL database sources, like data from a csv file? Well, we can! Although we will need to throw a new library in the mix. Let's try it out:

In [16]:
from pandasql import sqldf

Did you get an error? That's okay! Anaconda (the tool that we installed that interacts with Python) doesn't have every Python library installed automatically, and this is one that we will have to add ourselves. To do this:
- Go to Anaconda Navigator (likely how you launched Jupyter Lab)
- Click on the "Environments" tab
- You'll see a list of packages and their descriptions on the right side of the screen. On the dropdown next to the "Channels" button, make sure to change it to be "All" instead of "Installed." 
- Go to the search bar next to the "Update Index" button and type in "pandasql." One result should pop up in the list. 
- Click on the box next to its name, and it should turn into a green arrow pointing downwards.
- At the very bottom of the window, click on "Apply," and then "Apply" in the window that pops up afterwards. We're done!

Now let's try that cell again. You may need to reload python and rerun the cells in this notebook.

In [18]:
from pandasql import sqldf

Sweet, hopefully it worked now. So how does this new `sqldf()` function work? Well, pass in a query as a string (for example, `sqldf("SELECT * FROM schools")`) and it returns a DataFrame. Pretty slick!

Keep in mind that the tables you tell this `sqldf` package about *must* be variables. Let's load in two csvs and give it a shot. 

In [50]:
meals_df = pd.read_csv("../data/frpm.csv")
meals_df.columns = meals_df.columns.str.replace(' ', '')
meals_df.columns = meals_df.columns.str.replace('(%)', '')

sat_scores_df = pd.read_csv("../data/satscores.csv")

  meals_df.columns = meals_df.columns.str.replace('(%)', '')


In [51]:
meals_df.head()

Unnamed: 0,index,AcademicYear,CountyCode,DistrictCode,SchoolCode,CountyName,DistrictName,SchoolName,DistrictType,SchoolType,...,FreeMealCount(K-12),Percent()EligibleFree(K-12),FRPMCount(K-12),Percent()EligibleFRPM(K-12),Enrollment(Ages5-17),FreeMealCount(Ages5-17),Percent()EligibleFree(Ages5-17),FRPMCount(Ages5-17),Percent()EligibleFRPM(Ages5-17),2013-14CALPADSFall1CertificationStatus
0,0,2014-2015,1.0,10017.0,109835.0,Alameda,Alameda County Office of Education,FAME Public Charter,County Office of Education (COE),K-12 Schools (Public),...,565.0,0.519779,715.0,0.657774,1070.0,553.0,0.516822,702.0,0.656075,1.0
1,1,2014-2015,1.0,10017.0,112607.0,Alameda,Alameda County Office of Education,Envision Academy for Arts & Technology,County Office of Education (COE),High Schools (Public),...,186.0,0.470886,186.0,0.470886,376.0,182.0,0.484043,182.0,0.484043,1.0
2,2,2014-2015,1.0,10017.0,118489.0,Alameda,Alameda County Office of Education,Aspire California College Preparatory Academy,County Office of Education (COE),High Schools (Public),...,134.0,0.54918,175.0,0.717213,230.0,128.0,0.556522,168.0,0.730435,1.0
3,3,2014-2015,1.0,10017.0,123968.0,Alameda,Alameda County Office of Education,Community School for Creative Education,County Office of Education (COE),Elementary Schools (Public),...,113.0,0.591623,139.0,0.727749,190.0,113.0,0.594737,139.0,0.731579,1.0
4,4,2014-2015,1.0,10017.0,124172.0,Alameda,Alameda County Office of Education,Yu Ming Charter,County Office of Education (COE),Elementary Schools (Public),...,14.0,0.054475,21.0,0.081712,257.0,14.0,0.054475,21.0,0.081712,1.0


In [52]:
sat_scores_df.head()

Unnamed: 0,index,cds,rtype,sname,dname,cname,enroll12,NumTstTakr,AvgScrRead,AvgScrMath,AvgScrWrite,NumGE1500,PctGE1500
0,0,0,X,,,,496901,210706,489.0,500.0,484.0,93334.0,44.3
1,1,1000000000000,C,,,Alameda,16978,8855,516.0,536.0,517.0,4900.0,55.34
2,2,1100170000000,D,,Alameda County Office of Education,Alameda,398,88,418.0,418.0,417.0,14.0,15.91
3,3,1100170109835,S,FAME Public Charter,Alameda County Office of Education,Alameda,62,17,503.0,546.0,505.0,9.0,52.94
4,4,1100170112607,S,Envision Academy for Arts & Technology,Alameda County Office of Education,Alameda,75,71,397.0,387.0,395.0,5.0,7.04


In [56]:
query = '''
    SELECT m.SchoolName, m.'Percent()EligibleFRPM(Ages5-17)', 
        s.AvgScrMath, s.AvgScrRead, s.AvgScrWrite
    FROM meals_df 
        AS m
    INNER JOIN sat_scores_df 
        AS s
    ON m.SchoolName = s.sname
'''

meal_sat_df = sqldf(query)

meal_sat_df

Unnamed: 0,SchoolName,Percent()EligibleFRPM(Ages5-17),AvgScrMath,AvgScrRead,AvgScrWrite
0,FAME Public Charter,0.656075,546.0,503.0,505.0
1,Envision Academy for Arts & Technology,0.484043,387.0,397.0,395.0
2,Aspire California College Preparatory Academy,0.730435,,,
3,Alameda Science and Technology Institute,0.299401,590.0,562.0,555.0
4,Nea Community Learning Center,0.308511,,,
...,...,...,...,...,...
1995,Lindhurst High,0.880150,450.0,428.0,423.0
1996,Lincoln (Abraham) (Alternative),0.479798,,,
1997,Marysville Charter Academy for the Arts,0.479893,494.0,501.0,484.0
1998,Marysville High,0.639588,513.0,489.0,487.0
