# Introducción a SQL con Pandas

### Setup

Primero importamos la librería Pandas y SQLite

In [1]:
import pandas as pd
import sqlite3 as sql

Ahora leemos la información de graduados de universidades desde el github de fivethirtyeight de Nate Silver

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")
df.head()

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,756.0,679.0,77.0,Engineering,0.101852,7,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,856.0,725.0,131.0,Engineering,0.153037,3,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,1258.0,1123.0,135.0,Engineering,0.107313,16,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,32260.0,21239.0,11021.0,Engineering,0.341631,289,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
Rank                    173 non-null int64
Major_code              173 non-null int64
Major                   173 non-null object
Total                   172 non-null float64
Men                     172 non-null float64
Women                   172 non-null float64
Major_category          173 non-null object
ShareWomen              172 non-null float64
Sample_size             173 non-null int64
Employed                173 non-null int64
Full_time               173 non-null int64
Part_time               173 non-null int64
Full_time_year_round    173 non-null int64
Unemployed              173 non-null int64
Unemployment_rate       173 non-null float64
Median                  173 non-null int64
P25th                   173 non-null int64
P75th                   173 non-null int64
College_jobs            173 non-null int64
Non_college_jobs        173 non-null int64
Low_wage_jobs          

Ahora crearemos nuestra base de datos a partir del DataFrame que acabamos de descargar, primero establecemos la conexión a nuestra base de datos

In [4]:
con = sql.connect("clase3-2.db")

y generamos un cursor que nos ayude a manipular esta base de datos con SQL

In [5]:
cur = con.cursor()

El siguiente paso es convertir nuestro DataFrame en una tabla de la base de datos

In [6]:
df.to_sql?

In [8]:
df.to_sql('recent_grads',con)

Ahora podemos ejecutar sentencias SQL de dos formas, directo de la base de datos con cur.execute

### Continua...

Podemos renombrar las columnas calculadas con el operador AS de la siguiente forma

In [45]:
pd.read_sql_query("""
SELECT COUNT(*) AS num_students FROM recent_grads;
""", con)

Unnamed: 0,num_students
0,173


inclusive SQLite nos permite no escribir el operados AS si utilizamos un string

In [27]:
pd.read_sql_query("""
SELECT COUNT(*) "num students" FROM recent_grads;
""", con)

Unnamed: 0,num students
0,173


#### 13. Escribe una query que regrese el número de filas como "Número de estudiantes" y el máximo valor de Unemployment_rate como "Máxima tasa de desempleo"

Si queremos obtener los valores únicos de una columna entonces podemos usar DISTINCT

In [50]:
pd.read_sql_query("""
SELECT DISTINCT Major_category FROM recent_grads;
""", con)

Unnamed: 0,Major_category
0,Engineering
1,Business
2,Physical Sciences
3,Law & Public Policy
4,Computers & Mathematics
5,Agriculture & Natural Resources
6,Industrial Arts & Consumer Services
7,Arts
8,Health
9,Social Science


si pasamos como argumento más de una columna, entonces nos buscará las combinaciones únicas. Ejemplo

In [52]:
pd.read_sql_query("""
SELECT DISTINCT Major, Major_category FROM recent_grads
limit 5;
""", con)

Unnamed: 0,Major,Major_category
0,PETROLEUM ENGINEERING,Engineering
1,MINING AND MINERAL ENGINEERING,Engineering
2,METALLURGICAL ENGINEERING,Engineering
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,CHEMICAL ENGINEERING,Engineering


incluso podemos contar el número de valores únicos de una columna

In [53]:
pd.read_sql_query("""
SELECT COUNT(DISTINCT(Major_category)) unique_major_categories FROM recent_grads;
""", con)

Unnamed: 0,unique_major_categories
0,16


#### 14. Escribe una query que regrese el número total de valores únicos de las columnas 'Major', 'Major_category' y 'Major_code'. Coloca un alias por medio de un string a cada categoría

Otra pregunta interesante que ahora podemos contestar es ¿cuál de las majors tiene el mayor spread entre los salarios de los percentiles 75 y 25?

In [57]:
pd.read_sql_query("""
SELECT P75th - P25th quartile_spread, Major FROM recent_grads
LIMIT 10;
""", con)

Unnamed: 0,quartile_spread,Major
0,30000,PETROLEUM ENGINEERING
1,35000,MINING AND MINERAL ENGINEERING
2,55000,METALLURGICAL ENGINEERING
3,37000,NAVAL ARCHITECTURE AND MARINE ENGINEERING
4,25000,CHEMICAL ENGINEERING
5,52000,NUCLEAR ENGINEERING
6,19000,ACTUARIAL SCIENCE
7,77500,ASTRONOMY AND ASTROPHYSICS
8,22000,MECHANICAL ENGINEERING
9,27000,ELECTRICAL ENGINEERING


#### 15. Escribe una query que muestre la diferencia entre los percentiles 25 y 75 de salarios para todas las majors. Regresa en tu query primero la columna Major y Major_category utilizando el nombre de la columna por default. Asimismo, calcula la diferencia entre los percentiles 25 y 75 utilizando el alias quartile_spread. Finalmente, ordena los resultados de menor a mayor y regresa solo los primeros 20 resultados

Algunas veces queremos seleccionar un conjunto de filas después de escribir un GROUP BY. Para estos casos podemos utilizar HAVING

In [59]:
pd.read_sql_query("""
SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
""", con)

Unnamed: 0,Major_category,share_employed
0,Arts,0.806748
1,Business,0.835966
2,Communications & Journalism,0.842229
3,Education,0.85819
4,Health,0.803374
5,Industrial Arts & Consumer Services,0.82267
6,Law & Public Policy,0.808399


Fijense que SQL me permite utilizar los nombres personalizados en partes subsecuentes de mi Query incluyendo HAVING y WHERE

#### 16. Encuentra todas las categorías de major donde los graduados con salarios bajos son mayores al 10%. Utiliza SELECT para obtener Major_category, AVG(Low_wage_jobs)/AVG(Total) as share_low_wage, utiliza GROUP BY para agrupar por Major_category y utiliza HAVING para restringir la selección a las filas donde share_low_wage es mayor a 10%

Podemos utilizar la función ROUND para redondear valores. Ejemplo

In [62]:
pd.read_sql_query("""
SELECT ROUND(ShareWomen, 4), Major_category FROM recent_grads
LIMIT 10;
""", con)

Unnamed: 0,"ROUND(ShareWomen, 4)",Major_category
0,0.1206,Engineering
1,0.1019,Engineering
2,0.153,Engineering
3,0.1073,Engineering
4,0.3416,Engineering
5,0.145,Engineering
6,0.4414,Business
7,0.5357,Physical Sciences
8,0.1196,Engineering
9,0.1965,Engineering


#### 17. Utiliza select para seleccionar Major_category y calcula AVG(College_jobs)/AVG(Total) utilizando el alias de "share_degree_jobs" además utiliza la función ROUND para redondear a 3 decimales. Asimismo, agrupa por Major_category utilizando GROUP BY y selecciona sólo las filas donde "share_degree_jobs" sea menor al 30%

Podemos conocer algunos de los metadatos de nuestra tabla si corremos el siguiente comando

In [64]:
pd.read_sql_query("""
PRAGMA TABLE_INFO(recent_grads);
""", con)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,index,INTEGER,0,,0
1,1,Rank,INTEGER,0,,0
2,2,Major_code,INTEGER,0,,0
3,3,Major,TEXT,0,,0
4,4,Total,REAL,0,,0
5,5,Men,REAL,0,,0
6,6,Women,REAL,0,,0
7,7,Major_category,TEXT,0,,0
8,8,ShareWomen,REAL,0,,0
9,9,Sample_size,INTEGER,0,,0


Veamos que si nosotros calculamos un ratio entre dos datos del tipo integer, entonces SQLite va a redondear hacia abajo para quedarnos con la parte entera

In [73]:
pd.read_sql_query("""
SELECT Full_time, Employed, Full_time/Employed FROM recent_grads
LIMIT 5;
""", con)

Unnamed: 0,Full_time,Employed,Full_time/Employed
0,1849,1976,0
1,556,640,0
2,558,648,0
3,1069,758,1
4,23170,25694,0


Para obtener el resultado correcto tenemos que utilizar CAST() para cambiar el tipo de dato en nuestra operación

In [77]:
pd.read_sql_query("""
SELECT Full_time, Employed, CAST(Full_time AS Float)/CAST(Employed AS Float) "Share_full_time" FROM recent_grads
LIMIT 5;
""", con)

Unnamed: 0,Full_time,Employed,Share_full_time
0,1849,1976,0.935729
1,556,640,0.86875
2,558,648,0.861111
3,1069,758,1.41029
4,23170,25694,0.901767


#### 18. Escribe una query que seleccione las columnas Major_category y que calcule la proporción de empleados de tiempo completo como "Share_full_time", que agrupe por Major_category y que ordene los resultados por "Share_full_time" en forma ascendente

Ahora, intentemos resolver la pregunta: ¿Cuáles valores están por encima del promedio de la columna ShareWomen?

Si escribimos

In [28]:
pd.read_sql_query("""
SELECT * FROM recent_grads
WHERE ShareWomen > AVG(ShareWomen);
""", con)

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))



DatabaseError: Execution failed on sql '
SELECT * FROM recent_grads
WHERE ShareWomen > AVG(ShareWomen);
': misuse of aggregate function AVG()

obtenemos un error

La manera de poder escribir este tipo de expresiones es con **subquerys** que siempre debemos de escribir rodeadas de paréntesis (subquery)

Podríamos obtener el resultado deseado escribiendo

In [81]:
pd.read_sql_query("""
SELECT * FROM recent_grads
WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads);
""", con)

Unnamed: 0,index,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,7,8,5001,ASTRONOMY AND ASTROPHYSICS,1792.0,832.0,960.0,Physical Sciences,0.535714,10,...,553,827,33,0.021167,62000,31500,109000,972,500,220
1,29,30,5402,PUBLIC POLICY,5978.0,2639.0,3339.0,Law & Public Policy,0.558548,55,...,1306,2776,670,0.128426,50000,35000,70000,1550,1871,340
2,34,35,6107,NURSING,209394.0,21773.0,187621.0,Health,0.896019,2554,...,40818,122817,8497,0.044863,48000,39000,58000,151643,26146,6193
3,39,40,5102,"NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL ...",2116.0,528.0,1588.0,Physical Sciences,0.750473,31,...,579,1115,137,0.071540,46000,38000,53000,162,1475,124
4,40,41,6201,ACCOUNTING,198633.0,94519.0,104114.0,Business,0.524153,2042,...,27693,123169,12411,0.069749,45000,34000,56000,11417,39323,10886
5,44,45,6105,MEDICAL TECHNOLOGIES TECHNICIANS,15914.0,3916.0,11998.0,Health,0.753927,190,...,2665,9005,505,0.036983,45000,36000,50000,5546,7176,1002
6,46,47,3702,STATISTICS AND DECISION SCIENCE,6251.0,2960.0,3291.0,Computers & Mathematics,0.526476,37,...,1840,2151,401,0.086274,45000,26700,60000,2298,1200,343
7,48,49,3607,PHARMACOLOGY,1762.0,515.0,1247.0,Biology & Life Science,0.707719,3,...,532,565,107,0.085532,45000,40000,45000,603,478,93
8,49,50,5006,OCEANOGRAPHY,2418.0,752.0,1666.0,Physical Sciences,0.688999,36,...,379,1595,99,0.056995,44700,23000,50000,459,996,186
9,51,52,6104,MEDICAL ASSISTING SERVICES,11123.0,803.0,10320.0,Health,0.927807,67,...,4107,4290,407,0.042507,42000,30000,65000,2091,6948,1270


#### 19. Escribe una query que regrese las Majors que tienen una Unemployment_rate por debajo del promedio. Tu selección debe contener sólo las columnas Major y Unemployment_rate y tus resultados deben estar ordenados en orden ascendente por Unemployment_rate

El operador IN en SQL sirve para utilizar un WHERE sobre un conjunto de categorías. Por ejemplo, que la categoría pudiera ser 'Business' o 'Engineering'

In [83]:
pd.read_sql_query("""
SELECT Major, Major_category FROM recent_grads
WHERE Major_category IN ('Business', 'Engineering')
LIMIT 7;
""", con)

Unnamed: 0,Major,Major_category
0,PETROLEUM ENGINEERING,Engineering
1,MINING AND MINERAL ENGINEERING,Engineering
2,METALLURGICAL ENGINEERING,Engineering
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,CHEMICAL ENGINEERING,Engineering
5,NUCLEAR ENGINEERING,Engineering
6,ACTUARIAL SCIENCE,Business


Asimismo, podemos seleccionar aquellas categorías de Majors con más egresados

In [84]:
pd.read_sql_query("""
SELECT Major_category FROM recent_grads
GROUP BY Major_category
ORDER BY SUM(Total) DESC
LIMIT 5;
""", con)

Unnamed: 0,Major_category
0,Business
1,Humanities & Liberal Arts
2,Education
3,Engineering
4,Social Science


Uniendo nuestras dos querys anteriores podremos seleccionar aquellas Major_categories que se encuentren dentro de esta lista

In [86]:
pd.read_sql_query("""
SELECT Major, Major_category FROM recent_grads
WHERE Major_category IN 
(SELECT Major_category FROM recent_grads
GROUP BY Major_category
ORDER BY SUM(Total)
DESC LIMIT 5)
""", con)

Unnamed: 0,Major,Major_category
0,PETROLEUM ENGINEERING,Engineering
1,MINING AND MINERAL ENGINEERING,Engineering
2,METALLURGICAL ENGINEERING,Engineering
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,CHEMICAL ENGINEERING,Engineering
5,NUCLEAR ENGINEERING,Engineering
6,ACTUARIAL SCIENCE,Business
7,MECHANICAL ENGINEERING,Engineering
8,ELECTRICAL ENGINEERING,Engineering
9,COMPUTER ENGINEERING,Engineering


Calculemos ahora una query que nos regrese la muestra promedio (avg_ratio) de todas las majors

In [87]:
pd.read_sql_query("""
SELECT AVG(CAST(Sample_size as float)/CAST(Total as Float)) avg_ratio FROM recent_grads
""", con)

Unnamed: 0,avg_ratio
0,0.009091


#### 20. Escribe una query que seleccione las columnas Major, Major_category y la columna calculada ratio, que filtre las filas donde el ratio sea mayor que el avg_ratio. Tip: utiliza la subquery de arriba

In [89]:
con.close()