# Subqueries in WHERE Lab

### Introduction

In this lesson, we'll practice working with subqueries by using data from spotify.

### Loading our data

We can begin by loading our data into dataframes.

In [23]:
import pandas as pd
tracks_df = pd.read_csv('./tracks.csv')
tracks_df[:2]

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.445,0,-13.338,1,0.451,0.674,0.744,0.151,0.127,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.263,0,-22.136,1,0.957,0.797,0.0,0.148,0.655,102.009,1


In [29]:
artist = tracks_df['artists'].str[1:-1].str.split(', ').str[0].str[1:-1]

In [30]:
artist_id = tracks_df['id_artists'].str[1:-1].str.split(', ').str[0].str[1:-1]

In [33]:
updated_tracks_df = tracks_df.assign(artist = artist, artist_id = artist_id).drop(columns = ['artists', 'id_artists'])

In [35]:
updated_tracks_df[:2]

Unnamed: 0,id,name,popularity,duration_ms,explicit,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist,artist_id
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,1922-02-22,0.645,0.445,0,-13.338,1,0.451,0.674,0.744,0.151,0.127,104.851,3,Uli,45tIt06XoI0Iio4LBEVpls
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,1922-06-01,0.695,0.263,0,-22.136,1,0.957,0.797,0.0,0.148,0.655,102.009,1,Fernando Pessoa,14jtPCOoNZwquk5wd9DxrY


In [37]:
import pandas as pd
artists_df = pd.read_csv('./artists.csv')
artists_df[:10]

Unnamed: 0,id,followers,genres,name,popularity
0,0DheY5irMjBUeLybbCUEZ2,0.0,[],Armid & Amir Zare Pashai feat. Sara Rouzbehani,0
1,0DlhY15l3wsrnlfGio2bjU,5.0,[],ปูนา ภาวิณี,0
2,0DmRESX2JknGPQyO15yxg7,0.0,[],Sadaa,0
3,0DmhnbHjm1qw6NCYPeZNgJ,0.0,[],Tra'gruda,0
4,0Dn11fWM7vHQ3rinvWEl4E,2.0,[],Ioannis Panoutsopoulos,0
5,0DotfDlYMGqkbzfBhcA5r6,7.0,[],Astral Affect,0
6,0DqP3bOCiC48L8SM9gK4W8,1.0,[],Yung Seed,0
7,0Drs3maQb99iRglyTuxizI,0.0,[],Wi'Ma,0
8,0DsPeAi1gxPPnYjgpiEGSR,0.0,[],lentboy,0
9,0DtvnTxgZ9K5YaPS5jdlQW,20.0,[],addworks,0


In [40]:
artists_df.to_csv('spotify_artists.csv', index = False)

In [41]:
updated_tracks_df.to_csv('./spotify_tracks', index = False)

In [38]:
import sqlite3
conn = sqlite3.connect('spotify.db')

In [39]:
artists_df.to_sql('artists', conn, if_exists = 'replace')

In [42]:
updated_tracks_df.to_sql('tracks', conn, if_exists = 'replace')

### Viewing our data

In [44]:
pd.read_sql('select * from artists limit 2', conn)

Unnamed: 0,index,id,followers,genres,name,popularity
0,0,0DheY5irMjBUeLybbCUEZ2,0.0,[],Armid & Amir Zare Pashai feat. Sara Rouzbehani,0
1,1,0DlhY15l3wsrnlfGio2bjU,5.0,[],ปูนา ภาวิณี,0


In [43]:
pd.read_sql('select * from tracks limit 2', conn)

Unnamed: 0,index,id,name,popularity,duration_ms,explicit,release_date,danceability,energy,key,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,artist,artist_id
0,0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,1922-02-22,0.645,0.445,0,...,1,0.451,0.674,0.744,0.151,0.127,104.851,3,Uli,45tIt06XoI0Iio4LBEVpls
1,1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,1922-06-01,0.695,0.263,0,...,1,0.957,0.797,0.0,0.148,0.655,102.009,1,Fernando Pessoa,14jtPCOoNZwquk5wd9DxrY


### Performing our queries

Let's begin by finding the artists whose number of followers is within `40000000` of the top number of followers.

In [56]:
pd.read_sql('select * from artists WHERE followers > (SELECT MAX(followers) FROM artists) - 40000000', conn)

Unnamed: 0,index,id,followers,genres,name,popularity
0,126658,6qqNVTkY8uBg9cP3Jd7DAH,41792604.0,"['electropop', 'pop']",Billie Eilish,92
1,144138,6eUKZXaKkcviH0Ku9w2n3V,78900234.0,"['pop', 'uk pop']",Ed Sheeran,92
2,144481,1uNFoZAHBGtllmzznpCI3s,44606973.0,"['canadian pop', 'pop', 'post-teen pop']",Justin Bieber,100
3,144485,66CXWjxzNUsdJxJ2JdwvnR,61301006.0,"['pop', 'post-teen pop']",Ariana Grande,95
4,144488,7dGJo4pcD2V6oG8kP0tJRR,43747833.0,"['detroit hip hop', 'hip hop', 'rap']",Eminem,94
5,313508,5pKCCKE2ajJHZ9KAiaK11H,42244011.0,"['barbadian pop', 'dance pop', 'pop', 'post-te...",Rihanna,92
6,313676,3TVXtAsR1Inumwj472S9r4,54416812.0,"['canadian hip hop', 'canadian pop', 'hip hop'...",Drake,98


We can also use subqueries across multiple tables.  Then find all of the artists who have the has the same popularity as Billie Eilish -- use a subquery to do so.

In [58]:
pd.read_sql("""select * from artists 
WHERE popularity = (SELECT popularity FROM artists WHERE name = 'Billie Eilish')
""", conn)

Unnamed: 0,index,id,followers,genres,name,popularity
0,112978,0eDvMgVFoNV3TpwtrVCoTj,5076597.0,['brooklyn drill'],Pop Smoke,92
1,126658,6qqNVTkY8uBg9cP3Jd7DAH,41792604.0,"['electropop', 'pop']",Billie Eilish,92
2,144138,6eUKZXaKkcviH0Ku9w2n3V,78900234.0,"['pop', 'uk pop']",Ed Sheeran,92
3,144139,5K4W6rqBFWDnAN6FQUkS6x,13713751.0,"['chicago rap', 'rap']",Kanye West,92
4,144140,15UsOTVnJzReFVN1VCnxy4,26747778.0,"['emo rap', 'miami hip hop']",XXXTENTACION,92
5,144141,2R21vXR83lH98kGeO99Y66,17083706.0,"['latin', 'reggaeton', 'reggaeton flow', 'trap...",Anuel AA,92
6,144491,6LuN9FCkKOj5PcnpouEgny,13728298.0,"['alternative r&b', 'pop']",Khalid,92
7,313508,5pKCCKE2ajJHZ9KAiaK11H,42244011.0,"['barbadian pop', 'dance pop', 'pop', 'post-te...",Rihanna,92


### Subqueries across tables

Now, we can also use subqueries in the where clause across multiple tables.

For example, let's say that we want to find all of the artists who wrote a song with a popularity of 10 after 2000. First we simply write a query that finds all of those tracks with a popularity of 10 after 2020-12-31.

In [71]:
top_songs = pd.read_sql("""select name,  release_date, popularity, artist,
artist_id from tracks where popularity = 10 AND release_date > '2020-12-31'
""", conn)
top_songs[:10]

Unnamed: 0,name,release_date,popularity,artist,artist_id
0,The Message - 2021 Rawmix,2021-03-08,10,Gaston Zani,2uU0gIFQJd8zSyAaxhhoL1
1,"Small, Bent and Ugly (2021 Remaster)",2021-01-08,10,Crucial Dudes,5ccLp1eIj7ufwLUpVLmSFl
2,Twenty Years,2021-02-19,10,Barock Project,1PKB3oDVcZBThmJCYyMOQH
3,DJ No Pare,2021-04-02,10,Farruko,329e4yvIujISKGKz1BZZbO
4,Take You Dancing,2021-04-02,10,Jason Derulo,07YZf4WDAMNwqr4jfgOZ8y
5,Take You Dancing,2021-04-02,10,Jason Derulo,07YZf4WDAMNwqr4jfgOZ8y
6,Kiss the Sky,2021-04-02,10,Jason Derulo,07YZf4WDAMNwqr4jfgOZ8y
7,Brown Sugar,2021-01-22,10,The Rolling Stones,22bE4uQ6baNwSHPVcDxLCe
8,Anyone,2021-03-31,10,Justin Bieber,1uNFoZAHBGtllmzznpCI3s
9,Azul,2021-04-09,10,J Balvin,1vyhD5VmyZ7KMfW5gqLgo5


Then, let's return the number of followers of every artist who had a song with a popularity of 10 after `'2020-12-31'`.

We can do so with the in clause.

In [75]:
artists_with_top_songs = pd.read_sql("""select name, followers from artists WHERE id in (SELECT artist_id from tracks
WHERE popularity = 10 AND release_date > '2020-12-31')
""", conn)
artists_with_top_songs[:10]

Unnamed: 0,name,followers
0,Lil Quil,2043.0
1,Gaston Zani,1020.0
2,015B,21857.0
3,DaBaby,6485079.0
4,Selena Gomez,26692413.0
5,The Rolling Stones,10060904.0
6,Daddy Yankee,22831280.0
7,Jason Derulo,9223795.0
8,Justin Bieber,44606973.0
9,J Balvin,27286822.0


Now use subqueries to find the name and number of followers of artists who wrote a song with a tempo over 200 that was released after `2020-12-31`.  Order the results by the number of followers of the artist.

In [77]:
artists_with_fast_songs = pd.read_sql("""select name, followers from artists WHERE id in (SELECT artist_id from tracks
WHERE tempo > 200 AND release_date > '2020-12-31') ORDER BY followers desc
""", conn)
artists_with_fast_songs[:10]

Unnamed: 0,name,followers
0,Taylor Swift,38869193.0
1,J Balvin,27286822.0
2,Demi Lovato,19543911.0
3,Lana Del Rey,12750166.0
4,Morgan Wallen,1678738.0
5,G.E.M.,1453780.0
6,Frenna,457688.0
7,Tayc,366844.0
8,sanah,354762.0
9,Ardian Bujupi,153275.0


### Resources

[Subqueries in Select](https://www.essentialsql.com/get-ready-to-learn-sql-server-20-using-subqueries-in-the-select-statement/)

[College Tuition Data](https://www.kaggle.com/jessemostipak/college-tuition-diversity-and-pay)

[In vs join vs exists](https://explainextended.com/2009/06/16/in-vs-join-vs-exists/)