# COURSE 6: Database and SQL for DS
# DB & SQL for DS

## Prerequisite: 

TODO: Preform API calls using the sodapy library to interact with the Socrata API (*API docs at: https://dev.socrata.com/foundry/data.sfgov.org/yitu-d5am* )

TODO: create and store Film Locations in San Francisco data in a db wtih sqlite3 from datasette

1. Install datasette
2. Import bs4, requests 

In [1]:
# !pip install datasette 
# !pip install datasette requests
# !pip install sodapy

In [2]:
import bs4 as bs 
import datasette 
import requests
import sqlite3

In [4]:
import pandas as pd
from sodapy import Socrata

In [60]:
def fetch_data(endpoint,limit=1000):
    all_response = []
    offset = 0
    client = Socrata("data.sfgov.org", None)
    while True:
        # driver code using sodapy
        try:
            response = client.get(endpoint, limit=limit, offset=offset)
        
        except Exception as e:
            print(f'Failed to retrieve data: Reason: {e}')
            break
        
        if not response:
            print(f'No data left to retrieve after offset: {offset}')
            break
        
        # Append the data to all_response
        all_response.extend(response)
        offset += limit
        
    results_df = pd.DataFrame.from_records(all_response)

    
    return results_df

endpoint = "yitu-d5am"

# Fetch the data
df = fetch_data(endpoint)




No data left to retrieve after offset: 3000


In [64]:
# Create a SQLite database
conn = sqlite3.connect('FilmLocations.db')
cursor = conn.cursor()


In [65]:

# Create a table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS FilmLocations (
        title TEXT,
        release_year TEXT,
        locations TEXT,
        fun_facts TEXT,
        production_company TEXT,
        distributor TEXT,
        director TEXT,
        writer TEXT,
        actor_1 TEXT,
        actor_2 TEXT,
        actor_3 TEXT
    )
''')

# Insert the data into the table
df.to_sql('FilmLocations', conn, if_exists='replace', index=False)

# Commit and close the connection
conn.commit()
conn.close()
print("Loaded and stored in FilmLocations.db")


Loaded and stored in FilmLocations.db


## THE LABS 

**Note** : before running sql queries, open a connection to the db -> query(ies) -> assign query result(s) to pd df(s) -> close connection

### Select statement:

Suppose we want to retrieve details of all the films from the FilmLocations table. The details of each film record should contain all the columns. The query statement for this is:

**SELECT * FROM FileLocations LIMIT 10**


In [66]:
# Connect to the SQLite database
conn = sqlite3.connect('FilmLocations.db')

# Define the SQL query
query1 = "SELECT * FROM FilmLocations"  # Example query to select the first 10 rows
# Execute the query and load the result into a DataFrame
df1 = pd.read_sql_query(query1, conn)

# Display the DataFrame
df1

Unnamed: 0,title,release_year,locations,production_company,distributor,director,writer,actor_1,actor_2,actor_3,:@computed_region_6qbp_sg9q,:@computed_region_ajp5_b2md,:@computed_region_26cr_cadq,fun_facts
0,Experiment in Terror,1962,The Sea Captain's Chest (Fisherman's Wharf),Columbia Pictures Corporation,Columbia Pictures,Blake Edwards,The Gordons,Glenn Ford,Lee Remick,Stefanie Powers,99,23,3,
1,Experiment in Terror,1962,100 St. Germain Avenue,Columbia Pictures Corporation,Columbia Pictures,Blake Edwards,The Gordons,Glenn Ford,Lee Remick,Stefanie Powers,47,38,8,
2,Chan is Missing,1982,"Li Po (916 Grant Avenue at Washington, Chinatown)",New Yorker Films,New Yorker Films,Wayne Wang,Wayne Wang,Wood Moy,Marc Hayashi,Lauren Chew,104,6,3,
3,A View to a Kill,1985,Taylor and Jefferson Streets (Fisherman's Wharf),Metro-Goldwyn Mayer,MGM/UA Entertainment Company,John Glen,Richard Maibaum,Roger Moore,Christopher Walken,Tanya Roberts,99,23,3,
4,The Californians,2005,,Parker Film Company,Fabrication Films,Jonathan Parker,Jonathan Parker & Catherine DiNapoli,Noah Wyle,,,21,36,10,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2079,Vertigo,1958,900 Lombard Street,Alfred J. Hitchcock Productions,Paramount Pictures,Alfred Hitchcock,Alec Coppel,James Stewart,Kim Novak,Barbara Bel Geddes,107,32,6,Lombard Street is not actually the most crooke...
2080,Love & Taxes,2014,The Marsh Theatre (1062 Valencia Street),Bad Company Films,,Jacob Kornbluth,Jacob Kornbluth,Jacob Kornbluth,,,52,20,5,
2081,Chance- Season 1 ep109,2016,Terry A. Francois Blvd.,TVM Productions,HULU,Sara Gran and Pete Begler,Dan Attias,Hugh Laurie,Gretchen Mol,Ethan Suplee,34,4,10,
2082,Women is Losers,2020,1132 Florida St,Look at the Moon Pictures,,Lissette Feliciano,Lissette Feliciano,Lorenza Izzo,Simu Liu,Liza Weil,53,20,2,


**SELECT Title, Director, Writer FROM FilmLocations;**

In [67]:
query2 = "SELECT title, director, writer FROM FilmLocations"
df2 = pd.read_sql_query(query2, conn)

df2.head()

Unnamed: 0,title,director,writer
0,Experiment in Terror,Blake Edwards,The Gordons
1,Experiment in Terror,Blake Edwards,The Gordons
2,Chan is Missing,Wayne Wang,Wayne Wang
3,A View to a Kill,John Glen,Richard Maibaum
4,The Californians,Jonathan Parker,Jonathan Parker & Catherine DiNapoli


In [68]:
query3 = "SELECT title, release_year, locations FROM FilmLocations WHERE release_year>=2001;"
df3 = pd.read_sql_query(query3,conn)

df3.head()

Unnamed: 0,title,release_year,locations
0,The Californians,2005,
1,Babies,2010,
2,I's,2011,1 Post Street
3,When We Rise,2017,Bay Bridge
4,Nash Bridges,2021,California Street at Davis


### PRACTICE **SELECT**

1. Retrieve the fun facts and filming locations of all films.


In [69]:
prac1_query = "SELECT fun_facts, locations FROM FilmLocations"
df_ans1 = pd.read_sql_query(prac1_query,conn)

df_ans1

Unnamed: 0,fun_facts,locations
0,,The Sea Captain's Chest (Fisherman's Wharf)
1,,100 St. Germain Avenue
2,,"Li Po (916 Grant Avenue at Washington, Chinatown)"
3,,Taylor and Jefferson Streets (Fisherman's Wharf)
4,,
...,...,...
2079,Lombard Street is not actually the most crooke...,900 Lombard Street
2080,,The Marsh Theatre (1062 Valencia Street)
2081,,Terry A. Francois Blvd.
2082,,1132 Florida St


2. Retrieve the names of all films released in the 20th century and before (release years before 2000 including 2000), along with filming locations and release years.


In [70]:
prac2_query = "SELECT title, locations, release_year FROM FilmLocations WHERE release_year >= 2000;"
df_ans2 = pd.read_sql_query(prac2_query, conn)

df_ans2

Unnamed: 0,title,locations,release_year
0,The Californians,,2005
1,Babies,,2010
2,I's,1 Post Street,2011
3,When We Rise,Bay Bridge,2017
4,Nash Bridges,California Street at Davis,2021
...,...,...,...
1362,Alcatraz,Broadway from Mason to Taylor,2012
1363,Love & Taxes,The Marsh Theatre (1062 Valencia Street),2014
1364,Chance- Season 1 ep109,Terry A. Francois Blvd.,2016
1365,Women is Losers,1132 Florida St,2020


3. Retrieve the names, production company names, filming locations, and release years of the films not written by James Cameron.


In [71]:
prac3_query = "SELECT title, production_company, locations, release_year FROM FilmLocations WHERE writer != 'James Cameron';" 
df_ans3 = pd.read_sql_query(prac3_query, conn)

df_ans3

Unnamed: 0,title,production_company,locations,release_year
0,Experiment in Terror,Columbia Pictures Corporation,The Sea Captain's Chest (Fisherman's Wharf),1962
1,Experiment in Terror,Columbia Pictures Corporation,100 St. Germain Avenue,1962
2,Chan is Missing,New Yorker Films,"Li Po (916 Grant Avenue at Washington, Chinatown)",1982
3,A View to a Kill,Metro-Goldwyn Mayer,Taylor and Jefferson Streets (Fisherman's Wharf),1985
4,The Californians,Parker Film Company,,2005
...,...,...,...,...
2049,Vertigo,Alfred J. Hitchcock Productions,900 Lombard Street,1958
2050,Love & Taxes,Bad Company Films,The Marsh Theatre (1062 Valencia Street),2014
2051,Chance- Season 1 ep109,TVM Productions,Terry A. Francois Blvd.,2016
2052,Women is Losers,Look at the Moon Pictures,1132 Florida St,2020


## COUNT, DISCTINCT, LIMIT 

### COUNT

1. Suppose we want to count the number of records or rows of the "FilmLocations" table. The query for this would be:


In [73]:
query4 = "SELECT COUNT(*) FROM FilmLocations;"
df4 = pd.read_sql_query(query4, conn)
df4


Unnamed: 0,COUNT(*)
0,2084


2. We want to count the number of locations of the films. But we also want to restrict the output result set so that we only retrieve the number of locations of the films written by a certain writer. The query for this can be written as:


In [75]:
query5 = "SELECT COUNT(locations) FROM FilmLocations WHERE writer == 'James Cameron';"

df5 = pd.read_sql_query(query5, conn)
df5

Unnamed: 0,COUNT(locations)
0,24


### DISTINCT

1. Assume that we want to retrieve the titles of all films in the table so that duplicates will be discarded in the output result set.

In [76]:
query6 = "SELECT DISTINCT title FROM FilmLocations;"

df6 = pd.read_sql_query(query6, conn)
df6

Unnamed: 0,title
0,Experiment in Terror
1,Chan is Missing
2,A View to a Kill
3,The Californians
4,Babies
...,...
331,Another 48 Hours
332,Chance - Season 1ep105
333,Superman
334,The Fan


In [78]:
query7 = "SELECT COUNT(DISTINCT release_year) FROM FilmLocations WHERE production_company=='Warner Bros. Pictures';"
df7 = pd.read_sql_query(query7,conn)
df7

Unnamed: 0,COUNT(DISTINCT release_year)
0,14


### LIMIT

1. Retrieve only the first 25 rows from the table so that rows other than those are not in the output result set.


In [82]:
query8 = "SELECT * FROM FilmLocations lIMIT 25;"
df8 = pd.read_sql_query(query8,conn)
df8

Unnamed: 0,title,release_year,locations,production_company,distributor,director,writer,actor_1,actor_2,actor_3,:@computed_region_6qbp_sg9q,:@computed_region_ajp5_b2md,:@computed_region_26cr_cadq,fun_facts
0,Experiment in Terror,1962,The Sea Captain's Chest (Fisherman's Wharf),Columbia Pictures Corporation,Columbia Pictures,Blake Edwards,The Gordons,Glenn Ford,Lee Remick,Stefanie Powers,99.0,23.0,3.0,
1,Experiment in Terror,1962,100 St. Germain Avenue,Columbia Pictures Corporation,Columbia Pictures,Blake Edwards,The Gordons,Glenn Ford,Lee Remick,Stefanie Powers,47.0,38.0,8.0,
2,Chan is Missing,1982,"Li Po (916 Grant Avenue at Washington, Chinatown)",New Yorker Films,New Yorker Films,Wayne Wang,Wayne Wang,Wood Moy,Marc Hayashi,Lauren Chew,104.0,6.0,3.0,
3,A View to a Kill,1985,Taylor and Jefferson Streets (Fisherman's Wharf),Metro-Goldwyn Mayer,MGM/UA Entertainment Company,John Glen,Richard Maibaum,Roger Moore,Christopher Walken,Tanya Roberts,99.0,23.0,3.0,
4,The Californians,2005,,Parker Film Company,Fabrication Films,Jonathan Parker,Jonathan Parker & Catherine DiNapoli,Noah Wyle,,,21.0,36.0,10.0,
5,Babies,2010,,Canal+,Focus Features,Thomas Balmes,Thomas Balmes,Bayar,Hattie,,21.0,36.0,10.0,
6,I's,2011,1 Post Street,Banshee Cinema,,Chris Edgette,Kyle Tuck,,,,19.0,8.0,3.0,
7,When We Rise,2017,Bay Bridge,Film 49 Productions,Amercian Broadcasting Company,Gus Van Sant,Dustin Lance Black,Guy Pierce,Mary-Louise Parker,Michael Kenneth Williams,,,,
8,Nash Bridges,2021,California Street at Davis,"Village NB Productions, LLC",USA Nework,Greg Beeman,"Carlton Cuse, Bill Chais",Don Johnson,Cheech Marin,Joe Dinicol,108.0,8.0,3.0,
9,This Is Us,2022,Alamo Square Park,20th Television,NBC,Mandy Moore,"Dan Fogelman, Casey Johnson, David Windsor, Ch...",Milo Ventimiglia,Mandy Moore,Sterling K. Brown,22.0,9.0,11.0,


2. Now, we want to retrieve 15 rows from the table starting from row 11.


In [84]:
query9 = "SELECT * FROM FilmLocations lIMIT 15 OFFSET 10;"
df9 = pd.read_sql_query(query9, conn)
df9

Unnamed: 0,title,release_year,locations,production_company,distributor,director,writer,actor_1,actor_2,actor_3,:@computed_region_6qbp_sg9q,:@computed_region_ajp5_b2md,:@computed_region_26cr_cadq,fun_facts
0,The Rock,1996,Coit Tower,Hollywood Pictures,Buena Vista Pictures,Michael Bay,David Weisberg,Sean Connery,Nicolas Cage,,18.0,23.0,3.0,The Tower was funded by a gift bequeathed by L...
1,Blue Jasmine,2013,Pacific & Divisadero,Perdido Productions,Sony Pictures Classics,Woody Allen,Woody Allen,Cate Blanchett,Alec Baldwin,Peter Sarsgaard,102.0,30.0,6.0,
2,The Caine Mutiny,1954,Golden Gate Bridge,Stanley Kramer Productions,Columbia Pictures,Edward Dmytryk,Stanley Roberts,Humphrey Bogart,Fred MacMurray,Jose Ferrer,,,,"With 23 miles of ladders and 300,000 rivets in..."
3,The Diary of a Teenage Girl,2015,700 Kansas St.,"Diary the Movie, LLC",Sony Pictures Classics,Marielle Heller,Marielle Heller,Alexander Skarsgard,Kristen Wiig,Christopher Meloni,54.0,26.0,9.0,
4,Venom,2018,Grant Ave btwn Clay and Jackson,"L.O.Z. Productions, Inc.","Columbia Pictures, Sony Pictures Releasing",Ruben Fleischer,"Jeff Pinkner, Scott Rosenberg",Tom Hardy,Michelle Wiliams,Riz Ahmed,104.0,6.0,3.0,
5,"Murder in the First, Season 1",2014,13 Lucky Street,Turner North Center Productions,Turner Network Television (TNT),Steven Bochcho,Eric Lodal,Taye Diggs,Kathleen Robertson,Ian Anthony Dale,53.0,20.0,2.0,
6,GirlBoss,2017,"The Castro Theater, 429 Castro St.","Hippolyta Productions, LLC",Netflix,"Jamie Babbit, Amanda Brotchie, Steven K. Tsuch...",Kay Cannon,Britt Robertson,Ellie Reed,Amanda Rea,38.0,5.0,5.0,
7,Sense8 - Season 2,2016,Clay between Stockton and Grant,"Unpronounceable Productions, LLC",Netflix,Wachowski Siblings,"J. Michael Straczynski, Wachowiski Siblings",Jamie Clayton,Daryl Hannah,Naveen Andrews,104.0,6.0,3.0,
8,The Pursuit of Happyness,2006,Glen Park Subway Station,Columbia Pictures Corporation,Columbia Pictures,Steven Conrad,Gabriele Muccino,Will Smith,Jayden C. Smith,,59.0,10.0,5.0,
9,Women is Losers,2020,Capp St at 25th St,Look at the Moon Pictures,,Lissette Feliciano,Lissette Feliciano,Lorenza Izzo,Simu Liu,Liza Weil,53.0,20.0,2.0,


### PRACTICE **COUNT**

1. Retrieve the number of locations of the films which are directed by Woody Allen.


In [90]:
prac4_query = "SELECT COUNT(locations) FROM FilmLocations WHERE director == 'Woody Allen';"
df_ans4 = pd.read_sql_query(prac4_query,conn)
df_ans4

Unnamed: 0,COUNT(locations)
0,31


2. Retrieve the number of films shot at Russian Hill.


In [92]:
prac5_query = "SELECT COUNT(*) FROM FilmLocations WHERE locations == 'Russian Hill';"
df_ans5 = pd.read_sql_query(prac5_query,conn)
df_ans5

Unnamed: 0,COUNT(*)
0,1


3. Retrieve the number of rows having a release year older than 1950 from the "FilmLocations" table.

In [93]:
prac6_query = "SELECT COUNT(*) FROM FilmLocations WHERE release_year <= 1950;"
df_ans6 = pd.read_sql_query(prac6_query,conn)
df_ans6

Unnamed: 0,COUNT(*)
0,47


### PRACTICE **DISTINCT**

1. Retrieve the names of all unique films released in the 21st century and onwards, along with their release years.

In [98]:
prac7_query = "SELECT DISTINCT(title), release_year FROM FilmLocations WHERE release_year >= 2001;"
df_ans7 = pd.read_sql_query(prac7_query,conn)
df_ans7

Unnamed: 0,title,release_year
0,The Californians,2005
1,Babies,2010
2,I's,2011
3,When We Rise,2017
4,Nash Bridges,2021
...,...,...
141,50 First Dates,2004
142,The Sweetest Thing,2002
143,Serendipity,2001
144,Night of Henna,2005


2. Retrieve the directors' names and their distinct films shot at City Hall.

In [99]:
prac7_query = "SELECT DISTINCT(title), director FROM FilmLocations WHERE locations == 'City Hall';"
df_ans7 = pd.read_sql_query(prac7_query,conn)
df_ans7

Unnamed: 0,title,director
0,San Francisco,W.S. Van Dyke
1,"Smile Again, Jenny Lee",Carlo Caldana
2,The Wedding Planner,Adam Shankman
3,Magnum Force,Ted Post
4,The Rock,Michael Bay
5,When We Rise,Gus Van Sant
6,A View to a Kill,John Glen
7,Invasion of the Body Snatchers,Philip Kaufman
8,The Enforcer,James Fargo
9,Class Action,Michael Apted


3. Retrieve the number of distributors who distributed films with the 1st actor, Clint Eastwood.


In [102]:
prac8_query = "SELECT COUNT (DISTINCT(distributor)) FROM FilmLocations WHERE actor_1 == 'Clint Eastwood';"
df_ans8 = pd.read_sql_query(prac8_query,conn)
df_ans8

Unnamed: 0,COUNT (DISTINCT(distributor))
0,3


### PRACTICE **LIMIT**

1. Retrieve the names of the first 50 films.

In [105]:
prac9_query = "SELECT title FROM FilmLocations LIMIT 50;"
df_ans9 = pd.read_sql_query(prac9_query,conn)
df_ans9

Unnamed: 0,title
0,Experiment in Terror
1,Experiment in Terror
2,Chan is Missing
3,A View to a Kill
4,The Californians
5,Babies
6,I's
7,When We Rise
8,Nash Bridges
9,This Is Us


2. Retrieve the first 10 film names released in 2015.


In [106]:
prac10_query = "SELECT title FROM FilmLocations WHERE release_year == 2015 LIMIT 10;"
df_ans10 = pd.read_sql_query(prac10_query,conn)
df_ans10

Unnamed: 0,title
0,The Diary of a Teenage Girl
1,Quitters
2,"Murder in the First, Season 2"
3,Looking Season 2 ep 202
4,Sense8
5,Terminator - Genisys
6,Ant-Man
7,"Murder in the First, Season 2"
8,Terminator - Genisys
9,Looking Season 2 ep 202


3. Retrieve the next 3 film names that follow after the first 5 films released in 2015.


In [107]:
prac11_query = "SELECT title FROM FilmLocations WHERE release_year == 2015 LIMIT 3 OFFSET 5;"
df_ans11 = pd.read_sql_query(prac11_query,conn)
df_ans11

Unnamed: 0,title
0,Terminator - Genisys
1,Ant-Man
2,"Murder in the First, Season 2"
