<a href="https://colab.research.google.com/github/solver-Mart1n/data-science/blob/solver-Mart1n-c06w1p2-basic-sql/reference/ibm_ds/language/sql/basic_sql/1_Lab_Simple_SELECT_Statements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> Simple SELECT Statements in SQL using Jupyter Notebooks </h1>

## Description
In this lab, you will learn one of the most commonly used statements of SQL (Structured Query Language), the SELECT statement. The SELECT statement is used to select data from a database.

<h3>Objectives<h3>
<h4> After completing this lab, you will be able to: </h4>


*   Query a database to obtain a response as a result set
*   Retrieve all or selected columns of a dataset
*   Apply criteria commands to filter the result set





<h3>Table of Contents</h3>
    <ul>
        <li>Building the Database from an Internet Source</li>
        <li>Exploring the Database</li>
        <li>Using SELECT Statement</li>
        <li>Practice Exercises on the SELECT Statement</li>
    </ul>

<p>Estimated Time Needed: <strong>30 min</strong></p>
<hr>

## Building the Database from an Internet Source
The database used in this lab comes from the following dataset source: [Film Locations in San Francisco](https://data.sfgov.org/Culture-and-Recreation/Film-Locations-in-San-Francisco/yitu-d5am/about_data) under a [PDDL: Public Domain Dedication and License](http://opendatacommons.org/licenses/pddl/1.0/).

### Ingesting a CSV from a Data Source Endpoint
Three API parameters are used on top of the base URL for the data source. This data paging is in compliance with the row limit of the endpoint API.

In [None]:
url1 = 'https://data.sfgov.org/resource/yitu-d5am.csv?$limit=1000&$offset=0'
url2 = 'https://data.sfgov.org/resource/yitu-d5am.csv?$limit=1000&$offset=999'
url3 = 'https://data.sfgov.org/resource/yitu-d5am.csv?$limit=49&$offset=1999'

The three will be ingested as data frames and combined into one. And this will be used to generate a CSV.

In [None]:
!pip install pandas==1.3.3

In [None]:
import pandas as pd

In [None]:
# Read the dataset from a csv file
df1 = pd.read_csv(url1, header=0, sep=",")

# Display the first few rows of the DataFrame
df1.tail()

In [None]:
# Read the dataset from a csv file
df2 = pd.read_csv(url2, header=0, sep=",")

# Display the first few rows of the DataFrame
df2.head()

In [None]:
df2.tail()

In [None]:
# Read the dataset from a csv file
df3 = pd.read_csv(url3, header=0, sep=",")

# Display the first few rows of the DataFrame
df3.head()

In [None]:
df3.tail()

Visible from the tail() and head() print outs of the three data frames, indices start from 0 each time a CSV is loaded. The "ignore_index" attribute of the data frame append() function is set to _True_ in order to avoid copying repeating indices to the aggregated data frame.

In [None]:
df = df1.append(df2,ignore_index=True)
df = df.append(df3,ignore_index=True)
len(df)

In [None]:
df.columns

In [None]:
df.drop(columns=[':@computed_region_6qbp_sg9q', ':@computed_region_ajp5_b2md', ':@computed_region_26cr_cadq'],inplace=True)

In [None]:
df.set_axis(['Title', 'ReleaseYear', 'Locations', 'FunFacts', 'ProductionCompany', 'Distributor', 'Director', 'Writer', 'Actor1', 'Actor2', 'Actor3'], axis='columns', inplace=True)

As a result, the index labels of the combined data frame starts from 0 and ends with 2048. This indicates a successful merge of the pages of data from the [Data SF](https://data.sfgov.org/resource/yitu-d5am.csv) source URL.

In [None]:
df

Store the combined data frame into one CSV file.

In [None]:
df.to_csv('san_francisco_film_locations.csv', index=False)

### Create an SQL Database from the Pandas Data Frame

#### Option 1: Using Duck DB
You can create a duck database using a CSV file and the **CREATE OR REPLACE TABLE** and **AS FROM** directives with the _read_csv_auto()_ method.

In [None]:
%pip install jupysql --upgrade duckdb-engine --quiet

In [None]:
%reload_ext sql

In [None]:
%sql duckdb:///san_francisco_film_locations.duck.db

In [None]:
%%sql
CREATE OR REPLACE TABLE san_francisco_film_locations AS
FROM read_csv_auto('san_francisco_film_locations.csv', header=True, sep=',')

Proceed to the section: Exploring the Database

#### Option 2: Using SQL Alchemy

In [None]:
import sqlite3 as sq3

In [None]:
conn = sq3.connect('san_francisco_film_locations.db')
#df.to_sql('san_francisco_film_locations', conn, if_exists='append', index=False)
df.to_sql('san_francisco_film_locations', conn, if_exists='replace', index=False)

In [None]:
!pip install sqlalchemy

In [None]:
%reload_ext sql

In [None]:
%sql sqlite:///san_francisco_film_locations.db

## Exploring the Database
Now that we have a database. We can start exploring it through the _SELECT_ command. _FROM_ specifies the database to query. And the '*' specifies all its contents.

A value of 5 passed to the _LIMIT_ command limits the print out to only 5 records.

In [None]:
%%sql
SELECT *
FROM san_francisco_film_locations
LIMIT 5

These are the column attribute descriptions from the **san_francisco_film_locations** table:

|Column|Description|
|---|---|
|   Title| titles of the films|
|   ReleaseYear| time of public release of the films|
|   Locations| locations of San Francisco where the films were shot|
|   FunFacts| funny facts about the filming locations|
|   ProductionCompany| companies who produced the films|
|   Distributor| companies who distributed the films|
|   Director| people who directed the films|
|   Writer| people who wrote the films|
|   Actor1| person 1 who acted in the films|
|   Actor2| person 2 who acted in the films|
|   Actor3| person 3 who acted in the films|






## Using SELECT Statement
Now, let's go through some examples of SELECT queries.
1. Suppose we want to retrieve details of all the films from the san_francisco_film_locations table. The details of each film record should contain all the columns. The query statement for this is:


In [None]:
%%sql
SELECT *
FROM san_francisco_film_locations

2. We want to retrieve the film names and director and writer names. The query now would be:

In [None]:
%%sql
SELECT Title, Director, Writer
FROM san_francisco_film_locations

3. We want to retrieve film names along with filming locations and release years. But we also want to restrict the output resultset to retrieve only the film records released in 2001 and onwards (release years after 2001,
including 2001).

In [None]:
%%sql
SELECT Title, ReleaseYear, Locations
FROM san_francisco_film_locations
WHERE ReleaseYear>=2001

## Practice Exercises on the SELECT Statement

1. Retrieve the fun facts and filming locations of all films.

<details><summary>Click here if you need help locating the table</summary>

```
Follow example 2 of SELECT, where records containing details of some particular columns have been retrieved.
```

</details>

<details><summary>Click here for the solution</summary>

```
 %%sql
 SELECT Locations, FunFacts
 FROM san_francisco_film_locations
```

</details>

2. Retrieve the names of all films released in the 20th century and before (release years before 2000 including 2000), along with filming locations and release years.

<details><summary>Click here if you need help locating the table</summary>

```
Follow example 3 of SELECT, where we restricted the output resultset to retrieve only the film records with certain release years. Use WHERE clause comparison operator <=, which means Less than or equal to.
```

</details>

<details><summary>Click here for the solution</summary>

```
%%sql
SELECT Title, ReleaseYear, Locations
FROM san_francisco_film_locations
WHERE ReleaseYear<=2000
```

</details>

3. Retrieve the names, production company names, filming locations, and release years of the films not written by James Cameron.

<details><summary>Click for a Hint</summary>

```
Use WHERE clause comparison operator <>, which means Not equal to.
```

</details>

<details><summary>Click here for the solution</summary>

```
%%sql
SELECT Title, ProductionCompany, Locations, ReleaseYear
FROM san_francisco_film_locations
WHERE Writer<>"James Cameron"
```

</details>

## Conclusion
Congratulations on completing this lab!
You are now able to:

*   Build a database from Internet sources
*   Query a database using SELECT statements
*   Retrieve all or selected columns of data
*   Filter the query response to meet a defined criteria


## Credit to the Source Content
This python notebook uses the content of **Hands-on Lab: Simple SELECT Statements** by IBM Skills Network from the [Databases and SQL for Data Science with Python](https://www.coursera.org/learn/sql-data-science) course. The source content uses a different platform, [Datasette](https://github.com/simonw/datasette), which does not utilize Jupyter notebooks.


### Change Log
All versions prior to 2.0, are attributable to IBM Skills Network's version of the **Hands-on Lab: Simple SELECT Statements** written for [Datasette](https://github.com/simonw/datasette).


| Date (YYYY-MM-DD) | Version | Changed By    | Change Description        |
| ----------------- | ------- | ------------- | ------------------------- |
|2024-05-05|2.0|Martin Borja|Ported to Jupyter/Python Notebooks|
| 2023-10-12 | 1.9 |Mercedes Schneider| QA Pass w/Edits|
|2023-10-11 |1.8 |Misty Taylor| ID Check|
|2023-10-01 |1.7 |Abhishek Gagneja |Updated Lab instructions|
|2023-07-11 |1.6 |Lakshmi Holla |Updated labs|
|2023-06-02 |1.5 |Eric Hao |Fixed Page Styles|
|2023-05-10 |1.4 |Eric Hao & Vladislav Boyko| Updated Page Frames|
|2023-05-04 |1.3 |Benny Li| Republished|
|2022-07-27 |1.2 |Lakshmi Holla| Updated html tag|
|2020-11-23 |1.1 |Steve Ryan |ID Review|
|2020-11-20 |1.0| Sandip Saha Joy| Initial version created|

<hr>

### <h4 align="center"> **Hands-on Lab: Simple SELECT Statements** © IBM Corporation 2020. All rights reserved. <h4/>
### <h4 align="center"> **Simple SELECT Statements in SQL using Jupyter Notebooks** © Martin John Hilario Borja 2024. All rights reserved. <h4/>

<p>