<a href="https://colab.research.google.com/github/solver-Mart1n/data-science/blob/solver-Mart1n-c06w1p2-basic-sql/activities/relational_database/sql/basic_sql/2_Lab_COUNT_DISTINCT_LIMIT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> COUNT, DISTINCT, LIMIT in SQL using Jupyter Notebooks </h1>

## Description
In this lab, you will learn a few useful expressions that are used with SELECT statements. First, you will learn COUNT, which is an aggregate function that retrieves the number of rows that matches the query criteria. Next, you will learn DISTINCT, which is used to remove duplicate values from a specified result set and only return the unique values. Lastly, you will learn LIMIT, which is used for restricting the number of rows retrieved from the table.

<h3>Objectives</h3>
<h4>After completing this lab, you will be able to:</h4>
    <ul>
        <li>Retrieve the number of rows that match a query criteria</li>
        <li>Remove duplicate values from a result set and return the unique values</li>
        <li>Restrict the number of rows retrieved from a table</li>
    </ul>


<h3>Table of Contents</h3>
    <ul>
        <li>Building the Database from an Internet Source</li>
        <li>Exploring the Database</li>
        <li>Using COUNT Statement</li>
        <li>Using DISTINCT Statement</li>
        <li>Using LIMIT Statement</li>
        <li>Practice Exercises COUNT, DISTINCT, LIMIT</li>
    </ul>

<p>Estimated Time Needed: <strong>30 min</strong></p>
<hr>

## Building the Database from an Internet Source
The database used in this lab comes from the following dataset source: [Film Locations in San Francisco](https://data.sfgov.org/Culture-and-Recreation/Film-Locations-in-San-Francisco/yitu-d5am/about_data) under a [PDDL: Public Domain Dedication and License](http://opendatacommons.org/licenses/pddl/1.0/).

### Ingesting a CSV from a Data Source Endpoint
Three API parameters are used on top of the base URL for the data source. This data paging is in compliance with the row limit of the endpoint API.

In [None]:
url1 = 'https://data.sfgov.org/resource/yitu-d5am.csv?$limit=1000&$offset=0'
url2 = 'https://data.sfgov.org/resource/yitu-d5am.csv?$limit=1000&$offset=999'
url3 = 'https://data.sfgov.org/resource/yitu-d5am.csv?$limit=49&$offset=1999'

The three will be ingested as data frames and combined into one. And this will be used to generate a CSV.

In [None]:
!pip install pandas==1.3.3

In [None]:
import pandas as pd

In [None]:
# Read the dataset from a csv file
df1 = pd.read_csv(url1, header=0, sep=",")

# Display the first few rows of the DataFrame
df1.tail()

In [None]:
# Read the dataset from a csv file
df2 = pd.read_csv(url2, header=0, sep=",")

# Display the first few rows of the DataFrame
df2.head()

In [None]:
df2.tail()

In [None]:
# Read the dataset from a csv file
df3 = pd.read_csv(url3, header=0, sep=",")

# Display the first few rows of the DataFrame
df3.head()

In [None]:
df3.tail()

Visible from the tail() and head() print outs of the three data frames, indices start from 0 each time a CSV is loaded. The "ignore_index" attribute of the data frame append() function is set to _True_ in order to avoid copying repeating indices to the aggregated data frame.

In [None]:
df = df1.append(df2,ignore_index=True)
df = df.append(df3,ignore_index=True)
len(df)

In [None]:
df.columns

In [None]:
df.drop(columns=[':@computed_region_6qbp_sg9q', ':@computed_region_ajp5_b2md', ':@computed_region_26cr_cadq'],inplace=True)

In [None]:
df.set_axis(['Title', 'ReleaseYear', 'Locations', 'FunFacts', 'ProductionCompany', 'Distributor', 'Director', 'Writer', 'Actor1', 'Actor2', 'Actor3'], axis='columns', inplace=True)

As a result, the index labels of the combined data frame starts from 0 and ends with 2048. This indicates a successful merge of the pages of data from the [Data SF](https://data.sfgov.org/resource/yitu-d5am.csv) source URL.

In [None]:
df

Store the combined data frame into one CSV file.

In [None]:
df.to_csv('san_francisco_film_locations.csv', index=False)

### Create an SQL Database from the Pandas Data Frame

#### Option 1: Using Duck DB
You can create a duck database using a CSV file and the **CREATE OR REPLACE TABLE** and **AS FROM** directives with the _read_csv_auto()_ method.

In [None]:
%pip install jupysql --upgrade duckdb-engine --quiet

In [None]:
%reload_ext sql

In [None]:
%sql duckdb:///san_francisco_film_locations.duck.db

In [None]:
%%sql
CREATE OR REPLACE TABLE san_francisco_film_locations AS
FROM read_csv_auto('san_francisco_film_locations.csv', header=True, sep=',')

Proceed to the section: Exploring the Database

#### Option 2: Using SQL Alchemy

In [None]:
import sqlite3 as sq3

In [None]:
conn = sq3.connect('san_francisco_film_locations.db')
#df.to_sql('san_francisco_film_locations', conn, if_exists='append', index=False)
df.to_sql('san_francisco_film_locations', conn, if_exists='replace', index=False)

In [None]:
!pip install sqlalchemy

In [None]:
%reload_ext sql

In [None]:
%sql sqlite:///san_francisco_film_locations.db

## Exploring the Database
Now that we have a database. We can start exploring it through the _SELECT_ command. _FROM_ specifies the database to query. And the '*' specifies all its contents.

A value of 5 passed to the _LIMIT_ command limits the print out to only 5 records.

In [None]:
%%sql
SELECT *
FROM san_francisco_film_locations
LIMIT 5

These are the column attribute descriptions from the **san_francisco_film_locations** table:

|Column|Description|
|---|---|
|   Title| titles of the films|
|   ReleaseYear| time of public release of the films|
|   Locations| locations of San Francisco where the films were shot|
|   FunFacts| funny facts about the filming locations|
|   ProductionCompany| companies who produced the films|
|   Distributor| companies who distributed the films|
|   Director| people who directed the films|
|   Writer| people who wrote the films|
|   Actor1| person 1 who acted in the films|
|   Actor2| person 2 who acted in the films|
|   Actor3| person 3 who acted in the films|






## Using COUNT Statement
Now, let us go through some examples of COUNT-related queries.
1. Suppose we want to count the number of records or rows of the "san_francisco_film_locations" table. The query for this would be:


In [None]:
%%sql
SELECT COUNT(*) FROM san_francisco_film_locations

2. We want to count the number of locations of the films. But we also want to restrict the output result set so that we only retrieve the number of
locations of the films written by a certain writer. The query for this can be written as:

In [None]:
%%sql
SELECT COUNT(Locations)
FROM san_francisco_film_locations
WHERE Writer = 'James Cameron';

##Using DISTINCT Statement
In this exercise, you will go through some examples of using DISTINCT in queries.
1. Assume that we want to retrieve the titles of all films in the table so that duplicates will be discarded in the output result set.

In [None]:
%%sql
SELECT DISTINCT Title
FROM san_francisco_film_locations;

2. We want to retrieve the count of release years of the films produced by a specific company so that duplicate release years of those films will be
discarded in the count.

In [None]:
%%sql
SELECT COUNT(DISTINCT ReleaseYear)
FROM san_francisco_film_locations
WHERE ProductionCompany='Warner Bros. Pictures';

## Using LIMIT Statement
In this exercise, you will first go through some examples of using LIMIT in queries.
1. Retrieve only the first 25 rows from the table so that rows other than those are not in the output result set.

In [None]:
%%sql
SELECT * FROM san_francisco_film_locations
LIMIT 25

2. Now, we want to retrieve 15 rows from the table starting from row 11.

In [None]:
%%sql
SELECT *
FROM san_francisco_film_locations
LIMIT 15
OFFSET 10

## Practice Exercises

### COUNT

1. Retrieve the number of locations of the films which are directed by Woody Allen.

<details><summary>Hint</summary>

```
Follow example 2 of the COUNT exercise. Use the WHERE clause comparison operator =, which means Equal to.
```

</details>

<details><summary>Query Solution</summary>

```
 %%sql
 SELECT COUNT(Locations) FROM san_francisco_film_locations WHERE Director="Woody Allen"
```

</details>

In [None]:
 %%sql
 SELECT COUNT(Locations)
 FROM san_francisco_film_locations
 WHERE Director='Woody Allen'

2. Retrieve the number of films shot at Russian Hill.

<details><summary>Hint</summary>

```
Follow example 2 of the COUNT exercise. Use the WHERE clause comparison operator =, which means Equal to.
```

</details>

<details><summary>Query Solution</summary>

```
%%sql
SELECT Count(Title) FROM san_francisco_film_locations WHERE Locations="Russian Hill"
```

</details>

In [None]:
%%sql
SELECT Count(Title)
FROM san_francisco_film_locations
WHERE Locations='Russian Hill'

3. Retrieve the number of rows having a release year older than 1950 from the "FilmLocations" table.

<details><summary>Hint</summary>

```
Follow example 1 of the COUNT exercise. Use the WHERE clause comparison operator <, which means Less than.
```

</details>

<details><summary>Query Solution</summary>

```
%%sql
SELECT Count(*) FROM san_francisco_film_locations WHERE ReleaseYear<1950
```

</details>

In [None]:
%%sql
SELECT Count(*)
FROM san_francisco_film_locations
WHERE ReleaseYear<1950

### DISTINCT

1. Retrieve the names of all unique films released in the 21st century and onwards, along with their release years.

<details><summary>Hint</summary>

```
Follow example 1 of DISTINCT. Use WHERE clause comparison operator >=, which means Greater than or equal to.
```

</details>

<details><summary>Query Solution</summary>

```
%%sql
SELECT DISTINCT Title, ReleaseYear FROM san_francisco_film_locations WHERE ReleaseYear>=2001
```

</details>

In [None]:
%%sql
SELECT DISTINCT Title, ReleaseYear FROM san_francisco_film_locations WHERE ReleaseYear>=2001

2. Retrieve the directors' names and their distinct films shot at City Hall.

<details><summary>Hint</summary>

```
Follow example 1 of DISTINCT. Use WHERE clause comparison operator =, which means Equal to.
```

</details>

<details><summary>Query Solution</summary>

```
%%sql
SELECT DISTINCT Title, Director FROM san_francisco_film_locations WHERE Locations="City Hall"
```

</details>

In [None]:
%%sql
SELECT DISTINCT Title, Director FROM san_francisco_film_locations WHERE Locations='City Hall'

3. Retrieve the number of distributors who distributed films with the 1st actor, Clint Eastwood.

<details><summary>Hint</summary>

```
Follow example 2 of DISTINCT. Use the WHERE clause comparison operator =, which means Equal to.
```

</details>

<details><summary>Query Solution</summary>

```
%%sql
SELECT COUNT(DISTINCT Distributor) FROM san_francisco_film_locations WHERE Actor1="Clint Eastwood"
```

</details>

In [None]:
%%sql
SELECT COUNT(DISTINCT Distributor) FROM san_francisco_film_locations WHERE Actor1='Clint Eastwood'

### LIMIT

1. Retrieve the names of the first 50 films.

<details><summary>Hint</summary>

```
Follow example 1 of LIMIT. Use DISTINCT.
```

</details>

<details><summary>Query Solution</summary>

```
%%sql
SELECT DISTINCT Title FROM san_francisco_film_locations LIMIT 50
```

</details>

In [None]:
%%sql
SELECT DISTINCT Title FROM san_francisco_film_locations LIMIT 50


2. Retrieve the first 10 film names released in 2015.

<details><summary>Hint</summary>

```
Follow example 1 of LIMIT. Use DISTINCT. Use WHERE clause comparison operator =, which means Equal to.
```

</details>

<details><summary>Query Solution</summary>

```
%%sql
SELECT DISTINCT Title FROM san_francisco_film_locations WHERE ReleaseYear=2015 LIMIT 10
```

</details>

In [None]:
%%sql SELECT DISTINCT Title
FROM san_francisco_film_locations
WHERE ReleaseYear=2015
LIMIT 10

3. Retrieve the next 3 film names that follow after the first 5 films released in 2015.

<details><summary>Hint</summary>

```
Follow example 2 of the LIMIT exercise to learn how to use OFFSET. Use DISTINCT and use the WHERE clause comparison operator =, which
means Equal to.
```

</details>

<details><summary>Query Solution</summary>

```
%%sql
SELECT DISTINCT Title FROM san_francisco_film_locations WHERE ReleaseYear=2015 LIMIT 3 OFFSET 5
```

</details>

In [None]:
%%sql
SELECT DISTINCT Title FROM san_francisco_film_locations WHERE ReleaseYear=2015 LIMIT 3 OFFSET 5

## Conclusion
Congratulations on completing this lab!
You are now able to:

*   Build a database from Internet sources
*   Query a database using SELECT statements
*   Retrieve all or selected columns of data
*   Filter the query response to meet a defined criteria


## Credit to the Source Content
This python notebook uses the content of **Hands-on Lab: Simple SELECT Statements** by IBM Skills Network from the [Databases and SQL for Data Science with Python](https://www.coursera.org/learn/sql-data-science) course. The source content uses a different platform, [Datasette](https://github.com/simonw/datasette), which does not utilize Jupyter notebooks.


### Change Log
All versions prior to 2.0, are attributable to IBM Skills Network's version of the **Hands-on Lab: Simple SELECT Statements** written for [Datasette](https://github.com/simonw/datasette).


| Date (YYYY-MM-DD) | Version | Changed By    | Change Description        |
| ----------------- | ------- | ------------- | ------------------------- |
|2024-05-06|2.0|Martin Borja|Ported to Jupyter/Python Notebooks|
|2023-10-11|1.8|Steve Hord| QA pass with edits|
|2023-10-01| 1.7| Abhishek Gagneja| Updated instruction set|
|2023-05-11| 1.6| Eric Hao & Vladislav Boyko| Updated Page Frames|
|2023-05-10| 1.5| Eric Hao & Vladislav Boyko| Updated Page Frames|
|2023-05-10| 1.4| Eric Hao & Vladislav Boyko| Updated Page Frames|
|2023-05-05| 1.3| Benny Li| Reformatted and republished|
|2022-07-27| 1.2| Lakshmi Holla| Updated html tag|
|2020-12-23| 1.1| Steve Ryan| ID Review|
|2020-11-24| 1.0| Sandip Saha Joy| Initial version created|

<hr>

### <h4 align="center"> **Hands-on Lab: COUNT DISTINCT LIMIT** © IBM Corporation 2020. All rights reserved. <h4/>
### <h4 align="center"> **COUNT DISTINCT LIMIT in SQL using Jupyter Notebooks** © Martin John Hilario Borja 2024. All rights reserved. <h4/>

<p>