# Visualization and Modern Data Science

> Basic Queries in SQL

Kuo, Yao-Jen <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com)

In [1]:
%LOAD sqlite3 db=../datasets/nba.db timeout=2 shared_cache=true

## SQL Style Guide

## What is a style guide?

> Generally speaking, (programming) style guide is a written document, containing a set of rules or guidelines used when writing source code for a computer program (it could be anything from web app to desktop software). A particular programming style may be different from language to language. For example, what is considered a goood practice in SQL may not be appropriate for Python, vice versa.

## Why adopting style guide?

> A code deployed in production should look like it was written by a single developer, even if it was written by hundreds. Conforming to a style guide removes unnecessary guesswork and ambiguities. It also allows for a more streamlined creation of code and its maintenance, because we won’t have to think about the style or how we should name a variable - we simply follow instructions.

## What if we are not satisy with the current style guide adpated by our team?

> If we are new to a new team, no questions asked. We can suggest changes after a specific period time(probation or first year per se) if there is an obvious advantage, otherwise just follow the style guide.

## Next, the "common" SQL style guide

![](https://media.giphy.com/media/121em4mM0An4Pe/giphy.gif)

Source: <https://giphy.com/>

## We are adapting SQL style guide by [Simon Holywell](https://www.simonholywell.com/)

Source: <https://www.sqlstyle.guide/>

## Other SQL style guides

- [SQL Style Guide, GitLab](https://about.gitlab.com/handbook/business-ops/data-team/platform/sql-style-guide/)
- [SQL Style Guide, Mozilla](https://docs.telemetry.mozilla.org/concepts/sql_style.html)

## General Dos:

- Make judicious use of white space and indentation to make code easier to read
- Store ISO-8601 compliant time and date information (`YYYY-MM-DD HH:MM:SS.SSSSS`)
- Use only standard SQL functions instead of vendor specific functions for reasons of portability
- Include comments in SQL code where necessary
    - Use the opening `/*` and closing `*/` for multi-line comments
    - Use `--` for single line comment

## What is your idea of a perferct date?

![Imgur](https://i.imgur.com/eBYN9Qe.png)

Source: Google Search

## General naming conventions

- Begin with a letter and may not end with an underscore
- Do not use a **reserved keyword**
- Only use letters, numbers and underscores
- Do not use multiple consecutive underscores
- Use abbreviations only if they are commonly understood

## Reserved keywords in SQL

<https://www.w3schools.com/sql/sql_ref_keywords.asp>

## Using reserved keywords in query syntax

- Always use uppercase for the reserved keywords like `SELECT` and `WHERE`
- It is best to use the full length ones where available

## Right align `SELECT`, `FROM`, etc. in query syntax

In [2]:
SELECT *
  FROM teams
 WHERE confName = 'East'
 LIMIT 5;

isNBAFranchise,isAllStar,city,altCityName,fullName,tricode,teamId,nickname,urlName,teamShortName,confName,divName
1,0,Atlanta,Atlanta,Atlanta Hawks,ATL,1610612737,Hawks,hawks,Atlanta,East,Southeast
1,0,Boston,Boston,Boston Celtics,BOS,1610612738,Celtics,celtics,Boston,East,Atlantic
1,0,Cleveland,Cleveland,Cleveland Cavaliers,CLE,1610612739,Cavaliers,cavaliers,Cleveland,East,Central
1,0,Chicago,Chicago,Chicago Bulls,CHI,1610612741,Bulls,bulls,Chicago,East,Central
1,0,Miami,Miami,Miami Heat,MIA,1610612748,Heat,heat,Miami,East,Southeast


## Include spaces

- before and after equals `=`
- after commas `,`
- surrounding apostrophes (`'`) where not within parentheses or with a trailing comma or semicolon

In [3]:
SELECT *
  FROM teams
 WHERE divName IN ('Atlantic', 'Southeast');

isNBAFranchise,isAllStar,city,altCityName,fullName,tricode,teamId,nickname,urlName,teamShortName,confName,divName
1,0,Atlanta,Atlanta,Atlanta Hawks,ATL,1610612737,Hawks,hawks,Atlanta,East,Southeast
1,0,Boston,Boston,Boston Celtics,BOS,1610612738,Celtics,celtics,Boston,East,Atlantic
1,0,Miami,Miami,Miami Heat,MIA,1610612748,Heat,heat,Miami,East,Southeast
1,0,Brooklyn,Brooklyn,Brooklyn Nets,BKN,1610612751,Nets,nets,Brooklyn,East,Atlantic
1,0,New York,New York,New York Knicks,NYK,1610612752,Knicks,knicks,New York,East,Atlantic
1,0,Orlando,Orlando,Orlando Magic,ORL,1610612753,Magic,magic,Orlando,East,Southeast
1,0,Philadelphia,Philadelphia,Philadelphia 76ers,PHI,1610612755,76ers,sixers,Philadelphia,East,Atlantic
1,0,Toronto,Toronto,Toronto Raptors,TOR,1610612761,Raptors,raptors,Toronto,East,Atlantic
1,0,Washington,Washington,Washington Wizards,WAS,1610612764,Wizards,wizards,Washington,East,Southeast
1,0,Charlotte,Charlotte,Charlotte Hornets,CHA,1610612766,Hornets,hornets,Charlotte,East,Southeast


## Other preferences

- `BETWEEN` is better than multiple `AND`
- Similarly use `IN()` instead of multiple `OR`

## Aliasing conventions

- Should relate in some way to the object or expression
- The correlation name should be the first letter of each word in the object’s name
- If there is already a correlation with the same name then append a number
- Always include the `AS` keyword makes it easier to read as it is explicit

## Data Types

## Whenever digging into a new database, check a data dictionary(or sometimes refers to a data schema document)

A document that lists each column; specifies whether it’s a number,character, or other type; and explains the definition of column values.

## Well, in an ideal world...

![](https://media.giphy.com/media/iJ2cRDeQkcPXZiHh53/giphy.gif)

Source: <https://giphy.com/>

## Unfortunately, many organizations don’t create and maintain good documentation

## We can check data type manually via

- `TYPEOF` function for a certain column
- `PRAGMA_TABLE_INFO` function for an entire table

## `TYPEOF` function for a certain column

In [4]:
SELECT TYPEOF(heightMeters),
       TYPEOF(heightFeet)
  FROM players
 LIMIT 1;

TYPEOF(heightMeters),TYPEOF(heightFeet)
real,integer


## `PRAGMA_TABLE_INFO` function for an entire table

In [5]:
SELECT *
  FROM PRAGMA_TABLE_INFO('teams');

cid,name,type,notnull,dflt_value,pk
0,isNBAFranchise,INTEGER,0,,0
1,isAllStar,INTEGER,0,,0
2,city,TEXT,0,,0
3,altCityName,TEXT,0,,0
4,fullName,TEXT,0,,0
5,tricode,TEXT,0,,0
6,teamId,INTEGER,0,,1
7,nickname,TEXT,0,,0
8,urlName,TEXT,0,,0
9,teamShortName,TEXT,0,,0


In [6]:
SELECT * 
  FROM PRAGMA_TABLE_INFO('players');

cid,name,type,notnull,dflt_value,pk
0,firstName,TEXT,0,,0
1,lastName,TEXT,0,,0
2,temporaryDisplayName,TEXT,0,,0
3,personId,INTEGER,0,,1
4,teamId,INTEGER,0,,0
5,jersey,INTEGER,0,,0
6,pos,TEXT,0,,0
7,heightFeet,INTEGER,0,,0
8,heightInches,INTEGER,0,,0
9,heightMeters,REAL,0,,0


In [7]:
SELECT * 
  FROM PRAGMA_TABLE_INFO('career_summaries');

cid,name,type,notnull,dflt_value,pk
0,personId,INTEGER,0,,1
1,tpp,REAL,0,,0
2,ftp,REAL,0,,0
3,fgp,REAL,0,,0
4,ppg,REAL,0,,0
5,rpg,REAL,0,,0
6,apg,REAL,0,,0
7,bpg,REAL,0,,0
8,mpg,REAL,0,,0
9,spg,REAL,0,,0


## The categories we’ll encounter most

- `TEXT`
- `INTEGER`
- `REAL`
- Dates and times
- Boolean

## Referring to official documentation for data types in SQLite

<https://www.sqlite.org/datatypes.html>

## Using `CAST` function to convert data type for query results

Before using `CAST`.

In [8]:
SELECT heightMeters -- heightMeters column is recorded as REAL
  FROM players
 LIMIT 5;

heightMeters
2.06
2.01
2.03
2.08
1.98


## After using `CAST`

In [9]:
SELECT CAST(heightMeters AS INTEGER) AS heightMetersInteger -- heightMeters column is now displayed as INTEGER
  FROM players
 LIMIT 5;

heightMetersInteger
2
2
2
2
1


## However, using `CAST` function only modifies the query result rather than the table column

In [10]:
SELECT heightMeters -- heightMeters column is recorded as REAL
  FROM players
 LIMIT 5;

heightMeters
2.06
2.01
2.03
2.08
1.98


## Basic Operators and Functions for `TEXT`

## We've met the relational operators in `WHERE` for `TEXT`

- `=`: Equal to
- `!=`: Not equal to
- `IN`: Match one of a set of values
- `LIKE`: Match a pattern
- `NOT`: Negates a condition

## Filtering `confName` equals to `West` in `teams`

In [11]:
SELECT *
  FROM teams
 WHERE confName = 'West';

isNBAFranchise,isAllStar,city,altCityName,fullName,tricode,teamId,nickname,urlName,teamShortName,confName,divName
1,0,New Orleans,New Orleans,New Orleans Pelicans,NOP,1610612740,Pelicans,pelicans,New Orleans,West,Southwest
1,0,Dallas,Dallas,Dallas Mavericks,DAL,1610612742,Mavericks,mavericks,Dallas,West,Southwest
1,0,Denver,Denver,Denver Nuggets,DEN,1610612743,Nuggets,nuggets,Denver,West,Northwest
1,0,Golden State,Golden State,Golden State Warriors,GSW,1610612744,Warriors,warriors,Golden State,West,Pacific
1,0,Houston,Houston,Houston Rockets,HOU,1610612745,Rockets,rockets,Houston,West,Southwest
1,0,LA,LA Clippers,LA Clippers,LAC,1610612746,Clippers,clippers,LA Clippers,West,Pacific
1,0,Los Angeles,Los Angeles Lakers,Los Angeles Lakers,LAL,1610612747,Lakers,lakers,L.A. Lakers,West,Pacific
1,0,Minnesota,Minnesota,Minnesota Timberwolves,MIN,1610612750,Timberwolves,timberwolves,Minnesota,West,Northwest
1,0,Phoenix,Phoenix,Phoenix Suns,PHX,1610612756,Suns,suns,Phoenix,West,Pacific
1,0,Portland,Portland,Portland Trail Blazers,POR,1610612757,Trail Blazers,blazers,Portland,West,Northwest


## Or doing the same in the opposite way

In [12]:
SELECT *
  FROM teams
 WHERE confName != 'East';

isNBAFranchise,isAllStar,city,altCityName,fullName,tricode,teamId,nickname,urlName,teamShortName,confName,divName
1,0,New Orleans,New Orleans,New Orleans Pelicans,NOP,1610612740,Pelicans,pelicans,New Orleans,West,Southwest
1,0,Dallas,Dallas,Dallas Mavericks,DAL,1610612742,Mavericks,mavericks,Dallas,West,Southwest
1,0,Denver,Denver,Denver Nuggets,DEN,1610612743,Nuggets,nuggets,Denver,West,Northwest
1,0,Golden State,Golden State,Golden State Warriors,GSW,1610612744,Warriors,warriors,Golden State,West,Pacific
1,0,Houston,Houston,Houston Rockets,HOU,1610612745,Rockets,rockets,Houston,West,Southwest
1,0,LA,LA Clippers,LA Clippers,LAC,1610612746,Clippers,clippers,LA Clippers,West,Pacific
1,0,Los Angeles,Los Angeles Lakers,Los Angeles Lakers,LAL,1610612747,Lakers,lakers,L.A. Lakers,West,Pacific
1,0,Minnesota,Minnesota,Minnesota Timberwolves,MIN,1610612750,Timberwolves,timberwolves,Minnesota,West,Northwest
1,0,Phoenix,Phoenix,Phoenix Suns,PHX,1610612756,Suns,suns,Phoenix,West,Pacific
1,0,Portland,Portland,Portland Trail Blazers,POR,1610612757,Trail Blazers,blazers,Portland,West,Northwest


## Filtering `divName` in 'Atlantic', 'Southwest' in `teams`

In [13]:
SELECT *
  FROM teams
 WHERE divName IN ('Atlantic', 'Southwest');

isNBAFranchise,isAllStar,city,altCityName,fullName,tricode,teamId,nickname,urlName,teamShortName,confName,divName
1,0,Boston,Boston,Boston Celtics,BOS,1610612738,Celtics,celtics,Boston,East,Atlantic
1,0,New Orleans,New Orleans,New Orleans Pelicans,NOP,1610612740,Pelicans,pelicans,New Orleans,West,Southwest
1,0,Dallas,Dallas,Dallas Mavericks,DAL,1610612742,Mavericks,mavericks,Dallas,West,Southwest
1,0,Houston,Houston,Houston Rockets,HOU,1610612745,Rockets,rockets,Houston,West,Southwest
1,0,Brooklyn,Brooklyn,Brooklyn Nets,BKN,1610612751,Nets,nets,Brooklyn,East,Atlantic
1,0,New York,New York,New York Knicks,NYK,1610612752,Knicks,knicks,New York,East,Atlantic
1,0,Philadelphia,Philadelphia,Philadelphia 76ers,PHI,1610612755,76ers,sixers,Philadelphia,East,Atlantic
1,0,San Antonio,San Antonio,San Antonio Spurs,SAS,1610612759,Spurs,spurs,San Antonio,West,Southwest
1,0,Toronto,Toronto,Toronto Raptors,TOR,1610612761,Raptors,raptors,Toronto,East,Atlantic
1,0,Memphis,Memphis,Memphis Grizzlies,MEM,1610612763,Grizzlies,grizzlies,Memphis,West,Southwest


## Filtering `divName` with `%ic` in `teams`

In [14]:
SELECT *
  FROM teams
 WHERE divName LIKE '%ic';

isNBAFranchise,isAllStar,city,altCityName,fullName,tricode,teamId,nickname,urlName,teamShortName,confName,divName
1,0,Boston,Boston,Boston Celtics,BOS,1610612738,Celtics,celtics,Boston,East,Atlantic
1,0,Golden State,Golden State,Golden State Warriors,GSW,1610612744,Warriors,warriors,Golden State,West,Pacific
1,0,LA,LA Clippers,LA Clippers,LAC,1610612746,Clippers,clippers,LA Clippers,West,Pacific
1,0,Los Angeles,Los Angeles Lakers,Los Angeles Lakers,LAL,1610612747,Lakers,lakers,L.A. Lakers,West,Pacific
1,0,Brooklyn,Brooklyn,Brooklyn Nets,BKN,1610612751,Nets,nets,Brooklyn,East,Atlantic
1,0,New York,New York,New York Knicks,NYK,1610612752,Knicks,knicks,New York,East,Atlantic
1,0,Philadelphia,Philadelphia,Philadelphia 76ers,PHI,1610612755,76ers,sixers,Philadelphia,East,Atlantic
1,0,Phoenix,Phoenix,Phoenix Suns,PHX,1610612756,Suns,suns,Phoenix,West,Pacific
1,0,Sacramento,Sacramento,Sacramento Kings,SAC,1610612758,Kings,kings,Sacramento,West,Pacific
1,0,Toronto,Toronto,Toronto Raptors,TOR,1610612761,Raptors,raptors,Toronto,East,Atlantic


## Making uppercase with `UPPER`

In [15]:
SELECT UPPER(firstName) AS upperFirstName,
       UPPER(lastName) AS upperLastName
  FROM players
 LIMIT 5;

upperFirstName,upperLastName
LEBRON,JAMES
CARMELO,ANTHONY
UDONIS,HASLEM
DWIGHT,HOWARD
ANDRE,IGUODALA


## Concatenating with `||` operator

In [16]:
SELECT firstName || ' ' || lastName AS fullName
  FROM players
 LIMIT 5;

fullName
LeBron James
Carmelo Anthony
Udonis Haslem
Dwight Howard
Andre Iguodala


## Measuring number of strings with `LENGTH`

In [17]:
SELECT city,
       LENGTH(city) AS number_of_strings
  FROM teams
 LIMIT 5;

city,number_of_strings
Atlanta,7
Boston,6
Cleveland,9
New Orleans,11
Chicago,7


## Common functions for `TEXT` in SQLite

<https://www.sqlitetutorial.net/sqlite-string-functions/>

## Basic Operators and Functions for Numeric

## Numeric operators available in SQL

- `+`, `-`, `*`, `/` are straight-forward
- `%` Modulo returns just the remainder

In [18]:
SELECT 55 + 66,
       55 - 66,
       55 * 6,
       55 / 66, -- as an integer
       55 % 6;

55 + 66,55 - 66,55 * 6,55 / 66,55 % 6
121,-11,330,0,1


## Mind the order of operations

1. Exponents and roots
2. Multiplication, division, modulo
3. Addition and subtraction

In [19]:
SELECT 100.0 * 9/5 + 32,
       212.0 - 32 * 5/9,
       (212.0 - 32) * 5/9;

100.0 * 9/5 + 32,212.0 - 32 * 5/9,(212.0 - 32) * 5/9
212.0,195.0,100.0


## Getting `bmi` for each NBA player in `players`

\begin{equation}
BMI = \frac{weight_{kg}}{height^2_m}
\end{equation}

In [20]:
SELECT firstName,
       lastName,
       weightKilograms / (heightMeters*heightMeters) AS bmi
  FROM players
  ORDER BY bmi DESC
  LIMIT 5;

firstName,lastName,bmi
Zion,Williamson,31.8803990000248
Jusuf,Nurkic,29.5366231665955
Jarrell,Brantley,29.5189504373178
Eric,Paschall,29.5122946638098
Udoka,Azubuike,29.3546597633136


## Rounding `bmi` to 2 decimal places for each NBA player in `players` with `ROUND`

In [21]:
SELECT firstName,
       lastName,
       ROUND(weightKilograms / (heightMeters * heightMeters), 2) AS bmi
  FROM players
 ORDER BY bmi DESC
 LIMIT 5;

firstName,lastName,bmi
Zion,Williamson,31.88
Jusuf,Nurkic,29.54
Jarrell,Brantley,29.52
Eric,Paschall,29.51
Udoka,Azubuike,29.35


## Is BMI(Body Mass Index) a good metric measuring your fitness level?

![](https://media.giphy.com/media/jkZd4Z7ImNxNEWF1Z8/giphy.gif)

Source: <https://giphy.com/>

## Common functions for numeric in SQLite

<https://www.sqlite.org/lang_mathfunc.html>

## Summarizing and Grouping

## The functions for text and numeric we've been using so far belong to a function category called "universal functions"

## We can roughly divide the SQL function-verse into 2 sub categories

1. Universal functions
2. Aggregate functions

## What is the difference between these 2 categories?

The major difference is whether if the rows of output equals to the rows of input.

## What does "rows of output equals to rows of input" mean?

Say we are using `ROUND` function for each NBA player.

In [22]:
SELECT firstName,
       lastName,
       heightMeters,
       ROUND(heightMeters) AS heightMetersRounded
  FROM players
 LIMIT 5;

firstName,lastName,heightMeters,heightMetersRounded
LeBron,James,2.06,2.0
Carmelo,Anthony,2.01,2.0
Udonis,Haslem,2.03,2.0
Dwight,Howard,2.08,2.0
Andre,Iguodala,1.98,2.0


## So the rows of input `heightMeters` equals to the `heightMetersRounded`, then `ROUND` is a "universal function".

## Aggregate functions combine values from multiple rows and return a single result based on an operation on those values

Say we want to know the average height of NBA players.

In [23]:
SELECT AVG(heightMeters) AS avgHeightMeters
  FROM players;

avgHeightMeters
1.98981744421906


## So the `avgHeightMeters` takes only 1 row, which does not equal to the rows of input `heightMeters`, then `AVG` is an "aggregate function".

## Counting rows using `COUNT(*)`

Before querying, a sensible first step is to make sure the table has the expected number of rows.

In [24]:
SELECT COUNT(*)
  FROM teams;

COUNT(*)
30


In [25]:
SELECT COUNT(*)
  FROM players;

COUNT(*)
493


In [26]:
SELECT COUNT(*)
  FROM career_summaries;

COUNT(*)
493


## Counting non-null observations using `COUNT(column_name)`

In [27]:
SELECT COUNT(points) AS number_of_players_with_scoring_record
  FROM career_summaries;

number_of_players_with_scoring_record
490


## How to count the number of columns of a table?

## Using metadata!

In [28]:
SELECT COUNT(*)
  FROM PRAGMA_TABLE_INFO('teams');

COUNT(*)
12


## Finding maximum and minimum values

In [29]:
SELECT MAX(heightMeters),
       MIN(heightMeters)
  FROM players;

MAX(heightMeters),MIN(heightMeters)
2.26,1.78


## Common aggregate functions in SQLite

<https://sqlite.org/lang_aggfunc.html>

## Aggregating data using `GROUP BY`

`GROUP BY` on its own, eliminates duplicate values from the results, similar to the combination of `DISTINCT` and `ORDER BY`.

In [30]:
SELECT DISTINCT confName,
       divName
  FROM teams
 ORDER BY confName,
          divName;

confName,divName
East,Atlantic
East,Central
East,Southeast
West,Northwest
West,Pacific
West,Southwest


In [31]:
SELECT confName,
       divName
  FROM teams
 GROUP BY confName,
          divName;

confName,divName
East,Atlantic
East,Central
East,Southeast
West,Northwest
West,Pacific
West,Southwest


## Combining `GROUP BY` with `COUNT`

In [32]:
SELECT country,
       COUNT(*)
  FROM players
 GROUP BY country
 ORDER BY COUNT(*) DESC
 LIMIT 5;

country,COUNT(*)
USA,377
Canada,18
France,12
Australia,9
Serbia,6


## Combining `GROUP BY` with `AVG`

In [33]:
SELECT pos,
       ROUND(AVG(heightMeters), 2) AS avgHeightMeters
  FROM players
 GROUP BY pos
 ORDER BY avgHeightMeters;

pos,avgHeightMeters
G,1.91
G-F,1.98
F-G,2.0
F,2.02
F-C,2.08
C-F,2.1
C,2.12


## Filtering an aggregate query using `HAVING`

We are already familiar with using `WHERE` for filtering, but aggregate functions can’t be used within a `WHERE` clause because they operate at the row level, and aggregate functions work across rows.

## Combining `GROUP BY`, `COUNT`, and `HAVING`

In [34]:
SELECT country,
       COUNT(*)
  FROM players
 GROUP BY country
HAVING COUNT(*) >= 4
 ORDER BY COUNT(*) DESC;

country,COUNT(*)
USA,377
Canada,18
France,12
Australia,9
Serbia,6
Germany,6
Turkey,4
Spain,4
Croatia,4


## Combining `GROUP BY`, `AVG`, and `HAVING`

In [35]:
SELECT pos,
       ROUND(AVG(heightMeters), 2) AS avgHeightMeters
  FROM players
 GROUP BY pos
HAVING ROUND(AVG(heightMeters), 2) > 2
 ORDER BY avgHeightMeters;

pos,avgHeightMeters
F,2.02
F-C,2.08
C-F,2.1
C,2.12


## Putting what we have so far all together

SQL is about the order of keywords, so follow this convention:

```sql
SELECT column_names
  FROM table_name
 WHERE conditions
 GROUP BY column_names
HAVING aggregate conditions
 ORDER BY column_names;
```