# Working with different datatypes in SQL
© Explore Data Science Academy

## Learning Objectives

In this train we will learn:
- Querying numerical data;
- Querying Datetime data; and
- Querying text data.

## Outline

This train is structured as follows:
- Assessing the datatypes of table columns
- Numeric data
    - Integers 
    - Decimals
- Text data
    - LIKE operator
    - Case
    - Concatenation
- Date/Time data
    - The interval datatype
    - Getting portions of datetime objects
    - Between

## Introduction

SQL databases consist of tables which in turn consist of columns. Each column is allowed to store a single datatype. This datatype is allocated to a given column based on:

1. The **type** of data that needs to be stored in the column.
2. The **size** (in bytes) required to store each datum in the column.


As such, the general rule of thumb is to use the smallest version of the data type that also has enough capacity to reliably support the data to be stored.

For this train, we will discuss some of the common datatypes in databases and SQL queries that will be useful when working with them.  

## Assessing the datatypes of table columns

In this section, we discuss how to find out the datatypes of columns in a given table. This will be useful when we want to use the WHERE clause (i.e. to verify the type of data in the column) or when we need to modify or add information to a given database table. 

First, let's load our sample database:

In [1]:
# load sql magics
%load_ext sql

# load chinook database
%sql sqlite:///chinook.db

Chinook db ER diagram:

![Chinook ERD](https://github.com/Explore-AI/Pictures/blob/master/sqlite-sample-database-color.jpg?raw=true)

_[Image source](https://www.sqlitetutorial.net/sqlite-sample-database/)_

We can show information (including column types) about a table in the database as follows:

**SQLite:**

In [2]:
%%sql

PRAGMA table_info(employees);

 * sqlite:///chinook.db
Done.


cid,name,type,notnull,dflt_value,pk
0,EmployeeId,INTEGER,1,,1
1,LastName,NVARCHAR(20),1,,0
2,FirstName,NVARCHAR(20),1,,0
3,Title,NVARCHAR(30),0,,0
4,ReportsTo,INTEGER,0,,0
5,BirthDate,DATETIME,0,,0
6,HireDate,DATETIME,0,,0
7,Address,NVARCHAR(70),0,,0
8,City,NVARCHAR(40),0,,0
9,State,NVARCHAR(40),0,,0


This query uses the [PRAGMA](https://www.sqlite.org/pragma.html) statement. This is useful for querying metadata of a database.

## 1. Numeric data
When using SQL in the context of data science, we will need to be familiar with numerical data in the form of integers and floats.

### 1.1. Integers
SQL offers multiple datatypes for storing integer values (negative and positive whole numbers). These integer datatypes will vary depending on the range, i.e., the interval of supported integers, and the nature of the table column. While it only supports 4 bytes of memory and integers between a given range (i.e. roughly $-2^{31}$ to $2^{31}$ ) it is sufficient for most datasets. SQL has other integer data types that vary depending on the storage size such as:

- `INTEGER` (or `INT` depending on the SQL engine used) - allocates 4 bytes of memory per integer and supports integers between the range $-2^{31}$ to $2^{31}-1$.
- `BIGINT`  - allocates 8 bytes of memory per integer and supports integers between the range $-2^{63}$ to $2^{63}-1$.
- `SMALLINT` - allocates 2 bytes of memory per integer and supports integers between the range $-2^{15}$ to $2^{15}-1$. For example, the employee age (this value will always be positive and seldom exceed 100).
- `SERIAL` - a special integer datatype that auto increments when rows are added to the table (useful for creating id columns). Like the `INTEGER` data type, the `SERIAL` data type is 4 bytes in size but only supports integers from $1$ to $2^{63}-1$. Furthermore, it also has `BIGSERIAL` and `SMALLSERIAL` variants which correspond to the `BIGINT` and `SMALLINT` data types respectively.

With this knowledge, let's look at the tracks table in the Chinook database:

In [3]:
%%sql

PRAGMA table_info(tracks);

 * sqlite:///chinook.db
Done.


cid,name,type,notnull,dflt_value,pk
0,TrackId,INTEGER,1,,1
1,Name,NVARCHAR(200),1,,0
2,AlbumId,INTEGER,0,,0
3,MediaTypeId,INTEGER,1,,0
4,GenreId,INTEGER,0,,0
5,Composer,NVARCHAR(220),0,,0
6,Milliseconds,INTEGER,1,,0
7,Bytes,INTEGER,0,,0
8,UnitPrice,"NUMERIC(10,2)",1,,0


We can see that it contains 6 integer columns, the `TrackId`, `AlbumId`, `MediaTypeId`, `GenreId`, `Milliseconds`, and `Bytes` columns. 

Let's take a closer look at what these columns contain:

**Note, we limit our results to 10 rows for legibility.*

In [4]:
%%sql

SELECT TrackId, AlbumId, MediaTypeId, GenreId, Milliseconds, Bytes
FROM tracks
LIMIT 10;

 * sqlite:///chinook.db
Done.


TrackId,AlbumId,MediaTypeId,GenreId,Milliseconds,Bytes
1,1,1,1,343719,11170334
2,2,2,1,342562,5510424
3,3,2,1,230619,3990994
4,3,2,1,252051,4331779
5,3,2,1,375418,6290521
6,1,1,1,205662,6713451
7,1,1,1,233926,7636561
8,1,1,1,210834,6852860
9,1,1,1,203102,6599424
10,1,1,1,263497,8611245


As expected all these columns contain integers. However, most of them are column IDs, this means that we could generate them ourselves using the `SERIAL` integer types if we had to rebuild the database. A good way to analyse the properties of a given numerical column is to use summary statistics. 

Let's do this for the Bytes and the Milliseconds columns. 

In [5]:
%%sql

SELECT mt.Name, max(t.Bytes) AS "Bytes_Max", min(t.Bytes) AS "Bytes_Min", avg(t.Bytes) AS "Bytes_Mean", 
       max(t.Milliseconds) AS "Milliseconds_Max", min(t.Milliseconds) AS "Milliseconds_Min", avg(t.Milliseconds) "Milliseconds_Mean" 
FROM tracks AS t
LEFT JOIN media_types AS mt
ON mt.MediaTypeId = t.MediaTypeId
GROUP BY mt.Name;

 * sqlite:///chinook.db
Done.


Name,Bytes_Max,Bytes_Min,Bytes_Mean,Milliseconds_Max,Milliseconds_Min,Milliseconds_Mean
AAC audio file,6034098,2775071,4476793.818181818,366085,172710,276506.9090909091
MPEG audio file,52490554,38747,8630428.7656559,1612329,1071,265574.28872775217
Protected AAC audio file,11157785,1189062,4663795.573839663,672773,66639,281723.87341772154
Protected MPEG-4 video file,1059546140,20831818,420493713.0140187,5286953,112712,2342940.425233645
Purchased AAC audio file,16454937,2229617,8759372.42857143,493573,51780,260894.7142857143


Since the Bytes and Milliseconds are the same data type, the same limit applies to both of them. This means that we can't store tracks that are greater than $2^{31}-1$ bytes in size (i.e., 2 147 483 647 bytes or 2.15 gigabytes) and tracks that are longer than $2^{31}-1$ milliseconds (i.e., 597 hours). As you will see in later trains, it is important to know the datatypes of the columns we want to insert data into as well as the supported range of values.

### 1.2. Decimals

SQL also has datatypes for storing decimal numbers. These include:

- `DECIMAL` or `NUMERIC` - can store column values to user-specified precision (number of digits to the left and right of the decimal point) and variable storage size (based on specified precision). 
- `REAL` - allocates 4 bytes of memory per decimal and supports 7 decimal digits of fractional precision.
- `DOUBLE PRECISION` - allocates 8 bytes of memory per decimal and supports up to 15 decimal digits of fractional precision.

Let's take a look at the meta information of the invoices table:

In [6]:
%%sql

PRAGMA table_info(invoices);

 * sqlite:///chinook.db
Done.


cid,name,type,notnull,dflt_value,pk
0,InvoiceId,INTEGER,1,,1
1,CustomerId,INTEGER,1,,0
2,InvoiceDate,DATETIME,1,,0
3,BillingAddress,NVARCHAR(70),0,,0
4,BillingCity,NVARCHAR(40),0,,0
5,BillingState,NVARCHAR(40),0,,0
6,BillingCountry,NVARCHAR(40),0,,0
7,BillingPostalCode,NVARCHAR(10),0,,0
8,Total,"NUMERIC(10,2)",1,,0


Evidently, the Total column is the only decimal column in the table. Its data type `NUMERIC(10,2)`, means that the column supports 10 digits to the left of the decimal point and 2 digits to the right of the decimal point.

Let's confirm this:

In [7]:
%%sql

SELECT Total
FROM invoices
LIMIT 10;

 * sqlite:///chinook.db
Done.


Total
1.98
3.96
5.94
8.91
13.86
0.99
1.98
1.98
3.96
5.94


The `NUMERIC` column type is a convenient choice for decimals since it allows the user to specify the desired precision, i.e., we can ask to store more or less numbers before or after the decimal point.

*Note: size is an extremely important factor when considering what data type to use for a given column. Particularly when dealing with numeric data. Using bigger datatypes not only affects the size of the database itself, but also the speed of calculations. This is especially true when dealing with high precision decimal values.*

## 2. Text data

This family of datatypes stores text data in the form of characters or sequences of characters (i.e. strings). Common datatypes for text are:

- `CHAR(n)` - supports sequences of fixed-length characters up to length `n`. Shorter strings are padded with whitespaces until they have length `n`.
- `VARCHAR(n)` - supports strings of varying lengths up to length `n`. 
- `VARCHAR` or `VARYING TEXT` - supports strings of any length.

Let's Look at the metainformation of the customers table:

In [8]:
%%sql

PRAGMA table_info(customers);

 * sqlite:///chinook.db
Done.


cid,name,type,notnull,dflt_value,pk
0,CustomerId,INTEGER,1,,1
1,FirstName,NVARCHAR(40),1,,0
2,LastName,NVARCHAR(20),1,,0
3,Company,NVARCHAR(80),0,,0
4,Address,NVARCHAR(70),0,,0
5,City,NVARCHAR(40),0,,0
6,State,NVARCHAR(40),0,,0
7,Country,NVARCHAR(40),0,,0
8,PostalCode,NVARCHAR(10),0,,0
9,Phone,NVARCHAR(24),0,,0


As shown above, the table has multiple `VARCHAR(n)` column types of different lengths.

### 2.1. The LIKE operator
One of the most useful string operators in SQL is the `LIKE` operator. This allows us to search for a pattern of text within a table column.`LIKE` is used in the `WHERE` clause in conjunction with **wildcards**. Examples include:
- `%` - represents zero, one, or multiple characters
- `_` - represents a single character

Let's illustrate its use through some examples. 

1) Let's write a query that will return all tracks that contain the word love. 

In [9]:
%%sql 

SELECT * 
FROM tracks AS t
WHERE t.Name LIKE "%love%"
LIMIT 10;

 * sqlite:///chinook.db
Done.


TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
24,Love In An Elevator,5,1,1,"Steven Tyler, Joe Perry",321828,10552051,0.99
56,"Love, Hate, Love",7,1,1,"Jerry Cantrell, Layne Staley",387134,12575396,0.99
195,Let Me Love You Baby,20,1,6,Willie Dixon,175386,5716994,0.99
335,My Love,29,1,9,Jauperi/Zeu Góes,203493,6772813,0.99
341,The Girl I Love She Got Long Black Wavy Hair,30,1,1,Jimmy Page/John Bonham/John Estes/John Paul Jones/Robert Plant,183327,5995686,0.99
345,Whole Lotta Love,30,1,1,Jimmy Page/John Bonham/John Paul Jones/Robert Plant/Willie Dixon,373394,12258175,0.99
413,Loverman,35,1,3,Cave,472764,15446975,0.99
440,Love Gun,37,1,1,Paul Stanley,196257,6424915,0.99
444,Do You Love Me,37,1,1,"Paul Stanley, B. Ezrin, K. Fowley",214987,6976194,0.99
449,Calling Dr. Love,37,1,1,Gene Simmons,225332,7395034,0.99


2) Next, we can write a query that returns all customers whose name starts with the letter D.

In [11]:
%%sql

SELECT *
FROM customers AS c
WHERE c.FirstName LIKE "D%";

 * sqlite:///chinook.db
Done.


CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId
8,Daan,Peeters,,Grétrystraat 63,Brussels,,Belgium,1000,+32 02 219 03 03,,daan_peeters@apple.be,4
20,Dan,Miller,,541 Del Medio Avenue,Mountain View,CA,USA,94040-111,+1 (650) 644-3358,,dmiller@comcast.com,4
40,Dominique,Lefebvre,,"8, Rue Hanovre",Paris,,France,75002,+33 01 47 42 71 71,,dominiquelefebvre@gmail.com,4
56,Diego,Gutiérrez,,307 Macacha Güemes,Buenos Aires,,Argentina,1106,+54 (0)11 4311 4333,,diego.gutierrez@yahoo.ar,4


3) Thirdly, let's formulate a query that returns customers with email addresses that have domains that end in three letters.

In [13]:
%%sql

SELECT *
FROM customers AS c
WHERE c.Email LIKE "%.___"
LIMIT 5;

 * sqlite:///chinook.db
Done.


CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId
3,François,Tremblay,,1498 rue Bélanger,Montréal,QC,Canada,H2G 1A7,+1 (514) 721-4711,,ftremblay@gmail.com,3
5,František,Wichterlová,JetBrains s.r.o.,Klanova 9/506,Prague,,Czech Republic,14700,+420 2 4172 5555,+420 2 4172 5555,frantisekw@jetbrains.com,4
6,Helena,Holý,,Rilská 3174/6,Prague,,Czech Republic,14300,+420 2 4177 0449,,hholy@gmail.com,5
16,Frank,Harris,Google Inc.,1600 Amphitheatre Parkway,Mountain View,CA,USA,94043-1351,+1 (650) 253-0000,+1 (650) 253-0000,fharris@google.com,4
17,Jack,Smith,Microsoft Corporation,1 Microsoft Way,Redmond,WA,USA,98052-8300,+1 (425) 882-8080,+1 (425) 882-8081,jacksmith@microsoft.com,5


### 2.2. Case

Depending on how the text data in a column were collected, we may need to consider the case. This is especially relevant when constructing queries that have comparison operators. SQL has the `lower()` and `upper()` for converting between upper and lowercase. 

For example, let's write a query that shows all customers  who live in the UK.

In [14]:
%%sql

SELECT *
FROM customers
WHERE country = "United kingdom";

 * sqlite:///chinook.db
Done.


CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId


What went wrong here? If you have keen eyes, you would have noticed that the "K" in "United Kingdom" is in lowercase. An easy to fix problem. However, this kind of problem will also occur for data that's already in the column. As such, a better approach here, is to use the `upper()` and `lower()` functions as follows:

In [15]:
%%sql

SELECT *
FROM customers
WHERE lower(country) = "united kingdom";

 * sqlite:///chinook.db
Done.


CustomerId,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId
52,Emma,Jones,,202 Hoxton Street,London,,United Kingdom,N1 5LH,+44 020 7707 0707,,emma_jones@hotmail.com,3
53,Phil,Hughes,,113 Lupus St,London,,United Kingdom,SW1V 3EN,+44 020 7976 5722,,phil.hughes@gmail.com,3
54,Steve,Murray,,110 Raeburn Pl,Edinburgh,,United Kingdom,EH4 1HH,+44 0131 315 3300,,steve.murray@yahoo.uk,5


### 2.3. Concatenation

Concatenating two or more strings in SQL can be achieved using the `concat()` function. However, SQLite does not support this function, instead we use the concatenation operator `||`. For example:  

Write a query that shows employee first and last names in the same column.

In [16]:
%%sql

SELECT FirstName || " " || LastName AS "Full Name"
FROM employees;

 * sqlite:///chinook.db
Done.


Full Name
Andrew Adams
Nancy Edwards
Jane Peacock
Margaret Park
Steve Johnson
Michael Mitchell
Robert King
Laura Callahan


## 3. Date/Time data

Date and time values in SQL are stored under the hood as numbers (some combination of `ints` and `floats`), but they are represented as strings. In the value, units are ordered in descending order from the largest length of time to the smallest, i.e.: year, then month, day, hour, and so on. Common data types for storing dates and times in SQL include:

- `DATE` - stores dates in the format `YYYY-MM-DD`. 
- `DATETIME` -  stores the date and time in the format `YYYY-MM-DD HH:MI:SS`.  
- `TIMESTAMP` - stores the date and time in the format `YYYY-MM-DD HH:MI:SS` (or a unique number depending on the SQL engine).

Let's take a look at the meta information of the employees table:

In [17]:
%%sql

PRAGMA table_info(employees);

 * sqlite:///chinook.db
Done.


cid,name,type,notnull,dflt_value,pk
0,EmployeeId,INTEGER,1,,1
1,LastName,NVARCHAR(20),1,,0
2,FirstName,NVARCHAR(20),1,,0
3,Title,NVARCHAR(30),0,,0
4,ReportsTo,INTEGER,0,,0
5,BirthDate,DATETIME,0,,0
6,HireDate,DATETIME,0,,0
7,Address,NVARCHAR(70),0,,0
8,City,NVARCHAR(40),0,,0
9,State,NVARCHAR(40),0,,0


The table has two `DATETIME` columns; `HireDate` and `BirthDate`. Let's use these to explore queries for date and time data.

### 3.1. The interval datatype
In addition to the above listed data types, SQL also has the `INTERVAL` data type. Intervals are used for measuring the period between dates and times. We can get an interval column through declaration (i.e. when creating the table) and through performing arithmetic on existing date and time columns. For example:

We write a query that shows the age of all employees when they were hired:

In [18]:
%%sql

SELECT FirstName, LastName, HireDate - BirthDate AS "Age when hired"
FROM employees
ORDER BY 3;

 * sqlite:///chinook.db
Done.


FirstName,LastName,Age when hired
Jane,Peacock,29
Michael,Mitchell,30
Robert,King,34
Laura,Callahan,36
Steve,Johnson,38
Andrew,Adams,40
Nancy,Edwards,44
Margaret,Park,56


The "Age when hired" column contains `INTERVAL` type data, i.e., in years in this case. In other words, we subtracted two dates to obtain an interval value. However, this is not true in the case of addition.

In [19]:
%%sql

SELECT FirstName, LastName, Hiredate + 5
FROM employees
Limit 1

 * sqlite:///chinook.db
Done.


FirstName,LastName,Hiredate + 5
Andrew,Adams,2007


### 3.2. Getting portions of datetime objects
In the context of datetime SQL objects, the `substr()` (and `left()` in some SQL engines) function allows us to trim or extract certain information within the date or time. We use it by specifying the string and the indices for which to show data from, i.e.:

```sql
substr(datetime_column,start_index, end_index)
```

For example. Let's write a query for calculating the month to month revenue at Chinook:

In [22]:
%%sql

SELECT substr(InvoiceDate,1,7) AS "Month", round(sum(Total),2) AS "Revenue"
FROM invoices
GROUP BY 1
ORDER BY 1
LIMIT 10;

 * sqlite:///chinook.db
Done.


Month,Revenue
2009-01,35.64
2009-02,37.62
2009-03,37.62
2009-04,37.62
2009-05,37.62
2009-06,37.62
2009-07,37.62
2009-08,37.62
2009-09,37.62
2009-10,37.62


We can do the year to year revenue in a similar way:

In [23]:
%%sql

SELECT substr(InvoiceDate,1,4) AS "Year", round(sum(Total),2) AS "Revenue"
FROM invoices
GROUP BY 1
ORDER BY 1;

 * sqlite:///chinook.db
Done.


Year,Revenue
2009,449.46
2010,481.45
2011,469.58
2012,477.53
2013,450.58


### 3.3. Between

`BETWEEN` is boolean operator is used to find out if a value is within a range of values or not. 

For example:

Let's write a query that returns employees that were hired after `2002-08-14` and before `2003-10-17`.

In [24]:
%%sql 

SELECT *
FROM employees
WHERE HireDate between '2002-08-14' AND '2003-10-17'

 * sqlite:///chinook.db
Done.


EmployeeId,LastName,FirstName,Title,ReportsTo,BirthDate,HireDate,Address,City,State,Country,PostalCode,Phone,Fax,Email
1,Adams,Andrew,General Manager,,1962-02-18 00:00:00,2002-08-14 00:00:00,11120 Jasper Ave NW,Edmonton,AB,Canada,T5K 2N1,+1 (780) 428-9482,+1 (780) 428-3457,andrew@chinookcorp.com
4,Park,Margaret,Sales Support Agent,2.0,1947-09-19 00:00:00,2003-05-03 00:00:00,683 10 Street SW,Calgary,AB,Canada,T2P 5G3,+1 (403) 263-4423,+1 (403) 263-4289,margaret@chinookcorp.com


We can achieve the same result using standard comparison operators such as `<`, `>`, and `=`.

In [25]:
%%sql 

SELECT *
FROM employees
WHERE HireDate > '2002-08-14' AND
      HireDate < '2003-10-17';

 * sqlite:///chinook.db
Done.


EmployeeId,LastName,FirstName,Title,ReportsTo,BirthDate,HireDate,Address,City,State,Country,PostalCode,Phone,Fax,Email
1,Adams,Andrew,General Manager,,1962-02-18 00:00:00,2002-08-14 00:00:00,11120 Jasper Ave NW,Edmonton,AB,Canada,T5K 2N1,+1 (780) 428-9482,+1 (780) 428-3457,andrew@chinookcorp.com
4,Park,Margaret,Sales Support Agent,2.0,1947-09-19 00:00:00,2003-05-03 00:00:00,683 10 Street SW,Calgary,AB,Canada,T2P 5G3,+1 (403) 263-4423,+1 (403) 263-4289,margaret@chinookcorp.com


That's it for date-time data types!

## Conclusion

The concepts covered in this train will be more useful in cases where you have to create your own database tables. Although not comprehensively, we have covered: 

- How to view table meta information in SQLite - this is particularly useful in cases where we need to assess the datatypes of table columns.
- Common column data types in SQL such as: 
    - numerical data - `INTEGER`,`SERIAL`,`DECIMAL`, `REAL`, `DOUBLE PRECISION` 
    - text data - `CHAR(n)`,`VARCHAR(n)`, `VARCHAR`
    - date/time data - `DATE`, `DATETIME`, `INTERVAL`.
- How to construct queries around the above listed datatypes and useful built-in functions for each datatype.


## Additional Links

- [General SQL Datatypes](https://www.sqlite.org/datatype3.html)
- [LIKE operator](https://www.w3schools.com/sql/sql_like.asp)