A Gentle Introduction to SQL Using SQLite Part III
Spreading the data around: Data Normalization
All of the queries we’ve run so far are limited to a single table. If all you ever do is import data from a spreadsheet, then you could always limit your queries to a single table. But most data projects of any depth soon involve multiple database tables.
Why would you want to separate the data into different tables? Well let’s think back for a moment to the description of relational databases from Part I:
- Data is organized into tables (relations) that represent a collection of similar objects (e.g. contributors).
- The columns of the table represent the attributes that members of the collection share (last name, home address, amount of contribution).
- Each row in the table represents an individual member of the collection (one contributor).
- And the values in the row represent the attributes of that individual (Smith, 1228 Laurel St., $250).
So, a table represents a set of similar objects, and the objects all share certain attributes. But we could stretch that definition quite a bit: Contributors all have addresses, but they also have recipients (the candidates who received the contributions). Should we include in our contributors table the candidate name, the campaign address and phone number, the office sought, the state in which the candidate is running, etc? What about the candidate’s treasurer’s name? Committee positions the candidate holds? Previous offices held?
Including all of this loosely related data in a single table takes us pretty far afield of the original relation (Contributor). We would also be storing a lot of redundant data (all of the candidate data would be repeated for each contribution to a candidate). As a result, it could become difficult to update the data. Changing a candidate’s address, for example, would require a change to each row containing a contributor to that candidate. In addition, it would become increasingly difficult to spot any data entry errors. Each misspelling of a candidate’s name would be like adding a new candidate, and it would be easier to overlook the error amidst all the repeated data. Finally, all of this redundancy means we’re taking up more disk space than needed. (This last isn’t as big a concern as it once was when disk space was more expensive, but it can present problems.)
So, in order to help ensure data integrity, to keep tables logically coherent and to reduce disk usage, most database designers implement some degree of data normalization. There are varying degrees of normalization, known as the “normal forms,” but for practical purposes the goal is to remove repetition and to keep only clearly related data in the same table.
So, let’s go back to our hypothetically bulky contributors table and do some minimal normalization. Let’s begin with a table that looks like this:
|. last name|.first name|.street|.city|.state|_.zip|.amount|.date|.candidate last name|.candidate first name|.candidate party|
|Agee|Steven|549 Laurel Branch Road|Floyd|VA|24091|500.00|2007-06-30|Huckabee|Mike|R|
|Ahrens|Don|4034 Rennellwood Way|Pleasanton|CA|94566|250.00|2007-05-16|Huckabee|Mike|R|
Even with only two sample rows, it’s easy to see the redundancy here. Any place we see repetition is a potential site of some normalization. Also, it’s fairly clear that the table
really represents two different relations (contributors and candidates). So, one approach to restructuring this data is to create contributors and candidates tables and separate the
To get started, let’s create a fresh database. Start up Firefox and open the SQLite Manager:
When asked if you want to open the last database, click “Cancel,” so you can create a new database:
First, we’ll create the candidates table. Here’s the SQL
CREATE TABLE statement to execute:
CREATE TABLE "candidates" ("id" INTEGER PRIMARY KEY NOT NULL, "first_name" VARCHAR NOT NULL, "last_name" VARCHAR NOT NULL , "middle_name" VARCHAR, "party" VARCHAR NOT NULL )
This should all be old hat by now. We’re just creating a table for the candidates, including some basic information (name and party), and adding a PRIMARY KEY, a unique identifier for each candidate.
Now, let’s add some data to that table. Grab the text file at https://github.com/tthibo/SQL-Tutorial/raw/master/tutorial_files/candidates.txt and import
it using the Import icon () as you did in Part I.
- Remember to check the “First row contains column names” check box.
- And set the “Fields separated by” value to “Pipe (|).”
The import should add 17 rows to your candidates table:
So now, rather than having candidate data included with each row of the contributor data, we have one row for each candidate. It’s a much cleaner data structure.
Referentially speaking: Associating tables using foreign keys
So, now we have the candidates table, but we also have a problem. Now that we’ve moved the candidate data out of the contributors table, how do we link contributors to their candidates? Without this link, we have no way of running queries that give, for example, total contributions per candidate. To create this reference between the two tables, we’ll need a common field that the two tables share. The standard way of setting up this relationship is to include the Primary Key from the referenced table as a field in the referencing table. The new column in the referencing table is known as a Foreign Key.
Simply creating this foreign key column in the referencing table would be enough to let us run queries across both tables, but SQL also allows us to explicitly declare the foreign key and thus enforce this reference at the database level.
So, let’s create a new contributors table, but in addition to the data about the contributor, let’s add a candidate_id field and let SQLite know that it is a foreign key referencing the id column in the candidates table:
CREATE TABLE "contributors" ( "id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "last_name" VARCHAR, "first_name" VARCHAR, "middle_name" VARCHAR, "street_1" VARCHAR, "street_2" VARCHAR, "city" VARCHAR, "state" VARCHAR, "zip" VARCHAR, "amount" INTEGER, "date" DATETIME, "candidate_id" INTEGER NOT NULL, FOREIGN KEY(candidate_id) REFERENCES candidates(id) );
Notice the last two lines of that
CREATE statement. The penultimate line adds the candidate_id column, defines it as an integer, and makes it a required field (it cannot be null). The final line defines candidate_id as a foreign key referencing the id column in the candidates table.
Now SQLite will enforce this reference, and if we try to enter a row in the contributors table without a candidate_id or using a candidate_id that doesn’t actually appear in the candidates table, we’ll get an error. In other words, every contributor must now have a candidate, and that candidate must already exist in the candidates table.
Now let’s add some contributor data to the table. Download the text file at https://github.com/tthibo/SQL-Tutorial/raw/master/tutorial_files/contributors_with_candidate_id.txt and import it into the contributors table:
NOTE: You’ll have to set the table name to contributors. Otherwise, SQLite will create a new table called contributors_with_candidate_id, based on the name of the text file.
Also, you’ll have to check the “First row contains column names” box and set the separator to “Pipe(|).”
You should now have 175 rows in the contributors table:
(A quick aside about the text file: it includes an empty “id” column. SQLite will insert an autoincremented id into this field when importing the data, so we’ll have unique Primary Key values. For the candidates table, we specified the id field in each row so that they would match the candidate_id values in this data. In a real project, we would probably use autoincrementing values for the ids in the candidates table, and populating the candidate_id field in the contributors table with the appropriate value would be a separate task.)
Reaching across the aisles using JOINS
One way to run a query that uses data from two different tables is to use a subquery.
For example, to find all of the contributors to Barack Obama, you can do something like this:
SELECT * FROM contributors WHERE candidate_id = (SELECT id from candidates WHERE last_name = 'Obama' AND first_name = 'Barack');
This approach works fine as long as you’re simply looking up values in one table and using them in the conditions for the
WHERE clause. But often the queries you’ll want to run
will need to treat the two tables as a combined data set. A query that combines the data from two tables is known as a join on the tables. It is possible to do an implicit join simply by defining the relationship between the two tables in the
SELECT contributors.last_name, contributors.first_name, candidates.last_name FROM contributors, candidates WHERE contributors.candidate_id = candidates.id;
- Notice that we’re including both of the tables in the
- Also notice that we’re using a fully-qualified version of the column names:
candidates.last_name. We’re including the table name here because
last_nameappears in both tables. So, just using
last_name, as we usually would, would be ambiguous (the last name of the contributor or the last name of the candidate?). Adding the table name and a dot (.) before the column name disambiguates the column.
Including the full table name with each column name can become a bit tedious. So, SQL allows you to define an alias for the table. To do so, simply include the alias after the table name in the
FROM clause. Then you can use that alias, rather than the full table name, elsewhere in the query:
SELECT a.last_name, a.first_name, b.last_name FROM contributors a, candidates b WHERE a.candidate_id = b.id;
This query returns the same results as the one above, but it saves some typing by making “a” an alias for “contributors” and “b” an alias for “candidates.” The alias can use any valid table name you
like, but obviously shorter aliases will save more typing, while longer ones may make the intention of the query easier to understand.
In addition to this implicit join syntax, SQL includes an explicit
JOIN keyword. So, we could write the above query using that syntax instead:
SELECT contributors.last_name, contributors.first_name, candidates.last_name FROM contributors JOIN candidates ON contributors.candidate_id = candidates.id;
The query results should be the same as above, but using the
JOIN keyword makes the intent of the query more explicit.
Aliases work with
JOIN as well:
SELECT a.last_name, a.first_name, b.last_name FROM contributors a JOIN candidates b ON a.candidate_id = b.id;
Now let’s try something a bit more interesting:
SELECT count(a.id), b.id, b.last_name FROM contributors a JOIN candidates b ON a.candidate_id = b.id GROUP BY b.id, b.last_name;
Excellent! We now know that we have 25 contributors for each candidate. Very cool. But, hey, wait. Our list of candidates seems to be coming up short. Let’s check it:
SELECT DISTINCT id, last_name FROM candidates;
What the? We’re missing 10 candidates. SQLite has gone mad!
Actually, there’s a pretty sensible explanation for this result. We said above that performing the
JOIN would return the same results as the query with this clause:
WHERE contributors.candidate_id = candidates.id.
What if a candidate has no contributors? Then that candidate is not returned by the query.
JOIN acts just like the
WHERE clause and filters out any rows that don’t match the condition defined. Joins that return only rows in which there is a match in both tables are known as INNER JOINs. This is often exactly the behavior you want from the join (ignore any rows from either table that don’t relate to a row in the other table). So by default, the
JOIN keyword executes an
INNER JOIN. You can also explicitly request an
INNER JOIN, just to make things clearer:
SELECT count(a.id), b.id, b.last_name FROM contributors a INNER JOIN candidates b ON a.candidate_id = b.id GROUP BY b.id, b.last_name;
The results will be the same.
But how do we get the full list of candidates along with the number of contributors for each, including those candidates who have no contributors in our data set? SQL provides the
OUTER JOIN syntax for doing just that. Outer joins are typically defined by the table from which we want to include non-matching rows, and we do so by referring to where that table appears in the
LEFT OUTER JOINincludes all rows from the table on the left side of the statement and only matching rows from the table on the right side of the statement.
RIGHT OUTER JOINincludes all rows from the table on the right side of the statement and only matching rows from the left side of the statement.
FULL OUTER JOINincludes all rows from both tables.
Currently, SQLite only supports
LEFT OUTER JOIN from the list above, but some other database management systems support the other two types as well.
(As an aside, it’s easy to perform a
RIGHT OUTER JOIN in SQLite by simply reversing the order of tables and using a
LEFT OUTER JOIN. It’s also possible to do a
FULL OUTER JOIN by combining
LEFT OUTER JOINs using the
This all probably makes more sense in an example. Let’s rewrite the grouping query from above to include all candidates:
SELECT count(contributors.id), candidates.id, candidates.last_name FROM candidates LEFT OUTER JOIN contributors ON candidates.id = contributors.candidate_id GROUP BY candidates.id, candidates.last_name;
(Aliases would work here as well, but I’ve used the full table names to make the relationships clearer.)
candidates LEFT OUTER JOIN contributors. Because candidates is on the left side of that statement, the result set will include all of the candidate rows, even those for which there are no matching contributors:
Know thyself: Self Joins
Occasionally, it’s useful to perform a query that joins a table to itself. Consider, for example, if we had two tables of contributors (contributors_a and contributors_b), and we wanted to find contributors who appeared in both tables. Even without an explicit reference defined via a foreign key, we could still do a rough join on a combination of other columns, such as first and last names:
SELECT a.last_name, a.first_name, a.amount, b.amount FROM contributors_a a, contributors_b b WHERE a.last_name = b.last_name AND a.first_name = b.first_name;
Obviously, there are some limitations here (what if two people have the same first and last names? What if someone appears twice in one of the tables?) But it would still give us a rough and ready picture of the overlap between the two tables.
Now, what if all of the data is in a single table? For example, what if we want to know which contributors have contributed on more than one occasion? There are a few different ways to perform this query, but one quick
way is to join the table to itself and take the same approach we used to find the overlap between the two separate tables above:
SELECT a.last_name, a.first_name, a.id as 'A ID', b.id AS 'B ID', a.amount, b.amount FROM contributors a, contributors b where a.last_name = b.last_name AND a.first_name=b.first_name AND a.id < b.id;
Notice the final condition in the
a.id < b.id. So, we’re asking for rows that have the same last_name and first_name but not the same id. By using
< rather than
!=, we eliminate some duplicates (otherwise, we would get a result row for each side of the match).
This approach has the same limitations we would have if there were two tables involved; without a unique identifier that appears in both tables, we can’t be sure that these are really the same people. Also, as you can see from the results, because there are some contributors who appear more than twice, we’re getting some duplicates in the output. There are ways to eliminate these duplicates
Why be normal? Denormalization as an informed choice.
Looking at the candidates table, there is another column showing some repetition: party. Many database designers would extract this column into its own table and then include a
party_id foreign key in the candidates table. It might be a good idea here to use that id rather than a text field; as it stands, if the data came in with “R,” “Republican” and “GOP” all appearing in that column, we would have a real mess. If we had a parties table that included only “R,” “D” and “I,” then we’d know we have a nonstandard value coming in when we tried to look up the party_id for “GOP,” for example.
But normalization comes with a cost. Adding that
parties table would mean that, any time we want to show candidate name and party, we’d have to do a join. And if we wanted contributor, candidate, and party, we’d have a query with two joins:
SELECT contributors.last_name, candidates.last_name, parties.name FROM contributors JOIN candidates ON contributors.candidate_id = candidates.id JOIN parties ON candidates.party_id = parties.id;
Doing multiple joins can become rather expensive in terms of memory, so often developers will create summary tables from the output of a SELECT:
CREATE TABLE contributors_candidates AS SELECT contributors.last_name, candidates.last_name, parties.name FROM contributors JOIN candidates ON contributors.candidate_id = candidates.id JOIN parties ON candidates.party_id = parties.id;
But any changes to the contributors or candidates tables would immediately make this summary table out of date, so you’d have to create a way to update the summary table with each change.
There is another approach: denormalization. That is, collapsing your normalized data into a single table. If you’re interested, check out the blog post on codinghorror and the spirited debate in the comments. I’ll give Jeff Atwood the final comment here: “As the old adage goes, normalize until it hurts, denormalize until it works.”
Next up: Changing the structure of the database once it’s created and using INDEXES to improve performance.
A Gentle Introduction to SQL Using SQLite by Troy Thibodeaux is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.