# Combining data from multiple datasets

In this Notebook we'll work through a number of different ways in which two (or more) tabular datasets can be combined into a single table.  

In the first instance we will look at forming the 'union' of two tables that have the same structure and common datatypes for the columns. The union of two tables usually represents having the same type of data in two tables - say the attendance register in two school classes, or the crime rates in two police districts - that we want to combine into a single dataset.  This is about adding more rows of the same type of data to the base dataset.

The second type of combination of data will be where we have more data about the same things - where we want to add to the data in a row, not simply add more rows.   Two tables of this form usually have some common contents represented in a row, so this is about adding more columns and column-row values to the dataset.  This is known as the 'join' of two tables.  And there are a lot of issues to consider when joining tables (so, as this is quite a long Notebook, you may want to take a break while working through it).

## The examples used in this Notebook

In the previous Notebook on selecting and projecting data from tables, we created and used DataFrames in _pandas_ and used the _pandasql_ library to show the basic operations.  

In this Notebook we'll use _pandas_ DataFrames formed from datasets from external sources; we'll explain the origins and form of this data as we go.

We'll also take a different approach to the SQL evaluation in this Notebook. We'll use the external Postgres database management system (DBMS) which is an application running outside the Python Notebook.  

There are two reasons for this approach:
- so that you are familiar with accessing data held and processed in external systems (the _it's good for you_ reason!)
- the pandasql library makes use of an sqlite database engine which doesn't have all the join types we will consider (the _pragmatic_ reason!).

The Notebook takes you through the SQL examples first, then returns to them to look at how _pandas_ achieves the same (or similar) results.  But this shouldn't stop you starting with the _pandas_ if you want, as long as you then come back to the SQL.

## Accessing the Postgres database engine

In [2]:
import pandas as pd

The virtual machine you're using on this module has a PostgresSQL database management system installed.

We'll be using this to run the SQL code in this Notebook.   To do this we need first to connect to the Postgres system, and then have  a way to tell Python to pass the SQL code to the Postgres system for evaluation and to copy back into the Notebook any results tables we wish to capture.

This is most easily done using SQL cell magic - it's a way of marking a cell as containing SQL code and the Notebook will route the SQL to the connected DBMS. SQL cell magic cells start with `%sql`.

The exact details of the connection to the DBMS depend on the DBMS in use.   However, the following works for us.

In [3]:
# Load in the sql extensions:
%load_ext sql


# Then connect to a Postgres SQL database.
%sql postgresql://test:test@localhost:5432/tm351test

'Connected: test@tm351test'

With the *sql magic* extension loaded, we start a cell with `%%sql` and then write SQL commands. 

The following cell begins with the sql magic marker `%%sql`. It then checks to see if the `quickdemo` table already exists: if it does, it removes it (in SQL terms it _drops_ it).  Then it _creates_ the `quickdemo` table with three columns, called `id`, `name` and `value`, before _inserting_ two rows of data into the table.

Notice that this is  a cell with five SQL commands in it, each ending in ';'. The `%%sql` magic marker indicates the **whole** cell is to be processed by the connected DBMS.

In [4]:
%%sql
DROP TABLE IF EXISTS quickdemo;
CREATE TABLE quickdemo(id INT PRIMARY KEY, name VARCHAR(20), value INT);
INSERT INTO quickdemo VALUES(1,'This',12);
INSERT INTO quickdemo VALUES(2,'That',345);

SELECT * FROM quickdemo;

Done.
Done.
1 rows affected.
1 rows affected.
2 rows affected.


id,name,value
1,This,12
2,That,345


It is not possible to put a Python assignment statement at the end of a `%%sql` cell, so the table returned at the end of the cell (the `SELECT * FROM quickdemo;`) is picked up in the following cell using the `_` variable.  

The tables returned are lists - but we can use the result's `DataFrame()` method to convert this to a DataFrame, complete with index.

In [5]:
result_from_sql = _
result_from_sql

id,name,value
1,This,12
2,That,345


In [6]:
result_df = result_from_sql.DataFrame()
result_df

Unnamed: 0,id,name,value
0,1,This,12
1,2,That,345


In a code cell it is also possible to use `%sql` ahead of a _single_-line SQL command.

In [7]:
result = %sql SELECT * FROM quickdemo WHERE value > 25

1 rows affected.


In [8]:
dataframe_df = result.DataFrame()
dataframe_df

Unnamed: 0,id,name,value
0,2,That,345


# SQL: more of the same, the UNION of multiple datasets
## Vertical joins

Here are two tables ABCD1 and ABCD2 with a few rows of data in each.

In [17]:
%%sql
DROP TABLE IF EXISTS ABCD1;  -- This just allows us to run this cell repeatedly, 
                             -- it destroys the table before we recreate it;
                             -- this is not the normal way to use SQL where the 
                             -- persistence of data is important.
CREATE TABLE ABCD1(a CHAR(2), b CHAR(2), c CHAR(2), d CHAR(2) );
INSERT INTO ABCD1 VALUES('A1','b1','c1','d3');
INSERT INTO ABCD1 VALUES('A1','b1','c1','d4');


SELECT * FROM ABCD1; -- The '--' represents an in-line comment to SQL: 
                     -- anything after the '--' on the line is ignored.


Done.
Done.
1 rows affected.
1 rows affected.
2 rows affected.


a,b,c,d
A1,b1,c1,d3
A1,b1,c1,d4


In [18]:
%%sql
DROP TABLE IF EXISTS ABCD2;  

CREATE TABLE ABCD2(a CHAR(2), b CHAR(2), c CHAR(2), d CHAR(2) );
INSERT INTO ABCD2 VALUES('A2','b2','c2','d1');
INSERT INTO ABCD2 VALUES('A1','b1','c2','d7');
INSERT INTO ABCD2 VALUES('A6','b1','c8','d6');

SELECT * FROM ABCD2;

Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
3 rows affected.


a,b,c,d
A2,b2,c2,d1
A1,b1,c2,d7
A6,b1,c8,d6


As you can see the two tables have the same structure, the same number of columns and same datatypes in each column. They also have the same column names which is handy as we're trying to give the impression that the two tables are of data representing the same types of things.   

The vertical join of these two tables is achieved with the SQL UNION clause between two SELECT statements.

In [19]:
%%sql
SELECT a,b,c,d
FROM ABCD1
UNION
SELECT a,b,c,d
FROM ABCD2;

5 rows affected.


a,b,c,d
A1,b1,c1,d4
A2,b2,c2,d1
A1,b1,c1,d3
A6,b1,c8,d6
A1,b1,c2,d7


Notice that the table headers are metadata, so are not repeated, but are inherited by default from the first SQL statement.
(Also SQL doesn't maintain the order of the original rows.)  It also doesn't really care about the column names - just that the datatypes are compatible and the number of columns in each SELECT is the same. This is known as having **union-compatible tables**: the same number of columns and compatible datatypes for the columns.

We can reshape the original tables, using projection and selection, if we need to.

In [20]:
%%sql
SELECT a,b,c
FROM ABCD1
WHERE d='d4'
UNION
SELECT a,b,c
FROM ABCD2
WHERE c='c2';


3 rows affected.


a,b,c
A2,b2,c2
A1,b1,c1
A1,b1,c2


### Adjusting the tables to force compatability
The above abstract example, with all the columns having the same datatype and no real semantics behind the values shown, might give the impression that not much work needs to be done to ensure the two tables are compatible - but union-compatible doesn't necessarily mean semantically compatible.

Consider the following two tables:

In [21]:
%%sql
DROP TABLE IF EXISTS Parts1;  

CREATE TABLE Parts1(description VARCHAR(20), 
                    length_in_cm REAL, 
                    colour VARCHAR(20) );

INSERT INTO Parts1 VALUES('Plank',160.0,'Oak');
INSERT INTO Parts1 VALUES('Brace',20.2,'Green');

SELECT * FROM Parts1;

Done.
Done.
1 rows affected.
1 rows affected.
2 rows affected.


description,length_in_cm,colour
Plank,160.0,Oak
Brace,20.2,Green


In [22]:
%%sql
DROP TABLE IF EXISTS Parts2;  

CREATE TABLE Parts2(description VARCHAR(20), 
                    length_in_metres REAL );

INSERT INTO Parts2 VALUES('Flange',0.5);
INSERT INTO Parts2 VALUES('Sprocket',2.4);

SELECT * FROM Parts2;

Done.
Done.
1 rows affected.
1 rows affected.
2 rows affected.


description,length_in_metres
Flange,0.5
Sprocket,2.4


Not only do they have different numbers of columns, but the interpretation of the length columns would suggest they're not actually compatible, even though the underlying type is REAL in both cases.  So here, if we want to union the two tables we've got some **harmonisation** to do first.

In [23]:
%%sql
SELECT description, length_in_cm, colour
FROM Parts1
UNION
SELECT description, (length_in_metres*100.0), NULL
FROM Parts2;

4 rows affected.


description,length_in_cm,colour
Brace,20.2000007629395,Green
Flange,50.0,
Sprocket,240.000009536743,
Plank,160.0,Oak


Oh dear, something odd has happened here.  

The table looks to be the correct shape - it has the three columns we expect, and all the rows we want to see. But some of the values are not quite what we would expect. 

(1) The `NULL` marker in SQL is used to show where a value doesn't apply to a row; so we used it in the SQL code for the Parts 2 colour column values in the unioned table.  However, this has been replaced by `None` in the displayed result (that looks to be a Python side-effect).

(2) The Postgres SQL is showing that floating-point arithmetic can sometimes become inaccurate due to the precision of the stored values. If accuracy matters, we could round the resulting values as part of our harmonisation.  Here it's just an annoying detail to note - but in some calculations you really would want to know such things were occuring.

## Taking care with vertical joins
SQL is usually quite good at enforcing datatypes and ensuring consistency when it can; but other tabular data tools - spreadsheets and text editors - may take a more cavalier approach, allowing confusing hybrid columns to result.

We can demonstrate this with the two tables above: SQL sees that the description and colour columns have the same base type and, of course, we could simply neglect to convert the metres to centimetres.

In [24]:
%%sql
SELECT colour, length_in_cm, description
FROM Parts1
UNION
SELECT description, length_in_metres, NULL
FROM Parts2;

4 rows affected.


colour,length_in_cm,description
Oak,160.0,Plank
Green,20.2,Brace
Sprocket,2.4,
Flange,0.5,


### Exercise
Here is data from two expenses claim sheets from an employee who drives her own car on UK roads, but a hire car when in Europe.

            UKMileage

EmployeeName | Date | Start Location|Destination|Distance
--------------|------|---------------|-----------|-------
Smith|10-10-2010|Newcastle|Sunderland|13.1
Smith|11-10-2010|Sunderland|Newcastle|14.2


            EuropeanMileage

EmployeeName | Date | Start Location|Destination|Distance
--------------|------|---------------|-----------|-------
Smith|12-12-2010|Rouen|Paris|123.6
Smith|13-12-2010|Amiens|Calais|179.3

Is it safe to simply UNION these two datasets?


### Discussion
We haven't got sufficient information to tell us. The values in the columns all look reasonably compatible, but we need to understand the semantics behind the values and their origin.   Did the `Distance` values come directly from the cars' odometers?  In this case the UK figures are probably in miles, while the Europe figures are in kilometres.  Or has Smith already applied a conversion?  Without a description of the units, or a company policy document, or by consulting an external data source, such as a route planner, it would be unsafe to UNION the tables.

# SQL: more about the same, joining multiple datasets
## Horizontal joins

Here we're combining tables in ways that put the columns from the original datasets side by side. In most cases we do this because the two tables contain different information about the same things (represented in each row) and we want to create a single table with the combination of that data appearing in a single row.

For example,  each year the Higher Education Statistics Authority (HESA) publishes a wide variety of data about the performance of every UK university.  Each year, the UK Higher Education Funding Council (HEFCE) also publishes the results of a National Student Survey, again broken down by university. Data about research grants is published via the Gateway to Research site.  Results of the Research Excellence Framework are published by another organisation, and so on. Each organisation publishs data about the same set of things - UK universities.  So, how would we go about creating one table in which each university had all its data on a single row?

This process of combining rows from multiple tables that have the same values in some columns (i.e. University Name) is called the **horizontal join** (usually just **join**). In the following we'll explore several variations on the join, as there are quite a few variations.  

### The Cartesian product
The simplest case of the horizontal join doesn't try to do anything to match values between the source tables. It simply puts every row from one table alongside every row in a second table; the table that results is known by mathematicians as the Cartesian product.  

The Cartesian product is of interest because it's the basic logical building block for all the other horizontal joins.  

Let's look at an example, using our ABCD1 table from earlier (we'll remind ourselves what it looks like first) and a second similarly arbitrary table XYZ1.

In [25]:
%%sql
SELECT * FROM ABCD1;

2 rows affected.


a,b,c,d
A1,b1,c1,d3
A1,b1,c1,d4


In [26]:
%%sql
DROP TABLE IF EXISTS XYZ1;

CREATE TABLE XYZ1(x CHAR(2), y CHAR(2), z CHAR(2) );
INSERT INTO XYZ1 VALUES('X1','d3','z1');
INSERT INTO XYZ1 VALUES('X2','d3','z2');
INSERT INTO XYZ1 VALUES('X3','d4','z3');

SELECT * FROM XYZ1;

Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
3 rows affected.


x,y,z
X1,d3,z1
X2,d3,z2
X3,d4,z3


The SQL to produce the Cartesian product is simply to list the two (or more) table names in the FROM statement.

In [27]:
%%sql
SELECT * 
FROM ABCD1, XYZ1;

6 rows affected.


a,b,c,d,x,y,z
A1,b1,c1,d3,X1,d3,z1
A1,b1,c1,d4,X1,d3,z1
A1,b1,c1,d3,X2,d3,z2
A1,b1,c1,d4,X2,d3,z2
A1,b1,c1,d3,X3,d4,z3
A1,b1,c1,d4,X3,d4,z3


If you look at the resulting table you will see that each row in the original `ABCD1` table is repeated alongside every row in the original `XYZ1` table.  The source tables had 2 and 3 rows, the result has 2 * 3 = 6 rows.

In fact the Cartesian product result gets big quite fast, so we rarely use it - but we did say it was the basis of the rest of the horizontal join types so let's look at some more interesting joins.

### The equality and theta joins
In these joins a condition is applied between values in the columns of the two tables being joined; only those rows in the Cartesian product that satisfy the condition appear in the result.

Usually that's an equality condition, giving us the equality join, but it could be an arbitrary condition between the values. 

Here's an example of an equality join between the values in the `d` and `y` columns of our two tables.

In [28]:
%%sql
SELECT * 
FROM ABCD1 JOIN XYZ1 ON d=y;

3 rows affected.


a,b,c,d,x,y,z
A1,b1,c1,d3,X1,d3,z1
A1,b1,c1,d3,X2,d3,z2
A1,b1,c1,d4,X3,d4,z3


So, this only has rows in which the value of `d` and the value of `y` in the row are the same; it's the equality join expressed using the `ON d=y` part of the statement.

A way to remember the behaviour of the equality and theta joins is to think of them as filtering the Cartesian product:

    In the FROM statement, form the Cartesian product of ABCD1 and XYZ1,
    For each row in the Cartsian product, if the ON condition is TRUE put that row in the result otherwise discard that row,
    now SELECT the columns to project into the result from the remaining rows.
    
Let's look at a slightly more meaningful example - two sets of students and the marks they got on the module they took this year and the module they took last year.

In [29]:
%%sql
DROP TABLE IF EXISTS this_year;

CREATE TABLE this_year(student VARCHAR(20), 
                       course VARCHAR(20), 
                       mark INT);

INSERT INTO this_year VALUES('Ann','TM351',55);
INSERT INTO this_year VALUES('Alison','TM351', 90);
INSERT INTO this_year VALUES('Andy','TM355',5);
INSERT INTO this_year VALUES('Arthur','TM356',5);

SELECT * FROM this_year;

Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
4 rows affected.


student,course,mark
Ann,TM351,55
Alison,TM351,90
Andy,TM355,5
Arthur,TM356,5


In [30]:
%%sql
DROP TABLE IF EXISTS last_year;

CREATE TABLE last_year (name VARCHAR(20), 
                        module VARCHAR(20), 
                        score Int);

INSERT INTO last_year VALUES('Ann','TM352',40);
INSERT INTO last_year VALUES('Alison','TM352', 70);
INSERT INTO last_year VALUES('Andy','TM356',90);
INSERT INTO last_year VALUES('Arthur','TM356',55);
SELECT * FROM last_year;

Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
4 rows affected.


name,module,score
Ann,TM352,40
Alison,TM352,70
Andy,TM356,90
Arthur,TM356,55


### Exercise 
Write a join that will show you which students sat the same module this year and last year.

In [32]:
%%sql
SELECT *
FROM this_year JOIN last_year ON (student = name) AND (course = module)

1 rows affected.


student,course,mark,name,module,score
Arthur,TM356,5,Arthur,TM356,55


### Solution
In this query you need two equality conditions to be true in the same row - one for the student's name, the other for the module=course values.   This is still *technically* an equality join, but you can see we can put any condition applied to the Cartesian product row into the ON clause - this is where you get the *theta* join - if the condition in the ON is not an equality condition then it's described as a *theta* join.  

In [None]:
%%sql
SELECT * 
FROM  this_year JOIN last_year ON (student = name) AND (course = module);

### Exercise
Write an SQL statement using a join that will show you which students did better on this year's module than on last year's.

In [34]:
%%sql
SELECT *
FROM this_year JOIN last_year ON (score > mark) AND (course = module)

2 rows affected.


student,course,mark,name,module,score
Arthur,TM356,5,Arthur,TM356,55
Arthur,TM356,5,Andy,TM356,90


### Solution
In this query you're only interested in rows in the Cartesian product which have the same student *and* where the mark they got this year is higher than the score they got last year.

In [35]:
%%sql
SELECT * 
FROM  this_year JOIN last_year ON (student = name) AND (mark > score);

2 rows affected.


student,course,mark,name,module,score
Ann,TM351,55,Ann,TM352,40
Alison,TM351,90,Alison,TM352,70


## Referencing  the same column name in more than one table: table.column

Notice in the above example we deliberately chose column names for the two tables where there was no ambiguity.   

In [36]:
%%sql
DROP TABLE IF EXISTS another_last_year;

CREATE TABLE another_last_year (student VARCHAR(20), course VARCHAR(20), mark Int);
INSERT INTO another_last_year VALUES('Ann','TM352',40);
INSERT INTO another_last_year VALUES('Alison','TM352', 70);
INSERT INTO another_last_year VALUES('Andy','TM356',90);
INSERT INTO another_last_year VALUES('Arthur','TM356',55);
SELECT * FROM another_last_year;

Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
4 rows affected.


student,course,mark
Ann,TM352,40
Alison,TM352,70
Andy,TM356,90
Arthur,TM356,55


Here's how we resolve this ambiguity. When two tables have columns with the same name we use the full table and column name combination: `<tablename>.<columnname>` shown below. However, notice what the column names of the resulting table have been changed to in order to avoid confusion.

In [37]:
%%sql
SELECT * 
FROM  this_year JOIN another_last_year 
      ON (this_year.student = another_last_year.student) 
        AND (this_year.mark > another_last_year.mark);

2 rows affected.


student,course,mark,student_1,course_1,mark_1
Ann,TM351,55,Ann,TM352,40
Alison,TM351,90,Alison,TM352,70


## Natural join
It's quite common for two tables to have columns in different tables with the same name - this often indicates that there is a relationship between the rows in the two tables (in relational database terms the values usually represent primary and foreign keys).

Here's an example, using data taken from UK Government data tables on car fuel consumption figures - the original data can be found at hhtp://carfueldata.direct.gov.uk/search-by-fuel-economy.aspx (accessed 21-May-2015).  (note the data needed some cleaning and reshaping to get it into the form shown below).

The first table shows the combined fuel consumption figures, the second shows the base consumption figures (the Urban and Extra Urban figures).   In both tables the values of the manufacturer, model and fuel-type columns act as a unique key for the rows in the table, and those rows with the same key values in each table have information about the same thing.

In [38]:
%%sql 
DROP TABLE IF EXISTS car_type_combined_consumption;

CREATE TABLE car_type_combined_consumption (manufacturer VARCHAR(30), model VARCHAR(50), imperial_combined REAL, fuel_type VARCHAR(10));
INSERT INTO car_type_combined_consumption VALUES('MORGAN MOTOR COMPANY','2000, From January 2011 onwards',40.3,'Petrol');
INSERT INTO car_type_combined_consumption VALUES('CHEVROLET','Orlando, MY2013', 40.3, 'Petrol');
INSERT INTO car_type_combined_consumption VALUES('VOLKSWAGEN C.V.','California Motor Home',40.4, 'Diesel');
INSERT INTO car_type_combined_consumption VALUES('VOLKSWAGEN','Touareg', 40.4, 'Diesel');
INSERT INTO car_type_combined_consumption VALUES('VOLKSWAGEN','Passat Saloon',40.4,'Petrol');
SELECT * FROM car_type_combined_consumption;

Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
5 rows affected.


manufacturer,model,imperial_combined,fuel_type
MORGAN MOTOR COMPANY,"2000, From January 2011 onwards",40.3,Petrol
CHEVROLET,"Orlando, MY2013",40.3,Petrol
VOLKSWAGEN C.V.,California Motor Home,40.4,Diesel
VOLKSWAGEN,Touareg,40.4,Diesel
VOLKSWAGEN,Passat Saloon,40.4,Petrol


In [39]:
%%sql 
DROP TABLE IF EXISTS car_type_base_consumption;

CREATE TABLE car_type_base_consumption (manufacturer VARCHAR(30), model VARCHAR(50), fuel_type VARCHAR(10), imperial_urban_cold REAL, imperial_extra_urban REAL);
INSERT INTO car_type_base_consumption VALUES('MORGAN MOTOR COMPANY','2000, From January 2011 onwards','Petrol',32.8, 46.3);
INSERT INTO car_type_base_consumption VALUES('CHEVROLET','Orlando, MY2013', 'Petrol', 34.4, 44.8);
INSERT INTO car_type_base_consumption VALUES('VOLKSWAGEN C.V.','California Motor Home', 'Diesel',29.7, 51.4);
INSERT INTO car_type_base_consumption VALUES('VOLKSWAGEN','Touareg', 'Diesel',28.5, 55.3);
INSERT INTO car_type_base_consumption VALUES('VOLKSWAGEN','Passat Saloon','Petrol',30.7, 49.5);
SELECT * FROM car_type_base_consumption;

Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
1 rows affected.
5 rows affected.


manufacturer,model,fuel_type,imperial_urban_cold,imperial_extra_urban
MORGAN MOTOR COMPANY,"2000, From January 2011 onwards",Petrol,32.8,46.3
CHEVROLET,"Orlando, MY2013",Petrol,34.4,44.8
VOLKSWAGEN C.V.,California Motor Home,Diesel,29.7,51.4
VOLKSWAGEN,Touareg,Diesel,28.5,55.3
VOLKSWAGEN,Passat Saloon,Petrol,30.7,49.5


In a **natural join**, two tables must have at least one column in each with the same name and the same datatypes; the result will be an equality join applied to all of those matching columns.

In [40]:
%%sql
SELECT *
FROM car_type_combined_consumption NATURAL JOIN car_type_base_consumption;

5 rows affected.


manufacturer,model,fuel_type,imperial_combined,imperial_urban_cold,imperial_extra_urban
CHEVROLET,"Orlando, MY2013",Petrol,40.3,34.4,44.8
MORGAN MOTOR COMPANY,"2000, From January 2011 onwards",Petrol,40.3,32.8,46.3
VOLKSWAGEN,Passat Saloon,Petrol,40.4,30.7,49.5
VOLKSWAGEN,Touareg,Diesel,40.4,28.5,55.3
VOLKSWAGEN C.V.,California Motor Home,Diesel,40.4,29.7,51.4


You need to be careful of natural joins: firstly the equality applies to *all* the columns that have the same name and datatype, and secondly if someone adds or removes columns from one of the tables then the behaviour of the query with the natural join may change.

## Outer joins
The final major form of the join is the outer join - to handle cases when we would lose data if we simply accepted only the joined rows based on their filtering condition.

Look back at the join examples we've seen so far:

* the Cartesian product joined every row in one table with every row in the other 
* the other joins have generally been based around a condition (most often equality) between two tables in which values to be matched appeared in both tables. 

So, in the examples we've seen so far the result rows have always had values from rows taken from both tables.


Now consider this example:

Each year a small sports club registers members' names and addresses with the sports association, which gives them a registration number that the club notes.  The club also collects membership fees, which may be paid in installments, and they keep a running total of the amount paid by each member. At the start of the year this table is empty.

At some point the club has two datasets:

            member

name|address|registration_no
----|-------|---------------
Kevin | Milton Keynes | R345
Katy | Bedford | R34
Kirrin | Luton | R45

and 

            payment_made

name|total_amount
----|------------
Katy|54
Kevin|33

Note that Kirrin is missing from the payment table, as she has just joined and has not paid anything yet.


In [41]:
%%sql 
DROP TABLE IF EXISTS member;

CREATE TABLE member (name VARCHAR(30), address VARCHAR(50), registration_no VARCHAR(5));
INSERT INTO member VALUES('Kevin','Milton Keynes','R345');
INSERT INTO member VALUES('Katy','Bedford', 'R34');
INSERT INTO member VALUES('Kirrin','Luton', 'R45');

DROP TABLE IF EXISTS payment_made;

CREATE TABLE payment_made (name VARCHAR(30), total_amount INT);
INSERT INTO payment_made VALUES('Katy', 54);
INSERT INTO payment_made VALUES('Kevin', 33);

SELECT * FROM member;
SELECT * FROM payment_made;

Done.
Done.
1 rows affected.
1 rows affected.
1 rows affected.
Done.
Done.
1 rows affected.
1 rows affected.
3 rows affected.
2 rows affected.


name,total_amount
Katy,54
Kevin,33



Suppose the manager wants a query to show both registrations and payments in a single table.
In the cell below try writing a JOIN using the examples above that will create the table.

**Result**

name|address|registration_no|total_amount
----|-------|---------------|------------
...|...|...|...


In [42]:
%%sql
SELECT * 
FROM member, payment_made;

6 rows affected.


name,address,registration_no,name_1,total_amount
Kevin,Milton Keynes,R345,Katy,54
Katy,Bedford,R34,Katy,54
Kirrin,Luton,R45,Katy,54
Kevin,Milton Keynes,R345,Kevin,33
Katy,Bedford,R34,Kevin,33
Kirrin,Luton,R45,Kevin,33


The simple Cartesian product makes a real mess, simply putting everyone against everyone else.
If you try to use any of the conditional joins you'll notice that we keep losing Kirrin in the result table.   She has a registration number, but has no row in the payment table, so the join conditions never succeed.

To include those rows that appear in one table, but with no corresponding row in the other table we use an OUTER join.

In [43]:
%%sql
SELECT * 
FROM member LEFT OUTER JOIN payment_made ON member.name = payment_made.name;

3 rows affected.


name,address,registration_no,name_1,total_amount
Katy,Bedford,R34,Katy,54.0
Kevin,Milton Keynes,R345,Kevin,33.0
Kirrin,Luton,R45,,


The `LEFT OUTER JOIN`  says that if there is an unmatched row in the left table (the one named before the `LEFT OUTER JOIN` text) then it is kept in the result with NULLs in the columns from the right table.
The `None` (which substitutes for the SQL `NULL` marker) shows that there is no value in these row-column intersections.

Now suppose Kirrin pays £10, but brings along her friend Lucy who immediately pays the club £20 but does not fill in her registration form.  
We record these two events in the data tables.

In [44]:
%%sql
INSERT INTO payment_made VALUES('Kirrin',10);
INSERT INTO payment_made VALUES('Lucy',30);


1 rows affected.
1 rows affected.


[]

Kirren will now appear in the equality or natural joins because there is data in both tables against her name, and of course we've seen she appears in the LEFT OUTER JOIN.

In [45]:
%%sql
SELECT * 
FROM member NATURAL JOIN payment_made;

3 rows affected.


name,address,registration_no,total_amount
Katy,Bedford,R34,54
Kevin,Milton Keynes,R345,33
Kirrin,Luton,R45,10


In [46]:
%%sql
SELECT * 
FROM member LEFT OUTER JOIN payment_made ON member.name = payment_made.name;

3 rows affected.


name,address,registration_no,name_1,total_amount
Katy,Bedford,R34,Katy,54
Kevin,Milton Keynes,R345,Kevin,33
Kirrin,Luton,R45,Kirrin,10


But we've lost Lucy's payment.

### Exercise
How do you think we can change the query to generate the required result?

In [47]:
%%sql
SELECT * 
FROM member LEFT OUTER JOIN payment_made ON member.name = payment_made.name;

3 rows affected.


name,address,registration_no,name_1,total_amount
Katy,Bedford,R34,Katy,54
Kevin,Milton Keynes,R345,Kevin,33
Kirrin,Luton,R45,Kirrin,10


### Discussion

We have two options: 

(1) reorder the two references to the tables in the `JOIN` statement:

In [48]:
%%sql -- solution
SELECT * 
FROM payment_made LEFT OUTER JOIN member ON member.name = payment_made.name;

4 rows affected.


name,total_amount,name_1,address,registration_no
Katy,54,Katy,Bedford,R34
Kevin,33,Kevin,Milton Keynes,R345
Kirrin,10,Kirrin,Luton,R45
Lucy,30,,,


(2) or use the `RIGHT OUTER JOIN`:

In [49]:
%%sql -- solution 
SELECT * 
FROM member RIGHT OUTER JOIN payment_made ON member.name = payment_made.name;

4 rows affected.


name,address,registration_no,name_1,total_amount
Katy,Bedford,R34,Katy,54
Kevin,Milton Keynes,R345,Kevin,33
Kirrin,Luton,R45,Kirrin,10
,,,Lucy,30


So, by using LEFT or RIGHT OUTER JOINs we can allow 'unmatched' rows from the first or second table in the JOIN statement - but what about unmatched rows from _both_ tables?

Suppose Kirrin hadn't paid the £10, but Lucy had paid her £20.

In [50]:
%%sql
DELETE FROM payment_made WHERE name ='Kirrin';
SELECT * FROM payment_made;

1 rows affected.
3 rows affected.


name,total_amount
Katy,54
Kevin,33
Lucy,30


To see both Kirrin, who has no payment noted, and Lucy, who has no registration number, we need a `FULL OUTER JOIN`.

In [51]:
%%sql
SELECT * 
FROM member FULL OUTER JOIN payment_made ON member.name = payment_made.name;

4 rows affected.


name,address,registration_no,name_1,total_amount
Katy,Bedford,R34,Katy,54.0
Kevin,Milton Keynes,R345,Kevin,33.0
,,,Lucy,30.0
Kirrin,Luton,R45,,


It's worth noting that there would still be some work to do to make the output of the FULL OUTER JOIN useful.  The `member.name` column has a `NULL` (`None`) in it for Lucy. We would need to copy across the `payment_made.name`, otherwise anyone looking for people by name would need to search two columns of the table.  

# SQL summary: where we've reached

We've been looking at combining into a single table, data from more than one table; so far, we've seen this in SQL.

Let's recap what has been covered, before we move on to seeing the same types of table combinations in *pandas*.

So far we've seen:
1. how to access PostgreSQL from the notebook.
2. vertical joins: the union of union-compatible tables
3. horizontal joins: cartesian product, equality and theta joins, natural join, and left, right and full outer joins.

Let's now move on to looking at how *pandas* handles combining data from multiple DataFrames.



# *pandas*: combining data from multiple tables
To work through the remainder of this Notebook we will use a single example, using data from the Open Data Communities website (details are given below). The datasets are held in the `data` folder for this part of the course.

The main sections in the rest of the Notebook correspond to the vertical and horizontal joins.

Let's start by describing, then loading, our example datasets.

## Introducing the datasets
The data we will use for this activity comes from the Department for Communities and Local Government Open Data Communities (DCLG) website (http://opendatacommunities.org/). 

Two sorts of data have been downloaded from this site 
- information about the average weekly social rent of new PRP (Private Registered Providers) general needs lettings for 2012/13  
- data relating to house building, in particular the permanent dwellings started from 2009/10 to 2012/13.

We've also copied an Ordnance Survey file giving names for geographical areas and reference codes for them.

We've copied the data into files in the `housingdata` folder.

In [52]:
!ls data/housingdata

house-building-starts-tenure-2009-2010.csv
house-building-starts-tenure-2010-2011.csv
house-building-starts-tenure-2011-2012.csv
house-building-starts-tenure-2012-2013.csv
households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv
yorksAndHumberside.csv


The house building data files all have a similar form (which looks messy due to the wide rows wrapping):

In [53]:
!head data/housingdata/house-building-starts-tenure-2009-2010.csv

Generated by opendatacommunities.org,2014-03-25T13:48:42+00:00
http://opendatacommunities.org/data/house-building/starts/tenure,"Permanent dwellings started, 2009/10 to 2012/13, England, District By Tenure "
Reference period,2009-2010

,,http://opendatacommunities.org/def/concept/general-concepts/tenure/all,http://opendatacommunities.org/def/concept/general-concepts/tenure/housingAssociations,http://opendatacommunities.org/def/concept/general-concepts/tenure/localAuthority,http://opendatacommunities.org/def/concept/general-concepts/tenure/privateEnterprise
http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
http://statistics.data.gov.uk/id/statistical-geography/E06000001,E06000001 Hartlepool,230,0,0,230
http://statistics.data.gov.uk/id/statistical-geography/E06000002,E06000002 Middlesbrough,280,130,0,160
http://statistics.data.gov.uk/id/statistical-geography/E06000003,E06000003 Redcar and Cleve

Looking at this data, we can see most of the metadata gives URLs related to the definitions of the terms and concepts relevant to the file content.  This ensures that we can check our interpretation and understanding of the data elements and their context, by reference to the relevant definitions.

When we get down to the rows of data, each row has URL link to a reference element for the statistical geograph area, and the Reference code is repeated along with the textual name of the reference area. 

The first column value gives you the link to Sparql data for the local authority area (you'll cover Sparql in Part 23 of the module). The second column gives you two types of label for the reference area.  The data we're interested in manipulating is in the second and subsequent columns.

In [54]:
!head data/housingdata/house-building-starts-tenure-2010-2011.csv

Generated by opendatacommunities.org,2014-03-25T13:48:55+00:00
http://opendatacommunities.org/data/house-building/starts/tenure,"Permanent dwellings started, 2009/10 to 2012/13, England, District By Tenure "
Reference period,2010-2011

,,http://opendatacommunities.org/def/concept/general-concepts/tenure/all,http://opendatacommunities.org/def/concept/general-concepts/tenure/housingAssociations,http://opendatacommunities.org/def/concept/general-concepts/tenure/localAuthority,http://opendatacommunities.org/def/concept/general-concepts/tenure/privateEnterprise
http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
http://statistics.data.gov.uk/id/statistical-geography/E06000001,E06000001 Hartlepool,230,90,0,140
http://statistics.data.gov.uk/id/statistical-geography/E06000003,E06000003 Redcar and Cleveland,210,100,0,110
http://statistics.data.gov.uk/id/statistical-geography/E06000004,E06000004 Stockton

In [55]:
!head data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv

Generated by opendatacommunities.org,2014-03-25T13:53:14+00:00
http://opendatacommunities.org/data/households/social-lettings/general-needs/rents/prp/number-bedrooms,"Average weekly social rent of new PRP general needs lettings, 2012/2013, England, District By Number of Bedrooms"
Reference period,2012-2013

,,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/allBedroomSizes,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/fourOrMoreBedroom,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/oneBedroom,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/threeBedroom,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/twoBedroom
http://opendatacommunities.org/def/ontology/geography/refArea,Ref

We have also pulled down a file from the Ordnance Survey that contains a list of geographical areas within the Yorkshire and the Humber region, some of which are local councils and some of which aren't. Note that the data that identifies each authority appears to resemble that used in the DCLG data files but does not match exactly.

In [56]:
!head -n 5 data/housingdata/yorksAndHumberside.csv
# In the cell output,  the first column (up to the first comma) is a 
# URL giving access to an Ordnance Survey page for each district.

district,districtname,gss
http://data.ordnancesurvey.co.uk/id/7000000000022028,NorthYorkshire,E10000023
http://data.ordnancesurvey.co.uk/id/7000000000009082,Doncaster,E08000017
http://data.ordnancesurvey.co.uk/id/7000000000009113,Sheffield,E08000019
http://data.ordnancesurvey.co.uk/id/7000000000009123,Rotherham,E08000018


# Loading the house building data 
We can load the data from the CSV files using the *pandas* `read_csv()` function. For the housing data, we need to skip the first five lines (I counted!) of the file before accepting the header.

In [57]:
import pandas as pd

In [58]:
# Read in some of the data.
bldg_2009_10_df = pd.read_csv('data/housingdata/house-building-starts-tenure-2009-2010.csv', skiprows=5)
bldg_2010_11_df = pd.read_csv('data/housingdata/house-building-starts-tenure-2010-2011.csv', skiprows=5)


In [59]:
# Preview the data we have loaded, to make sure it looks sensible.
bldg_2009_10_df[:5]

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,230.0,0.0,0.0,230.0
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,280.0,130.0,0.0,160.0
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,240.0,130.0,0.0,110.0
3,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,440.0,50.0,0.0,390.0
4,http://statistics.data.gov.uk/id/statistical-g...,E06000005 Darlington,130.0,0.0,0.0,130.0


### Exercise

In [62]:
# YOUR TURN
# Import the remaining house building files into separate DataFrames.

bldg_2011_12_df = pd.read_csv('data/housingdata/house-building-starts-tenure-2011-2012.csv', skiprows=5)
bldg_2012_13_df = pd.read_csv('data/housingdata/house-building-starts-tenure-2012-2013.csv', skiprows=5)

bldg_2011_12_df[:5]



Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,190.0,30.0,0.0,160.0
1,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,280.0,100.0,0.0,170.0
2,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,560.0,50.0,0.0,510.0
3,http://statistics.data.gov.uk/id/statistical-g...,E06000005 Darlington,110.0,0.0,0.0,110.0
4,http://statistics.data.gov.uk/id/statistical-g...,E06000007 Warrington,650.0,40.0,0.0,610.0


In [74]:
!head -n 7 'data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv'


Generated by opendatacommunities.org,2014-03-25T13:53:14+00:00
http://opendatacommunities.org/data/households/social-lettings/general-needs/rents/prp/number-bedrooms,"Average weekly social rent of new PRP general needs lettings, 2012/2013, England, District By Number of Bedrooms"
Reference period,2012-2013

,,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/allBedroomSizes,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/fourOrMoreBedroom,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/oneBedroom,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/threeBedroom,http://opendatacommunities.org/def/concept/households/social-lettings/general-needs/rents/prp/number-bedrooms/twoBedroom
http://opendatacommunities.org/def/ontology/geography/refArea,Ref

In [75]:
social_lettings_2012_13_df = pd.read_csv('data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv', skiprows=5)
social_lettings_2012_13_df[:5]

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,79.92,104.27,70.5,87.98,79.02
3,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,73.13,93.67,65.2,85.2,76.46
4,http://statistics.data.gov.uk/id/statistical-g...,E06000005 Darlington,70.45,85.64,63.11,79.0,73.67


In [80]:
# !head -n 1 data/housingdata/yorksAndHumberside.csv
york_humberside_df = pd.read_csv('data/housingdata/yorksAndHumberside.csv', skiprows=1)
york_humberside_df[:5]

Unnamed: 0,http://data.ordnancesurvey.co.uk/id/7000000000022028,NorthYorkshire,E10000023
0,http://data.ordnancesurvey.co.uk/id/7000000000...,Doncaster,E08000017
1,http://data.ordnancesurvey.co.uk/id/7000000000...,Sheffield,E08000019
2,http://data.ordnancesurvey.co.uk/id/7000000000...,Rotherham,E08000018
3,http://data.ordnancesurvey.co.uk/id/7000000000...,Barnsley,E08000016
4,http://data.ordnancesurvey.co.uk/id/7000000000...,Kirklees,E08000034


# Vertical joins: 
## concatenating house building data from several datasets

Suppose we want to work with a single DataFrame that contains all the annual house building starts data over the period 2009-2013. 

The pandas `concat()` function will concatenate rows from a list of DataFrames where each DataFrame shares the same column headings.

Let's create a couple of samples from the tables just to try this function out.

In [81]:
#Just use a sample of the data rows for now as we develop the code
sample1_df = bldg_2009_10_df[:3]
sample2_df = bldg_2010_11_df[:3]
sample2_df

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,230.0,90.0,0.0,140.0
1,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,100.0,0.0,110.0
2,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,500.0,30.0,0.0,470.0


OK, so let's test the `concact()` function on our `sample1_df` and `sample2_df` datasets:

In [82]:
# Try out the concat() function - pass in a list of DataFrames to be concatenated.
pd.concat([sample1_df, sample2_df])

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,230.0,0.0,0.0,230.0
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,280.0,130.0,0.0,160.0
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,240.0,130.0,0.0,110.0
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,230.0,90.0,0.0,140.0
1,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,100.0,0.0,110.0
2,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,500.0,30.0,0.0,470.0


That should have worked OK... We've got the rows from two DataFrames combined into a single DataFrame - the row indexes have been repeated, but the table structure looks OK.

What happens when we try to merge two complete DataFrames?

In [83]:
bldg_2009_11_df = pd.concat([bldg_2009_10_df, bldg_2010_11_df])
# Check to see if the dataframes appear to have been concatenated 
# together by inspecting row counts.
print(len(bldg_2009_10_df), len(bldg_2010_11_df), len(bldg_2009_11_df))

300 309 609


That looks to have worked, or it did when I tried it! The original DataFrames have 300 and 309 rows each, the merged DataFrame has 609 rows; so no rows appear to have been lost or added. 

What happens if the DataFrames have the same column names, but they appear in a different order?

In [84]:
# Create a sample DataFrame containing the *same* columns as the original but 
# in a *different* order.
sample3_df = bldg_2009_10_df[['Reference area',
                              'All',
                              'Housing-Associations','http://opendatacommunities.org/def/ontology/geography/refArea',
                              'Local-Authority',
                              'Private-Enterprise']][:3]
sample3_df

Unnamed: 0,Reference area,All,Housing-Associations,http://opendatacommunities.org/def/ontology/geography/refArea,Local-Authority,Private-Enterprise
0,E06000001 Hartlepool,230.0,0.0,http://statistics.data.gov.uk/id/statistical-g...,0.0,230.0
1,E06000002 Middlesbrough,280.0,130.0,http://statistics.data.gov.uk/id/statistical-g...,0.0,160.0
2,E06000003 Redcar and Cleveland,240.0,130.0,http://statistics.data.gov.uk/id/statistical-g...,0.0,110.0


In [85]:
# Concatenate DataFrames with the same columns, but differently ordered.
concat_difforder_df = pd.concat([sample1_df, sample3_df])
concat_difforder_df

Unnamed: 0,All,Housing-Associations,Local-Authority,Private-Enterprise,Reference area,http://opendatacommunities.org/def/ontology/geography/refArea
0,230.0,0.0,0.0,230.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...
1,280.0,130.0,0.0,160.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...
2,240.0,130.0,0.0,110.0,E06000003 Redcar and Cleveland,http://statistics.data.gov.uk/id/statistical-g...
0,230.0,0.0,0.0,230.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...
1,280.0,130.0,0.0,160.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...
2,240.0,130.0,0.0,110.0,E06000003 Redcar and Cleveland,http://statistics.data.gov.uk/id/statistical-g...


_pandas_ is capable of automatically aligning the named columns from such DataFrames.

### Exercise
How does _pandas_ behaviour compare with SQL when the columns are in different orders?

### Solution
_pandas_ can cope with columns names being different orders; SQL must have the same order of columns.

What happens if we try to concatenate DataFrames in which the DataFrames only partially share columns (that is, there are some columns in one DataFrame that are not in the other)?

In [86]:
# Create a sample DataFrame that contains only a subset of the 
# columns from an original DataFrame.
sample4_df=bldg_2009_10_df[['Reference area','All','Housing-Associations']][:3]
sample4_df

Unnamed: 0,Reference area,All,Housing-Associations
0,E06000001 Hartlepool,230.0,0.0
1,E06000002 Middlesbrough,280.0,130.0
2,E06000003 Redcar and Cleveland,240.0,130.0


In [87]:
# Concatenate two DataFrames with different numbers of columns.
concat_diffcolumns_df = pd.concat([sample1_df, sample4_df])
concat_diffcolumns_df 

Unnamed: 0,All,Housing-Associations,Local-Authority,Private-Enterprise,Reference area,http://opendatacommunities.org/def/ontology/geography/refArea
0,230.0,0.0,0.0,230.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...
1,280.0,130.0,0.0,160.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...
2,240.0,130.0,0.0,110.0,E06000003 Redcar and Cleveland,http://statistics.data.gov.uk/id/statistical-g...
0,230.0,0.0,,,E06000001 Hartlepool,
1,280.0,130.0,,,E06000002 Middlesbrough,
2,240.0,130.0,,,E06000003 Redcar and Cleveland,


The `concat()` function aligns columns where it can. By default, the columns in the combined DataFrame are the superset of distinctly named columns in the concatenated DataFrame. Missing values are given a NaN value.  

This form of concatenation is a type of *outer join* in the sense that we are producing a set of columns in the output that represent the combination of columns contained in the concatenated datasets - the widest possible table - putting empty cells in rows where the original table did not have that column.

The `concat()` function uses the outer style join by default as this doesn't lose any data.

We can also force it to adopt an *inner join* behaviour in which the columns in the output DataFrame correspond to the intersection of columns from the DataFrames, that is the common columns from the original tables.  Note that the inner join loses data in columns from at least one of the tables.

In [88]:
# Explicitly use an INNER join ('inner') on the concatenation; 'outer' is the default value.
concat_inner_df = pd.concat([sample1_df, sample4_df], join='inner')
concat_inner_df

Unnamed: 0,Reference area,All,Housing-Associations
0,E06000001 Hartlepool,230.0,0.0
1,E06000002 Middlesbrough,280.0,130.0
2,E06000003 Redcar and Cleveland,240.0,130.0
0,E06000001 Hartlepool,230.0,0.0
1,E06000002 Middlesbrough,280.0,130.0
2,E06000003 Redcar and Cleveland,240.0,130.0


Notice that only the common columns appear in the result: all the other data has been lost from the resulting DataFrame.

## Exercise
What problems, if any, can you see in interpreting the data in any of the concatenated datasets produced above, and how might they be resolved?

## Discussion
Although the data items represent reports from different years, we have lost that information. 

The years the reports refer to are not encoded in the actual rows of data - but as *metadata* in the initial rows of the CSV file, and embedded in the filenames.  So, in the concatenated file we have the problem of determining which rows relate to which years.

If we add an additional column to each dataset as it is loaded in that contains the year the report relates to, we can carry that information in to the concatenated dataset.

### Adding in the 'which years' metadata

So how can we add in an additional data column that identifies the period the data relates to before we concatenate the separate DataFrames?

In [89]:
# YOUR ATTEMPT HERE
bldg_2009_10_df['Period']="2009-10"
bldg_2009_10_df[:3]


Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,Period
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,230.0,0.0,0.0,230.0,2009-10
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,280.0,130.0,0.0,160.0,2009-10
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,240.0,130.0,0.0,110.0,2009-10


In [None]:
# Solution
# Here's how I did it: 
bldg_2009_10_df['Period']="2009-10"
bldg_2009_10_df[:3]

In [91]:
# Now add a Period column to each annual building DataFrame you created earlier.
bldg_2010_11_df['Period']="2010-11"
bldg_2011_12_df['Period']="2011-12"
bldg_2012_13_df['Period']="2012-13"
social_lettings_2012_13_df['Period']="2012-13"


In [97]:
# And create a single DataFrame containing all the house building data 
# with rows distinguishable by period.
all_bldg_data_df = pd.concat([bldg_2009_10_df, bldg_2010_11_df, bldg_2011_12_df, bldg_2012_13_df])
print(len(all_bldg_data_df), len(bldg_2009_10_df), len(bldg_2010_11_df),len(bldg_2011_12_df), len(bldg_2012_13_df))

1248 300 309 313 326


In [102]:
all_bldg_data_df[['Period','Reference area','All']][:5]

Unnamed: 0,Period,Reference area,All
0,2009-10,E06000001 Hartlepool,230.0
1,2009-10,E06000002 Middlesbrough,280.0
2,2009-10,E06000003 Redcar and Cleveland,240.0
3,2009-10,E06000004 Stockton-on-Tees,440.0
4,2009-10,E06000005 Darlington,130.0


## Horizontally joining data: merging data from several datasets

By inspection of the building start data and the lettings data, we see that data elements have some common columns: `geographical reference area codes`, and `names`.   

The common columns allow us to join rows of data where the values in the common columns are the same.

In [103]:
bldg_2012_13_df = pd.read_csv('data/housingdata/house-building-starts-tenure-2012-2013.csv', 
                              skiprows=5)
bldgSample_df = bldg_2012_13_df[:3]
bldgSample_df

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0


In [104]:
lettings_2012_13_df = pd.read_csv('data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv',
                                  skiprows=5)
lettingsSample_df=lettings_2012_13_df[:3]
lettingsSample_df

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,79.92,104.27,70.5,87.98,79.02


It is straightforward to merge the tables horizontally using the _pandas_ `merge()` function. The first two arguments specify the data tables to be merged. Where the columns that act as the focus for merging share the same name, we can specify them in a list assigned to the `on` parameter.

*If you worked through the SQL examples earlier you'll see a similarity to the JOIN ON clause.*

In [105]:
simplemerge_df = pd.merge(bldgSample_df, lettingsSample_df,
                          on=['http://opendatacommunities.org/def/ontology/geography/refArea',
                              'Reference area'])
simplemerge_df

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0,79.92,104.27,70.5,87.98,79.02


Note that we could have also have merged the DataFrames on a single column. In this case, duplicate columns are brought in to the merged result separately, and _pandas_ automatically appends a suffix to each one so it remains uniquely labelled in the resulting DataFrame (so for example we get `Reference area_x` and `Reference area_y` in the result).  Note: *again comparable to the SQL behaviour.*

In [106]:
pd.merge(bldgSample_df, lettingsSample_df,
         on=['http://opendatacommunities.org/def/ontology/geography/refArea'])

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area_x,All,Housing-Associations,Local-Authority,Private-Enterprise,Reference area_y,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0,E06000003 Redcar and Cleveland,79.92,104.27,70.5,87.98,79.02


This time we have only a single key, with five uniquely named columns from the left table and six from the right.

If the column names are differently labelled, we can specify them explicitly for each data table.
We can change one of the `lettingsSample_df` column names to demonstrate this.

In [107]:
# Renaming one of the merge columns in one table:
lettingsSample_df.columns = ['Ref Area Code'] + lettingsSample_df.columns[1:].tolist()
lettingsSample_df

Unnamed: 0,Ref Area Code,Reference area,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,79.92,104.27,70.5,87.98,79.02


We can explicitly declare the columns we want to merge from each table using the `left_on` and `right_on` parameters (I find this confusing, and would have expected `on_left` and `on_right`). 

For the `merge()` to work, these parameters need to identify the same number of columns in the same order.

In [108]:
pd.merge(bldgSample_df, lettingsSample_df, 
         left_on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
         right_on=['Ref Area Code','Reference area'])

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,Ref Area Code,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,http://statistics.data.gov.uk/id/statistical-g...,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,http://statistics.data.gov.uk/id/statistical-g...,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0,http://statistics.data.gov.uk/id/statistical-g...,79.92,104.27,70.5,87.98,79.02


## Inner joins:  the merge() default behaviour
The default behaviour of *pandas* merge is an inner join (`how='inner'`) where the results table is formed from the intersection of the joined key column values. 

Consider the example where one table has additional rows.

In [109]:
bldgSample_long_df = bldg_2012_13_df[:4] # 4 rows compared to 3 in the lettings sample.
bldgSample_long_df.columns = ['Ref Area Code'] + bldgSample_long_df.columns[1:].tolist()

pd.merge(bldgSample_long_df, lettingsSample_df, 
         on=['Ref Area Code','Reference area'])

Unnamed: 0,Ref Area Code,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0,79.92,104.27,70.5,87.98,79.02


Again the inner join is effectively losing data: where there is no match between the key columns in the two data tables, no row is put into the resulting DataFrame.

## Outer joins

Outer joins retain rows from one, or both, of the original DataFrames even if there is no matching row from the 'other' table.

### Left outer join

In a left outer join we use all the columns from the left table, and matched ones from the right table.

Let's generate a long sample from the lettings data but include some different reference areas compared to the building start data. To do this we will take data from the top and the bottom of the original DataFrames.

In [110]:
lettingsSample_long_df = pd.concat([lettings_2012_13_df[:2], lettings_2012_13_df[-2:]])
lettingsSample_long_df

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
324,http://statistics.data.gov.uk/id/statistical-g...,E09000032 Wandsworth,126.53,148.34,116.09,138.07,129.93
325,http://statistics.data.gov.uk/id/statistical-g...,E09000033 Westminster,124.34,147.12,115.08,138.94,129.54


In [111]:
# Remind yourself of the behaviour of inner joins when there are unmatched rows.
# What happens if you try to inner join bldgSample_df and lettingsSample_long_df?


Now try a left outer join, by setting `how='left'`. What happens to the columns from the right-hand table for the unmatched rows from the left table?

In [112]:
pd.merge(bldgSample_df, lettingsSample_long_df,
         on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
         how='left')


Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0,,,,,


Here we see the two key columns, four unique columns from the left table and five unique columns from the right. 

The final row shows missing values in the right table's columns: it's retained the data from the unmatched rows in the original left table. 

But we still don't have all the data from both tables - maybe there were unmatched key values in the right table too.

### Right outer join

Unsurprisingly, a right join is achieved by setting `how='right'`. What happens to the columns from the left-hand table for the unmatched rows from the right column?

In [113]:
pd.merge(bldgSample_df, lettingsSample_long_df,
         on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
         how='right')

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E09000032 Wandsworth,,,,,126.53,148.34,116.09,138.07,129.93
3,http://statistics.data.gov.uk/id/statistical-g...,E09000033 Westminster,,,,,124.34,147.12,115.08,138.94,129.54


### Full outer join
A full outer join, retaining unmatched rows from both tables, can be achieved by setting `how='outer'`. 

What happens to the unmatched rows from each table?

In [114]:
pd.merge(bldgSample_df, lettingsSample_long_df,
         on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
         how='outer')

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,73.33,97.78,65.97,85.41,77.67
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,75.81,91.82,66.21,83.55,75.69
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0,,,,,
3,http://statistics.data.gov.uk/id/statistical-g...,E09000032 Wandsworth,,,,,126.53,148.34,116.09,138.07,129.93
4,http://statistics.data.gov.uk/id/statistical-g...,E09000033 Westminster,,,,,124.34,147.12,115.08,138.94,129.54


# What happens if a key in one table matches key values in several rows in the second table? 

*(Note: if you know about relationship modelling, this represents a one-to-many relationship.)*

Let's generate a sample DataFrame that has several rows containing the same (repeated) reference area.


In [115]:
# Two rows from each of two building DataFrames - to create a DataFrame in which 
# rows have duplicate values for Reference area.  
bldg_sample_mixed_df = pd.concat([ bldg_2009_10_df[:2], bldg_2012_13_df[:2] ])
bldg_sample_mixed_df

Unnamed: 0,All,Housing-Associations,Local-Authority,Period,Private-Enterprise,Reference area,http://opendatacommunities.org/def/ontology/geography/refArea
0,230.0,0.0,0.0,2009-10,230.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...
1,280.0,130.0,0.0,2009-10,160.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...
0,130.0,10.0,0.0,,120.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...
1,230.0,60.0,0.0,,170.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...


### Exercise
We can then explore what happens when we try to merge a DataFrame with one unique reference area per row with a DataFrame where there may be multiple rows with repeated values.

What happens for the various joins (inner, left, right, outer) when applied to `bldg_sample_mixed_df` (which has two rows for Hartlepool and two for Middlesborough) and `lettingsSample_long_df` (which has one row each for Hartlepool, Middlesborough, Wandsworth and Westminster)?

In [116]:
# What happens with the inner join on bldg_sample_mixed_df and lettingsSample_long_df?
pd.merge(bldg_sample_mixed_df, lettingsSample_long_df,
         on=['http://opendatacommunities.org/def/ontology/geography/refArea']  )

Unnamed: 0,All,Housing-Associations,Local-Authority,Period,Private-Enterprise,Reference area_x,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area_y,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,230.0,0.0,0.0,2009-10,230.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
1,130.0,10.0,0.0,,120.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
2,280.0,130.0,0.0,2009-10,160.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
3,230.0,60.0,0.0,,170.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69


#### Discussion
If there are repeated matches of one row from one table to multiple rows of another then each joined row is added, leading to repetition of the values from the `on`-row side.   For the inner join any unmatched rows are lost.

In [117]:
# What happens to the left join on bldg_sample_mixed_df and lettingsSample_long_df?
pd.merge(bldg_sample_mixed_df, lettingsSample_long_df,
         on=['http://opendatacommunities.org/def/ontology/geography/refArea'],
         how='left')

Unnamed: 0,All,Housing-Associations,Local-Authority,Period,Private-Enterprise,Reference area_x,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area_y,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,230.0,0.0,0.0,2009-10,230.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
1,280.0,130.0,0.0,2009-10,160.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
2,130.0,10.0,0.0,,120.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
3,230.0,60.0,0.0,,170.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69


#### Discussion 
This is the left outer join, so no rows are lost from the first (left-most) table (`bldg_sample_mixed_df`), and any rows that match repeatedly to rows in the second table are repeated in the result. The usual NaNs fill the unmatched row values.

In [118]:
# What happens to the right outer join on bldg_sample_mixed_df and lettingsSample_long_df?
pd.merge(bldg_sample_mixed_df, lettingsSample_long_df,
         on=['http://opendatacommunities.org/def/ontology/geography/refArea'],
         how='right')

Unnamed: 0,All,Housing-Associations,Local-Authority,Period,Private-Enterprise,Reference area_x,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area_y,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,230.0,0.0,0.0,2009-10,230.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
1,130.0,10.0,0.0,,120.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,73.33,97.78,65.97,85.41,77.67
2,280.0,130.0,0.0,2009-10,160.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
3,230.0,60.0,0.0,,170.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,75.81,91.82,66.21,83.55,75.69
4,,,,,,,http://statistics.data.gov.uk/id/statistical-g...,E09000032 Wandsworth,126.53,148.34,116.09,138.07,129.93
5,,,,,,,http://statistics.data.gov.uk/id/statistical-g...,E09000033 Westminster,124.34,147.12,115.08,138.94,129.54


#### Discussion
This is the right outer join, so no rows are lost from the second (right-most) table (`lettingsSample_long_df`), and any rows that match repeatedly to rows in the second table are repeated in the result. The usual NaNs fill the unmatched row values.

In [119]:
# What happens to the full outer join on bldg_sample_mixed_df and lettingsSample_long_df?
pd.merge(bldg_sample_mixed_df, lettingsSample_long_df,
         on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
         how='outer')

Unnamed: 0,All,Housing-Associations,Local-Authority,Period,Private-Enterprise,Reference area,http://opendatacommunities.org/def/ontology/geography/refArea,All bedroom sizes,Four or more bedrooms,One bedroom,Three bedrooms,Two bedrooms
0,230.0,0.0,0.0,2009-10,230.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...,73.33,97.78,65.97,85.41,77.67
1,130.0,10.0,0.0,,120.0,E06000001 Hartlepool,http://statistics.data.gov.uk/id/statistical-g...,73.33,97.78,65.97,85.41,77.67
2,280.0,130.0,0.0,2009-10,160.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...,75.81,91.82,66.21,83.55,75.69
3,230.0,60.0,0.0,,170.0,E06000002 Middlesbrough,http://statistics.data.gov.uk/id/statistical-g...,75.81,91.82,66.21,83.55,75.69
4,,,,,,E09000032 Wandsworth,http://statistics.data.gov.uk/id/statistical-g...,126.53,148.34,116.09,138.07,129.93
5,,,,,,E09000033 Westminster,http://statistics.data.gov.uk/id/statistical-g...,124.34,147.12,115.08,138.94,129.54


#### Discussion
This is the full outer join, so no rows are lost from either table, and any rows that match repeatedly to rows in the second table are repeated in the result. The usual NaNs fill the unmatched row values.

### Merging data tables where one key column represents a unique part of another

Where a scheme using common column identifiers is used to identify the same element or entity that is represented in several datasets, it is easy enough to merge the datasets using the column that contains the common identifier values. 

In the above examples, we were able to merge data about housing build starts and letting prices across UK administrative areas using the reference area names and/or codes - which the two datasets had in common.

In some cases, usually where the datasets have been generated by different organisations or with different data models, the values of the identifiers used in one dataset may only partially match the identifiers in another.  

Sometimes, it is possible for us to recreate the identifiers used in one scheme from the identifiers used in another. For example, if one dataset had given a reference area code in an abbreviated form, such as E06000001, we could generate the full identifier from this http://statistics.data.gov.uk/id/statistical-geography/E06000001. 
This is because the full identifier has a regular pattern http://statistics.data.gov.uk/id/statistical-geography/AREACODE; so given an AREACODE we can recreate the identifier.

At other times, the partial match may be more problematic. For example, is 'Open Uni' the same as 'Open University'? Such issues are more in the nature of cleansing and harmonisation issues. More involved data cleansing and harmonisation processes are required to cope with such considerations, which we will ignore for now.

### Well-behaved partial matches
The data file `housingdata/yorksAndHumberside.csv` contains a list of administrative areas in the Yorkshire and Humberside adminstrative area. There are three columns in the dataset, taking the form http://data.ordnancesurvey.co.uk/id/7000000000022028, NorthYorkshire, E10000023.

In [120]:
pd.read_csv('data/housingdata/yorksAndHumberside.csv')[0:12]

Unnamed: 0,district,districtname,gss
0,http://data.ordnancesurvey.co.uk/id/7000000000...,NorthYorkshire,E10000023
1,http://data.ordnancesurvey.co.uk/id/7000000000...,Doncaster,E08000017
2,http://data.ordnancesurvey.co.uk/id/7000000000...,Sheffield,E08000019
3,http://data.ordnancesurvey.co.uk/id/7000000000...,Rotherham,E08000018
4,http://data.ordnancesurvey.co.uk/id/7000000000...,Barnsley,E08000016
5,http://data.ordnancesurvey.co.uk/id/7000000000...,Kirklees,E08000034
6,http://data.ordnancesurvey.co.uk/id/7000000000...,Leeds,E08000035
7,http://data.ordnancesurvey.co.uk/id/7000000000...,Calderdale,E08000033
8,http://data.ordnancesurvey.co.uk/id/7000000000...,Bradford,E08000032
9,http://data.ordnancesurvey.co.uk/id/7000000000...,Wakefield,E08000036


These contrast with the way administrative areas are recorded in the DCLG datasets, which take the form of two columns e.g. http://statistics.data.gov.uk/id/statistical-geography/E06000002 and `E06000002 Middlesbrough`.  (And of course the DCLG datasets contain data from all over the country, not just Yorskhire and Humberside.)

In [121]:
pd.read_csv('data/housingdata/house-building-starts-tenure-2009-2010.csv', skiprows=5)[0:12]

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,230.0,0.0,0.0,230.0
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,280.0,130.0,0.0,160.0
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,240.0,130.0,0.0,110.0
3,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,440.0,50.0,0.0,390.0
4,http://statistics.data.gov.uk/id/statistical-g...,E06000005 Darlington,130.0,0.0,0.0,130.0
5,http://statistics.data.gov.uk/id/statistical-g...,E06000007 Warrington,480.0,20.0,0.0,460.0
6,http://statistics.data.gov.uk/id/statistical-g...,E06000008 Blackburn with Darwen,200.0,110.0,0.0,90.0
7,http://statistics.data.gov.uk/id/statistical-g...,E06000009 Blackpool,120.0,110.0,0.0,20.0
8,http://statistics.data.gov.uk/id/statistical-g...,"E06000010 Kingston upon Hull, City of",160.0,20.0,0.0,140.0
9,http://statistics.data.gov.uk/id/statistical-g...,E06000011 East Riding of Yorkshire,360.0,0.0,0.0,360.0


## Exercise
Looking at the two datasets, the `Reference Area` of the DCLG data looks to be formed by joining the `gss` and `districtname` values into a single string. However, if you look at index row 11 of the Ordnance Survey data and index row 11 of the DCLG data then you'll see that the spacing between district names elements differ.  Closer inspection suggests that the `gss` values are unique for each district, and are properly formed in the DCLG data.

So, if you wanted to join these datasets there is some harmonisation required first.

Describe how you would adjust the DCLG data so that you could create a DataFrame which could be joined with the Ordnance Survey dataset.

## Discussion
This requires the `Reference area` column values to be split after the first space.  The string up to the first space should be copied into a new column - this string is the `gss` value, which can be matched to the Ordnance Survey `gss` values.

## Exercise

(a) Read in the 2012-13 housing data from the DGLC dataset into a DataFrame, then split the `gss` code values into a new column.

(b) Join the housing data with the Ordnance Survey data on the `gss` column, so that the result is the data for Yorkshire and Humberside only.


In [125]:
# Your solution (a).
housing_2012_13_df = pd.read_csv('data/housingdata/house-building-starts-tenure-2012-2013.csv', skiprows=5)
housing_2012_13_df[:5]


Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0
3,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,490.0,40.0,0.0,450.0
4,http://statistics.data.gov.uk/id/statistical-g...,E06000005 Darlington,100.0,0.0,10.0,80.0


In [130]:
columnsplitter = lambda x: pd.Series([i for i in (x.split(' '))])

splitgss = housing_2012_13_df['Reference area'].apply(columnsplitter)[0]

housing_2012_13_df['gss'] = splitgss

housing_2012_13_df[:6]

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,gss
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,E06000001
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,E06000002
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0,E06000003
3,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,490.0,40.0,0.0,450.0,E06000004
4,http://statistics.data.gov.uk/id/statistical-g...,E06000005 Darlington,100.0,0.0,10.0,80.0,E06000005
5,http://statistics.data.gov.uk/id/statistical-g...,E06000008 Blackburn with Darwen,210.0,50.0,0.0,160.0,E06000008


In [None]:
# Your solution (b).

In [129]:
# Solution (a)
# Break this down with each step in a seperate cell if you want to see what the intermediate 
#  results are like.

# First bring in the 12-13 dataset.
housing1213_df = pd.read_csv('data/housingdata/house-building-starts-tenure-2012-2013.csv', 
                             skiprows=5)

# Now extract the first part of the Reference area column values into a new list.
# We saw how to do this when cleaning data, but this time we only need the first column named [0].
columnssplitter = lambda x: pd.Series([i for i in (x.split(' '))])

split_gss = housing1213_df['Reference area'].apply(columnssplitter)[0]

# Add this list to the housing1213_df DataFrame with the column name gss.
housing1213_df['gss'] = split_gss

# A quick check to see that the added column looks correct.
housing1213_df[:4]

Unnamed: 0,http://opendatacommunities.org/def/ontology/geography/refArea,Reference area,All,Housing-Associations,Local-Authority,Private-Enterprise,gss
0,http://statistics.data.gov.uk/id/statistical-g...,E06000001 Hartlepool,130.0,10.0,0.0,120.0,E06000001
1,http://statistics.data.gov.uk/id/statistical-g...,E06000002 Middlesbrough,230.0,60.0,0.0,170.0,E06000002
2,http://statistics.data.gov.uk/id/statistical-g...,E06000003 Redcar and Cleveland,210.0,20.0,0.0,190.0,E06000003
3,http://statistics.data.gov.uk/id/statistical-g...,E06000004 Stockton-on-Tees,490.0,40.0,0.0,450.0,E06000004


In [131]:
# Solution (b)
# Unusually here we actually want to lose data; 
# we want to lose any rows that don't match in the Ordnance Survey data, 
# and those in the Ordnance Survey data that don't match in the housing data.
# So we want an inner join, and we want to retain those rows that have matching gss values.

# First read in the Ordnance Survey data
OSdata_df = pd.read_csv('data/housingdata/yorksAndHumberside.csv')

# Then merge the two datasets on the gss columns.
combineddata_df = pd.merge(housing1213_df, OSdata_df, on=['gss'])

# If I were tidying this table up for use later I'd probably lose the 
# refArea and Reference area columns now, and possibly the district column as well. 
# (Of course, that depends on what I was using the dataset for, and what I 
# intended doing with it later.)
YorkshireHumbersideHousing201313_df = combineddata_df[['gss', 
                                                       'districtname',
                                                       'All', 
                                                       'Housing-Associations',
                                                       'Local-Authority', 
                                                       'Private-Enterprise' ]]
YorkshireHumbersideHousing201313_df

Unnamed: 0,gss,districtname,All,Housing-Associations,Local-Authority,Private-Enterprise
0,E06000010,Cityof Kingston upon Hull,450.0,0.0,0.0,450.0
1,E06000013,NorthLincolnshire,180.0,50.0,0.0,130.0
2,E06000014,York,160.0,30.0,0.0,130.0
3,E08000016,Barnsley,710.0,10.0,0.0,700.0
4,E08000017,Doncaster,390.0,30.0,0.0,360.0
5,E08000018,Rotherham,690.0,70.0,70.0,550.0
6,E08000019,Sheffield,280.0,0.0,0.0,280.0
7,E08000032,Bradford,400.0,60.0,0.0,340.0
8,E08000033,Calderdale,220.0,60.0,0.0,160.0
9,E08000034,Kirklees,500.0,20.0,20.0,460.0


# _pandas_  joins summary


DataFrames can be joined vertically in *pandas* using the `concat()` function, which implements the notion of the `inner` and `outer` union for non-union compatible DataFrames. (This permits tables that don't have the same number of columns to be unioned.)

Horizontal joins are achieved using `merge()`. *pandas* merge supports `inner` and `outer`, `full`, `left` and `right` joins.


# What next?

In this Notebook, you have seen examples of a number of technqiues for combining data from several tabular datasets.  Extending and enhancing a dataset with data from other datasets is a common requirement - the building block of complex analysis.

Once again you will benefit from a build up of case knowledge and experience. Feel free to add to this Notebook as you come up with your own techniques for joining datasets.

If you are working through this Notebook as part of an inline exercise, return to the module materials now. If you are working through this set of Notebooks as a whole, move on to `03.4 Handling missing data`.