# Phase 2 - Transform - Data

Our unified data model is actually pretty clean, so there aren't many transformations to do, but we'll de doing some just to exemplify.

Just in case you forgot, you can query the view just like any table:

````sql
-- select everything
select * from joined_data;

-- select count of movies by category
select category, count(film_id) as total 
from joined_data 
group by category
order by total desc;
````

However, since our view was built to always have up to date data, if you try to execute the second query above, you'll get this result: `Result: 16 rows returned in 14543ms`

That's too much time to get 16 rows! So here's another cool feature: temporary tables. Just run the sql below to create one from our view:

````sql
create temporary table temp as
    select * from joined_data;
````

Now if you try the same query again:

````sql
select category, count(film_id) as total 
from temp 
group by category
order by total desc;
````

That's more like it: `Result: 16 rows returned in 106ms`

**Use the temporary table to inspect the data and see if you come up with possible transformations before loading everything into our data warehouse (I found 4!).**

Here they are:

* format `payment_date` to have the same format as the other dates
* check for null values and replace them
* remove the column `film_id:1`
* capitalize the `name`, `title` and `actor` columns

Let's go through each one.

## Format `payment_date`

This column is formatted as 2005-05-25T10:30:37.000Z, while the other ones are formatted as 2005-05-25 11:30:37, so we should normalize them to prevent headaches afterwards.

We're going to apply this change in the best location possible to try not to go back a lot, and that's the first query:

````sql
select a.*, 
b.name, b.address, b.phone, b."zip code" as zip_code, b.city, b.country, 
c.name as staff_name, c.address as staff_address, c.phone as staff_phone, c.city as staff_city, c.country as staff_country,
d.rental_date, d.return_date,
e.address as store_address, e.district as store_district, e.city as store_city,
f.film_id 
from payment a
join customer b on b.id  = a.customer_id 
join staff c on c.id = a.staff_id 
join rentals d on d.rental_id = a.rental_id
join store e on e.store_id = c.sid
join inventory f on f.inventory_id = d.inventory_id
````

Which we'll be changing to:

````sql
select a.payment_id, a.customer_id, a.staff_id, a.rental_id, a.amount,  substr(replace(payment_date, 'T', ' '), 0, 20) as payment_date, -- only line changed
b.name, b.address, b.phone, b."zip code" as zip_code, b.city, b.country, 
c.name as staff_name, c.address as staff_address, c.phone as staff_phone, c.city as staff_city, c.country as staff_country,
d.rental_date, d.return_date,
e.address as store_address, e.district as store_district, e.city as store_city,
f.film_id 
from payment a
join customer b on b.id  = a.customer_id 
join staff c on c.id = a.staff_id 
join rentals d on d.rental_id = a.rental_id
join store e on e.store_id = c.sid
join inventory f on f.inventory_id = d.inventory_id;
````

## Null verification

Remember that not all films had actors? Maybe there's other columns with missing values, let's run the following (we could, off course, quickly load the data into a pandas dataframe and check for nulls instead of writing this large query):

````sql
select * from
(select count(film_id) as rental_id from temp where rental_id is null) ,
(select count(film_id) as amount from temp where amount is null) ,
(select count(film_id) as payment_date from temp where payment_date is null) ,
(select count(film_id) as name from temp where name is null) ,
(select count(film_id) as address from temp where address is null) ,
(select count(film_id) as phone from temp where phone is null) ,
(select count(film_id) as zip_code from temp where zip_code is null) ,
(select count(film_id) as city from temp where city is null) ,
(select count(film_id) as country from temp where country is null) ,
(select count(film_id) as staff_name from temp where staff_name is null) ,
(select count(film_id) as staff_address from temp where staff_address is null) ,
(select count(film_id) as staff_phone from temp where staff_phone is null) ,
(select count(film_id) as staff_city from temp where staff_city is null) ,
(select count(film_id) as staff_country from temp where staff_country is null) ,
(select count(film_id) as rental_date from temp where rental_date is null) ,
(select count(film_id) as return_date from temp where return_date is null) ,
(select count(film_id) as store_address from temp where store_address is null) ,
(select count(film_id) as store_district from temp where store_district is null) ,
(select count(film_id) as store_city from temp where store_city is null) ,
(select count(film_id) as title from temp where title is null),
(select count(film_id) as description from temp where description is null) ,
(select count(film_id) as release_year from temp where release_year is null) ,
(select count(film_id) as rental_duration from temp where rental_duration is null) ,
(select count(film_id) as rental_date from temp where rental_date is null) ,
(select count(film_id) as length from temp where length is null) ,
(select count(film_id) as replacement_cost from temp where replacement_cost is null) ,
(select count(film_id) as rating from temp where rating is null) ,
(select count(film_id) as special_features from temp where special_features is null) ,
(select count(film_id) as language from temp where language is null) ,
(select count(film_id) as category from temp where category is null) ,
(select count(film_id) as actor from temp where actor is null);
````

Ok, so the only column with null values is `actor`. We'll be replacing the null values with 'No Information', and the best location possible is the second query:

````sql
select a.film_id, a.title, a.description, a.release_year, a.rental_duration, a.rental_rate, a.length, a.replacement_cost, a.rating, a.special_features,
b.name as language, 
d.name as category, 
group_concat(f.first_name || ' ' || f.last_name) as actor
from film a
join language b on b.language_id = a.language_id
join film_category c on c.film_id = a.film_id
join category d on d.category_id = c.category_id
left join film_actor e on e.film_id = a.film_id
left join actor f on f.actor_id = e.actor_id
group by a.film_id;
````

Which we'll be changing to:

````sql
select a.film_id, a.title, a.description, a.release_year, a.rental_duration, a.rental_rate, a.length, a.replacement_cost, a.rating, a.special_features,
b.name as language, 
d.name as category, 
iif(first_name is null, 'No Information', group_concat(f.first_name || ' ' || f.last_name)) as actor --one line changed
from film a
join language b on b.language_id = a.language_id
join film_category c on c.film_id = a.film_id
join category d on d.category_id = c.category_id
left join film_actor e on e.film_id = a.film_id
left join actor f on f.actor_id = e.actor_id
group by a.film_id;
````

## Remove `film_id:1`

Why two `film_id` columns? Because it was the field used to join our first query to the second query.

So there's no way we cannot not have both columns, or the join will fail. We'll have to deal with this afterwards.

## Capitalize columns

We could do this in SQL, but if there's one thing that SQL doesn't do well is string manipulation, unlike python. 

We now have two options: go all the way back to the Extraction phase and correct this, or deal with it after after the Load phase and before starting our analysis.

We won't go back now because this is not that important, but it's an example of how many times we have to jump between phases to achieve the best (read: automated) data pipeline possible.

## Finalize the transformation

So the only thing missing is recreating the view and temporary table. Run this in your database:

````sql
drop view joined_data;

create view joined_data as
select first.*, second.*
from
	(select a.payment_id, a.customer_id, a.staff_id, a.rental_id, a.amount,  substr(replace(payment_date, 'T', ' '), 0, 20) as payment_date,
    b.name, b.address, b.phone, b."zip code" as zip_code, b.city, b.country, 
    c.name as staff_name, c.address as staff_address, c.phone as staff_phone, c.city as staff_city, c.country as staff_country,
    d.rental_date, d.return_date,
    e.address as store_address, e.district as store_district, e.city as store_city,
    f.film_id 
    from payment a
    join customer b on b.id  = a.customer_id 
    join staff c on c.id = a.staff_id 
    join rentals d on d.rental_id = a.rental_id
    join store e on e.store_id = c.sid
    join inventory f on f.inventory_id = d.inventory_id) as first
join
	(select a.film_id, a.title, a.description, a.release_year, a.rental_duration, a.rental_rate, a.length, a.replacement_cost, a.rating, a.special_features,
	b.name as language, 
	d.name as category, 
	iif(first_name is null, 'No Information', group_concat(f.first_name || ' ' || f.last_name)) as actor
	from film a
	join language b on b.language_id = a.language_id
	join film_category c on c.film_id = a.film_id
	join category d on d.category_id = c.category_id
	left join film_actor e on e.film_id = a.film_id
	left join actor f on f.actor_id = e.actor_id
	group by a.film_id) as second 
on second.film_id = first.film_id;
````

Go ahead and run this in your database, and then rerun the temporary table creation:

````sql
drop table temp;

create temporary table temp as
    select * from joined_data;
````