In [1]:
%load_ext sql

In [3]:
%sql postgresql://postgres:postgres@localhost:5432/analysis

'Connected: postgres@analysis'

# Inspecting and Modifying Data

## Importing Data on Meat, Poultry, and Eggs Producers

In [4]:
%%sql

CREATE TABLE meat_poultry_egg_inspect (
    est_number varchar(50) CONSTRAINT est_number_key PRIMARY KEY,
    company varchar(100),
    street varchar(100),
    city varchar(30),
    st varchar(2),
    zip varchar(5),
    phone varchar(14),
    grant_date date,
    activities text,
    dbas text
);

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [7]:
%%sql

COPY meat_poultry_egg_inspect
FROM '/Users/ugurtigu/Documents/Learn/Docs/SQL/MPI_Directory_by_Establishment_Name.csv'
WITH (FORMAT CSV, HEADER, DELIMITER ',');

 * postgresql://postgres:***@localhost:5432/analysis
6287 rows affected.


[]

In [10]:
%%sql

CREATE INDEX company_idx ON meat_poultry_egg_inspect (company);

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

* we first create a table for the meat, poultry, egg inspect
* we add the natural primary key
* the activities column describes the activities of the company
* the strings are really long, so we need the text data type here
* we import our csv file and copy data into the table
* we create an index on the company column, to speed ip searches for particular companies

In [11]:
%%sql

SELECT count(*) FROM meat_poultry_egg_inspect;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


count
6287


* we check the lenght of the table
    * everything is correct

## Interviewing the Data Set

In [13]:
%%sql

SELECT company,
        street,
        city,
        st,
        count(*) AS adress_count
FROM meat_poultry_egg_inspect
GROUP BY company, street, city, st
HAVING count(*) > 1
ORDER BY company, street, city, st;

 * postgresql://postgres:***@localhost:5432/analysis
23 rows affected.


company,street,city,st,adress_count
Acre Station Meat Farm,17076 Hwy 32 N,Pinetown,NC,2
Beltex Corporation,3801 North Grove Street,Fort Worth,TX,2
Cloverleaf Cold Storage,111 Imperial Drive,Sanford,NC,2
"Crete Core Ingredients, LLC",2220 County Road I,Crete,NE,2
"Crider, Inc.",1 Plant Avenue,Stillmore,GA,3
"Dimension Marketing & Sales, Inc.",386 West 9400 South,Sandy,UT,2
"Foster Poultry Farms, A California Corporation",6648 Highway 15 North,Farmerville,LA,2
"Freezer & Dry Storage, LLC",21740 Trolley Industrial Drive,Taylor,MI,2
JBS Souderton Inc.,249 Allentown Road,Souderton,PA,2
KB Poultry Processing LLC,15024 Sandstone Dr.,Utica,MN,2


* we group companies by unique combinations of the company, strett, city and st columns then we use count(*) which returns the number of rows for each combination of those columns and gives it the alias adress_count

### Checking for Missing Values

In [14]:
%%sql

SELECT st,
        count(*) AS st_count
FROM meat_poultry_egg_inspect
GROUP BY st
ORDER BY st;

 * postgresql://postgres:***@localhost:5432/analysis
57 rows affected.


st,st_count
AK,17
AL,93
AR,87
AS,1
AZ,37
CA,666
CO,121
CT,55
DC,2
DE,22


* this query is a simple count it counts each state postal code (st) and gives it a alias

In [15]:
%%sql

SELECT est_number,
        company, 
        city,
        st,
        zip
FROM meat_poultry_egg_inspect
WHERE st IS NULL;

 * postgresql://postgres:***@localhost:5432/analysis
3 rows affected.


est_number,company,city,st,zip
V18677A,"Atlas Inspection, Inc.",Blaine,,55449
M45319+P45319,"Hall-Namie Packing Company, Inc",,,36671
M263A+P263A+V263A,Jones Dairy Farm,,,53538


* we want to find out where the NULL values are coming from

### Checking for Inconsistent Data Values

In [24]:
%%sql

SELECT company,
        count(*) AS company_count
FROM meat_poultry_egg_inspect
GROUP BY company
ORDER BY company ASC
LIMIT 350;

 * postgresql://postgres:***@localhost:5432/analysis
350 rows affected.


company,company_count
121 In-Flight Catering LLC,1
165368 C. Corporation,1
1732 Meats LLC,1
"1st Original Texas Chili Company, Inc.",1
290 West Bar & Grill,1
3 Little Pigs LLC,1
3-A Enterprises,1
3282 Beaver Meadow Road LLC,1
"3D Meats, LLC",1
4 Frendz Meat Market,1


* we can see that some values are inconsistent

### Checking for Malformed Values Using lenght()

In [26]:
%%sql

SELECT length(zip),
        count(*) AS length_count
FROM meat_poultry_egg_inspect
GROUP BY length(zip)
ORDER BY length(zip) ASC;

 * postgresql://postgres:***@localhost:5432/analysis
3 rows affected.


length,length_count
3,86
4,496
5,5705


* we find out, that some zip codes are not 5 characters
* they are taken with the leading zeros and as an integer they are not taken

In [28]:
%%sql

SELECT st,
        count(*) AS st_count
FROM meat_poultry_egg_inspect
WHERE length(zip) < 5
GROUP BY st
ORDER BY st ASC;

 * postgresql://postgres:***@localhost:5432/analysis
9 rows affected.


st,st_count
CT,55
MA,101
ME,24
NH,18
NJ,244
PR,84
RI,27
VI,2
VT,27


* using the WHERE clause we can check the details of the result to see which states there ZIP codes are coming from

* so far we have to correct these errors:
    * **missing values for three rows in the st column**
    * **inconsistent spelling of at least one company's name**
    * **inaccurate ZIP codes due to file conversion**

## Modifying Tables, Columns, and Data

### Modifying Tables with ALTER TABLE

* ALTER TABLE table ADD COLUMN colum data_type;
    * the code for adding a colum to a table
* ALTER TABLE table DROP COLUMN column;
    * remove a column
* ALTER TABLE table ALTER COLUMN column SET DATA TYPE data_type;
    * to change the data type of a colum
* ALTER TABLE table ALTER COLUMN column SET NOT NULL;
    * adding a NOT NULL constrains to a column
* ALTER TABLE table ALTER COLUMN column DROP NOT NULL;
    * removing the NOT NULL constraint

### Modifying Values with UPDATE

* UPDATE table SET column = value;
    * modifies the data in a column in all rows or in a subset of rows tht meet the condition

* UPDATE table SET column_a = value, column_b = value;
    * we first pass UPDATE to the table to update
    * SET clause contains the values to change
    * the new value can be a string, number, the name of another column or even a query or expression that generates value

* UPDATE table SET column = value WHERE criteria;
    * we can restrict the update to particular rows, we add a WHERE clause with some criteria

* UPDATE table SET column = (SELECT colum FROM table_b WHERE table.column = table_b.column) WHERE EXISTS (SELECT column FROM table_b WHERE table.column = table_b.column);
    * we can also update one table with values from another table
    * the WHERE EXISTS clause is alternative

### Creating Backup Tables

In [29]:
%%sql

CREATE TABLE meat_poultry_egg_inspect_backup AS
SELECT * FROM meat_poultry_egg_inspect;

 * postgresql://postgres:***@localhost:5432/analysis
6287 rows affected.


[]

* before modifying a table, it is a good idea to make a copy for reference and backup
* a confrmation that everything is backuped

In [30]:
%%sql

SELECT
    (SELECT count(*) FROM meat_poultry_egg_inspect) AS original,
    (SELECT count(*) FROM meat_poultry_egg_inspect_backup) AS backup;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


original,backup
6287,6287


### Restoring Missing Column Values

#### Creating a Column Copy

In [31]:
%%sql

ALTER TABLE meat_poultry_egg_inspect ADD COLUMN st_copy varchar(2);

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [32]:
%%sql

UPDATE meat_poultry_egg_inspect
SET st_copy = st;

 * postgresql://postgres:***@localhost:5432/analysis
6287 rows affected.


[]

* we used the **ALTER TABLE** statement which adds a column called st_copy using the same varchar data type as the original st column
* next the **UPDATE** statement's **SET** fill our newly created st_copy column with the data values in column st
* because we dont specify any criteria using WHERE every row is updated

In [37]:
%%sql 

SELECT st,
        st_copy
FROM meat_poultry_egg_inspect
ORDER BY st
LIMIT 10; -- LIMIT originally 6287 rows

 * postgresql://postgres:***@localhost:5432/analysis
10 rows affected.


st,st_copy
AK,AK
AK,AK
AK,AK
AK,AK
AK,AK
AK,AK
AK,AK
AK,AK
AK,AK
AK,AK


#### Updating Rows where Values are Missing

In [40]:
%%sql

UPDATE meat_poultry_egg_inspect
SET st = 'MN'
WHERE est_number = 'V18677A';

UPDATE meat_poultry_egg_inspect
SET st = 'AL'
WHERE est_number = 'M45319+P45319';

UPDATE meat_poultry_egg_inspect
SET st = 'WI'
WHERE est_number = 'M263A+P263A+V263A';

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.
1 rows affected.
1 rows affected.


[]

#### Restoring Original Values

In [39]:
%%sql

UPDATE meat_poultry_egg_inspect
SET st = st_copy;

UPDATE meat_poultry_egg_inspect AS original
SET st = backup.st
FROM meat_poultry_egg_inspect_backup AS backup
WHERE original.est_number = backup.est_number;

 * postgresql://postgres:***@localhost:5432/analysis
6287 rows affected.
6287 rows affected.


[]

* to restore the values from the backup colum in meat_poultry_egg_inspect we created earlier (which is the first UPDATE here)
* both column should have the identical values again
* we do that with an update that sets st to the values from the backup file

## Updating Values for Consistentcy

* in several cases a single company's name was entered incosistently 
* if we want to aggregate data by compamy name such inconsistencies will hinder us from doing so

In [None]:
%%sql

ALTER TABLE meat_poultry_egg_inspect 
ADD COLUMN company_standart varchar(100);

In [45]:
%%sql 

UPDATE meat_poultry_egg_inspect
SET company_standart = company;

 * postgresql://postgres:***@localhost:5432/analysis
6287 rows affected.


[]

* to protect our data we add a new column to our table which we name company_standart
* we then **UPDATE** our table and **SET** the new column with the old original one, we want to protect

In [46]:
%%sql

UPDATE meat_poultry_egg_inspect
SET company_standart = 'Armour-Ekrich Meats'
WHERE company LIKE 'Armour%';

 * postgresql://postgres:***@localhost:5432/analysis
7 rows affected.


[]

* in this line we use the WHERE clause that uses the LIKE keyword
* including the wildcard % at the end of the string ARMOUR, it updates all rows that start with those characters regardless of what comes after them

In [47]:
%%sql

SELECT company, company_standart
FROM meat_poultry_egg_inspect
WHERE company LIKE 'Armour%';

 * postgresql://postgres:***@localhost:5432/analysis
7 rows affected.


company,company_standart
Armour-Eckrich Meats LLC,Armour-Ekrich Meats
"Armour - Eckrich Meats, LLC",Armour-Ekrich Meats
Armour-Eckrich Meats LLC,Armour-Ekrich Meats
Armour-Eckrich Meats LLC,Armour-Ekrich Meats
"Armour-Eckrich Meats, Inc.",Armour-Ekrich Meats
"Armour-Eckrich Meats, LLC",Armour-Ekrich Meats
"Armour-Eckrich Meats, LLC",Armour-Ekrich Meats


* the SELECT statement results the updated company_standart
* now we have standart values for Armour-Eckrich

## Repairing ZIP Codes Using Concatenation

* for the issiue with the leading zeors at the beginning of the zip codes we will use UPDATE again but this time in conjunction with the double pupe string operator (||) which performs concatenation
* concatenation combines two or more strings or non-strings into one
* for example if you insert || between abc and 123 you get abc123

In [48]:
%%sql

ALTER TABLE meat_poultry_egg_inspect ADD COLUMN zip_copy varchar(5);

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [49]:
%%sql

UPDATE meat_poultry_egg_inspect
SET zip_copy = zip;

 * postgresql://postgres:***@localhost:5432/analysis
6287 rows affected.


[]

In [51]:
%%sql 

UPDATE meat_poultry_egg_inspect
SET zip = '00' || zip
WHERE st IN('PR', 'VI') AND length(zip) = 3;

 * postgresql://postgres:***@localhost:5432/analysis
86 rows affected.


[]

* first we create and fill the copy colum
* we modify the codes in the zip column with the leading zeors
* we do that by setting the zip value to a result of a concatenation of the string and 00
* we limit the update to only those wors where the st column has the state codes 'PD' and 'VI' using **IN** comparison operator and add a test for rows where the length of the zp is 3

In [52]:
%%sql

UPDATE meat_poultry_egg_inspect
SET zip = '0' || zip
WHERE st IN('CT', 'MA', 'ME', 'NH', 'NJ', 'RI', 'VT') AND length(zip) = 4;

 * postgresql://postgres:***@localhost:5432/analysis
496 rows affected.


[]

* for the remaining missing 1 leading zeors we do this operatopn

In [53]:
%%sql

SELECT length(zip),
        count(*) AS length_count
FROM meat_poultry_egg_inspect
GROUP BY length(zip)
ORDER BY length(zip) ASC;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


length,length_count
5,6287


* we do the same query again to check if everything is okay

### Updating Values Across Tables

In [54]:
%%sql

CREATE TABLE state_regions (
        st varchar(2) CONSTRAINT st_key PRIMARY KEY,
        region varchar(20) NOT NULL
);

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [57]:
%%sql

COPY state_regions
FROM '/Users/ugurtigu/Documents/Learn/Docs/SQL/state_regions-Copy1.csv'
WITH (FORMAT CSV, HEADER, DELIMITER ',');

 * postgresql://postgres:***@localhost:5432/analysis
56 rows affected.


[]

* we are creating a new table and filling it with the data from the csv file
* we have created two columns in a state_regions table
* one containing two-character state code *st* and the other containing the region name *region*
* we set the primary key for the st which holds a unique *st-key*

In [58]:
%%sql

ALTER TABLE meat_poultry_egg_inspect ADD COLUMN inspection_date date;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [61]:
%%sql

UPDATE meat_poultry_egg_inspect inspect
SET inspection_date = '2019-12-01'
WHERE EXISTS (SELECT state_regions.region
             FROM state_regions
             WHERE inspect.st = state_regions.st
                    AND state_regions.region = 'New England');

 * postgresql://postgres:***@localhost:5432/analysis
252 rows affected.


[]

* the **ALTER TABLE** statement creates the inspection_date column in the meat_poultry_egg_inspect table
* the **UPDATE** statement, we start by naming the table using an alias of inspect to make the code easiert to read
* we set a date value for the inspection_date column
* with **WHERE EXISTS** we connect the meat_poultry_egg_inspect table to the state_regions table
* this subquery looks for rows in the state_regions table where the region column mathces the string New Englad
* at the same time it joins the meat_poultry_egg_inspect table with the state-regions table using the *st* column from both tables

In [62]:
%%sql

SELECT st, inspection_date
FROM meat_poultry_egg_inspect
GROUP BY st, inspection_date
ORDER BY st;

 * postgresql://postgres:***@localhost:5432/analysis
56 rows affected.


st,inspection_date
AK,
AL,
AR,
AS,
AZ,
CA,
CO,
CT,2019-12-01
DC,
DE,


* we updated the table with dates (we have been selected New England in this example)

## Deleting Unnecessary Data

* without backup the data is gone for good
* **DELETE FROM**
    * for removing rows from a table
* **ALTER TABLE** 
    * for remove a column from a table
* **DROP TABLE**
    * remove a while table froim the db

### Deleting Rows from a Table

* DELETE FROM table_name;
    * we can remove all rows from a table or we can use WHERE to delete only the porion that matches an expression we supply
* DELETE FROM table_name WHERE expression;

In [63]:
%%sql

DELETE FROM meat_poultry_egg_inspect
WHERE st IN('PR', 'VI');

 * postgresql://postgres:***@localhost:5432/analysis
86 rows affected.


[]

* for example if we want our table of meat, poultry and egg we can remove the 2 other us states from the table

### Deleting a Column from a Table

* ALTER TABLE table_name DROP COLUMN column_name;
    * we earlier created a zip_copy column as a backup to our table

In [64]:
%%sql

ALTER TABLE meat_poultry_egg_inspect DROP COLUMN zip_copy;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

### Deleting a Table from a DB

* DROP TABLE table_name;
    * you can just remove the table 

In [65]:
%%sql

DROP TABLE meat_poultry_egg_inspect_backup;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

## Using Transaction Blocks to Save or Revert Changes

* the techniques in this chapter such as **UPDATE** or **DELETE** are final
* the only way to restore is from a backup
* you can check your changes before finalizing them and cancel the change if it's not what wo want
* **transaction block** can do that
    * **START TRANSACTION** signals the sart of the transaction block
    * **COMIT** signals the end of the block and saves all changes
    * **ROLLBACK** signals the end of the block and reverts all changes
* defining both transaction steps as one unit - if one steps fails, the other iscanceled too

In [69]:
%%sql

START TRANSACTION;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [70]:
%%sql

UPDATE meat_poultry_egg_inspect
SET company = 'AGRO Merchants Oakland LLC'
WHERE company = 'AGRO Merchantss Oakland, LLC';

 * postgresql://postgres:***@localhost:5432/analysis
0 rows affected.


[]

In [71]:
%%sql

SELECT company
FROM meat_poultry_egg_inspect
WHERE company LIKE 'AGRO%'
ORDER BY company;

 * postgresql://postgres:***@localhost:5432/analysis
3 rows affected.


company
AGRO Merchants Oakland LLC
AGRO Merchants Oakland LLC
AGRO Merchantss Oakland LLC


In [75]:
%%sql

ROLLBACK;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

* we run each statement separately with **START TRANSACTION**
* the DB letting you know that any changes you make will not be permanent unless you **COMMIT** the command
* next we can run the **UPDATE** statement which changes the name by mistake
* when we view the names of the companies wich start with the ARGO using SELECT we see the typo
* Instead of reunning the UPDATE we can fix the typo running **ROLLBACK** command

## Improving Performance When Updating Large Tables

* instead of adding a column and filling it with values, we can save disk space by copying the entire table and adding a populated column during the operation
* then we can rname the tables so the copy replaces the original

In [83]:
%%sql

CREATE TABLE meat_poultry_egg_inspect_backup AS
SELECT *,
        '2018-02-07'::date AS reviewd_date
FROM meat_poultry_egg_inspect;

 * postgresql://postgres:***@localhost:5432/analysis
6201 rows affected.


[]

* in addition to selecting all the columns unsing the asterisk
* we also add  a column called reviewd_date 
* we provide this value a cast date data type and an alias

In [84]:
%%sql

ALTER TABLE meat_poultry_egg_inspect RENAME TO meat_poultry_egg_inspect_temp;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [85]:
%%sql

ALTER TABLE meat_poultry_egg_inspect_backup
RENAME TO meat_poultry_egg_inspect;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [87]:
%%sql

ALTER TABLE meat_poultry_egg_inspect_temp
RENAME TO meat_poultry_egg_inspect_backup;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

* we swap the names
* the first statement renames the copy we made to the temp table
* the second renames the cop we made to the original name
* finall we rename the table that ends with _tmp to ending _backup
* the original is now called _backup and the copy with the added column is called _inspect

## Try it Yourself

-- In this exercise, you’ll turn the meat_poultry_egg_inspect table into useful
-- information. You needed to answer two questions: How many of the companies
-- in the table process meat, and how many process poultry?

-- Create two new columns called meat_processing and poultry_processing. Each
-- can be of the type boolean.

-- Using UPDATE, set meat_processing = TRUE on any row where the activities
-- column contains the text 'Meat Processing'. Do the same update on the
-- poultry_processing column, but this time lookup for the text
-- 'Poultry Processing' in activities.

-- Use the data from the new, updated columns to count how many companies
-- perform each type of activity. For a bonus challenge, count how many
-- companies perform both activities.

In [95]:
%%sql

ALTER TABLE meat_poultry_egg_inspect ADD COLUMN meat_processing bool;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [96]:
%%sql

ALTER TABLE meat_poultry_egg_inspect ADD COLUMN poultry_processing bool;

 * postgresql://postgres:***@localhost:5432/analysis
Done.


[]

In [97]:
%%sql

UPDATE meat_poultry_egg_inspect
SET meat_processing = TRUE
WHERE activities LIKE '%Meat Processing%';

 * postgresql://postgres:***@localhost:5432/analysis
4764 rows affected.


[]

In [101]:
%%sql

UPDATE meat_poultry_egg_inspect
SET poultry_processing = TRUE
WHERE activities LIKE '%Poultry Processing%';

 * postgresql://postgres:***@localhost:5432/analysis
3728 rows affected.


[]

In [103]:
%%sql

SELECT count(meat_processing), count(poultry_processing)
FROM meat_poultry_egg_inspect;

 * postgresql://postgres:***@localhost:5432/analysis
1 rows affected.


count,count_1
4764,3728
