# Data Cleaning
- file contains all table alterations
    - run this to transform tables set up in 01_db_setup.ipynb to their final state
- low hanging fruit
- null columns
- data anomalies (e.g., price data)

## Import Libraries, connect to database

In [114]:
import pandas as pd

#### Load iPython-SQL module

In [115]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### Connect to database

In [116]:
%sql postgresql://postgres:12345@localhost/ecomm_cleanse

### Restore default table states

- needed for cells to run correctly (serial table alterations)

In [117]:
%%sql

DROP TABLE IF EXISTS all_sessions CASCADE;
CREATE TABLE IF NOT EXISTS all_sessions AS SELECT * FROM all_sessions_backup;

DROP TABLE IF EXISTS analytics CASCADE;
CREATE TABLE IF NOT EXISTS analytics AS SELECT * FROM analytics_backup;

DROP TABLE IF EXISTS products CASCADE;
CREATE TABLE IF NOT EXISTS products AS SELECT * FROM products_backup;

DROP TABLE IF EXISTS sales_by_sku CASCADE;
CREATE TABLE IF NOT EXISTS sales_by_sku AS SELECT * FROM sales_by_sku_backup;

DROP TABLE IF EXISTS sales_report CASCADE;
CREATE TABLE IF NOT EXISTS sales_report AS SELECT * FROM sales_report_backup;

 * postgresql://postgres:***@localhost/ecomm_cleanse
Done.
15134 rows affected.
Done.


4301122 rows affected.
Done.
1092 rows affected.
Done.
462 rows affected.
Done.
454 rows affected.


[]

## `all_sessions` table

### Duplicate records

- determine if any rows in the table are duplicates

In [118]:
%%sql
SELECT (
    SELECT COUNT(*)
    FROM (SELECT DISTINCT * FROM all_sessions)
) AS unique_rows,
(
    SELECT COUNT(*)
    FROM all_sessions
) AS total_rows;


 * postgresql://postgres:***@localhost/ecomm_cleanse
1 rows affected.


unique_rows,total_rows
15134,15134


- all records in `all_sessions` table are unique
- there may still be duplicate records that are being obscurred by data collection/entry issues

### Determine null values
- get a count of the number of null values for each column in the table

In [119]:
%%sql

SELECT 
    'all_sessions' AS table_name,
    COUNT(*) AS total_rows,
    SUM(CASE WHEN fullvisitorid IS NULL THEN 1 ELSE 0 END) AS null_in_fullvisitorid, 
    SUM(CASE WHEN channelgrouping IS NULL THEN 1 ELSE 0 END) AS null_in_channelgrouping,
    SUM(CASE WHEN time IS NULL THEN 1 ELSE 0 END) AS null_in_time,
    SUM(CASE WHEN country IS NULL THEN 1 ELSE 0 END) AS null_in_country,
    SUM(CASE WHEN city IS NULL THEN 1 ELSE 0 END) AS null_in_city,
    SUM(CASE WHEN totaltransactionrevenue IS NULL THEN 1 ELSE 0 END) AS null_in_totaltransactionrevenue,
    SUM(CASE WHEN transactions IS NULL THEN 1 ELSE 0 END) AS null_in_transactions,
    SUM(CASE WHEN timeonsite IS NULL THEN 1 ELSE 0 END) AS null_in_timeonsite,
    SUM(CASE WHEN pageviews IS NULL THEN 1 ELSE 0 END) AS null_in_pageviews,
    SUM(CASE WHEN sessionqualitydim IS NULL THEN 1 ELSE 0 END) AS null_in_sessionqualitydim,
    SUM(CASE WHEN date IS NULL THEN 1 ELSE 0 END) AS null_in_date,
    SUM(CASE WHEN visitid IS NULL THEN 1 ELSE 0 END) AS null_in_visitid,
    SUM(CASE WHEN type IS NULL THEN 1 ELSE 0 END) AS null_in_type,
    SUM(CASE WHEN productrefundamount IS NULL THEN 1 ELSE 0 END) AS null_in_productrefundamount,
    SUM(CASE WHEN productquantity IS NULL THEN 1 ELSE 0 END) AS null_in_productquantity,
    SUM(CASE WHEN productprice IS NULL THEN 1 ELSE 0 END) AS null_in_productprice,
    SUM(CASE WHEN productrevenue IS NULL THEN 1 ELSE 0 END) AS null_in_productrevenue,
    SUM(CASE WHEN productsku IS NULL THEN 1 ELSE 0 END) AS null_in_productsku,
    SUM(CASE WHEN v2productname IS NULL THEN 1 ELSE 0 END) AS null_in_v2productname,
    SUM(CASE WHEN v2productcategory IS NULL THEN 1 ELSE 0 END) AS null_in_v2productcategory,
    SUM(CASE WHEN productvariant IS NULL THEN 1 ELSE 0 END) AS null_in_productvariant,
    SUM(CASE WHEN currencycode IS NULL THEN 1 ELSE 0 END) AS null_in_currencycode,
    SUM(CASE WHEN itemquantity IS NULL THEN 1 ELSE 0 END) AS null_in_itemquantity,
    SUM(CASE WHEN itemrevenue IS NULL THEN 1 ELSE 0 END) AS null_in_itemrevenue,
    SUM(CASE WHEN transactionrevenue IS NULL THEN 1 ELSE 0 END) AS null_in_transactionrevenue,
    SUM(CASE WHEN transactionid IS NULL THEN 1 ELSE 0 END) AS null_in_transactionid,
    SUM(CASE WHEN pagetitle IS NULL THEN 1 ELSE 0 END) AS null_in_pagetitle,
    SUM(CASE WHEN searchkeyword IS NULL THEN 1 ELSE 0 END) AS null_in_searchkeyword,
    SUM(CASE WHEN pagepathlevel1 IS NULL THEN 1 ELSE 0 END) AS null_in_pagepathlevel1,
    SUM(CASE WHEN ecommerceactiontype IS NULL THEN 1 ELSE 0 END) AS null_in_ecommerceactiontype,
    SUM(CASE WHEN ecommerceactionstep IS NULL THEN 1 ELSE 0 END) AS null_in_ecommerceactionstep,
    SUM(CASE WHEN ecommerceactionoption IS NULL THEN 1 ELSE 0 END) AS null_in_ecommerceactionoption
FROM all_sessions;

 * postgresql://postgres:***@localhost/ecomm_cleanse
1 rows affected.


table_name,total_rows,null_in_fullvisitorid,null_in_channelgrouping,null_in_time,null_in_country,null_in_city,null_in_totaltransactionrevenue,null_in_transactions,null_in_timeonsite,null_in_pageviews,null_in_sessionqualitydim,null_in_date,null_in_visitid,null_in_type,null_in_productrefundamount,null_in_productquantity,null_in_productprice,null_in_productrevenue,null_in_productsku,null_in_v2productname,null_in_v2productcategory,null_in_productvariant,null_in_currencycode,null_in_itemquantity,null_in_itemrevenue,null_in_transactionrevenue,null_in_transactionid,null_in_pagetitle,null_in_searchkeyword,null_in_pagepathlevel1,null_in_ecommerceactiontype,null_in_ecommerceactionstep,null_in_ecommerceactionoption
all_sessions,15134,0,0,0,0,0,15053,15053,3300,0,13906,0,0,0,15134,15081,0,15130,0,0,0,0,272,15134,15134,15130,15125,1,15134,0,0,0,15103


In [120]:
# assign the last query result to a pandas dataframe and display as transposed table
_.DataFrame().T

Unnamed: 0,0
table_name,all_sessions
total_rows,15134
null_in_fullvisitorid,0
null_in_channelgrouping,0
null_in_time,0
null_in_country,0
null_in_city,0
null_in_totaltransactionrevenue,15053
null_in_transactions,15053
null_in_timeonsite,3300


#### Drop null columns

- we'll drop columns that are filled entirely with null values, as there is no data loss, and this will increase the efficiency of future queries and analysis

In [121]:
%%sql
ALTER TABLE all_sessions
DROP COLUMN IF EXISTS productrefundamount,
DROP COLUMN IF EXISTS itemquantity,
DROP COLUMN IF EXISTS itemrevenue,
DROP COLUMN IF EXISTS searchkeyword;

 * postgresql://postgres:***@localhost/ecomm_cleanse
Done.


[]

In [122]:
%%sql

-- confirm that the columns have been dropped
SELECT COLUMN_NAME 
FROM INFORMATION_SCHEMA.COLUMNS 
WHERE TABLE_NAME = 'all_sessions';

 * postgresql://postgres:***@localhost/ecomm_cleanse
28 rows affected.


column_name
fullvisitorid
channelgrouping
time
country
city
totaltransactionrevenue
transactions
timeonsite
pageviews
sessionqualitydim


### `date` column

- we loaded the data as string type to avoid any alterations when being filled with dataframes created from the csv data, we will update the tables as needed to convert the columns to the appropriate data type.

- we confirmed above that `date` column has no null values, so lets convert it to date type

In [123]:
%%sql
-- change the data type of the analytics.date to date
ALTER TABLE all_sessions
ALTER COLUMN date TYPE DATE USING date::DATE;

 * postgresql://postgres:***@localhost/ecomm_cleanse
Done.


[]

In [124]:
%%sql
-- check the data type of the all_sessions.date

SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'all_sessions'
LIMIT 2;


 * postgresql://postgres:***@localhost/ecomm_cleanse
2 rows affected.


column_name,data_type
date,date
channelgrouping,text


 ### Financial data correction

- many of the columns have data that is meant to be currency
- these have extremely large values, that are likely to be an data recording issue with misplaced decimal value
    - we have context for what the prices are likely meant to represent, within an order of magnitude, given the `v2productname`
        - e.g., a t-shirt is unlikely to be priced at $23 million
- we will convert these values to numeric type, and then divide them by the appropriate value

In [125]:
%%sql
SELECT totaltransactionrevenue, productprice, productrevenue, transactionrevenue, v2productname
FROM all_sessions
WHERE totaltransactionrevenue IS NOT NULL
    AND productprice IS NOT NULL
    AND productrevenue IS NOT NULL
    AND transactionrevenue IS NOT NULL;

 * postgresql://postgres:***@localhost/ecomm_cleanse
4 rows affected.


totaltransactionrevenue,productprice,productrevenue,transactionrevenue,v2productname
200000000,119000000,120000000,200000000,Nest® Cam Indoor Security Camera - USA
169970000,55990000,58656666,169970000,Compact Bluetooth Speaker
1015480000,3500000,176400000,1015480000,Reusable Shopping Bag
1005500000,59990000,60365000,1005500000,Google Bongo Cupholder Bluetooth Speaker


- the above values appear to be inflated by a multiple of 1,000,000
    - lets assume this is the case, given the 0.99 pattern typically associated with retail prices
        - e.g., the Compact Bluetooth Speaker would have its price corrected by moving the 99 to become decimals
            - 55990000 / 1000000 = 55.99

- assume that each column containing currency values has been affected by the same inflation factor

- lets test this before making the alteration

In [126]:
%%sql
SELECT
    ROUND((totaltransactionrevenue::NUMERIC / 1000000), 2) AS totaltransactionrevenue,
    ROUND ((productprice::NUMERIC / 1000000), 2) AS productprice,
    ROUND ((productrevenue::NUMERIC / 1000000), 2) AS productrevenue,
    ROUND ((transactionrevenue::NUMERIC / 1000000), 2) AS transactionrevenue,
    v2productname
FROM all_sessions
WHERE totaltransactionrevenue IS NOT NULL
    AND productprice IS NOT NULL
    AND productrevenue IS NOT NULL
    AND transactionrevenue IS NOT NULL;

 * postgresql://postgres:***@localhost/ecomm_cleanse
4 rows affected.


totaltransactionrevenue,productprice,productrevenue,transactionrevenue,v2productname
200.0,119.0,120.0,200.0,Nest® Cam Indoor Security Camera - USA
169.97,55.99,58.66,169.97,Compact Bluetooth Speaker
1015.48,3.5,176.4,1015.48,Reusable Shopping Bag
1005.5,59.99,60.37,1005.5,Google Bongo Cupholder Bluetooth Speaker


- the above query looks much more appropriate. lets alter the table to implement this change

In [127]:
%%sql
-- change the data type to numeric, divide by 1 million, round to 2 decimal places
ALTER TABLE all_sessions
	ALTER COLUMN totaltransactionrevenue TYPE NUMERIC USING ROUND((totaltransactionrevenue::NUMERIC / 1000000), 2),
	ALTER COLUMN productprice TYPE NUMERIC USING ROUND((productprice::NUMERIC / 1000000), 2),
	ALTER COLUMN productrevenue TYPE NUMERIC USING ROUND((productrevenue::NUMERIC / 1000000), 2),
	ALTER COLUMN transactionrevenue TYPE NUMERIC USING ROUND((transactionrevenue::NUMERIC / 1000000), 2);


 * postgresql://postgres:***@localhost/ecomm_cleanse
Done.


[]

In [128]:
%%sql
SELECT totaltransactionrevenue, productprice, productrevenue, transactionrevenue, v2productname
FROM all_sessions
WHERE totaltransactionrevenue IS NOT NULL
    AND productprice IS NOT NULL
    AND productrevenue IS NOT NULL
    AND transactionrevenue IS NOT NULL;

 * postgresql://postgres:***@localhost/ecomm_cleanse
4 rows affected.


totaltransactionrevenue,productprice,productrevenue,transactionrevenue,v2productname
200.0,119.0,120.0,200.0,Nest® Cam Indoor Security Camera - USA
169.97,55.99,58.66,169.97,Compact Bluetooth Speaker
1015.48,3.5,176.4,1015.48,Reusable Shopping Bag
1005.5,59.99,60.37,1005.5,Google Bongo Cupholder Bluetooth Speaker


In [129]:
%sql SELECT * FROM all_sessions WHERE totaltransactionrevenue IS NOT NULL LIMIT 5;

 * postgresql://postgres:***@localhost/ecomm_cleanse
5 rows affected.


fullvisitorid,channelgrouping,time,country,city,totaltransactionrevenue,transactions,timeonsite,pageviews,sessionqualitydim,date,visitid,type,productquantity,productprice,productrevenue,productsku,v2productname,v2productcategory,productvariant,currencycode,transactionrevenue,transactionid,pagetitle,pagepathlevel1,ecommerceactiontype,ecommerceactionstep,ecommerceactionoption
3440884800752704995,Direct,55394,United States,Palo Alto,305.0,1,228,13,,2016-11-16,1479318391,PAGE,,119.0,,GGOENEBQ078999,Nest® Cam Outdoor Security Camera - USA,Home/Nest/Nest-USA/,(not set),USD,,,Nest-USA,/google+redesign/,0,1,
395821106338763980,Referral,14111,United States,Palo Alto,152.0,1,404,10,,2017-02-22,1487816197,EVENT,1.0,149.0,,GGOENEBJ079499,Nest® Learning Thermostat 3rd Gen-USA - Stainless Steel,Home/Nest/Nest-USA/,(not set),USD,,,Nest-USA,/google+redesign/,3,1,
4088086075239844129,Organic Search,0,United States,not available in demo dataset,13.21,1,297,11,,2016-09-06,1473206048,PAGE,,19.99,,GGOEGBMB073799,Google Zipper-front Sports Bag,Home/Bags/,(not set),USD,,,Bags,/google+redesign/,0,1,
803888563485194008,Referral,8713,United States,not available in demo dataset,32.18,1,422,10,,2016-09-26,1474892406,PAGE,,15.19,,GGOEGAAX0293,Android Women's Short Sleeve Tri-blend Badge Tee Light Grey,Home/Apparel/Women's/Women's-T-Shirts/,(not set),USD,,,Women's-T-Shirts,/google+redesign/,0,1,
8444725132150789814,Referral,700092,United States,Atlanta,742.48,1,865,20,,2016-12-13,1481645890,PAGE,4.0,11.2,,GGOEGBJR018199,Reusable Shopping Bag,Bags,Single Option Only,USD,,,Checkout Your Information,/yourinfo.html,5,1,Billing and Shipping


## `analytics` table

### Duplicate records

- determine if any rows in the table are duplicates

In [130]:
%%sql
SELECT (
    SELECT COUNT(*)
    FROM (SELECT DISTINCT * FROM analytics)
) AS unique_rows,
(
    SELECT COUNT(*)
    FROM analytics
) AS total_rows;


 * postgresql://postgres:***@localhost/ecomm_cleanse


1 rows affected.


unique_rows,total_rows
1739308,4301122


- from the above query, we can see that only 1,739,307 / 4,301,122 rows in analytics are unique
- without more information, we can't determine if these duplicates are caused by technical glitches, or by data collection issues
    - aggregating the records to get a count of how many times a given record is duplicated may be useful information to keep, in order to help source the technical glitch, based on patterns in users, times, geographic locations

In [131]:
%sql SELECT * FROM analytics LIMIT 1;

 * postgresql://postgres:***@localhost/ecomm_cleanse
1 rows affected.


visitnumber,visitid,visitstarttime,date,fullvisitorid,userid,channelgrouping,socialengagementtype,unitssold,pageviews,timeonsite,bounces,revenue,unitprice
7,1498424366,1498424366,20170625,9444016982622091039,,Display,Not Socially Engaged,,1,,1,,8990000


In [132]:
%%sql
-- get count of unique rows in analytics, grouped by all columns in table
SELECT COUNT(*) as duplicate_rows_count, visitnumber, visitid, visitstarttime, date, fullvisitorid, channelgrouping, socialengagementtype, unitssold, pageviews, timeonsite, bounces, revenue, unitprice
FROM analytics
GROUP BY visitnumber, visitid, visitstarttime, date, fullvisitorid, channelgrouping, socialengagementtype, unitssold, pageviews, timeonsite, bounces, revenue, unitprice
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
LIMIT 5;

 * postgresql://postgres:***@localhost/ecomm_cleanse
5 rows affected.


duplicate_rows_count,visitnumber,visitid,visitstarttime,date,fullvisitorid,channelgrouping,socialengagementtype,unitssold,pageviews,timeonsite,bounces,revenue,unitprice
298,1,1501621275,1501621275,20170801,7342454030115611747,Direct,Not Socially Engaged,,155,220,,,14990000
178,19,1495993707,1495993707,20170528,6347736525399420278,Referral,Not Socially Engaged,,104,2531,,,15190000
176,3,1500237228,1500237228,20170716,1280993661204347450,Referral,Not Socially Engaged,,145,4383,,,15190000
163,3,1500169155,1500169155,20170715,3663775838456282680,Paid Search,Not Socially Engaged,,139,2249,,,0
148,4,1496118354,1496118354,20170529,2943746373898362304,Referral,Not Socially Engaged,,99,4138,,,18990000


- from the above query, we can see that there is a significant amount of duplicated records, and that these are not evenly distributed. clearly certain `visitid`s have generated far more duplicates than others
- since investigating the source of the duplicates is impossible, not to mention out of project scope, we will compromise by saving this query output as a csv file, which could be shared with the appropriate parties in a real world scenario to help them locate any technical glitches causing this issue.

In [133]:
%%sql

CREATE TEMPORARY VIEW analytics_duplicate_count AS (
    SELECT COUNT(*) as duplicate_rows_count, visitnumber, visitid, visitstarttime, date, fullvisitorid, channelgrouping, socialengagementtype, unitssold, pageviews, timeonsite, bounces, revenue, unitprice
    FROM analytics
    GROUP BY visitnumber, visitid, visitstarttime, date, fullvisitorid, channelgrouping, socialengagementtype, unitssold, pageviews, timeonsite, bounces, revenue, unitprice
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC
);

 * postgresql://postgres:***@localhost/ecomm_cleanse
Done.


[]

#### Export count of duplicate records

In [134]:
# save the last query result to a pandas dataframe
analytics_duplicate_count_df = %sql SELECT * FROM analytics_duplicate_count;

# export the dataframe to a csv file
analytics_duplicate_count_df.DataFrame().to_csv('../data/processed/analytics_duplicate_count.csv', index=False)

 * postgresql://postgres:***@localhost/ecomm_cleanse
870577 rows affected.


- with the duplicate rows counted and saved as a csv, we can safely drop the duplicate rows, and continue with the analysis

In [135]:
%%sql

-- create a deduplicated table
CREATE TABLE analytics_deduped AS (
    SELECT DISTINCT * FROM analytics
);

 * postgresql://postgres:***@localhost/ecomm_cleanse
1739308 rows affected.


[]

In [136]:
%%sql
-- drop the original table
DROP TABLE analytics CASCADE;

-- rename the deduplicated table to the original table name
ALTER TABLE analytics_deduped RENAME TO analytics;

 * postgresql://postgres:***@localhost/ecomm_cleanse
Done.
Done.


[]

###  Determine null columns

In [137]:
%%sql

SELECT
    'analytics' AS table_name,
    COUNT(*) AS total_rows,
    SUM(CASE WHEN visitnumber IS NULL THEN 1 ELSE 0 END) AS null_in_visitnumber,
    SUM(CASE WHEN visitid IS NULL THEN 1 ELSE 0 END) AS null_in_visitid,
    SUM(CASE WHEN visitstarttime IS NULL THEN 1 ELSE 0 END) AS null_in_visitstarttime,
    SUM(CASE WHEN date IS NULL THEN 1 ELSE 0 END) AS null_in_date,
    SUM(CASE WHEN fullvisitorid IS NULL THEN 1 ELSE 0 END) AS null_in_fullvisitorid,
    SUM(CASE WHEN userid IS NULL THEN 1 ELSE 0 END) AS null_in_userid,
    SUM(CASE WHEN channelgrouping IS NULL THEN 1 ELSE 0 END) AS null_in_channelgrouping,
    SUM(CASE WHEN socialengagementtype IS NULL THEN 1 ELSE 0 END) AS null_in_socialengagementtype,
    SUM(CASE WHEN unitssold IS NULL THEN 1 ELSE 0 END) AS null_in_unitssold,
    SUM(CASE WHEN pageviews IS NULL THEN 1 ELSE 0 END) AS null_in_pageviews,
    SUM(CASE WHEN timeonsite IS NULL THEN 1 ELSE 0 END) AS null_in_timeonsite,
    SUM(CASE WHEN bounces IS NULL THEN 1 ELSE 0 END) AS null_in_bounces,
    SUM(CASE WHEN revenue IS NULL THEN 1 ELSE 0 END) AS null_in_revenue,
    SUM(CASE WHEN unitprice IS NULL THEN 1 ELSE 0 END) AS null_in_unitprice
FROM analytics;

 * postgresql://postgres:***@localhost/ecomm_cleanse
1 rows affected.


table_name,total_rows,null_in_visitnumber,null_in_visitid,null_in_visitstarttime,null_in_date,null_in_fullvisitorid,null_in_userid,null_in_channelgrouping,null_in_socialengagementtype,null_in_unitssold,null_in_pageviews,null_in_timeonsite,null_in_bounces,null_in_revenue,null_in_unitprice
analytics,1739308,0,0,0,0,0,1739308,0,0,1678454,51,346491,1393938,1725889,0


In [138]:
# convert the last query result to a pandas dataframe and display it as transposed table
_.DataFrame().T

Unnamed: 0,0
table_name,analytics
total_rows,1739308
null_in_visitnumber,0
null_in_visitid,0
null_in_visitstarttime,0
null_in_date,0
null_in_fullvisitorid,0
null_in_userid,1739308
null_in_channelgrouping,0
null_in_socialengagementtype,0


#### Drop null columns

In [139]:
%%sql
ALTER TABLE analytics
DROP COLUMN IF EXISTS userid;

 * postgresql://postgres:***@localhost/ecomm_cleanse
Done.


[]

In [140]:
%%sql

-- confirm that the columns have been dropped
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'analytics';

 * postgresql://postgres:***@localhost/ecomm_cleanse
13 rows affected.


column_name
visitnumber
visitid
visitstarttime
date
fullvisitorid
channelgrouping
socialengagementtype
unitssold
pageviews
timeonsite


`Date` column

In [141]:
%%sql
-- change the data type of the analytics.date to date
ALTER TABLE analytics
ALTER COLUMN date TYPE DATE USING date::DATE;

 * postgresql://postgres:***@localhost/ecomm_cleanse


Done.


[]

In [142]:
%%sql
-- check the data type of the analytics.date
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'analytics'
LIMIT 2;


 * postgresql://postgres:***@localhost/ecomm_cleanse
2 rows affected.


column_name,data_type
date,date
visitid,text


### Financial data correction
- similar to `all_sessions` table, `analytics` contains columns that seem inflated by the same multiple, namely `revenue` and `unitprice`

In [143]:
%sql SELECT revenue, unitprice FROM analytics WHERE revenue IS NOT NULL LIMIT 5;

 * postgresql://postgres:***@localhost/ecomm_cleanse
5 rows affected.


revenue,unitprice
242000000,79000000
41590000,33590000
41590000,33590000
37590000,33590000
41590000,33590000


In [144]:
%%sql
-- change the data type to numeric, divide by 1 million, round to 2 decimal places
ALTER TABLE analytics
	ALTER COLUMN revenue TYPE NUMERIC USING ROUND((revenue::NUMERIC / 1000000), 2),
	ALTER COLUMN unitprice TYPE NUMERIC USING ROUND((unitprice::NUMERIC / 1000000), 2);

 * postgresql://postgres:***@localhost/ecomm_cleanse


Done.


[]

In [145]:
%sql SELECT revenue, unitprice FROM analytics WHERE revenue IS NOT NULL LIMIT 5;

 * postgresql://postgres:***@localhost/ecomm_cleanse
5 rows affected.


revenue,unitprice
41.59,33.59
41.59,33.59
37.59,33.59
41.59,33.59
14.49,12.99
