Final Week - Dillard's Database Exercises

Date created: 25 April 2017

Last updated: 9 Sept 2017

This is the COMPLETE answer key (including explanations where necessary) for Week 5 (final week) of the "Managing Big Data wtih MySQL" course by Duke University.

I wrote this answer key as no official answers have been released online. These answers reflect my own work and are accurate to the best of my knowledge. I will update them if the professors ever release an "official" answer key.

These answers will come in handy during the final exam for the course, which requires one to make similar queries.

Update: These answers are based on the older version of UA_Dillards dataset (not UA_Dillards1, nor UA_Dillards_2016). For example, this means I am using the table SKSTINFO and not SKSTINFO_FIX which is the newer version.

Meanwhile, let's start.

Answers

To start, enter DATABASE ua_dillards; into the Teradata SQL scratchpad.

Question 1

How many distinct dates are there in the saledate column of the transaction table for each month/year combination in the database?

SELECT 
  EXTRACT (month FROM saledate) AS month_num, 
  EXTRACT (year FROM saledate) AS year_num, 
  COUNT (DISTINCT EXTRACT (day FROM saledate)) AS days_in_month, 
  COUNT (EXTRACT (day FROM saledate)) AS num_transactions -- I'm curious abt num transactions per mth
FROM trnsact
GROUP BY month_num, year_num
ORDER BY year_num, month_num

Result

MONTH_NUM	YEAR_NUM	DAYS_IN_MONTH	NUM_TRANSACTIONS
8	2004	31	8292953
9	2004	30	8967415
10	2004	31	8412131
11	2004	29	7047319
12	2004	30	13383892
1	2005	31	8952311
2	2005	28	11352221
3	2005	30	8940444
4	2005	30	9082523
5	2005	31	7715779
6	2005	30	7922997
7	2005	31	11122770
8	2005	27	9724141

There appears to be an incomplete record of the month during August 2005 (as it has only 27 days).

As the homework instructs, I will restrict all further analysis of August sales to only include those recorded in 2004, and not 2005.

Next, it appears that Dillard's department has designated holidays from their calendar. None of the stores have data for 25 November (Thanksgiving), 25 December (Christmas), or 27 March (their annual sale date).

Question 2

Use a CASE statement within an aggregate function to determine which sku had the greatest total sales during the combined summer months of June, July, and August.

SELECT DISTINCT sku,
SUM (CASE WHEN EXTRACT(month FROM saledate)=6 AND stype='p' THEN amt END) AS rev_june,
SUM (CASE WHEN EXTRACT(month FROM saledate)=7 AND stype='p' THEN amt END) AS rev_july,
SUM (CASE WHEN EXTRACT(month FROM saledate)=8 AND stype='p' THEN amt END) AS rev_aug, -- !
(rev_aug + rev_june + rev_july) AS rev_total_summer
FROM trnsact
GROUP BY sku
ORDER BY rev_total_summer DESC
HAVING rev_total_summer > 0 -- exclude null values

There is a problem with this question statement. It suggests that:

'If your query is correct, you should find that sku #2783996 has the fifth greatest total sales during the combined months of June, July, and August, with a total summer sales sum of $897,807.01.'

However, you will only get this value if you include values from both August 2004 and August 2005, which Question 1 explicitly states not to do so.

A more sensible answer, that includes only one copy of each month per year, would be:

SELECT DISTINCT sku,
SUM (CASE WHEN EXTRACT(month FROM saledate)=6 AND stype='p' THEN amt END) AS rev_june,
SUM (CASE WHEN EXTRACT(month FROM saledate)=7 AND stype='p' THEN amt END) AS rev_july,
SUM (CASE WHEN EXTRACT(month FROM saledate)=8 AND stype='p' 
AND EXTRACT(year FROM saledate)=12 -- new line
THEN amt END) AS rev_aug,
(rev_aug + rev_june + rev_july) AS rev_total_summer
FROM trnsact
GROUP BY sku
ORDER BY rev_total_summer DESC
HAVING rev_total_summer > 0 -- exclude null values

This gives the answer:

SKU ITEM CODE	REV_JUNE 2005	REV_JULY 2005	REV_AUG 2004	REV_TOTAL_SUMMER
4108011	309511.88	379326.00	499821.00	1, 188, 658.88
3524026	269934.50	344833.00	458227.50	1, 072, 995.00
5528349	339349.00	325156.50	337221.00	1, 001, 726.50
3978011	197885.37	259279.60	308910.00	766, 074.97
2783996	190252.01	197414.50	313736.50	701, 403.01

Additional background information on the most popular summer items: (because I'm curious lol)

SELECT *
FROM SKUINFO
WHERE sku IN (4108011, 3524026, 5528349, 3978011, 2783996)

SKU CODE	COLOUR	SIZE	PACKSIZE	BRAND
4108011	DDML	DDML 4OZ	6	CLINIQUE
3524026	DDML	PUMP 4.2 OZ	6	CLINIQUE
5528349	01-BLACK	01-BLACK	3	LANCOME
3978011	CLARIFY #2	13.5 OZ	3	CLINIQUE
2783996	01-BLACK	NO SIZE	3	LANCOME

Exercise 3.

How many distinct dates are there in the saledate column of the transaction table for each month/year/store combination in the database? Sort your results by the number of days per combination in ascending order.

SELECT 
  EXTRACT (month FROM saledate) AS month_num, 
  EXTRACT (year FROM saledate) AS year_num,
  store,
  COUNT (DISTINCT saledate) AS num_dates
FROM trnsact
GROUP BY month_num, year_num, store
ORDER BY num_dates asc

Some stores appear to have missing or removed data (i.e., less than 30 days per month).

MONTH	YEAR	STORE ID	NUM_DATES
7	2005	7604	1
3	2005	8304	1
9	2004	4402	1
8	2004	9906	1
8	2004	8304	1
8	2004	7203	3
3	2005	6402	11

We will note the missing data in case of in future calculations. Where possible, we will aim to exclude months that do not meet criteria when doing trend analysis.

Exercise 4a.

What is the average daily revenue for each store/month/year combination in the database? Calculate this by dividing the total revenue for a group by the number of sales days available in the transaction table for that group.

We can solve this by modifying the solution from Qn 3 to include revenue data.

SELECT 
  store, 
  EXTRACT (month FROM saledate) AS month_num, 
  EXTRACT (year FROM saledate) AS year_num,
  COUNT (DISTINCT saledate) AS num_dates,
  SUM(amt) AS total_revenue,
  revenue/num_dates AS daily_revenue 
FROM trnsact
WHERE stype='p'
GROUP BY store, month_num, year_num
ORDER BY daily_revenue desc

Dr Jana: If your query is correct, you should find that store #204 has an average daily revenue of $16,303.65 in August of 2005.

-- Modified to check results
SELECT 
  store, 
  EXTRACT (month FROM saledate) AS month_num, 
  EXTRACT (year FROM saledate) AS year_num,
  COUNT (DISTINCT saledate) AS num_dates,
  SUM(amt) AS total_revenue,
  total_revenue/num_dates AS daily_revenue 
FROM trnsact
WHERE stype='p' AND store=204 -- !
GROUP BY store, month_num, year_num
ORDER BY year_num desc, month_num desc -- !

Results

STORE	MONTH_NUM	YEAR_NUM	NUM_DATES	TOTAL_REVENUE	DAILY_REVENUE
204	12	2004	30	651309.29	21710.31
204	7	2005	31	520512.72	16790.73
204	4	2005	30	503312.54	16777.08
204	8	2005	27	440198.68	16303.65

Awesome!

For all of the exercises that follow, unless otherwise specified, we will assess sales by summing the total revenue for a given time period, and dividing by the total number of days that contributed to that time period. This will give us “average daily revenue”.

Question 4b.

Modify the query you wrote above to assess the average daily revenue for each store/month/year combination with a clause that removes all the data from August, 2005. Then, given the data we have available in our data set, I propose that we only examine store/month/year combinations that have at least 20 days of data within that month.

SELECT 
  sub.store, 
  sub.year_num, 
  sub.month_num, 
  sub.num_dates, 
  sub.daily_revenue
FROM (
  SELECT 
  store, 
  EXTRACT (month FROM saledate) AS month_num, 
  EXTRACT (year FROM saledate) AS year_num,
  COUNT (DISTINCT saledate) AS num_dates,
  SUM(amt) AS total_revenue,
  total_revenue/num_dates AS daily_revenue,
  (CASE 
  WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
  END) As can_use_anot
  FROM trnsact
  WHERE stype='p' AND can_use_anot='can'
  GROUP BY store, month_num, year_num
  ) AS sub
HAVING sub.num_dates >=20
GROUP BY sub.store, sub.year_num, sub.month_num, sub.num_dates, sub.daily_revenue
ORDER BY sub.num_dates ASC;

DR JANA: Save your final queries that remove “bad data” for use in subsequent exercises. From now on (and in the graded quiz), when I ask for average daily revenue: (1) Only examine purchases (not returns). (2) Exclude all stores with less than 20 days of data. (3) Exclude all data from August, 2005.

Question 5.

What is the average daily revenue brought in by Dillard’s stores in areas of high, medium, or low levels of high school education? Define areas of “low” education as those that have high school graduation rates between 50-60%, areas of “medium” education as those that have high school graduation rates between 60.01-70%, and areas of “high” education as those that have high school graduation rates of above 70%.

I'll start by counting the number of stores within each education level first.

SELECT
(CASE
WHEN msa_high>=50 AND msa_high<70 THEN 'low'
WHEN msa_high>=70 AND msa_high<80 THEN 'med'
WHEN msa_high>=80 THEN 'high'
END) AS education_levels,
COUNT (DISTINCT store) AS num_stores
FROM store_msa
GROUP BY education_levels

Unfortunately it's not a nice distribution:

EDUCATION_LEVEL	NUM_STORES
LOW (>50%)	324
MED (>60%)	5
HIGH (>70%)	4

It would be better if we could redistribute it like this:

EDUCATION_LEVEL	NUM_STORES
LOW (>50%)	213
MED (>70%)	111
HIGH (>80%)	9

But that's not what the question asked, so I'll leave it aside for now. Back to the question, let's merge them:

SELECT 
	(CASE
	WHEN s.msa_high >= 50 and s.msa_high < 60 THEN 'low'
	WHEN s.msa_high >= 60 and s.msa_high < 70 THEN 'medium'
	WHEN s.msa_high >= 70 THEN 'high'
	END) AS education_levels,
	SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue
FROM store_msa s 
	JOIN (
		SELECT 
		store, 
		EXTRACT (year FROM saledate) AS year_num,
		EXTRACT (month FROM saledate) AS month_num, 
		SUM(amt) AS total_revenue, 
		COUNT (DISTINCT (saledate)) AS num_dates,
		(CASE 
		WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
		END) As can_use_anot
		FROM trnsact
		WHERE stype='p' AND can_use_anot='can'
		GROUP BY year_num, month_num, store
		HAVING num_dates >= 20 -- moving this back to within the subquery
		) AS sub
			ON s.store = sub.store
GROUP BY education_levels;

I'm not sure why line 21 HAVING num_dates >= 20 only works when inside the subquery but not when requested from the outer query. It worked fine in the previous question. I guess something about aggregate functions??

DR JANA: If you have executed this query correctly, you will find that the average daily revenue brought in by Dillard’s stores in the low education group is a little more than $34,000, the average daily revenue brought in by Dillard’s stores in the medium education group is a little more than $25,000, and the average daily revenue brought in by Dillard’s stores in the high education group is just under $21,000.

EDUCATION_LEVEL	AVG_DAILY_REVENUE
low	34,159.76
medium	27,112.67
high	20,921.32

Hooray! Moving forward...

Whenever I ask you to calculate the average daily revenue for a group of stores in either these exercises or the quiz, do so by summing together all the revenue from all the entries in that group, and then dividing that summed total by the total number of sale days that contributed to the total. Do not compute averages of averages.

Question 6.

Compare the average daily revenues of the stores with the highest median msa_income and the lowest median msa_income. In what city and state were these stores, and which store had a higher average daily revenue? Use msa_income to calculate.

SELECT 
s.city, 
s.state,
s.msa_income,
SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue
FROM store_msa s 
	JOIN (
		SELECT 
		store,
		EXTRACT (year FROM saledate) AS year_num,
		EXTRACT (month FROM saledate) AS month_num,  
		SUM(amt) AS total_revenue, 
		COUNT(DISTINCT saledate) AS num_dates, 
		(CASE 
			WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
			END) As can_use_anot
		FROM trnsact 
		WHERE stype='p' AND can_use_anot='can'
		GROUP BY year_num, month_num, store
		HAVING num_dates >= 20
		) AS sub 
			ON s.store = sub.store
WHERE s.msa_income IN (
	(SELECT MAX(msa_income) FROM store_msa),
	(SELECT MIN(msa_income) FROM store_msa))
GROUP BY s.city, s.state;

Overall pretty similar to Qn 5.

CITY	STATE	AVG_DAILY_REVENUE
SPANISH FORT	AL	17884.08
MCALLEN	TX	56601.99

Exercise 7:

What is the brand of the sku with the greatest standard deviation in sprice? Only examine skus that have been part of over 100 transactions.

SELECT 
DISTINCT (t.SKU) AS item,
s.brand AS brand,
STDDEV_SAMP(t.sprice) AS dev_price,
COUNT(DISTINCT(t.SEQ||t.STORE||t.REGISTER||t.TRANNUM||t.SALEDATE)) AS distinct_transactions
FROM TRNSACT t
	JOIN SKUINFO s
		ON t.sku=s.sku
WHERE t.stype='p'
HAVING distinct_transactions>100
GROUP BY item, brand
ORDER BY dev_price DESC

I'm not sure which of these columns are unique so I put them all in together: SEQ, STORE, REGISTER, TRANNUM, SALEDATE.

ITEM	BRAND	STYLE	COLOR	SIZE	DEV_PRICE
2762683	HART SCH	403154133510	BLACK	42REG	175.8106
5453849	POLO FAS	9HA 726680	FA02	L	169.4284
5623849	POLO FAS	9HA 726680	FA02	M	164.4187

Exercise 8:

**Examine all the transactions for the sku with the greatest standard deviation in sprice, but only consider skus that are part of more than 100 transactions. Do you think the retail price was set too high, or just right? **

SELECT 
distinct(s.sku) AS items, 
s.brand,
AVG(t.sprice) AS avg_price,
STDDEV_SAMP(t.sprice) AS variation_price, 
avg(t.orgprice)-avg(t.sprice) AS sale_price_diff,
COUNT(distinct(t.trannum)) AS distinct_transactions
FROM skuinfo s 
JOIN trnsact t
ON s.sku=t.sku
WHERE stype='p'
GROUP BY items, s.brand
HAVING distinct_transactions > 100
ORDER BY variation_price DESC;

Not perfect, but consider how items with the highest variation (std dev) prices are not quite those with the greatest sales price differences. This may suggest that some stores are simply pricing items higher/lower across the band, rather than offering massively discounted sale prices (vs original prices) to clear their stock. This might simply reflect their msa_income differences around each store.

So... Was the retail price just right? Can't say for sure, but it's definitely not too high.

Exercise 9

What was the average daily revenue Dillard’s brought in during each month of the year?

SELECT 
(CASE
WHEN sub.month_num=1 THEN 'Jan'
WHEN sub.month_num=2 THEN 'Feb'
WHEN sub.month_num=3 THEN 'Mar'
WHEN sub.month_num=4 THEN 'Apr'
WHEN sub.month_num=5 THEN 'May'
WHEN sub.month_num=6 THEN 'Jun'
WHEN sub.month_num=7 THEN 'Jul'
WHEN sub.month_num=8 THEN 'Aug'
WHEN sub.month_num=9 THEN 'Sep'
WHEN sub.month_num=10 THEN 'Oct'
WHEN sub.month_num=11 THEN 'Nov'
WHEN sub.month_num=12 THEN 'Dec'
END) as month_name,
SUM(num_dates) AS num_days_in_month,
SUM(total_revenue)/SUM(num_dates) AS avg_monthly_revenue
FROM (
	SELECT 
	EXTRACT (month FROM saledate) AS month_num, 
	EXTRACT (year FROM saledate) AS year_num,
	COUNT (DISTINCT saledate) AS num_dates,
	SUM(amt) AS total_revenue,
	(CASE 
	WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
	END) As can_use_anot
	FROM trnsact
	WHERE stype='p' AND can_use_anot='can'
	GROUP BY month_num, year_num
	HAVING num_dates>=20
	) AS sub
GROUP BY month_name
ORDER BY avg_monthly_revenue DESC;

MONTH_NUM	DAYS_IN_MONTH	AVG_MONTHLY_REVENUE
Dec	30	11333356.01
Feb	28	7363752.69
Jul	31	7271088.69
Apr	30	6949616.95
Mar	30	6736315.39
May	31	6666962.59
Jun	30	6524845.42
Nov	29	6296913.50
Oct	31	6106357.90
Jan	31	5836833.31
Aug	31	5616841.37
Sep	30	5596588.02

DR JANA: you should find that December consistently has the best sales, September consistently has the worst or close to the worst sales, and July has very good sales, although less than December.

Question 10

Which department, in which city and state of what store, had the greatest percentage increase in average daily sales revenue from November to December? Note: Use percentage change.

Hints from the notes:

Need to join 4 tables
Use two CASE statements within an aggregate function to sum all revenue Nov and Dec 1 .Use two CASE statements within an aggregate function to count the number of sale days that contributed to the revenue in November and December, separately
Use these 4 fields to calculate the average daily revenue for November and December. You can then calculate the change in these values using the following % change formula: *(X-Y)/Y)*100.
Don’t forget to exclude “bad data” and to exclude return transactions.

First I'd try to find just the percentage increase in revenue from November to December, for each store. I will join the extra details like dept and stuff later.

SELECT 
sub.store,
SUM(CASE WHEN sub.month_num=11 THEN sub.amt END) AS Nov_revenue,
SUM(CASE WHEN sub.month_num=12 THEN sub.amt END) AS Dec_revenue,
COUNT(DISTINCT CASE WHEN sub.month_num=11 THEN sub.saledate END) AS Nov_days,
COUNT(DISTINCT CASE WHEN sub.month_num=12 THEN sub.saledate END) AS Dec_days,
Nov_revenue/Nov_days AS Nov_daily_rev, 
Dec_revenue/Dec_days AS Dec_daily_rev,
((Dec_daily_rev-Nov_daily_rev)/Nov_daily_rev)*100 AS percent_increase
FROM (
  SELECT 
  store,
  amt,
  saledate,
  EXTRACT (month FROM saledate) AS month_num, 
  EXTRACT (year FROM saledate) AS year_num,
  (CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot
  FROM trnsact
  WHERE stype='p' AND can_use_anot='can'
  ) AS sub
GROUP BY sub.store
HAVING Nov_days>=20 AND Dec_days>=20
ORDER BY percent_increase DESC;

STORE	NOV_REV	DEC_REV	NOV_DAYS	DEC_DAYS	NOV_DAILY_REV	DEC_DAILY_REV	PERCENT_INC
3809	210139.08	486314.01	29	30	7246.18	16210.47	124.00
303	175003.74	399975.83	29	30	6034.61	13332.53	121.00
7003	169776.27	380024.73	29	30	5854.35	12667.49	116.00

Seems okay. Let's add the others in.

SELECT -- outer query to drop all necessary columns from inner query
clean.store,
clean.dept,
clean.deptdesc,
clean.city,
clean.state,
clean.percent_increase
FROM (
	SELECT 
	sub.store,
	d.dept, 
	d.deptdesc,
	str.city,
	str.state,
	SUM(CASE WHEN sub.month_num=11 THEN sub.amt END) AS Nov_revenue,
	SUM(CASE WHEN sub.month_num=12 THEN sub.amt END) AS Dec_revenue,
	COUNT(DISTINCT CASE WHEN sub.month_num=11 THEN sub.saledate END) AS Nov_days,
	COUNT(DISTINCT CASE WHEN sub.month_num=12 THEN sub.saledate END) AS Dec_days,    
	Nov_revenue/Nov_days AS Nov_daily_rev, 
	Dec_revenue/Dec_days AS Dec_daily_rev,
	((Dec_daily_rev-Nov_daily_rev)/Nov_daily_rev)*100 AS percent_increase
	FROM (
		SELECT 
 		sku.dept, -- NEW: include this here bc you need to group-by departments at most granular lvl
    t.store,
		t.amt,
		t.saledate,
		EXTRACT (month FROM t.saledate) AS month_num, 
		EXTRACT (year FROM t.saledate) AS year_num,
		(CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot
		FROM trnsact t
			INNER JOIN skuinfo sku 
				ON t.sku=sku.sku 
		WHERE stype='p' AND can_use_anot='can' -- only query purchases, from legal dates
		) AS sub 
		INNER JOIN strinfo str
			ON str.store = sub.store -- to select city and state
		INNER JOIN deptinfo d 
			ON d.dept = sub.dept -- to select department description
	GROUP BY sub.store, d.dept, d.deptdesc, str.city, str.state
	HAVING Nov_days>=20 AND Dec_days>=20 
	) AS clean
GROUP BY 1,2,3,4,5,6
ORDER BY clean.percent_increase DESC

STORE	DEPT	DEPT_DESC	CITY	STATE	PERCENTAGE_INCREASE
3403	7205	LOUIS VL	SALINA	KS	596.00
9806	6402	FREDERI	MABELVALE	AR	476.00
404	2107	MAI	PINE BLUFF	AR	442.00

Question 11

What is the city and state of the store that had the greatest decrease in average daily revenue from August to September?

This is easy, just adapt the query from Qn 10 and remove unnecessary tables.

SELECT 
sub.store, 
str.city, -- left join store_info table for these two
str.state,
SUM(CASE WHEN sub.month_num=8 THEN sub.amt END) AS Aug_revenue, 
SUM(CASE WHEN sub.month_num=9 THEN sub.amt END) AS Sep_revenue,
COUNT(DISTINCT CASE WHEN sub.month_num=8 THEN sub.saledate END) AS Aug_days,
COUNT(DISTINCT CASE WHEN sub.month_num=9 THEN sub.saledate END) AS Sep_days,    
Aug_revenue/Aug_days AS Aug_daily_rev, 
Sep_revenue/Sep_days AS Sep_daily_rev,
(Sep_daily_rev-Aug_daily_rev) AS rev_difference
	FROM ( -- clean inner query for legal dates and purchases only
		SELECT 
		store,
		amt,
		saledate,
		EXTRACT (month FROM saledate) AS month_num, 
		EXTRACT (year FROM saledate) AS year_num,
		(CASE WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' END) As can_use_anot
		FROM trnsact 
		WHERE stype='p' AND can_use_anot='can'
		) AS sub
	INNER JOIN strinfo str -- to extract store's city, state
		ON str.store = sub.store
GROUP BY sub.store, str.city, str.state
HAVING Aug_days>=20 AND Sep_days>=20 -- only keep stores with more than 20 dates per month
ORDER BY rev_difference ASC

STORE	CITY	STATE	REV_DIFFERENCE
4003	WEST DES MOINES	IA	-6479.60
9103	LOUISVILLE	KY	-5233.12
2707	MCALLEN	TX	-5109.47

Question 12

Determine the month of maximum total revenue for each store. Count the number of stores whose month of maximum total revenue was in each of the twelve months.

Then determine the month of maximum average daily revenue. Count the number of stores whose month of maximum average daily revenue was in each of the twelve months. How do they compare?

I'm guessing the assignment wants us to see which month has the most number of stores hitting their maximum total revenue in, and also their highest average daily revenue in.

If the numbers don't match, it might suggest hidden trends, outliers or missing data within the set.

Things to do:

Calculate the average daily revenue for each store, for each month (for each year, but there will only be one year associated with each month)
Order the rows within a store according to average daily revenue from high to low
Assign a rank to each of the ordered rows
Retrieve all of the rows that have the rank you want
Count all of your retrieved rows

DR JANA: You can assign ranks using the ROW_NUMBER or RANK() function. Make sure you “partition” by store in your ROW_NUMBER clause. Lastly when you have confirmed that the output is reasonable, introduce a QUALIFY clause (described in the references above) into your query in order to restrict the output to rows that represent the month with the minimum average daily revenue for each store.

Starting with task (1) and (2), I'll calculate the average daily revenue for each store, by month. We can do this by recycling the query from Qn 9.

SELECT 
(CASE
WHEN sub.month_num=1 THEN 'Jan'
WHEN sub.month_num=2 THEN 'Feb'
WHEN sub.month_num=3 THEN 'Mar'
WHEN sub.month_num=4 THEN 'Apr'
WHEN sub.month_num=5 THEN 'May'
WHEN sub.month_num=6 THEN 'Jun'
WHEN sub.month_num=7 THEN 'Jul'
WHEN sub.month_num=8 THEN 'Aug'
WHEN sub.month_num=9 THEN 'Sep'
WHEN sub.month_num=10 THEN 'Oct'
WHEN sub.month_num=11 THEN 'Nov'
WHEN sub.month_num=12 THEN 'Dec'
END) as month_name,
sub.store,
SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue
FROM (
	SELECT 
	store,
	EXTRACT (month FROM saledate) AS month_num, 
	EXTRACT (year FROM saledate) AS year_num,
	COUNT (DISTINCT saledate) AS num_dates,
	SUM(amt) AS total_revenue,
	(CASE 
	WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
	END) As can_use_anot
	FROM trnsact
	WHERE stype='p' AND can_use_anot='can'
	GROUP BY month_num, year_num
	HAVING num_dates>=20
	) AS sub
GROUP BY month_name, sub.store
ORDER BY avg_daily_revenue DESC;

(3) Let's add the bit for RANK() and PARTITION: (a snippet)

SELECT 
(CASE
WHEN sub.month_num=1 THEN 'Jan'
	...
WHEN sub.month_num=12 THEN 'Dec'
END) as month_name,
sub.store,
SUM(sub.total_revenue) AS sum_monthly_revenue, -- TOTAL monthly rev
SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue, -- AVERAGE rev within month
ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY avg_daily_revenue DESC ) AS Row_sum_rev, --! 
ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY sum_monthly_revenue DESC ) AS Row_avg_rev --!
FROM (
	...
	) AS sub
GROUP BY month_name, sub.store
ORDER BY avg_daily_revenue DESC;

(4)+(5) Finally, let's retrieve all rows with top ranking month, to see which month performed best.

SELECT 
clean.month_name AS month_n, 
COUNT(CASE WHEN clean.Row_sum_rev =1 THEN clean.store END) AS Total_monthly_rev_count, -- count number of rank 1s per month
COUNT(CASE WHEN clean.Row_avg_rev =1 THEN clean.store END) AS Average_daily_rev_count -- count number of rank 1s per month
FROM (
	SELECT 
	(CASE
	WHEN sub.month_num=1 THEN 'Jan'
	WHEN sub.month_num=2 THEN 'Feb'
	WHEN sub.month_num=3 THEN 'Mar'
	WHEN sub.month_num=4 THEN 'Apr'
	WHEN sub.month_num=5 THEN 'May'
	WHEN sub.month_num=6 THEN 'Jun'
	WHEN sub.month_num=7 THEN 'Jul'
	WHEN sub.month_num=8 THEN 'Aug'
	WHEN sub.month_num=9 THEN 'Sep'
	WHEN sub.month_num=10 THEN 'Oct'
	WHEN sub.month_num=11 THEN 'Nov'
	WHEN sub.month_num=12 THEN 'Dec'
	END) as month_name,
	sub.store,
	SUM(sub.total_revenue) AS sum_monthly_revenue,
	SUM(sub.total_revenue)/SUM(sub.num_dates) AS avg_daily_revenue,
	ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY avg_daily_revenue DESC ) AS Row_sum_rev,
	ROW_NUMBER() OVER (PARTITION BY sub.store ORDER BY sum_monthly_revenue DESC ) AS Row_avg_rev
	FROM (
		SELECT 
		store,
		EXTRACT (month FROM saledate) AS month_num, 
		EXTRACT (year FROM saledate) AS year_num,
		COUNT (DISTINCT saledate) AS num_dates,
		SUM(amt) AS total_revenue,
		(CASE 
		WHEN (year_num=2005 AND month_num=8) THEN 'cannot' ELSE 'can' 
		END) As can_use_anot
		FROM trnsact
		WHERE stype='p' AND can_use_anot='can'
		GROUP BY month_num, year_num, store
		HAVING num_dates>=20
		) AS sub
	GROUP BY month_name, sub.store
	) AS clean
GROUP BY Month_n
ORDER BY Total_monthly_rev_count DESC

DR JANA: If you write your queries correctly, you will find that 8 stores have the greatest total sales in April, while only 4 stores have the greatest average daily revenue in April.

MONTH	TOTAL_MONTHLY	AVG_DAILY
Dec	317	321
Mar	4	3
Jul	3	3

While the output fits with our expectations of the data (ie. that Dec should be the most popular month), but it doesn't match Dr Jana's hint.

AFter reading the forum, I realised that official assignment seems to give the wrong hint (quite a significant mistake!). We will get the expected result if we write our queries to find the LOWEST total sales as ranked by month instead of the HIGHEST total sales, like so:

SELECT 
clean.month_name AS month_n, 
COUNT(CASE WHEN clean.Row_sum_rev =1 THEN clean.store END) AS Total_monthly_rev_count, -- change 1 to 12
COUNT(CASE WHEN clean.Row_avg_rev =1 THEN clean.store END) AS Average_daily_rev_count -- change 1 to 12
FROM (
...

MONTH	LOW_TOTAL_MONTH	LOW_AVG_DAILY
Aug	120	77
Jan	73	54
Sep	72	108
...	...	...
Apr	4	8
...	...	...
Dec	0	0

End

Thoughts on this course: Notes were messy and with quite a few significant mistakes, like that last one we saw. But overall it was a good introduction to SQL and I appreciate the resources to let us try and play it out.

Key takeaways:

Computational thinking: Learning how to split large, complex problems into smaller sets that can be reassembled later
Rigorous testing and checking of trend inconsistencies using month, year-aggregations, or standard deviations
Dealing with outliers and missing data by setting predefined criterias in subqueries
Overall syntax nuances between dialects for MySQL, Teradata
Perseverance for long queries lol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week5 - Dillards.md

Week5 - Dillards.md

Final Week - Dillard's Database Exercises

Answers

Question 1

Question 2

Exercise 3.

Exercise 4a.

Question 4b.

Question 5.

Question 6.

Exercise 7:

Exercise 8:

Exercise 9

Question 10

Question 11

Question 12

End

Files

Week5 - Dillards.md

Latest commit

History

Week5 - Dillards.md

File metadata and controls

Final Week - Dillard's Database Exercises

Answers

Question 1

Question 2

Exercise 3.

Exercise 4a.

Question 4b.

Question 5.

Question 6.

Exercise 7:

Exercise 8:

Exercise 9

Question 10

Question 11

Question 12

End