# SQL Project Analysis
## Redfin Data Analyst Intern

In [3]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [None]:
%sql mysql://~USER~:~PASSWORD~@~AWS_HOST~/~DATABASE_NAME~

## Exploratory Queries

### 1. 
#### Many measurements for smaller counties in all states turn up NULL results since there is either no data being registered from those counties, or Redfin does not collect from them since they are not active housing markets.

#### Purpose: This query sorts out all of the NULL values for total_homes_sold, which I have determined to be one of the main measurements I will be looking for in the property data that is generally indicative of whether or not other measurements have been collected as well.Additionally, this query gives a brief snapshot of some of the most active housing markets in terms of homes sold and active listings. It also provides the median sale price for each.

#### Estimated Run Time: ~1 second

In [4]:
%%sql
SELECT 
    region_name,
    SUM(total_homes_sold) AS total_homes_sold,
    FORMAT(AVG(median_sale_price), 0) AS avg_median_sale_price,
    FORMAT(AVG(total_active_listings), 0) AS avg_total_active_listings
FROM housing_data
WHERE total_homes_sold IS NOT NULL
GROUP BY region_name
ORDER BY total_homes_sold DESC, median_sale_price;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
235 rows affected.


region_name,total_homes_sold,avg_median_sale_price,avg_total_active_listings
"Houston, TX metro area",6312067.0,243525,152545
"Los Angeles County, CA",4907011.0,633365,102865
"Los Angeles, CA metro area",4907011.0,633365,102865
"Dallas, TX metro area",4761026.0,293357,94801
"New York, NY metro area",4674078.0,547929,208524
"Riverside, CA metro area",4097739.0,378499,90244
"Harris County, TX",3647624.0,227531,86317
"San Diego, CA metro area",2541703.0,585497,38203
"San Diego County, CA",2541703.0,585497,38203
"Austin, TX metro area",2538921.0,320710,43178


#### Discovery:
After running this query, I found that there were many NULL results for different regions. I presume this was mostly due to the lack of reporting for some smaller regions in the US that either Redfin has not expanded their services to or simply do not have organized real estate registries available for analysis.

### 2.
#### Connect each county with their corresponding 2019 median household income.

#### Purpose: Understand which counties are wealthier on average, which will most likely be reflected in property prices.

#### Estimated Run Time: ~53 seconds

In [8]:
%%sql
SELECT 
    hd.region_name,
    chi.countyName,
    chi.medianHouseholdIncome 
FROM housing_data hd 
JOIN countyHouseholdIncome chi 
    ON hd.region_name LIKE chi.countyName
GROUP BY region_name
LIMIT 100;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
100 rows affected.


region_name,countyName,medianHouseholdIncome
"Alameda County, CA","Alameda County, CA",107589
"Amador County, CA","Amador County, CA",62640
"Butte County, CA","Butte County, CA",58394
"Calaveras County, CA","Calaveras County, CA",68248
"Contra Costa County, CA","Contra Costa County, CA",106555
"El Dorado County, CA","El Dorado County, CA",86202
"Fresno County, CA","Fresno County, CA",56926
"Glenn County, CA","Glenn County, CA",55682
"Kern County, CA","Kern County, CA",53245
"Lake County, CA","Lake County, CA",46897


#### Discovery:
After running the above query, I made sure that the housing_data VIEW and medianHouseholdIncome table were properly connected based on their shared region names, generally the names of specific US counties.

### 3.
#### List out all of the weekly periods for specific counties. In the housing_data table, each county has many period ranges, from weekly, to monthly, to quarterly. For the purposes of my queries, I will just be utilizing the weekly data.

#### Purpose: Understand the time structure of the housing_data table. Each county has 52 weeks worth of data in each year, from 2017 and after.

#### Estimated Run Time: ~1 second

In [9]:
%%sql
SELECT 
    region_name,
    period_begin,
    period_end,
    DATEDIFF(period_end, period_begin) AS period_length,
    total_homes_sold,
    median_sale_price
FROM housing_data hd 
WHERE YEAR(period_begin) = '2017'
    AND DATEDIFF(period_end, period_begin) = 6
    AND region_name = 'Los Angeles County, CA'
ORDER BY period_begin;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
52 rows affected.


region_name,period_begin,period_end,period_length,total_homes_sold,median_sale_price
"Los Angeles County, CA",2017-01-02,2017-01-08,6,958.0,543750.0
"Los Angeles County, CA",2017-01-09,2017-01-15,6,1146.0,560000.0
"Los Angeles County, CA",2017-01-16,2017-01-22,6,1037.0,550000.0
"Los Angeles County, CA",2017-01-23,2017-01-29,6,1140.0,521000.0
"Los Angeles County, CA",2017-01-30,2017-02-05,6,1231.0,525000.0
"Los Angeles County, CA",2017-02-06,2017-02-12,6,924.0,528750.0
"Los Angeles County, CA",2017-02-13,2017-02-19,6,1220.0,547500.0
"Los Angeles County, CA",2017-02-20,2017-02-26,6,1089.0,545000.0
"Los Angeles County, CA",2017-02-27,2017-03-05,6,1450.0,547450.0
"Los Angeles County, CA",2017-03-06,2017-03-12,6,1281.0,563000.0


#### Discovery:
The above query revealed that each region had data for each of the 52 weeks of a year, which reflects how often the public dataset is modified and published by Redfin. This was useful to base the rest of my yearly calculations off of since there seemed to be a lot of overlapping data in terms of the dates covered.

### 4.
#### DISPLAY total property sales each year and calculate year-over-year sales changes.

#### Purpose: Understand which housing markets have experienced the highest growth since 2017.

#### Estimated Run Time: ~16 seconds

In [10]:
%%sql
SELECT 
    DISTINCT(hd.region_name),
    hst.home_sale_year_total_2017,
    hst2.home_sale_year_total_2018,
    hst3.home_sale_year_total_2019,
    hst4.home_sale_year_total_2020,
    (
        (
            (hst2.home_sale_year_total_2018 / hst.home_sale_year_total_2017) +
            (hst3.home_sale_year_total_2019 / hst2.home_sale_year_total_2018) +
            (hst4.home_sale_year_total_2020 / hst3.home_sale_year_total_2019)
        ) / 3
    ) AS avg_yoy_sales_change
FROM housing_data hd
JOIN home_sale_totals_2017 hst
    ON hd.region_name = hst.region_name
JOIN home_sale_totals_2018 hst2 
    ON hd.region_name = hst2.region_name 
JOIN home_sale_totals_2019 hst3 
    ON hd.region_name = hst3.region_name 
JOIN home_sale_totals_2020 hst4 
    ON hd.region_name = hst4.region_name
WHERE hst.home_sale_year_total_2017 > 1000
ORDER BY avg_yoy_sales_change DESC;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
128 rows affected.


region_name,home_sale_year_total_2017,home_sale_year_total_2018,home_sale_year_total_2019,home_sale_year_total_2020,avg_yoy_sales_change
"Comal County, TX",2633.0,3042.0,3443.0,4056.0,1.155066564552729
"Kaufman County, TX",2493.0,2731.0,2848.0,3549.0,1.1281487996438149
"Hays County, TX",3413.0,3759.0,4005.0,4731.0,1.116031144265743
"Hunt County, TX",1111.0,1127.0,1267.0,1512.0,1.1106650694581144
"El Paso, TX metro area",7211.0,8012.0,8710.0,9848.0,1.1096180117401386
"El Paso County, TX",7186.0,7948.0,8660.0,9756.0,1.1073935658659668
"Johnson County, TX",2414.0,2629.0,2461.0,3119.0,1.0975107191211038
"Montgomery County, TX",10215.0,10634.0,11331.0,13326.0,1.0942094170562402
"Brazoria County, TX",4410.0,4793.0,5041.0,5754.0,1.0933434623104985
"McAllen, TX metro area",2983.0,3033.0,3542.0,3870.0,1.0923951127006042


#### Discovery:
After running this query, I was able to take a look at which regions have seen the greatest increases in home sale prices over the 2017-2020 period. I was surprised to see that many of the metro regions I initially suspected to be on there like Los Angeles, San Francisco, New York were not near the top of the list.

### 5.
#### This query compares the yearly average home sale prices across each county to see which markets have been increasing in price steadily over the past 4 years. 

#### Purpose: To provide a suitable and easily-modifiable subquery to search through for question #1, when attempting to collect data on the most and least successful housing markets over the 4-year period.

#### Estimated Run Time ~ 1 second each

VIEW for counties with higest sale price change ratio over the 4-year period

In [11]:
%%sql
CREATE OR REPLACE VIEW yoy_sale_price_increase AS 
SELECT 
    hd.region_name AS region_name,
    FORMAT(hst.avg_sale_price_2017, 0) AS avg_sale_price_2017,
    FORMAT(hst2.avg_sale_price_2018, 0) AS avg_sale_price_2018,
    FORMAT(hst3.avg_sale_price_2019, 0) AS avg_sale_price_2019,
    FORMAT(hst4.avg_sale_price_2020, 0) AS avg_sale_price_2020,
    (
        (
            (hst2.avg_sale_price_2018 / hst.avg_sale_price_2017) +
            (hst3.avg_sale_price_2019 / hst2.avg_sale_price_2018) +
            (hst4.avg_sale_price_2020 / hst3.avg_sale_price_2019)
        ) / 3
    ) AS yoy_sale_price_change
FROM housing_data hd
LEFT JOIN home_sale_totals_2017 hst
    ON hd.region_name = hst.region_name
LEFT JOIN home_sale_totals_2018 hst2 
    ON hd.region_name = hst2.region_name 
LEFT JOIN home_sale_totals_2019 hst3 
    ON hd.region_name = hst3.region_name 
LEFT JOIN home_sale_totals_2020 hst4 
    ON hd.region_name = hst4.region_name
WHERE 
    (
        (
            (hst2.avg_sale_price_2018 / hst.avg_sale_price_2017) +
            (hst3.avg_sale_price_2019 / hst2.avg_sale_price_2018) +
            (hst4.avg_sale_price_2020 / hst3.avg_sale_price_2019)
        ) / 3
    ) > 1
GROUP BY hd.region_name
ORDER BY yoy_sale_price_change DESC
LIMIT 50;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
0 rows affected.


[]

VIEW for counties with the lowest sale price change ratio in the 4-year period

In [12]:
%%sql
CREATE OR REPLACE VIEW yoy_sales_price_decrease AS 
SELECT 
    hd.region_name AS region_name,
    FORMAT(hst.avg_sale_price_2017, 0) AS avg_sale_price_2017,
    FORMAT(hst2.avg_sale_price_2018, 0) AS avg_sale_price_2018,
    FORMAT(hst3.avg_sale_price_2019, 0) AS avg_sale_price_2019,
    FORMAT(hst4.avg_sale_price_2020, 0) AS avg_sale_price_2020,
    (
        (
            (hst2.avg_sale_price_2018 / hst.avg_sale_price_2017) +
            (hst3.avg_sale_price_2019 / hst2.avg_sale_price_2018) +
            (hst4.avg_sale_price_2020 / hst3.avg_sale_price_2019)
        ) / 3
    ) AS yoy_sale_price_change
FROM housing_data hd
LEFT JOIN home_sale_totals_2017 hst
    ON hd.region_name = hst.region_name
LEFT JOIN home_sale_totals_2018 hst2 
    ON hd.region_name = hst2.region_name 
LEFT JOIN home_sale_totals_2019 hst3 
    ON hd.region_name = hst3.region_name 
LEFT JOIN home_sale_totals_2020 hst4 
    ON hd.region_name = hst4.region_name
WHERE 
    (
        (
            (hst2.avg_sale_price_2018 / hst.avg_sale_price_2017) +
            (hst3.avg_sale_price_2019 / hst2.avg_sale_price_2018) +
            (hst4.avg_sale_price_2020 / hst3.avg_sale_price_2019)
        ) / 3
    ) < 1
GROUP BY hd.region_name
ORDER BY yoy_sale_price_change DESC
LIMIT 50;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
0 rows affected.


[]

VIEW displaying sale price changes for all counties no matter if they have decreased or increased

In [13]:
%%sql
CREATE OR REPLACE VIEW yoy_sales_price_change AS 
SELECT 
    hd.region_name AS region_name,
    FORMAT(hst.avg_sale_price_2017, 0) AS avg_sale_price_2017,
    FORMAT(hst2.avg_sale_price_2018, 0) AS avg_sale_price_2018,
    FORMAT(hst3.avg_sale_price_2019, 0) AS avg_sale_price_2019,
    FORMAT(hst4.avg_sale_price_2020, 0) AS avg_sale_price_2020,
    (
        (
            (hst2.avg_sale_price_2018 / hst.avg_sale_price_2017) +
            (hst3.avg_sale_price_2019 / hst2.avg_sale_price_2018) +
            (hst4.avg_sale_price_2020 / hst3.avg_sale_price_2019)
        ) / 3
    ) AS yoy_sale_price_change
FROM housing_data hd
JOIN home_sale_totals_2017 hst
    ON hd.region_name = hst.region_name
JOIN home_sale_totals_2018 hst2 
    ON hd.region_name = hst2.region_name 
JOIN home_sale_totals_2019 hst3 
    ON hd.region_name = hst3.region_name 
JOIN home_sale_totals_2020 hst4 
    ON hd.region_name = hst4.region_name
GROUP BY hd.region_name
ORDER BY yoy_sale_price_change DESC;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
0 rows affected.


[]

#### Discovery:
These views were useful to create because they helped me formulate the queries for my main questions surrounding year-over-year home sale price and total homes sold data.

### 6. 
#### CREATE VIEWS to organize data into yearly subsections for property sales and median property prices in each county

### Estimated Run Time: ~1 second each

In [14]:
%%sql
CREATE OR REPLACE VIEW home_sale_totals_2017 AS
SELECT 
    DISTINCT(region_name),
    SUM(total_homes_sold) OVER(PARTITION BY region_name) AS home_sale_year_total_2017,
    AVG(median_sale_price) OVER(PARTITION BY region_name) AS avg_sale_price_2017
FROM housing_data hd 
WHERE YEAR(period_begin) = '2017'
    AND DATEDIFF(period_end, period_begin) = 6
ORDER BY home_sale_year_total_2017 DESC, region_name; 

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
0 rows affected.


[]

In [15]:
%%sql
CREATE OR REPLACE VIEW home_sale_totals_2018 AS
SELECT 
    DISTINCT(region_name),
    SUM(total_homes_sold) OVER(PARTITION BY region_name) AS home_sale_year_total_2018,
    AVG(median_sale_price) OVER(PARTITION BY region_name) AS avg_sale_price_2018
FROM housing_data hd 
WHERE YEAR(period_begin) = '2018'
    AND DATEDIFF(period_end, period_begin) = 6
ORDER BY home_sale_year_total_2018 DESC, region_name;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
0 rows affected.


[]

In [16]:
%%sql
CREATE OR REPLACE VIEW home_sale_totals_2019 AS
SELECT 
    DISTINCT(region_name),
    SUM(total_homes_sold) OVER(PARTITION BY region_name) AS home_sale_year_total_2019,
    AVG(median_sale_price) OVER(PARTITION BY region_name) AS avg_sale_price_2019
FROM housing_data hd 
WHERE YEAR(period_begin) = '2019'
    AND DATEDIFF(period_end, period_begin) = 6
ORDER BY home_sale_year_total_2019 DESC, region_name;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
0 rows affected.


[]

In [17]:
%%sql
CREATE OR REPLACE VIEW home_sale_totals_2020 AS
SELECT 
    DISTINCT(region_name),
    SUM(total_homes_sold) OVER(PARTITION BY region_name) AS home_sale_year_total_2020,
    AVG(median_sale_price) OVER(PARTITION BY region_name) AS avg_sale_price_2020
FROM housing_data hd 
WHERE YEAR(period_begin) = '2020'
    AND DATEDIFF(period_end, period_begin) = 6
ORDER BY home_sale_year_total_2020 DESC, region_name;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
0 rows affected.


[]

In [18]:
%%sql
CREATE OR REPLACE VIEW home_sale_totals_2021 AS
SELECT 
    DISTINCT(region_name),
    SUM(total_homes_sold) OVER(PARTITION BY region_name) AS home_sale_year_total_2021,
    AVG(median_sale_price) OVER(PARTITION BY region_name) AS avg_sale_price_2021
FROM housing_data hd 
WHERE YEAR(period_begin) = '2021'
    AND DATEDIFF(period_end, period_begin) = 6
ORDER BY home_sale_year_total_2021 DESC, region_name;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
0 rows affected.


[]

#### Discovery:
These VIEWS were useful to collect yearly data on all of the counties based off of the 52-week-long periods recorded for each.

## Primary Question

### Based on housing data collected between 2017 and 2020, how have the housing markets in county/metro regions in California, New York, and Texas performed and which regions provide the best potential investments for residential property buyers?

## Two Sub Questions

### 1.A: Which housing markets have seen the biggest boom out of the 3 studied between 2017-2020?
### 1.B: Which housing markets have seen the largest bust out of the 3 studied between 2017-2020?

### 2: How do home prices compare to the (2019) median household income in each county in 2020? Which counties have the lowest home price to household income ratio in 2020? 
______________________________________________________________________________________________________________________

### 1.A: Which housing markets have seen the biggest boom out of the 3 studied between 2017-2020?
#### Business Justification:
It is important for Redfin analysts to understand which counties/regions in the US have experienced the greatest housing booms over the past 4 years so that they can adjust their range of listings accordingly. By taking advantage of counties with high home sales and healthily increasing prices, Redfin can cater to these areas more where home buying is occuring the most.

#### Estimated Run Time: ~1m 49s

In [19]:
%%sql
WITH cte_yoy_sales_data AS (
    SELECT 
        DISTINCT(hd.region_name),
        hst.home_sale_year_total_2017,
        hst2.home_sale_year_total_2018,
        hst3.home_sale_year_total_2019,
        hst4.home_sale_year_total_2020,
        (
            (
                (hst2.home_sale_year_total_2018 / hst.home_sale_year_total_2017) +
                (hst3.home_sale_year_total_2019 / hst2.home_sale_year_total_2018) +
                (hst4.home_sale_year_total_2020 / hst3.home_sale_year_total_2019)
            ) / 3
        ) AS avg_yoy_sales_change
    FROM housing_data hd
    JOIN home_sale_totals_2017 hst
        ON hd.region_name = hst.region_name
    JOIN home_sale_totals_2018 hst2 
        ON hd.region_name = hst2.region_name 
    JOIN home_sale_totals_2019 hst3 
        ON hd.region_name = hst3.region_name 
    JOIN home_sale_totals_2020 hst4 
        ON hd.region_name = hst4.region_name
    WHERE hst.home_sale_year_total_2017 > 1000
    ORDER BY avg_yoy_sales_change DESC
)
SELECT 
    cte.region_name, 
    FORMAT(cte.avg_yoy_sales_change, 2) AS avg_yoy_home_sales_change,
    FORMAT(yspi.yoy_sale_price_change, 2) AS avg_yoy_sale_price_change 
FROM cte_yoy_sales_data cte
JOIN yoy_sale_price_increase yspi
    ON cte.region_name = yspi.region_name 
WHERE cte.region_name IN 
    (
        SELECT 
            region_name
        FROM yoy_sale_price_increase yspi 
    )
    AND cte.avg_yoy_sales_change > 1
ORDER BY cte.avg_yoy_sales_change DESC
LIMIT 10;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
6 rows affected.


region_name,avg_yoy_home_sales_change,avg_yoy_sale_price_change
"Chautauqua County, NY",1.06,1.1
"Jamestown, NY metro area",1.06,1.1
"Madera, CA metro area",1.04,1.09
"Madera County, CA",1.04,1.09
"Poughkeepsie, NY metro area",1.03,1.09
"Orange County, NY",1.01,1.09


#### Recommendation:
In the query above, I have collected the following data:
- region name
- average year-over-year total home sales change
- average year-over-year sale price change

As seen by the above results, contrary to popular belief about housing booms taking place in largely urban and metropolitan areas, the best growth in housing markets across California, New York, and Texas are actually taking place in more rural areas, such as Madera County in California, and Chautauqua County in New York. While metropolitan regions may boast the highest home sales and highest property prices, it is these areas that are actually growing at the healthiest rate. Because of this, I recommend that Redfin continue pushing these smaller but popular and healthy regions to consumers.

### 1.B: Which housing markets have seen the largest bust out of the 3 studied between 2017-2020?
#### Business Justification:
As much as it is important to understand where the housing market has been the most successful, it is equally important to understand where it has suffered the most. Redfin analysts must recognize which regions are struggling or stagnating in the current California, Texas, and New York housing markets.

#### Estimated Run Time: ~1m 45s

In [21]:
%%sql
WITH cte_yoy_sales_data AS (
    SELECT 
        DISTINCT(hd.region_name),
        hst.home_sale_year_total_2017,
        hst2.home_sale_year_total_2018,
        hst3.home_sale_year_total_2019,
        hst4.home_sale_year_total_2020,
        (
            (
                (hst2.home_sale_year_total_2018 / hst.home_sale_year_total_2017) +
                (hst3.home_sale_year_total_2019 / hst2.home_sale_year_total_2018) +
                (hst4.home_sale_year_total_2020 / hst3.home_sale_year_total_2019)
            ) / 3
        ) AS avg_yoy_sales_change
    FROM housing_data hd
    JOIN home_sale_totals_2017 hst
        ON hd.region_name = hst.region_name
    JOIN home_sale_totals_2018 hst2 
        ON hd.region_name = hst2.region_name 
    JOIN home_sale_totals_2019 hst3 
        ON hd.region_name = hst3.region_name 
    JOIN home_sale_totals_2020 hst4 
        ON hd.region_name = hst4.region_name
    WHERE hst.home_sale_year_total_2017 > 1000
    ORDER BY avg_yoy_sales_change DESC
)
SELECT 
    cte.region_name, 
    FORMAT(cte.avg_yoy_sales_change, 2) AS avg_yoy_home_sales_change,
    FORMAT(yspd.yoy_sale_price_change, 2) AS avg_yoy_sale_price_change
FROM cte_yoy_sales_data cte
JOIN yoy_sales_price_decrease yspd
    ON cte.region_name = yspd.region_name 
WHERE cte.region_name IN 
    (
        SELECT 
            yspd.region_name
        FROM yoy_sales_price_decrease yspd 
    )
    AND cte.avg_yoy_sales_change < 1
ORDER BY cte.avg_yoy_sales_change
LIMIT 10;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
1 rows affected.


region_name,avg_yoy_home_sales_change,avg_yoy_sale_price_change
"New York County, NY",0.87,0.97


#### Recommendation:
In the query below, I have collected the following data:
- region name
- average year-over-year total home sales change
- average year-over-year sale price change

Based on my query, and relating to the query results in Question 1.A, it is New York County, home to the greater part of New York City, with an approximate ~8.4 million people, that has been stagnating the most over the 4-year period. The bustling metropolis is known for its sky high home prices, making it inaccessible to most, and a symbol of success and status to some. However, while the data shows that New York County's housing market has struggled to stay healthy from 2017 to 2020, it has actually been reported that their housing market has been bouncing back over the past few months nearing the end of the COVID-19 pandemic, allowing for greater home sales and and reducing prices to more affordable levels. Because of this, I recommend that Redfin continue to direct consumers to smaller regions with healthier growth potential outside of metropolitan areas such as New York City.

### 2. How do home prices compare to the (2019) median household income in each county in 2020? Which counties have the lowest home price to household income ratio in 2020?

#### Business Justification:
For Redfin it is important to understand the financial risks homebuyers are taking when purchasing property in different housing markets. By calculating the home price to income ratio across different counties, Redfin is able to get a better picture of who their potential customers are and what kinds of properties they are looking for cost-wise. Additionally, they can see which housing markets require buyers to pay signficantly over their financial weight, which can be financiall difficult and risky, considering buyers generally apply for mortgages from banks.

#### Estimated Run Time: ~1 second

In [22]:
%%sql
SELECT 
    hst.region_name, 
    FORMAT(avg_sale_price_2020, 0) AS avg_sale_price_2020,
    chi.medianHouseholdIncome AS median_household_income_2019,
    CAST( 
        (avg_sale_price_2020 / chi.medianHouseholdIncome) AS decimal(12,2)
    ) AS price_to_income_ratio,
    CASE 
        WHEN (avg_sale_price_2020 / chi.medianHouseholdIncome) < 2.5 THEN 'Affordable'
        WHEN (avg_sale_price_2020 / chi.medianHouseholdIncome) BETWEEN 2.5 AND 5.0 THEN 'Expensive'
        WHEN (avg_sale_price_2020 / chi.medianHouseholdIncome) > 5.0 THEN 'Incredibly Expensive'
    END AS property_affordability
FROM home_sale_totals_2020 hst
JOIN countyHouseholdIncome chi 
    ON hst.region_name LIKE chi.countyName
ORDER BY price_to_income_ratio;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
165 rows affected.


region_name,avg_sale_price_2020,median_household_income_2019,price_to_income_ratio,property_affordability
"Allegany County, NY",85302,49411,1.73,Affordable
"Refugio County, TX",86500,46883,1.85,Affordable
"Fisher County, TX",101285,51053,1.98,Affordable
"Wyoming County, NY",118417,58214,2.03,Affordable
"Haskell County, TX",78892,38613,2.04,Affordable
"Oswego County, NY",123037,57640,2.13,Affordable
"Chautauqua County, NY",107117,50143,2.14,Affordable
"Throckmorton County, TX",91500,42746,2.14,Affordable
"Genesee County, NY",134758,62570,2.15,Affordable
"Orleans County, NY",118904,53115,2.24,Affordable


#### Recommendation:
For this query, I have collected the following data for county/metro regions for the year 2020:
- region name
- 2020 average median sale price
- 2019 median household income
- price to income ratio
- property affordability rating

Based on the above results, I believe that Redfin should expand their offerings to buyers by providing information and guidance on how to apply for loans/mortgages for buyers searching in markets where the price to income ratio is far above what is considered affordable. By doing so, Redfin buyers will trust the site more to guide their house buying decision since Redfin actively provides different options on their website catered towards the individual financing needs of each customer.

### Primary Question
#### Based on housing data collected between 2017 and 2020, how have the housing markets in county/metro regions in California, New York, and Texas performed and which regions provide the best potential investments for residential property buyers?

#### Business Justification:
As an analyst at Redfin, it is important to understand where home buyers have the greatest opportunities in terms of finding ample, healthy and potentially lucrative home investments in the long-term. By analyzing the housing market the perspective of a buyer, Redfin analysts can optimally adjust their listings and the geographic regions in which they service by either increasing their marketing to those regions or investing more in outreach to local real estate agencies. Similarly, by providing reliable data insights about the financial health of a property investment as well as its financial consequences, good or bad, Redfin can boost their trusting relationships with consumers so that buyers will be more inclined to use Redfin's service and buy through the site, thereby increasing revenues all around.

#### Estimated Run Time: ~1m 9s

In [23]:
%%sql
WITH cte_yoy_sales_data AS (
    SELECT 
        DISTINCT(hd.region_name),
        (
            (
                (hst2.home_sale_year_total_2018 / hst.home_sale_year_total_2017) +
                (hst3.home_sale_year_total_2019 / hst2.home_sale_year_total_2018) +
                (hst4.home_sale_year_total_2020 / hst3.home_sale_year_total_2019)
            ) / 3
        ) AS avg_yoy_sales_change
    FROM housing_data hd
    JOIN home_sale_totals_2017 hst
        ON hd.region_name = hst.region_name
    JOIN home_sale_totals_2018 hst2 
        ON hd.region_name = hst2.region_name 
    JOIN home_sale_totals_2019 hst3 
        ON hd.region_name = hst3.region_name 
    JOIN home_sale_totals_2020 hst4 
        ON hd.region_name = hst4.region_name
    WHERE hst.home_sale_year_total_2017 > 1000
    ORDER BY avg_yoy_sales_change DESC
)
SELECT 
    hd.region_name,
    AVG(median_sale_price) OVER(PARTITION BY region_name) AS avg_median_sale_price,
    chi.medianHouseholdIncome AS median_household_income,
    CAST( 
        ( AVG(median_sale_price) OVER(PARTITION BY region_name)/ chi.medianHouseholdIncome) AS decimal(12,2) 
    ) AS price_to_income_ratio,
    CASE 
        WHEN 
            CAST( 
            ( AVG(median_sale_price) OVER(PARTITION BY region_name)/ chi.medianHouseholdIncome) AS decimal(12,2)
            ) < 2.5 
            AND CAST(cte.avg_yoy_sales_change AS decimal (12,2)) > 1
            AND CAST(yspc.yoy_sale_price_change AS decimal (12,2)) > 1 
        THEN 'Ideal to buy'
        WHEN 
            CAST( 
            ( AVG(median_sale_price) OVER(PARTITION BY region_name)/ chi.medianHouseholdIncome) AS decimal(12,2)
            ) BETWEEN 2.5 AND 5
            AND CAST(cte.avg_yoy_sales_change AS decimal (12,2)) > 1
            AND CAST(yspc.yoy_sale_price_change AS decimal (12,2)) > 1
        THEN 'Moderate risk; good future gains'
        WHEN 
            CAST( 
            ( AVG(median_sale_price) OVER(PARTITION BY region_name)/ chi.medianHouseholdIncome) AS decimal(12,2)
            ) > 5
            AND CAST(cte.avg_yoy_sales_change AS decimal (12,2)) > 1
            AND CAST(yspc.yoy_sale_price_change AS decimal (12,2)) > 1
        THEN 'high risk; potential future gains'
        WHEN 
            CAST( 
            ( AVG(median_sale_price) OVER(PARTITION BY region_name)/ chi.medianHouseholdIncome) AS decimal(12,2)
            ) > 5
            AND CAST(cte.avg_yoy_sales_change AS decimal (12,2)) < 1
                OR CAST(yspc.yoy_sale_price_change AS decimal (12,2)) < 1
        THEN 'Extreme risk; not recommended'
        WHEN 
            CAST( 
            ( AVG(median_sale_price) OVER(PARTITION BY region_name)/ chi.medianHouseholdIncome) AS decimal(12,2)
            ) < 2.5
            AND (CAST(cte.avg_yoy_sales_change AS decimal (12,2)) < 1
                AND CAST(yspc.yoy_sale_price_change AS decimal (12,2)) > 1)
            OR (CAST(cte.avg_yoy_sales_change AS decimal (12,2)) > 1
                AND CAST(yspc.yoy_sale_price_change AS decimal (12,2)) < 1)
        THEN 'Little risk; good buy'
        ELSE 'Meh; not recommended'
    END AS homebuying_rating,
    CAST(cte.avg_yoy_sales_change AS decimal (12,2)) AS avg_home_sales_change_yoy,
    CAST(yspc.yoy_sale_price_change AS decimal (12,2)) AS avg_sale_price_change_yoy
FROM housing_data hd 
JOIN yoy_sales_price_change yspc
    ON hd.region_name = yspc.region_name 
JOIN cte_yoy_sales_data cte
ON hd.region_name = cte.region_name
JOIN countyHouseholdIncome chi
    ON hd.region_name LIKE chi.countyName 
GROUP BY region_name
ORDER BY 
    price_to_income_ratio, avg_home_sales_change_yoy DESC, 
    avg_sale_price_change_yoy DESC;

 * mysql://admin:***@sql-project.ckos5u2mpfmb.us-east-1.rds.amazonaws.com/sql_project
82 rows affected.


region_name,avg_median_sale_price,median_household_income,price_to_income_ratio,homebuying_rating,avg_home_sales_change_yoy,avg_sale_price_change_yoy
"Chautauqua County, NY",80900.0,50143,1.61,Ideal to buy,1.06,1.1
"Oswego County, NY",104500.0,57640,1.81,Little risk; good buy,0.98,1.08
"Niagara County, NY",104950.0,56371,1.86,Little risk; good buy,0.99,1.08
"Monroe County, NY",125000.0,62159,2.01,Little risk; good buy,0.98,1.07
"Wayne County, NY",140000.0,61989,2.26,Little risk; good buy,0.97,1.07
"Rensselaer County, NY",163187.9,70688,2.31,Ideal to buy,1.01,1.07
"Onondaga County, NY",146500.0,61597,2.38,Little risk; good buy,0.96,1.06
"Ontario County, NY",168862.5,66754,2.53,Meh; not recommended,1.0,1.08
"Rockwall County, TX",285950.0,105763,2.7,Moderate risk; good future gains,1.08,1.04
"Erie County, NY",166000.0,60620,2.74,Meh; not recommended,0.98,1.08


#### Recommendation:
In the above query, I have collected a wide variety of useful measurements with the goal of determining a "Redfin rating" of sorts, similar to what you would find when researching stocks and seeing the analyst's ratings. For each CA, NY, or TX county with over 1000 new listings in 2017, I have collected the following over the course of the 2017-2020 period: 
- average median sale price
- 2019 median household income
- home price-to-income ratio
- Redfin homebuy rating
- average year-over-year change in total home sales
- average year-over-year change in median home sale prices

Based on the results, I recommend that Redfin implement a similar feature to rank different counties and metro regions in terms of how optimal they are as financially safe investments for buyers. By taking into account multiple factors such as the price-to-income ratio and year-over-year sale price changes, Redfin can provide honest, statistically sound recommendations to buyers about how risky such an investment would be.