# ENGR 371 - Housing affordability in Canada

## Purpose
The purpose of the following document is to demonstrate the calculations required to support the hypothesis outlined below.

## Hypothesis
The average remaining mortgage owed on the dwelling is <s>2 times</s> greater than the average household income.

## Context
The population is all private households in Canada who have a mortgage on their dwelling.
It is important to analyze the relationship between household income and the amount of mortgage debt that a household is carrying. 
This information can be useful in a variety of contexts, such as understanding the ratio of debt versus income of Canadian households and their financial health.


## Methodology
This statistical research is based on the data collected by Statistics Canada in 2019 and published as part of a statistical research available [here](https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&Id=793713).
Data sources and methodology as well as references are available [here](https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&Id=793713)
The following project is based on a limited set of data collected from the aforementioned survey and is analyzed using Student's t test and various other techniques.

## Results
Describe and comment the most important results.

## Suggested next steps
Full code and documentation are available in our [public code repository](https://github.com/vlkyrylenko/stat-project)

# Setup

## Library import
We import all the required Python libraries

In [46]:
# Data manipulation
import pandas as pd
import numpy as np
from pathlib import Path
import math as m
from scipy import stats

# Options for pandas
pd.options.display.max_columns = 5
pd.options.display.max_rows = 30

# Visualizations
import plotly
import plotly.graph_objs as go
import plotly.offline as ply
plotly.offline.init_notebook_mode(connected=True)
import matplotlib as plt

## Local library import
We import all the required local libraries libraries

In [47]:
# System libraries
import os, sys, glob
import pathlib
# sys.path.append('path/to/local/lib')
abs_path = os.path.abspath('')
os.chdir(abs_path)

# Import local libraries
from stat_analysis import STAT

# Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.


# Data import
Retrieve all the required data for the analysis.
The data was extracted from the [Statics Canada web site](https://www150.statcan.gc.ca/n1/pub/46-25-0001/2021001/2021.zip) and stored in csv format.
The following survey has over 40,000 respondents. However, since the goal of this project is to test the aforementioned hypothesis, only a subset of the survey data has been used.
The mean and standard deviation used to calculate the t score for each attribute were calculated using the original data to minimize the error while ignoring missing values.

# Data processing
The following process is based on a custom class functions in combination with SciPy, NumPy, and Pandas.
For more information, please refer to the documentation and code repository.

1. Calculate the mean of the remaining mortgage owned using the following code:

In [48]:
# Export csv file as a data frame and calculate statistical parameters
# POWN_80 - Shelter costs for owners - $ amount currently owed on mortgage
STAT.append_csv('./data/data.csv', dropNaN=True)
STAT.get_parameters('./data/new_data.csv','POWN_80')
mortgage_mean = STAT.get_parameters.mean
mortgage_std = STAT.get_parameters.std
print('Mean of the remaining mortgage owned is: {:.2f}$ CAD'.format(mortgage_mean))
print('Standard deviation of the remaining mortgage owned is: {:.2f}$ CAD'.format(mortgage_std))

Mean of the remaining mortgage owned is: 195666.83$ CAD
Standard deviation of the remaining mortgage owned is: 156177.76$ CAD


2. Calculate the mean of the household income:

In [49]:
# PHHTTINC - Total income of household
STAT.get_parameters('./data/new_data.csv','PHHTTINC')
income_mean = STAT.get_parameters.mean
income_std = STAT.get_parameters.std
print('Mean of the household income is: {:.2f}$ CAD'.format(income_mean))
print('Standard deviation of the household income is: {:.2f}$ CAD'.format(income_std))

Mean of the household income is: 134899.05$ CAD
Standard deviation of the household income is: 92340.09$ CAD


3. Make a claim based on the hypothesis described above.<br><br>
Claim (H1):
        Population mean is greater that 176528.17$ CAD
\begin{align}
\mu \leq = 176528.17 CAD
\end{align}
Opposite (H0):

\begin{align}
        \mu > 176528.17 CAD
        \end{align}

4. Select significance level $\alpha$ equal to 0.01 and sample size $n$ equal to 9118.

5. Calculate T score using the following formula:<br>
*Calculations are based on a custom function that resides within our python class*
\begin{align}
T = \frac{\bar{x} - \mu}{\frac {s}{\sqrt{n}}}
\end{align}

Where<dl><li>$\bar{x}$ is the sample mean of the mortgage balance</li><li>$\mu$ is the population mean represented by the previously arbitrary value</li><li>s is the standard deviation of the mortgage balance </li><li>n is the sample size</li></dl>

In [50]:
n = len(STAT.append_csv.df.index)
STAT.mean_t(populationMean=income_mean,sampleMean=mortgage_mean,
        sampleSD= mortgage_std,sampleSize=n,significanceLevel=0.01)
# populationMean - nu (from claim), sampleMean - x bar, s - population standard deviation, sampleSize - n
        # Calculate t-score for mean
        # For unknown population mean and standard deviation
        # n > 30 or the population is normally distributed
print("T-score is {:.2f}".format(STAT.mean_t.value))

T-score is 37.15


In [51]:
stats.ttest_1samp(popmean=income_mean,a=STAT.append_csv.df['POWN_80'])

TtestResult(statistic=37.153857767147336, pvalue=1.7558861177799532e-281, df=9117)

In [52]:
# P-value in the right-tailed test:
stats.t.sf(abs(STAT.mean_t.value),df=n-1)
# where the first argument is the t-score, and the second argument is the degrees of freedom.

8.779430588896769e-282

### Right tail test
#### 1st method
According to the t-tables, the area corresponding to $\alpha = 0.01$ equals to 2.429
The T-score obtained using both methods ($37.15$) is significantly greater than 2.429
Hence, there **is enough** evidence to reject the null hypothesis and support the the claim that the mean mortgage debt is greater than the mean household income.

#### 2nd method
P-value

The P-value obtained using the 2nd method ($8.779430588896769e^{-282}$) is significantly smaller than our significance level.
Hence, there **is enough** evidence to reject the null hypothesis based on the P-value.

# References

1. Canadian Housing Survey: Public use of microdata file, 2021 DOI: https://doi.org/10.25318/46250001-eng
2. Volodymyr Kyrylenko, Project repository, Github, 2023, https://github.com/vlkyrylenko/stat-project
3. Brandon Leonard, Statistics Lecture 8.5: Hypothesis Testing for Population Mean. Population Std Dev is Unknown., 2011 https://youtu.be/onTQhD7osY4