## Task 2
Test the hypothesis that, properties with vulnerable tenants have higher likelihood of developing damp and mould problems.  

In [1]:
import pandas as pd
import numpy as np

tenancy_case_df = pd.read_csv("tenancy_case.csv")

## Table of contents

- Methodology
- Assumptions
- Data Preparation
- Hypothesis Testing
- Conclusion
- Appendix

## Methodology

Let 
- $p_1$ = the proportion of the properties with vulnerable tenants who has damp and mould problems
- $p_2$ = the proportion of the properties without vulnerable tenants who has damp and mould problems. 

Then we are interested in testing the null hypothesis: $H_0:p_1=p_2$ against the alternative hypothesis: $H_1:p_1>p_2$.

The test statistic for testing the difference in two population proportions under the null hypothesis: $H_0:p_1=p_2$  is 

$Z=\dfrac{(\hat{p}_1-\hat{p}_2)-0}{\sqrt{\hat{p}(1-\hat{p})\left(\dfrac{1}{n_1}+\dfrac{1}{n_2}\right)}}$ 

where

$\hat{p}=\dfrac{Y_1+Y_2}{n_1+n_2}$

is the proportion of properties which have damp and mould problems in the two samples combined.

For more information on how the test statistic is derived, please refer to the *Appendix* section at the end.

## Assumptions

**1. The sampling method for each population is simple random sampling.**

Our population is all the units in L&Q properties (over 100,000) but we only have ~10,000 in the dataset. We need to check with the data engineers on why this might be the case.

For the vulnerable tenants, we are only using samples where tenants have reported to be vulnerable and therefore excluding tenants who fail to report (could be due to a variety of reasons such as personal dignity or unawareness about the reporting mechanism). We need to check with the tenancy team on how likely this might occur. 

**2. The samples are independent.**

This assumes the likelihood of a property to have damp and mould problem is independent of another property.

We can assume this assumption is true for houses but this might be a weak assumption for flats. If the damp and mould issue is due to penetrating damp (caused by water coming through external walls or the roof) or rising damp (when moisture beneath a building is soaked up into the bricks or concrete), then this might potentially affect all the units in a flat. Therefore, if a unit in a flat has damp and mould problem, another unit in the same flat might have a higher likelihood of having damp and mould problem.

**3. Each sample includes at least 10 successes and 10 failures.** 

**4. Each population is at least 20 times as big as its sample.**

## Data Preparation

`tenancy_case_df` contains the case history reported by tenants living in the unit on 2020-06-01. Since we are testing the hypothesis on a unit level, we need to aggregate up to the unit level.

We will use the following definitions:
- A unit with vulnerable tenant is a unit with at least one vulnerable tenant. 
- A unit has damp & mould problem if it reported damp & mould issue at least once.

In [2]:
def reduce_unit_level(tenancy_case_df):
    """
    Aggregate from tenancy-case level up to unit level.
    """
    # create unit vulnerability indicator
    tenants_df = tenancy_case_df[["unit_ref", "tenancy_id", "vul_ind"]].drop_duplicates()
    unit_vul_df = tenants_df.groupby("unit_ref")["vul_ind"].sum().rename("number_vul_tenants").reset_index()
    unit_vul_df = unit_vul_df.assign(unit_vul_ind=np.where(unit_vul_df["number_vul_tenants"]==0, 0, 1))
    
    # create unit damp indicator
    damp_cond = tenancy_case_df["case_sub_type"]=="Damp & Mould"
    tenancy_case_df = tenancy_case_df.assign(damp_ind=np.where(damp_cond, 1, 0))
    unit_damp_df = tenancy_case_df.groupby("unit_ref")["damp_ind"].sum().rename("number_damp_issue").reset_index()
    unit_damp_df = unit_damp_df.assign(unit_damp_ind=np.where(unit_damp_df["number_damp_issue"]==0, 0, 1))
    
    merged_df = unit_vul_df.merge(unit_damp_df, on="unit_ref", validate="one_to_one")
    return merged_df

unit_df = reduce_unit_level(tenancy_case_df)

In [3]:
unit_df.head()

Unnamed: 0,unit_ref,number_vul_tenants,unit_vul_ind,number_damp_issue,unit_damp_ind
0,Unit0,0.0,0,1,1
1,Unit1,0.0,0,1,1
2,Unit10,1.0,1,1,1
3,Unit100,0.0,0,1,1
4,Unit1000,0.0,0,0,0


Each row of `unit_df` represents a unit which is occupied on 2020-06-01. 
- `unit_vul_ind` indicates whether the unit has at least one vulnerable tenant
- `unit_damp_ind` indicates whether the unit reported damp and mould issues at least once during the tenancy.

In [4]:
pd.crosstab(unit_df["unit_damp_ind"], unit_df["unit_vul_ind"], margins=True)

unit_vul_ind,0,1,All
unit_damp_ind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,6381,1970,8351
1,777,432,1209
All,7158,2402,9560


We have a total of 7158 properties with non-vulnerable tenants, of which 777 reported damp and mould issues at least once, and a total of 2402 properties with at least one vulnerable tenant, of which 432 reported damp and mould issues at least once.

## Hypothesis Testing

In [5]:
from scipy import stats

y1, y2 = 432, 777
n1, n2 = 2402, 7158
p1, p2 = y1/n1, y2/n2
p = (y1+y2)/(n1+n2)

se = (p*(1-p)*((1/n1)+(1/n2)))**(1/2)
z = (p1-p2)/se 

p_value = stats.norm.sf(z)
print(f"Test statistic: {z:4f}, p-value: {p_value:4f}")

Test statistic: 9.097456, p-value: 0.000000


In [6]:
# reassure our hypothesis testing is correct using built-in function from statsmodel
from statsmodels.stats.proportion import proportions_ztest

z, p_value = proportions_ztest(count=[y1, y2], nobs=[n1, n2], alternative="larger")
print(f"Test statistic: {z:4f}, p-value: {p_value:4f}")

Test statistic: 9.097456, p-value: 0.000000


## Conclusion

The p-value is very small so it is very unlikely that we observe this large difference in proportion if the proportions are the same. Therefore, there is significant evidence to reject the null hypothesis at 1% significance level and we can conclude that properties with vulnerable tenants have higher likelihood of developing damp and mould problems.  

## Appendix

We can view each unit $X_i$ as a Bernoulli random variable with   

$
X_i=\begin{cases}
      1, & \text{if unit has damp and mould problem} \\
      0, & \text{otherwise}
    \end{cases}
$

For a Bernoulli random variable, the sample mean $\bar{X}$ is also the sample proportion $\hat{p}$.

By Central Limit Theorem, if we have a random sample of size n, $X_1, X_2, ... X_n$, then the sample mean (proportion) is approximately normally distributed.

$\hat{p} \sim N(p, \frac{p(1-p)}{n})$.  

Therefore, the difference in two proportions, $\hat{p_1}-\hat{p_2}$ is also approximately normally distributed.

$\hat{p_1}-\hat{p_2} \sim N(p_1-p_2, \frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2})$

Under the null hypothesis $H_0: p_1=p_2$, the population proportions are equal to some common value $p$. 

$\hat{p_1}-\hat{p_2} \sim N(0, p(1-p)(\frac{1}{n_1}+\frac{1}{n_2}))$

Since we don't know the true proportion $p$, we estimate $p$ using the proportion of "successes" in the two samples combined.