# Stat 201 Week 1 - Week 4 Contents
## Author: 201冲冲冲

This studyguide covers all contents of week 1 and week of STAT201, eg concepts and their application in R codes. <br> 
The main title contains hyperlink to textbook, while the hyperlink next to it named <b>"Doc hyp"</b>takes you to the content of this guide.
</br>
The following are topics covered:
* <a href="https://moderndive.com/7-sampling.html"><b> Introduction to Statistical Inference and Sampling </b></a> 
<a href="#topic1">Doc hyp</a>
* <a href="https://docs.google.com/presentation/d/16DKhcwkkA3bulJ7jaECAEASVVM30_Yf_CBD1ACNZYD4/edit#slide=id.ge4a649f4_10"><b> Populations and Sampling </b></a>
<a href=#topic2>Doc hyp</a>
* <a href ="https://moderndive.com/8-confidence-intervals.html#resampling-simulation"><b> Bootstrapping and its Relationship to Sampling Distribution </b></a> 
<a href=#topic3>Doc hyp</a>
* <a href ="https://moderndive.com/8-confidence-intervals.html#ci-build-up"><b> Confidence Intervals via bootstrapping </b></a>
<a href=#topic4>Doc hyp</a>





In [1]:
# Load required packages 
library(cowplot)
library(datateachr)
library(infer)
library(repr)
library(tidyverse)
library(moderndive)
library(taxyvr)

-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.5     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.6     [32mv[39m [34mdplyr  [39m 1.0.7
[32mv[39m [34mtidyr  [39m 1.1.4     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 2.1.1     [32mv[39m [34mforcats[39m 0.5.1

-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## <a id ="topic1">Introduction to Statistical Inference and Sampling</a>

Let's start with type of analysis questions, In Stat 201, we will mostly focus on inferential questions. 


### Types of data analysis questions
<br>
<img src="tables/table1.1_types_q.png" width=600/>
</br>

#### Important Statistical Definitions

##### Population
Collection of individuals or observations that we are interested in. And it's inaccessible in real life.

##### Population Parameter

<b>Fixed</b> numeric summary of population that is unknown. It is usually estimated by sample statistics.

##### Sample
A subset of a population.


##### Sampling
Collecting a sample from population
##### Estimate (point)
A sample statistic usually sample mean or standard deviation (Standard Error) calculated from sample and estimates the corresponding unknown population parameter.

##### Sampling variation
Variation exists upon different samples, since they are all random, thus, their sample statistic (mean, sd, var) are also different than each other.

##### Statiscal Inference
Take a sample of population with appropriate assumptions and calculate its statistic to "guess" an unknown population parameter

##### Standard Error
Quantifies how much we expect of discrepancy varying from sample to sample. Every point estimate has a standard error (STAT344 covers), the lower it is, the better is an estimate. As sample size increases, standard error decreases.

##### Sample Distribution
How data are distributed inside <b>ONE</b> sample

##### Sampling Distribution
Distribution of all possible values for point estimates of a parameter, and how often we would expect those estimates to appear when making a number of point estimates. <b>Note:</b> This is not accessible in real life, see <a href=#topic3>Bootstrapping</a>. And the mean of a sampling distribution closely approximates the unknown parameter (pop mean/pop median/pop sd/pop var ....)

#### Facts

##### Sampling methodology:
Way of collecting sample, to apply stastical methods we need to make sure we are doing correct way of sampling:
* Representative sample (roughly looks like population)
* Generalizable(results of sample can genaralize to population)
* Unbiasedness (individuals of sample are equally chanced chosen from population)

##### Steps of statistical inference
1. Take a random sample (following above sampling methodology) from the population
2. Calculate point estimate(S) for the sample(s)
3. Describe the uncertainty related to your estimate (report your standard error) <b>NOTE:</b> You can ignore this part for now ....

#### Coding

Try this part after you are familiar with conceptual part above. <br>
Let's try <b>"Steps of statistical inference"</b> to estimate the mean or proportion of a population. Below is consist of two parts, one for mean, and one for proportion.

#### Part 1
We are only interested in the population corresponding to multiple-family dwellings in strata housing. In this data set, the properties that correspond to that meet the following criteria:  
- **Have a land value greater than \$1:**  Some properties are assigned a value of `NA`, and these are the properties undergoing big renovations. These values get amended after the renovations and are reflected in the following year's assessment. The same occurs with homes that are assessed at $0.
- **Are of legal type land `STRATA`**
- **Are of zone category `Multiple Family Dwelling`** 

In [41]:
# This is the part of mean
# Read data and apply conditions then assign it to tax_pop
tax_pop <- tax_2019 %>%
           filter(!is.na(current_land_value),
                  current_land_value > 1,
                  legal_type == "STRATA",
                  zone_category == "Multiple Family Dwelling") %>%
            select(current_land_value)

# size of tax_pop
tax_size <- dim(tax_pop)
tax_obs <- nrow(tax_pop)

# Read first six rows of tax_pop
#head(tax_pop)
# Retrieves all column names of the pop
tax_cols <- colnames(tax_pop)

# We are interested to estimate pop mean of current_land_value
# In this case, we have acccess to population
# Let`s calculate it and then compare results of the sample estimates
# We can do this in two ways:
# either dblyr way or tidyverse way

# But since original data contained NAs , so we need to filter out those NAs
tax_fast <- mean(tax_pop$current_land_value, na.rm = TRUE)
tax_tidy <- tax_pop %>%
            filter(!is.na(current_land_value)) %>%
                  summarize(mean = mean(current_land_value)) 


# combines the two speed means and prints
tax_means <- rbind(tax_fast , tax_tidy)

# Visualize the population current land value
tax_plot <- tax_pop %>%
            filter(!is.na(current_land_value)) %>%
            ggplot(aes(x = current_land_value)) +
            geom_histogram(bins = 50)+
            xlab("Current Land Value") +
            ggtitle("Land Value Distribution") +
            scale_x_continuous(labels = scales::dollar_format()) +
            theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Let`s try following "Steps of statistical inference" to estimate
# the current_land_value pop mean above

# 1) Take a random sample, usually size <= 10 % of population size (STAT 200)
# Since this data is large enough :
# 213182 rows (observation) * 32 columns (variables)
# A good size of sample is 50 or 100 which is much lower than 10% of pop size

# Set a seed (this allows us a reproducible workflow)
# BTW you can try commenting out the seed !!
# Lets do single sample first
set.seed(3332)
tax_sample_one <- tax_pop %>%
              rep_sample_n(size = 50, reps = 1)

# Visualize the sample using ggplot
tax_one_plot <- tax_sample_one %>%
                ggplot(aes(x = current_land_value)) +
                geom_histogram(bins=50,color = "white") + 
                labs(x = "Current Land Value" ,
                    title = "Land value Distribution of One sample") +
                scale_x_continuous(labels = scales::dollar_format()) +
                theme(axis.text.x = element_text(angle = 45, hjust = 1))
#tax_one_plot

# 2) Calculate The point estimate (this case, the mean of the sample)
tax_one_mean <- tax_sample_one %>%
                summarize(sample_mean = mean(current_land_value)) 

# Compare with pop mean
tax_one_vs_pop <- data.frame("One" = tax_one_mean,
                            "Population mean" = tax_fast) %>%
                  select(-1)

# ------------------------------------------------------------------

# Now let`s try taking many different samples
# 1) Take many random samples of size 50
tax_many_sample <- tax_pop %>%
                   rep_sample_n(size = 50, reps = 1000)

# 2) Calculate The point estimates (this case, the mean of many samples)
# Note this the mean of every corresponding sample
# Eg mean of sample 1 = 586428
# mean of sample 9 = 547782
tax_many_mean <- tax_many_sample %>%
                 group_by(replicate) %>%
                summarize(sample_mean = mean(current_land_value)) 

# Now we need an extra step to take mean of these means, as this value
# approximates closely to actual population parameter

tax_mean_many_mean <- tax_many_mean %>%
                      summarize(mean = mean(sample_mean))


# Visualize the sampling distribution using ggplot
# We added a red line, which is actual population mean
tax_many_plot <- tax_many_mean %>%
                ggplot(aes(x = sample_mean)) +
                geom_histogram(bins=50,color = "white") + 
                labs(x = "Sample mean of Current Land Value" ,
                    title = "Sampling distribution of Sample means") +
                geom_vline(xintercept = tax_fast, color = "red") +
                scale_x_continuous(labels = scales::dollar_format()) +
                theme(axis.text.x = element_text(angle = 45, hjust = 1))
#tax_many_plot


# Compare with pop mean and one sample mean
tax_compare_means <- merge(tax_one_vs_pop, tax_mean_many_mean) %>%
                     setNames(., 
                              c("Mean of One Sample", "Actual Population Mean",
                             "Mean of Many Sample Means"))
                     
tax_compare_means






Mean of One Sample,Actual Population Mean,Mean of Many Sample Means
<dbl>,<dbl>,<dbl>
605000,620331.5,619465.4


#### Part 2 
#### IN PROGRESS

In [42]:
# This is the part for proportion


# bowl itself is a population data taken from "moderndive" library
# We are going to assign it to "pop" object
pop <- bowl

# Reads first 6 rows of the data
# After readig, we know there are 2 columns in the data
# ball#ad(pop)

# Let`s say we are interested in calculating the mean 

#### BELOW is code from textbook
Ctrl + A then Ctrl + / to select all to uncomment to run


In [4]:
# # Segment 1: sample size = 25 ------------------------------
# # 1.a) Virtually use shovel 1000 times
# virtual_samples_25 <- bowl %>% 
#   rep_sample_n(size = 25, reps = 1000)

# # 1.b) Compute resulting 1000 replicates of proportion red
# virtual_prop_red_25 <- virtual_samples_25 %>% 
#   group_by(replicate) %>% 
#   summarize(red = sum(color == "red")) %>% 
#   mutate(prop_red = red / 25)

# # 1.c) Plot distribution via a histogram
# ggplot(virtual_prop_red_25, aes(x = prop_red)) +
#   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
#   labs(x = "Proportion of 25 balls that were red", title = "25") 


# # Segment 2: sample size = 50 ------------------------------
# # 2.a) Virtually use shovel 1000 times
# virtual_samples_50 <- bowl %>% 
#   rep_sample_n(size = 50, reps = 1000)

# # 2.b) Compute resulting 1000 replicates of proportion red
# virtual_prop_red_50 <- virtual_samples_50 %>% 
#   group_by(replicate) %>% 
#   summarize(red = sum(color == "red")) %>% 
#   mutate(prop_red = red / 50)

# # 2.c) Plot distribution via a histogram
# ggplot(virtual_prop_red_50, aes(x = prop_red)) +
#   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
#   labs(x = "Proportion of 50 balls that were red", title = "50")  


# # Segment 3: sample size = 100 ------------------------------
# # 3.a) Virtually using shovel with 100 slots 1000 times
# virtual_samples_100 <- bowl %>% 
#   rep_sample_n(size = 100, reps = 1000)

# # 3.b) Compute resulting 1000 replicates of proportion red
# virtual_prop_red_100 <- virtual_samples_100 %>% 
#   group_by(replicate) %>% 
#   summarize(red = sum(color == "red")) %>% 
#   mutate(prop_red = red / 100)

# # 3.c) Plot distribution via a histogram
# ggplot(virtual_prop_red_100, aes(x = prop_red)) +
#   geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
#   labs(x = "Proportion of 100 balls that were red", title = "100") 

#### Above is code from textbook

## <a id ="topic2">Populations and Sampling</a>
### IN PROGRESS

## <a id = "topic3">Bootstrapping and its Relationship to Sampling Distribution</a>

### IN PROGRESS

<img src="tables/sampl_vs_boot.png"/>


## <a id ="topic4">Confidence Intervals via bootstrapping</a>

 <a href="#top">Back to top</a>