**Revision**
Hypothesis Testing Review
- One Category (e.g. percent of flowers that are purple)
  - Test Statistic (1): empirical_percentage
  - Test Statistic (1): abs(empirical_percentage - null_percentage)
  - How to Simulate: sample_proportions(n, null_dist)
- Multiple Categories (e.g. color distribution of pea plants)
  - Test Statistic: tvd(empirical_dist, null_dist)
  - How to Simulate: sample_proportions(n, null_dist)
- Numerical Data (e.g. scores in a class section)
  - Test Statistic: empirical_mean
  - How to Simulate: population_data.sample(n, with_replacement=False)

given a csv file containing longitude, latitude , and pm10  columns [Air Quality data](https://raw.githubusercontent.com/IsamAljawarneh/datasets/master/data/NYC_PM.csv) representing readings of low cost air quality sensor mounted on moving vehicles, in addition to a geojson file containing polygons representing administrative divisions of NYC city known as neighbourhoods [nyc_polygon.geojson](https://raw.githubusercontent.com/IsamAljawarneh/datasets/master/data/nyc_polygon.geojson). Then based on the combined data, consider the following formulation of a hypothesis testing problem depending on the average value of pm10 of <font color='red'>Belmont</font> neighbourhood, describing exactly what would be a null hypothesis and the alternative hypothesis.

Hypothesis Testing Problem:

### **Null Hypothesis (H0)**: The average PM10 value in the ```Belmont``` neighborhood is equal to a specified reference value (e.g., the city-wide average PM10 level).

## **Alternative Hypothesis (H1):** The average PM10 value in the Belmont neighborhood is significantly different from the specified reference value.

In other terms:
***hypthesis***
* ***Null*** Hypothesis: picking up a neighbourhood at random from neighbourhoods, the average PM10 would be similar to Belmont's neighbourhood average
* ***Alternative*** Hypothesis: the average pm10 in Belmont's is too divergent that randomness is not the only reason for those divergent values

Explanation:

The null hypothesis (H0) posits that there is no significant difference between the average PM10 level observed in the Belmont neighborhood and a specified reference value (consider it to be the ***city-wide*** average PM10 level or any other predetermined benchmark). This implies that any observed difference in the average PM10 value of Belmont compared to the reference value is due to random chance or natural variability in air quality measurements.

Conversely, the alternative hypothesis (H1) suggests that there is a significant difference between the average PM10 level observed in the ```Belmont``` neighborhood and the reference value. This indicates that the observed average PM10 value of Belmont deviates from what would be expected based on random chance alone, indicating potential spatial disparities or differences in air quality specifically within the Belmont neighborhood.

# **You need to do the following tasks to decide between the two hypothesis, null and alternative, based on the data that you have!**

# **part - A** preprocessing [1.5 marks]

do all tasks and the subtasks!

###1. Among those above  
Which test statistic you would choose and why?


import necessary libraries

###2. Read the CSV file containing PM10 sensor readings
 & Read the GeoJSON file containing neighborhood boundaries into a GeoDataFrame

In [None]:
pm10_data.dtypes

###3. convert the csv into a geodataframe and join it (sjoin) with the geojson, assign a coordinate reference system (CRS) the csv geodataframe which is identical to that of the geojson file, then perform the join, the result is a geodataframe, convert it to dataframe, and select pm10, neighborhood columns in a new dataframe

4. you need to convert</h1></section> from dataframe to Datascience Table. Use the following format: ```Table.from_df(df, keep_index=False)``` read more here
[create DS Table from DF](https://www.data8.org/datascience/_autosummary/datascience.tables.Table.from_df.html)

**N.B.** <font color='red'>from this word upwards, perform all tasks using the table abstraction as we have learned in the class!</font>

the following is the opposite:

[Table.to_df](https://www.data8.org/datascience/_autosummary/datascience.tables.Table.to_df.html)

what is the maximum pm10 value

what is the maximum pm10 value

show the first few rows of the table?

print minimum and maximum pm10 values?

###5. using table abstraction, draw histogram of the pm10 values? which binning you can use? in addition to sub-tasks that follow
you should obtain figure similar to the one attached (assignment2_fig1)

<font color='red'>attention</font>

remove pm10 values that are unreasonably high (above 10000 µg/m³)

what is the number of rows after removing outliers

group by neigborhood

what is the average pm10 for each neighborhood

what is the maximum and minimum average pm10 after removing outliers

focus on

```
Belmont
```
for analysis that follows


# **part - B** Testing Hypothesis [3.5 marks]

do all tasks, in addition to the subtasks that follow!

# 6. compute the observed statistic
Belmont's pm10 average



# 7. compute the sample size (Belmont's population count)

# 8. build a simulated test statistic. take a random sample of pm10 of size equal to sample_size
store the result in an array, then compute the test statistic (the average of those pm10 values)

# 9. simulate one value of the test statistic, under null hypothesis (that the Belmont's pm10 values is like a random sample from the distribution)
declare a method random_sample_average that takes on one argument (sample_size), draw a random sample of size sample_size from the pm10 values and returns the average pm10 of the sample
then test the method many times with random_sample size that is equal to Belmont sample_size (population)

# 10. perform a simulation
* to run the trial many times , repititions = 5000 number of times
simulate 5000 copies of the test statistic
* create an empty array that will contain the sample average for each of the trials
* create for loop
* call random_sample_average(sample_size) to calculate pm10 average of the sample for that trial
* append the new test statistic to the empty array

# 11. make a decision regarding the hypothesis
* compare the simulated distribution of the statistic and the actual (observed statistic)
  * creat a table containing the sample averages
* plot a histogram of the sample averages with number of bins
  * Remember: the histogram is the empirical distribution of the test statistic under the null hypothesis. It is like all of our simulated values
  * plot a red dot of size 120 at vertical coordinate -0.01, which is just under the horizontal axis exactly covering the location of the observed statistic

  you should obtain a graph similar to one shown in assignment2_figure2
   attached!


Compute the p-value

Decide whether neighborhood averages are very different from other neighborhoods, and explain?

# 12. how to decide?
* consider a 0.05 P-value threshold
* compute how many of simulated sample averages in the histogram are less than or equal to the observed average (observed statistic)?
 * use either sum or np.count_nonzero
* calculate the p-value (the probability of obtaining results at least as extreme as the observed value). In other terms, simulation area beyond observed value
  * remeber p-value (the tail probability) is the ratio of the tail count over number of repititions in your simulation
  * Remember: the histogram is the empirical distribution of the test statistic under the null hypothesis. It is like all of our simulated values
  * so, you want to check how many of those are less than or equal to the observed value
* is the value that you have got less than or equal to the p-value threshold(0.05)
  * in either way, can you reject the null hypothesis, or otherwise fail to reject the null hypothesis. Based on the caclulated p-value?


**draw a gold vertical line representing the cut-off p-value (0.05), you should obtain a figure similar to assignment2_figure3**