# Exploratory Activity 11: Classifying Air Quality Monitoring Sites


Today we will explore a large data set compiled by the US Environmental Protection Agency’s Air Quality System ([EPA AQS](https://aqs.epa.gov/aqsweb/airdata/download_files.html#Annual)), a network of thousands of monitoring sites that measure dozens of meteorological as well as air quality-related variables. These sites can be classified as either “urban” (category 2) or “suburban/rural” (category 1) based on the population of the area in which they are located. 

In this activity, you will use a computational classification algorithm (specifically, the k-nearest-neighbor algorithm you learned about this morning) to predict a monitoring site’s urban vs. rural classification based on its annually averaged temperature, pressure, ozone, sulfur dioxide, and particulate matter measurements from 2013. 


## 11.1 Five-Nearest Neighbors Classification

Use the code below to load the data from 165 sites in 2013. Note that the temperature, pressure, and PM<sub>2.5</sub> quantities reflect the average of 365 daily averages, while the ozone and sulfur dioxide quantities are the average of 365 daily maxima. 


In [None]:
# import some useful libraries
import pandas as pd
import numpy as np

# load the data 
data = pd.read_csv("data/knearest_neighbor_practice.csv", error_bad_lines=False)

# show the first 5 rows of data
data.head()

**Question 11.1.1** Begin by normalizing the numerical variables into "standard units."


In [None]:
## ENTER CODE HERE THAT SUBTRACTS OFF THE MEAN AND DIVIDES BY THE STANDARD DEVIATION OF EACH DATA COLUMN

**Question 11.1.2** Use the code below separate the data set into two randomly selected and non-overlapping subgroups: the "training" sample and the "test" sample. You may choose any size for training sample, but it is best if training sample is at least as large as the test sample, if not much larger.


In [None]:
## CHOOSE HOW MANY SITES YOU WOULD LIKE TO INCLUDE IN YOUR TRAINING SAMPLE
training_sample_size = 

# select a training sample of this size from sites #0-164
training_indices = np.random.choice(165,training_sample_size,replace=False)
training_indices = np.sort(training_indices)

# define test sample as whatever is left
all_indices = np.arange(165)
test_indices = all_indices[np.logical_not(np.isin(all_indices,training_indices))]
test_indices = np.sort(test_indices)

# divide up the standardized numerical values according to site numbers chosen for each sample
training_data = data_std.iloc[training_indices]
test_data = data_std.iloc[test_indices]

The sample code above used the function "iloc" to extract certain values from a dataframe--in this case, all of the values in a certain row. The function "iat" works similarly, but has the added benefit that the values, once extracted, are standalone numbers outside of the dataframe format. This means you can do math with the output of the "iat" function. Use the code provided below to subtract the value in the first row and column of your training_data from the value in the first row and column of your test_data.

In [None]:
# get the value of the first row and column of training_data
first_train = training_data.iat[0,0]

# get the value of the first row and column of test_data
first_test = test_data.iat[0,0]

# subtract!
first_test - first_train

**Question 11.1.3** Calculate the "distance" between each site in your test sample and all of the sites in your training sample based on the differences in their temperature, pressure, ozone, sulfur dioxide, and particulate matter measurements.

*Hint: The "iat" function you used above will likely come in handy!*


In [None]:
# define an empty dataframe to hold your distance calculations
all_distances = pd.DataFrame()

# iterate over each site (row) in your test_data variable
for i in np.arange(len(test_indices)):
        
    ## ENTER CODE HERE TO CALCULATE THE "DISTANCE" BETWEEN THIS SITE AND EACH OF THE TRAINING SITES

# display the results
print(all_distances)

The code below demonstrates how to use the function "nsmallest" to identify the indices associated with the five smallest numbers in the first row of the "all_distances" variable you created above.

In [None]:
# extract the first row from the all_distances dataframe
first_test_site = all_distances.iloc[0,:]

# identify the indices of the five smallest values
smallest_indices = first_test_site.nsmallest(5).index

# display the result
print(smallest_indices)

**Question 11.1.4** By using the sample code above in a "for" loop, identify the "nearest" five training sample site neighbors for each test sample site.

In [None]:
# define an empty dataframe to hold the nearest neighbor indices
nearest_neighbors = pd.DataFrame()

# interate over each site (row) in your all_distances dataframe
for i in np.arange(len(all_distances)):
    
    ## ENTER CODE HERE TO IDENTIFY WHICH FIVE TRAINING SITES ARE "NEAREST" TO THIS TEST SITE
    

The code below demonstrates how to use the scipy function "stats.mode" to identify the most common value in a list of values.

In [None]:
# import the necessary library
import scipy.stats

# define an example variable containing five hypothetical urban/rural classifications
example_classifications = [1,2,2,2,1]

# compute and display the mode of the example_classifications variable
print(scipy.stats.mode(example_classifications))

In [None]:
# the numerical value of the "mode array" can also be saved to its own variable
mode = scipy.stats.mode(example_classifications).mode[0]

**Question 11.1.5** Predict the urban vs. rural classification of each test sample site based on the most common classification observed among its five nearest training sample neighbors.


In [None]:
## ENTER CODE HERE THAT PREDICTS THE CLASSIFICATION OF EACH TEST SITE

## 11.2 Model Evaluation

**Question 11.2.1** Compare your predicted classifications and the actual urban vs. rural classifications of your test sample sites. Calculate the number of "true positives" (predicted "urban" and actually "urban"), "false positives" (predicted "urban," but actually "rural"), and "false negatives" (predicted "rural," but actually "urban").


In [None]:
## ENTER CODE HERE TO CALCULATE THE NUMBER OF TRUE POSITIVES,
## FALSE POSITIVES, AND FALSE NEGATIVES IN YOUR PREDICTIONS

**Question 11.2.2** Modify the code below to calculate the "precision" and "recall" of your classification algorithm based on the names of the variables you defined above.


In [None]:
## UPDATE THE VARIABLE NAMES BELOW ACCORDINGLY
precision = true_pos / ( true_pos + false_pos )
recall = true_pos / ( true_pos + false_neg )

Finally, run the code below to calculate the cumulative F<sub>1</sub> score for your classification algorithm.


In [None]:
F1 = 2 * ( precision * recall ) / ( precision + recall )
F1

**Question 11.2.3** Given the definitions of the various metrics above, is it more desirable to have a high F<sub>1</sub> score or a low F<sub>1</sub> score?


## 11.3 Classification Using Other K Values

**Question 11.3.1** Re-write your original algorithm two more times, each time using a different number of nearest neighbors to determine the classification of your test sample sites.

*Hint: You should always use an odd number of neighbors! Why?*


In [None]:
## ENTER CODE HERE THAT PREDICTS URBAN/RURAL CLASSIFICATION BASED ON A DIFFERENT NUMBER OF NEIGHBORS

In [None]:
## ENTER CODE HERE THAT PREDICTS URBAN/RURAL CLASSIFICATION again BASED ON yet another DIFFERENT NUMBER OF NEIGHBORS

**Question 11.3.2** Calculate the recall, precision, and F<sub>1</sub> score for your two new classification algorithms as well.


In [None]:
## ENTER CODE HERE THAT EVALUATES THE PERFORMANCE OF YOUR NEW CLASSIFICATION ALGORITHMS

**Question 11.3.3** How many neighbors were used in the algorithm with the optimal F<sub>1</sub> score?


## 11.4 Classification Using Fewer Variables


**Question 11.4.1** Although you were given five numerical variables upon which to base your classification, you do not necessarily need to incorporate all five into your algorithm. Think about what each of these variables actually represent in the physical world–do all seem equally related to a site's urban vs. rural classification? 


**Question 11.4.2** Choose a two- to four-variable subset of the original five and re-write your classification algorithm using only this subset. 


In [None]:
## ENTER CODE HERE THAT PREDICTS URBAN/RURAL CLASSIFICATION BASED ON FEWER VARIABLES

**Question 11.4.3** Did your F<sub>1</sub> score improve? What does this say about your hypothesis regarding the relationship between certain variables and the urban vs. rural classification?


**Discussion Questions.** Pair up with a partner and discuss the following:

1. Compare the F<sub>1</sub> scores of your original, five-variable classification algorithms. 
    * What was the optimal number of neighbors used by either you or your partner?
    * Did you both use the same size of training vs. test samples? Which sample size resulted in the superior F<sub>1</sub> score for the original, five-neighbor algorithm?
    * What seems to be the general relationship between training sample size and algorithm performance? You should modify and re-run your algorithm a few times to help support your answer.
    * Notice that your F<sub>1</sub> score changes slightly every time you re-run your algorithm, even when you do not change any of the parameters. This is due to a new random selection of sites for the training vs. test samples. How many runs of a given algorithm configuration would you need to feel confident that you have characterized its performance relative to another algorithm?
2. The “urban” vs. “suburban/rural” classifications in this data set were made based on the population of the “[Core Based Statistical Area](https://en.wikipedia.org/wiki/Core-based_statistical_area)” in which the monitoring site is located. Knowing this definition, what factors might confound the relationship between a site’s classification and its meteorological and/or air quality measurements? What might this mean for the maximum efficacy your classification algorithm can ultimately achieve?
3. Given that the US Environmental Protection Agency knows the actual location of each of its air quality monitoring sites, there are probably easier ways to determine their urban vs. rural classification than applying an algorithm to their annual air quality metrics. Can you think of an atmospheric science question for which an effective classification algorithm would be more directly useful?