## Lab 3: R in Probability and Distributions
#### MA 189 Data Dive Into Birmingham (with R)
##### _Blazer Core: City as Classroom_


Course Website: [github.com/tphilli2/datadiveintobirmingham](https://github.com/tphilli2/datadiveintobirmingham) 

#### Levels:
<div class="alert-success"> Concepts and general information</div>
<div class="alert-warning"> Important methods and technique details </div>
<div class="alert-info"> Extended reading </div>
<div class="alert-danger"> (Local) Examples, assignments, and <b>Practice in Birmingham</b> </div>

In this lab, we will work with probability and distribution concepts using R. We will cover calculating probabilities for different events, explore basic probability distributions, and visualize them using R.

---

### <div class="alert alert-block alert-danger"><b>Example</b>: Alabama Home Values</div>

In April 2024, the average home value in Alabama was \\$228,241, with a standard deviation of \\$20,000. Assume the dataset follows a normal distribution. Let’s calculate probabilities and visualize this data.

**Question 1:** What percentage of homes are worth more than \\$250,000?

**Step 1: Calculate the probability of a home being worth more than \\$250,000**


In [None]:
# Define the mean and standard deviation for home values
mean_home_value <- 228241
sd_home_value <- 20000

# Calculate the probability of a home being worth more than $250,000
P_more_than_250k <- 1 - pnorm(250000, mean = mean_home_value, sd = sd_home_value)
P_more_than_250k  # This gives the percentage of homes worth more than $250,000


**Question 2:** What is the probability of a home value being between \\$200,000 and \\$250,000?

**Step 2: Calculate the probability of a home value being in a certain range**


In [None]:
# Calculate the probability of a home value between $200,000 and $250,000
P_between_200k_250k <- pnorm(250000, mean = mean_home_value, sd = sd_home_value) - 
pnorm(200000, mean = mean_home_value, sd = sd_home_value)
P_between_200k_250k  # This gives the probability of home value falling in this range

**Question 3:** Find the minimum home value for the top 10% of homes.

**Step 3: Calculate the value corresponding to the 90th percentile of home values**


In [None]:
# Calculate the 90th percentile value (top 10% of home values)
top_10_percent_value <- qnorm(0.90, mean = mean_home_value, sd = sd_home_value)
top_10_percent_value  # This is the minimum value for the top 10% of homes

---
### <div class="alert alert-block alert-danger"><b>Example</b>: ACT Scores for Incoming Students at UAB</div>

Recall the information from the lecture regarding ACT scores being normally distributed. Assume an average national ACT score of 20.8 with a standard deviation of 5.8.

**Question 4:** A student earns an ACT score of 26.5 to improve their chances of UAB scholarships. What percentile are they in?


In [None]:
# Define the mean and standard deviation for ACT scores
mean_ACT <- 20.8
sd_ACT <- 5.8

# Calculate the percentile for a score of 26.5
percentile_26_5 <- pnorm(26.5, mean = mean_ACT, sd = sd_ACT)
percentile_26_5 
floor(percentile_26_5 * 100) # This gives the percentile rank of the student

**Question 5:** What percentile would a student earning the average Alabama ACT score of 18 be in?

In [None]:
# Calculate the percentile for a score of 18 (Alabama's average ACT score)
percentile_18 <- pnorm(18, mean = mean_ACT, sd = sd_ACT)
percentile_18 
floor(percentile_18 * 100)# This gives the percentile rank of a student with an ACT score of 18

**Question 6:** What ACT scores make up the middle 68% of the normal distribution?

**Step 4: Calculate the range for the middle 68% using 1 standard deviation from the mean**

In [None]:
# Calculate the lower and upper bounds of the middle 68% (within 1 standard deviation)
lower_bound <- mean_ACT - sd_ACT
upper_bound <- mean_ACT + sd_ACT
c(lower_bound, upper_bound)  # This gives the range of ACT scores for the middle 68%

**Question 7:** A student scores 29 on their ACT. What percentile are they in?

### <div class="alert alert-block alert-danger"><b>Example</b>: Alabama Teacher Salaries </div>

**Question 8:** In April 2024, the average teacher salary in Alabama was \\$53,572. Assume a standard deviation of \\$10,000 and assume salaries follow a normal distribution:
- What percentage of teachers earn more than $60,000?
- What is the probability that a teacher’s salary is between \\$50,000 and \\$60,000?

*Hints:*
- Use `pnorm()` to calculate the probabilities for normal distributions. (input is value, output is cumulative probablility/area)
- Use `qnorm()` to find the percentile values. (input is cumulative probability/area, output is value)

---

### <div class="alert alert-block alert-danger"><b>Practice in Birmingham</b></div>

Note: This problem focuses more on the skills we learned in Lab 2 as further practice of that material. We will continue to work with the material that was presented here in Lab 3 throughout the term. 

Consider the data set provided by Alabama Power Company. Alabama Power’s incremental cost of generating electricity is monitored using the **system marginal cost**, also known as **system lambda**. Lambda represents the incremental cost of generating one more unit (megawatt-hour) of electricity. In a typical year (non-leap year), there are **8760 hours**; thus, this industry often refers to the “**8760 lambdas**.” As a general rule, generating units that run to help meet peak energy usage on the system are incrementally more expensive to run than baseload plants (those plants that run in both peak and off-peak times).

**Data**: Use the **Lambdas and RTP Customer Loads** excel data set for analysis.

**Question 1:** Calculate what the average cost was (in dollars per megawatt-hour) for generating electricity from 12am-6am in the year 2018. What about from 6am-12pm in 2018? What conclusion might you make from this comparison?

**Hints:**
To calculate the average cost (in dollars per megawatt-hour) for generating electricity during specific time periods (e.g., 12am-6am and 6am-12pm) in 2018, you can follow the steps below. We'll first merge the relevant hourly columns for each time period and then calculate the averages.

**Question 2:** Calculate the average cost for generating electricity from 12am-6am and from 6am-12pm in the year 2020. What conclusion might you make from this comparison?

---
### Example R Code for Data Import and Analysis:

In [None]:
# load necessary libraries 
install.packages("tidyverse")
library (tidyverse)

# installing package to read Excel file
if (!require(readxl)) {
    install.packages("readxl")
    library(readxl)
}

# Define the path to the Excel file
file_path <- "data/Lambdas and RTP Customer Loads.xlsx"

# Read the Excel file from the "data" subfolder, skipping reading first two rows
lambda_data_2018 <- read_excel(file_path, sheet = "2018 Lambdas", skip=2)

lambda_data #gives a lot of the data, with some dots in the middle 
head(lambda_data_2018) #gives a snapshot of first six rows of data

# Reshape the dataset: Merging hourly columns into a single column
data_long <- lambda_data_2018 %>%
  pivot_longer(cols = starts_with("hour"), # Select columns to merge
               names_to = "hour",          # Name for the new column with hour identifiers
               values_to = "lambda")       # Name for the new column with lambda values

# View the reshaped dataset
head(data_long,30) #shows 20 rows of data where it's organized with day, hour, lambda as columns 

In [None]:
# Calculate the average cost from 12am-6am (hour01 to hour06)
avg_cost_12am_6am <- data_long %>%
  filter(hour %in% paste0("hour0",1:6)) %>%
  summarise(avg_lambda_12am_6am = mean(lambda, na.rm = TRUE))

avg_cost_12am_6am