# Intro to R and Exploratory Data Analysis

## Navigating around Jupyter Notebooks

This is a Jupyter notebook running R. Code cells can be created by clicking the "+" button in the toolbar or using the hotkey ESC+B (below) or ESC+A (above). To run a code cell, click the play button or use SHIFT+ENTER.

Try running the code cell below now.

In [None]:
4 * 5
3 + 6
x <- c(3, 4, 5)
x

The above is examples of simple computations and creating a variable "x" which stores a vector of 3 numbers.

## Loading Libraries

Before we begin, we need to load the `tidyverse` package, which includes helpful tools for data manipulation and visualization.

In [None]:
library(tidyverse)

## Loading Data from Army Vantage

In this course, we'll be working with datasets stored in Army Vantage. To load a dataset, we use the `datasets.read_table()` function with the dataset name in quotes.

**Note:** For this Colab environment, you may need to upload your data file or connect to your data source differently.

In [None]:
# Load a dataset from Vantage
# If running in Colab, you may need to upload your CSV file first
dataset_173 <- read_csv("https://raw.githubusercontent.com/lonespear/MA206/refs/heads/main/dataset_173.csv")

# View the first few rows
head(dataset_173)

At any time, you may also view the dataset by clicking on it in the Environment pane (if using RStudio) or by running `View(dataset_173)`.

**Practice:** Load your dataset and answer the following:

- How many rows (observations) does the dataset have?
- How many columns (variables) does the dataset have?
- What are the names of the variables?

## Descriptive Statistics: Measures of Location

Measures of location tell us about the "center" or "typical" value of our data. We can simply calculate them using the function names `mean()` and `median`. However, we can also use these within the `summarize()` function from tidyverse to create a summary table.

**Example:** Let's create a simple tibble and calculate its measures of location:

In [None]:
dataset_173 |>
  summarize(
    mean_score = mean(m4_score),
    median_score = median(m4_score)
  )

**Practice:**

1. Why might the mean and median be different?
2. Calculate the mean and median for another variable in dataset_173.

## Descriptive Statistics: Measures of Spread

Measures of spread tell us how "spread out" or variable our data is.

In [None]:
# Calculate multiple measures of spread
dataset_173 %>%
  summarize(
    min_score = min(m4_score),
    max_score = max(m4_score),
    range = max(m4_score) - min(m4_score),
    variance = var(m4_score),
    std_dev = sd(m4_score),
    iqr = IQR(m4_score),
    q1 = quantile(m4_score, 0.25),
    q3 = quantile(m4_score, 0.75)
  )

**Key Insight:** Standard deviation measures how far, on average, data points are from the mean. A small SD means data is clustered tightly around the mean; a large SD means data is more spread out.

**Practice:**

1. Calculate the above measures of spread using another variable in dataset_173.
2. What does this value tell you about the variability in that variable?

## Data Visualization: Histograms

Histograms show the distribution of a single quantitative variable by dividing the data into bins and displaying the frequency of observations in each bin. We'll use `ggplot2` to create professional-looking visualizations.

In [None]:
# Basic histogram
ggplot(dataset_173, aes(x = m4_score)) +
  geom_histogram()

In [None]:
# Histogram with customization
ggplot(dataset_173, aes(x = m4_score)) +
  geom_histogram(bins = 10, fill = "lightblue", color = "black") +
  labs(
    title = "Distribution of M4 Qual Scores",
    x = "Score",
    y = "Frequency"
  ) +
  theme_minimal()

**Practice:**

1. Create a histogram of another variable from dataset_173.
2. Describe the shape: Is it symmetric? Skewed left or right? Are there outliers?

## Data Visualization: Boxplots

Boxplots display the five-number summary and help identify outliers.

In [None]:
# Basic boxplot
ggplot(dataset_173, aes(y = m4_score)) +
  geom_boxplot()

In [None]:
# Boxplot with customization
ggplot(dataset_173, aes(y = m4_score)) +
  geom_boxplot(fill = "lightgreen", color = "black") +
  labs(
    title = "M4 Qual Scores Boxplot",
    y = "Score"
  ) +
  theme_minimal()

In [None]:
# Horizontal boxplot
ggplot(dataset_173, aes(x = m4_score)) +
  geom_boxplot(fill = "lightgreen", color = "black") +
  labs(
    title = "M4 Qual Scores Boxplot",
    x = "Score"
  ) +
  theme_minimal()

**Understanding Boxplots:**

- Box shows the interquartile range (Q1 to Q3)
- Line in the box is the median
- Whiskers extend to roughly the min/max (within 1.5Ã—IQR)
- Points beyond whiskers are potential outliers

**Practice:**

1. Create a boxplot of another variable from dataset_173.
2. Are there any outliers? How can you tell?

## Data Visualization: Scatterplots

Scatterplots show the relationship between two quantitative variables.

In [None]:
# Basic scatterplot
ggplot(dataset_173, aes(x = height, y = weight)) +
  geom_point()

In [None]:
# Scatterplot with customization
ggplot(dataset_173, aes(x = height, y = weight)) +
  geom_point(size = 0.5, color = "blue") +
  labs(
    title = "Height vs Weight",
    x = "Height (inches)",
    y = "Weight (lbs)"
  ) +
  theme_minimal()

**Practice:**

1. Create a scatterplot using two other variables from dataset_173.
2. Does there appear to be a relationship between the variables?
3. Is the relationship positive (both increase together) or negative (one increases as the other decreases)?

## Grouped Statistics with group_by()

Often we want to calculate statistics separately for different groups in our data. The `group_by()` function allows us to do this efficiently.

**Example:** Calculate mean M4 scores by age_bracket:

In [None]:
dataset_173 %>%
  group_by(age_bracket) %>%
  summarize(
    n = n(),  # count observations in each group
    mean_m4 = mean(m4_score),
    sd_m4 = sd(m4_score),
    median_m4 = median(m4_score)
  )

**Multiple grouping variables:** You can group by more than one variable:

In [None]:
dataset_173 %>%
  group_by(age_bracket, battalion) %>%
  summarize(
    n = n(),
    mean_m4 = mean(m4_score),
    .groups = "drop"  # removes grouping after summarize
  )

**Practice:**

1. Calculate the mean and standard deviation of another variable, grouped by company.
2. How do the groups compare? Which has the highest mean? Which has the most variability?

## Comparative Visualizations: Side-by-Side Boxplots

Side-by-side boxplots allow us to compare the distribution of a quantitative variable across different groups.

In [None]:
# Boxplots by age_bracket
ggplot(dataset_173, aes(x = age_bracket, y = m4_score, fill = age_bracket)) +
  geom_boxplot() +
  labs(
    title = "M4 Scores by age_bracket",
    x = "age_bracket",
    y = "M4 Score"
  ) +
  theme_minimal()

In [None]:
# Boxplots by battalion
ggplot(dataset_173, aes(x = battalion, y = m4_score, fill = battalion)) +
  geom_boxplot() +
  labs(
    title = "M4 Scores by Battalion",
    x = "Battalion",
    y = "M4 Score"
  ) +
  theme_minimal() +
  theme(legend.position = "none")  # remove legend if x-axis is clear

**Practice:**

1. Create side-by-side boxplots comparing another variable across groups.
2. Which group has the highest median? Are there more outliers in any particular group?

## Comparative Visualizations: Faceted Histograms

Faceting creates separate plots for each group, making it easy to compare distributions.

In [None]:
# Histograms faceted by age_bracket
ggplot(dataset_173, aes(x = m4_score, fill = age_bracket)) +
  geom_histogram(bins = 10, color = "black") +
  facet_wrap(~age_bracket) +
  labs(
    title = "Distribution of M4 Scores by age_bracket",
    x = "M4 Score",
    y = "Frequency"
  ) +
  theme_minimal()

In [None]:
# Can also use facet_grid for two grouping variables
ggplot(dataset_173, aes(x = m4_score)) +
  geom_histogram(bins = 10, fill = "steelblue", color = "black") +
  facet_grid(age_bracket ~ battalion) +
  labs(
    title = "M4 Scores by age_bracket and Company",
    x = "M4 Score",
    y = "Frequency"
  ) +
  theme_minimal()

**Practice:**

1. Create faceted histograms for another variable.
2. Do the distributions look similar across groups or different?

## Comparative Visualizations: Colored Scatterplots

Adding color to scatterplots helps us see patterns within different groups.

In [None]:
# Scatterplot with points colored by group
ggplot(dataset_173, aes(x = height, y = weight, color = age_bracket)) +
  geom_point(size = 2, alpha = 0.6) +  # alpha controls transparency
  labs(
    title = "Height vs Weight by age_bracket",
    x = "Height (inches)",
    y = "Weight (lbs)",
    color = "age_bracket"
  ) +
  theme_minimal()

In [None]:
# Can also add separate trend lines for each group
ggplot(dataset_173, aes(x = height, y = weight, color = age_bracket)) +
  geom_point(size = 2, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +  # linear trend lines
  labs(
    title = "Height vs Weight by age_bracket",
    x = "Height (inches)",
    y = "Weight (lbs)",
    color = "age_bracket"
  ) +
  theme_minimal()

**Practice:**

1. Create a scatterplot with points colored by a grouping variable.
2. Does the relationship between your two variables appear to be different across groups?

## Comparative Visualizations: Overlapping Histograms

Sometimes it's useful to overlay histograms to directly compare distributions.

In [None]:
# Overlapping histograms
ggplot(dataset_173, aes(x = m4_score, fill = age_bracket)) +
  geom_histogram(bins = 10, alpha = 0.5, position = "identity") +
  labs(
    title = "Overlapping M4 Score Distributions",
    x = "M4 Score",
    y = "Frequency",
    fill = "age_bracket"
  ) +
  theme_minimal()

In [None]:
# Density plots work well for this too
ggplot(dataset_173, aes(x = m4_score, fill = age_bracket)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "M4 Score Distributions by age_bracket",
    x = "M4 Score",
    y = "Density",
    fill = "age_bracket"
  ) +
  theme_minimal()

**Practice:**

1. Create overlapping histograms or density plots for another variable.
2. Which group has the higher values on average? Is there a lot of overlap?

## Putting It All Together: Example Analysis

Here's a complete analysis workflow combining multiple techniques:

In [None]:
# 1. Calculate grouped statistics
dataset_173 %>%
  group_by(battalion) %>%
  summarize(
    n = n(),
    mean_m4 = mean(m4_score),
    sd_m4 = sd(m4_score),
    mean_weight = mean(weight),
    sd_weight = sd(weight)
  )

In [None]:
# 2. Create comparison boxplots
ggplot(dataset_173, aes(x = battalion, y = m4_score, fill = battalion)) +
  geom_boxplot() +
  labs(
    title = "M4 Score Comparison Across Battalions",
    x = "Battalion",
    y = "M4 Score"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

In [None]:
# 3. Create a faceted scatterplot
ggplot(dataset_173, aes(x = height, y = weight)) +
  geom_point(aes(color = age_bracket), alpha = 0.6) +
  facet_wrap(~battalion) +
  labs(
    title = "Height vs Weight by Battalion",
    x = "Height (inches)",
    y = "Weight (lbs)"
  ) +
  theme_minimal()

You will access the assignment via Army Vantage using your Brigade's data.