# Linear Regression: Univariate

---
This notebook demonstrates how to train and test a linear regression model in R, using tidyverse and modern R best practices. The examples and explanations are designed to mirror the Python version, so you can compare the two approaches side by side.

## What you'll learn

- How to generate and visualize linear data in R
- How to fit a linear regression model using `lm()`
- How to make predictions and interpret model coefficients
- How to use real-world data (e.g., `mtcars`)
- How to split data into train/test sets and evaluate model performance
- How to check model assumptions (linearity, homoscedasticity, normality, independence)
- How to use time series data and create lagged features for forecasting

## Libraries

- **tidyverse**: For data manipulation and visualization (dplyr, ggplot2, readr, tibble, etc.)
- **broom**: For tidying model outputs
- **caret**: For train/test splitting and metrics
- **lubridate**: For date handling (if needed)

Let's get started!

> **Note:** This notebook is heavily commented and includes explanations for each step, just like the Python version. If you are new to R, pay attention to the code comments and markdown cells for guidance.

In [None]:
# Install packages if not already installed
packages <- c("tidyverse", "broom", "caret", "lubridate", "reshape2")
new_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if (length(new_packages) > 0) {
  cat("Installing packages:", paste(new_packages, collapse = ", "), "\n")
  install.packages(new_packages, repos = "https://cran.rstudio.com/", dependencies = TRUE)
}

# Load packages with error handling
for (pkg in packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    stop(paste("Failed to load package:", pkg))
  }
}

## Generate Linear Data Example

Let's generate some linear-looking data, similar to the Python example. We'll use `runif` for uniform random numbers and `rnorm` for normal noise.

In [None]:
# Set a random seed for reproducibility (so results are the same each run)
set.seed(42)
n <- 100
# Generate 100 random x values between 0 and 2
X <- tibble(x = runif(n, 0, 2))
# Generate y values with a linear relationship plus some random noise
y <- 4 + 3 * X$x + rnorm(n)
# Combine x and y into a single data frame
df <- X %>% mutate(y = y)

In [None]:
# Visualize the data
ggplot(df, aes(x = x, y = y)) +
  geom_point(color = 'blue') +
  labs(x = 'x', y = 'y', title = 'Simulated Linear Data') +
  theme_minimal()

## Fit a Linear Model

We'll use `lm()` to fit a linear regression model.

In [None]:
# Fit a linear regression model: y = intercept + slope * x
model <- lm(y ~ x, data = df)
# Show a summary of the model (coefficients, R-squared, etc.)
summary(model)

In [None]:
# Add predictions to the data frame
df <- df %>% mutate(y_pred = predict(model, newdata = df))

# Plot data and fitted line
ggplot(df, aes(x = x, y = y)) +
  geom_point(color = 'blue') +
  geom_line(aes(y = y_pred), color = 'red') +
  labs(x = 'x', y = 'y', title = 'Linear Fit') +
  theme_minimal()

## Predict New Values

Let's predict for new x values.

In [None]:
# Predict y for new x values using the fitted model
X_new <- tibble(x = c(0.5, 1.75))
y_new_pred <- predict(model, newdata = X_new)
# Show the predicted values
tibble(x = X_new$x, y_pred = y_new_pred)

## Real Data Example: Import and Explore

Let's read a real dataset. We'll use a sample CSV (replace with your own path or use a built-in dataset for demonstration).

In [None]:
# Example: Use built-in mtcars dataset for demonstration
# (In practice, you would load your own CSV file here)
data <- as_tibble(mtcars)
# Show the first few rows of the data
data %>% head()

## Data Exploration

Let's check the structure, missing values, and summary statistics.

In [None]:
# Check the structure of the data (column types, etc.)
glimpse(data)
# Check for missing values in each column
colSums(is.na(data))
# Show summary statistics for each column
summary(data)

## Simple Linear Regression Example

Let's predict `mpg` (miles per gallon) from `hp` (horsepower) as a univariate regression.

In [None]:
# Visualize the relationship between horsepower (hp) and miles per gallon (mpg)
ggplot(data, aes(x = hp, y = mpg)) +
  geom_point() +
  theme_minimal()

In [None]:
# Fit a simple linear regression: mpg ~ hp
model2 <- lm(mpg ~ hp, data = data)
# Show model summary
summary(model2)

In [None]:
# Add predictions to the data frame and plot the fit
data <- data %>% mutate(mpg_pred = predict(model2, newdata = data))
ggplot(data, aes(x = hp, y = mpg)) +
  geom_point(color = 'blue') +
  geom_line(aes(y = mpg_pred), color = 'red') +
  theme_minimal()

## Train/Test Split and Model Evaluation

We'll use `caret::createDataPartition` to split the data.

In [None]:
# Split the data into training and test sets (80% train, 20% test)
set.seed(123)
train_idx <- createDataPartition(data$mpg, p = 0.8, list = FALSE)
train <- data[train_idx, ]
test <- data[-train_idx, ]

# Fit the model on the training set
model3 <- lm(mpg ~ hp, data = train)
# Predict on the test set
test <- test %>% mutate(mpg_pred = predict(model3, newdata = test))

# Calculate RMSE (Root Mean Squared Error) and R-squared
rmse <- sqrt(mean((test$mpg - test$mpg_pred)^2))
r2 <- cor(test$mpg, test$mpg_pred)^2
cat('Root Mean Squared Error:', round(rmse, 2), '\n')
cat('R-squared:', round(r2, 2), '\n')

## Residual Analysis

Let's check the residuals for homoscedasticity and normality.

In [None]:
# Calculate residuals (difference between actual and predicted values)
residuals <- test$mpg - test$mpg_pred
residual_df <- tibble(index = 1:length(residuals), residuals = residuals)

# Plot residuals to check for patterns (should look random if model is good)
p1 <- ggplot(residual_df, aes(x = index, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 'dashed', color = 'red') +
  labs(title = 'Residuals', x = 'Index', y = 'Residual') +
  theme_minimal()

# Q-Q plot to check if residuals are normally distributed
p2 <- ggplot(residual_df, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = 'Q-Q Plot', x = 'Theoretical Quantiles', y = 'Sample Quantiles') +
  theme_minimal()

# Display plots side by side
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

## Correlation Matrix and Heatmap

Let's check the correlation between numeric variables.

In [None]:
# Calculate correlation matrix for numeric variables
cor_mat <- cor(data %>% select(where(is.numeric)))
# Reshape for plotting
melted_cor <- melt(cor_mat)
# Plot a heatmap of correlations
ggplot(melted_cor, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = 'blue', high = 'red', mid = 'white', midpoint = 0) +
  theme_minimal() +
  labs(title = 'Correlation Heatmap')

# Time Series Example: Bitcoin Price Forecasting

Let's show how to use lagged features for time series forecasting in R. We'll use a simulated time series for demonstration (replace with your own data as needed).

In [None]:
# Simulate a time series (replace with your own data for real use)
# Here we create a fake Bitcoin price series for demonstration
set.seed(123)
n <- 200
btc <- tibble(
  date = seq.Date(from = as.Date('2022-01-01'), by = 'day', length.out = n),
  price = cumsum(rnorm(n, 0.1, 2)) + 30000
)
# Plot the simulated price series
ggplot(btc, aes(x = date, y = price)) +
  geom_line(color = 'blue') +
  labs(title = 'Simulated Bitcoin Price', x = 'Date', y = 'Price (USD)') +
  theme_minimal()

In [None]:
# Create lagged features for time series forecasting
# This function adds columns for previous values (lags) of the target variable
create_lags <- function(df, var, lags = 5) {
  for (i in 1:lags) {
    df[[paste0(var, '_lag', i)]] <- dplyr::lag(df[[var]], i)
  }
  df
}
# Add 5 lagged features and drop rows with NA (due to lag)
btc_lagged <- create_lags(btc, 'price', lags = 5) %>% drop_na()

In [None]:
# Split the time series into train and test sets (use last 20% as test)
n_train <- floor(0.8 * nrow(btc_lagged))
train <- btc_lagged[1:n_train, ]
test <- btc_lagged[(n_train + 1):nrow(btc_lagged), ]

# Fit a linear model using lagged features to predict price
model_ts <- lm(price ~ price_lag1 + price_lag2 + price_lag3 + price_lag4 + price_lag5, data = train)
# Show model summary
summary(model_ts)

In [None]:
# Predict on the test set and evaluate performance
test <- test %>% mutate(pred = predict(model_ts, newdata = test))
rmse_ts <- sqrt(mean((test$price - test$pred)^2))
r2_ts <- cor(test$price, test$pred)^2
cat('Time Series RMSE:', round(rmse_ts, 2), '\n')
cat('Time Series R-squared:', round(r2_ts, 2), '\n')

In [None]:
# Plot true vs predicted prices for the test set
ggplot(test, aes(x = price, y = pred)) +
  geom_point(color = 'red') +
  geom_abline(slope = 1, intercept = 0, linetype = 'dashed') +
  labs(title = 'True vs Predicted Bitcoin Price', x = 'True', y = 'Predicted') +
  theme_minimal()

# Linear Regression Assumptions

1. **Linearity**: Relationship between predictors and target is linear.
2. **No (or little) multicollinearity**: Predictors are not highly correlated.
3. **Homoscedasticity**: Residuals have constant variance.
4. **Normality of residuals**: Residuals are normally distributed.
5. **Independence of residuals**: No autocorrelation in residuals (important for time series).

Let's check these for our simple regression example.

In [None]:
# 1. Linearity: Check if the relationship between lagged price and current price is linear
ggplot(train, aes(x = price_lag1, y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE, color = 'red') +
  labs(title = 'Linearity Check', x = 'Lag 1 Price', y = 'Price')

In [None]:
# 2. Multicollinearity: Check if lagged features are highly correlated
cor(train %>% select(starts_with('price_lag')))

In [None]:
# 3. Homoscedasticity: Plot residuals to check for constant variance
resid_ts <- test$price - test$pred
resid_ts_df <- tibble(index = 1:length(resid_ts), residuals = resid_ts)
ggplot(resid_ts_df, aes(x = index, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 'dashed', color = 'red') +
  labs(title = 'Residuals (Time Series)', x = 'Index', y = 'Residual') +
  theme_minimal()

In [None]:
# 4. Normality of residuals: Q-Q plot to check if residuals are normally distributed
ggplot(resid_ts_df, aes(sample = residuals)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = 'Q-Q Plot of Residuals', x = 'Theoretical Quantiles', y = 'Sample Quantiles') +
  theme_minimal()

In [None]:
# 5. Independence (autocorrelation): Plot autocorrelation of residuals
acf_result <- acf(resid_ts, plot = FALSE)
acf_df <- tibble(
  lag = as.numeric(acf_result$lag),
  acf = as.numeric(acf_result$acf)
)
ggplot(acf_df, aes(x = lag, y = acf)) +
  geom_hline(yintercept = 0, color = 'black') +
  geom_segment(aes(xend = lag, yend = 0)) +
  geom_hline(yintercept = c(-0.2, 0.2), linetype = 'dashed', color = 'blue') +
  labs(title = 'ACF of Residuals', x = 'Lag', y = 'ACF') +
  theme_minimal()

## Conclusion

This notebook covered univariate and time series linear regression in R, using tidyverse and modern R idioms. The structure and explanations mirror the Python notebook, so you can compare the two approaches directly.

### Key takeaways
- You learned how to generate, visualize, and model linear data in R
- You saw how to use train/test splits and evaluate model performance
- You checked model assumptions visually
- You saw how to use lagged features for time series forecasting

For more advanced modeling, check out the `tidymodels` ecosystem in R!

> **Tip:** If you want to see more detailed explanations or have questions about any step, compare this notebook with the Python version or check the code comments above.