# Linear Regression: Univariate (R Version)

---
This notebook demonstrates how to train and test a linear regression model in R, using tidyverse and modern R best practices. The examples and explanations follow the same rhythm as the Python version, so you can compare the two approaches side by side.

## Libraries

- **tidyverse**: For data manipulation and visualization (dplyr, ggplot2, readr, tibble, etc.)
- **broom**: For tidying model outputs
- **caret**: For train/test splitting and metrics
- **lubridate**: For date handling (if needed)

Let's get started!

In [None]:
# Install packages if not already installed
packages <- c("tidyverse", "broom", "caret", "lubridate", "reshape2")
new_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if (length(new_packages) > 0) {
  cat("Installing packages:", paste(new_packages, collapse = ", "), "\n")
  install.packages(new_packages, repos = "https://cran.rstudio.com/", dependencies = TRUE)
}

# Load packages with error handling
for (pkg in packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    stop(paste("Failed to load package:", pkg))
  }
}

## Generate Linear Data Example

Let's generate some linear-looking data, similar to the Python example. We'll use `runif` for uniform random numbers and `rnorm` for normal noise.

In [None]:
set.seed(42)
n <- 100
X <- tibble(x = runif(n, 0, 2))
y <- 4 + 3 * X$x + rnorm(n)
df <- X %>% mutate(y = y)

In [None]:
# Visualize the data
ggplot(df, aes(x = x, y = y)) +
  geom_point(color = 'blue') +
  labs(x = 'x', y = 'y', title = 'Simulated Linear Data') +
  theme_minimal()

## Fit a Linear Model

We'll use `lm()` to fit a linear regression model.

In [None]:
model <- lm(y ~ x, data = df)
summary(model)

In [None]:
# Add predictions to the data frame
df <- df %>% mutate(y_pred = predict(model, newdata = df))

# Plot data and fitted line
ggplot(df, aes(x = x, y = y)) +
  geom_point(color = 'blue') +
  geom_line(aes(y = y_pred), color = 'red') +
  labs(x = 'x', y = 'y', title = 'Linear Fit') +
  theme_minimal()

## Predict New Values

Let's predict for new x values.

In [None]:
X_new <- tibble(x = c(0.5, 1.75))
y_new_pred <- predict(model, newdata = X_new)
tibble(x = X_new$x, y_pred = y_new_pred)

## Real Data Example: Import and Explore

Let's read a real dataset. We'll use a sample CSV (replace with your own path or use a built-in dataset for demonstration).

In [None]:
# Example: Use built-in mtcars dataset for demonstration
data <- as_tibble(mtcars)
data %>% head()

## Data Exploration

Let's check the structure, missing values, and summary statistics.

In [None]:
glimpse(data)
colSums(is.na(data))
summary(data)

## Simple Linear Regression Example

Let's predict `mpg` (miles per gallon) from `hp` (horsepower) as a univariate regression.

In [None]:
ggplot(data, aes(x = hp, y = mpg)) +
  geom_point() +
  theme_minimal()

In [None]:
model2 <- lm(mpg ~ hp, data = data)
summary(model2)

In [None]:
data <- data %>% mutate(mpg_pred = predict(model2, newdata = data))
ggplot(data, aes(x = hp, y = mpg)) +
  geom_point(color = 'blue') +
  geom_line(aes(y = mpg_pred), color = 'red') +
  theme_minimal()

## Train/Test Split and Model Evaluation

We'll use `caret::createDataPartition` to split the data.

In [None]:
set.seed(123)
train_idx <- createDataPartition(data$mpg, p = 0.8, list = FALSE)
train <- data[train_idx, ]
test <- data[-train_idx, ]

model3 <- lm(mpg ~ hp, data = train)
test <- test %>% mutate(mpg_pred = predict(model3, newdata = test))

# RMSE and R2
rmse <- sqrt(mean((test$mpg - test$mpg_pred)^2))
r2 <- cor(test$mpg, test$mpg_pred)^2
cat('Root Mean Squared Error:', round(rmse, 2), '\n')
cat('R-squared:', round(r2, 2), '\n')

## Residual Analysis

Let's check the residuals for homoscedasticity and normality.

In [None]:
residuals <- test$mpg - test$mpg_pred
par(mfrow = c(1, 2))
plot(residuals, main = 'Residuals', ylab = 'Residual', xlab = 'Index')
qqnorm(residuals); qqline(residuals)
par(mfrow = c(1, 1))

## Correlation Matrix and Heatmap

Let's check the correlation between numeric variables.

In [None]:
cor_mat <- cor(data %>% select(where(is.numeric)))
melted_cor <- melt(cor_mat)
ggplot(melted_cor, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = 'blue', high = 'red', mid = 'white', midpoint = 0) +
  theme_minimal() +
  labs(title = 'Correlation Heatmap')

# Time Series Example: Bitcoin Price Forecasting (R Version)

Let's show how to use lagged features for time series forecasting in R. We'll use a simulated time series for demonstration (replace with your own data as needed).

In [None]:
# Simulate a time series (replace with your own data for real use)
set.seed(123)
n <- 200
btc <- tibble(
  date = seq.Date(from = as.Date('2022-01-01'), by = 'day', length.out = n),
  price = cumsum(rnorm(n, 0.1, 2)) + 30000
)
ggplot(btc, aes(x = date, y = price)) +
  geom_line(color = 'blue') +
  labs(title = 'Simulated Bitcoin Price', x = 'Date', y = 'Price (USD)') +
  theme_minimal()

In [None]:
# Create lagged features
create_lags <- function(df, var, lags = 5) {
  for (i in 1:lags) {
    df[[paste0(var, '_lag', i)]] <- dplyr::lag(df[[var]], i)
  }
  df
}
btc_lagged <- create_lags(btc, 'price', lags = 5) %>% drop_na()

In [None]:
# Train/test split (time series: use the last 20% as test)
n_train <- floor(0.8 * nrow(btc_lagged))
train <- btc_lagged[1:n_train, ]
test <- btc_lagged[(n_train + 1):nrow(btc_lagged), ]

model_ts <- lm(price ~ price_lag1 + price_lag2 + price_lag3 + price_lag4 + price_lag5, data = train)
summary(model_ts)

In [None]:
# Predict and evaluate
test <- test %>% mutate(pred = predict(model_ts, newdata = test))
rmse_ts <- sqrt(mean((test$price - test$pred)^2))
r2_ts <- cor(test$price, test$pred)^2
cat('Time Series RMSE:', round(rmse_ts, 2), '\n')
cat('Time Series R-squared:', round(r2_ts, 2), '\n')

In [None]:
# Plot true vs predicted
ggplot(test, aes(x = price, y = pred)) +
  geom_point(color = 'red') +
  geom_abline(slope = 1, intercept = 0, linetype = 'dashed') +
  labs(title = 'True vs Predicted Bitcoin Price', x = 'True', y = 'Predicted') +
  theme_minimal()

# Linear Regression Assumptions (R Version)

1. **Linearity**: Relationship between predictors and target is linear.
2. **No (or little) multicollinearity**: Predictors are not highly correlated.
3. **Homoscedasticity**: Residuals have constant variance.
4. **Normality of residuals**: Residuals are normally distributed.
5. **Independence of residuals**: No autocorrelation in residuals (important for time series).

Let's check these for our simple regression example.

In [None]:
# 1. Linearity
ggplot(train, aes(x = price_lag1, y = price)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE, color = 'red') +
  labs(title = 'Linearity Check', x = 'Lag 1 Price', y = 'Price')

In [None]:
# 2. Multicollinearity
cor(train %>% select(starts_with('price_lag')))

In [None]:
# 3. Homoscedasticity
resid_ts <- test$price - test$pred
plot(resid_ts, main = 'Residuals (Time Series)', ylab = 'Residual', xlab = 'Index')

In [None]:
# 4. Normality of residuals
qqnorm(resid_ts); qqline(resid_ts)

In [None]:
# 5. Independence (autocorrelation)
acf(resid_ts, main = 'ACF of Residuals')

## Conclusion

This notebook covered univariate and time series linear regression in R, using tidyverse and modern R idioms. The structure and explanations mirror the Python notebook, so you can compare the two approaches directly.

For more advanced modeling, check out the `tidymodels` ecosystem in R!