Part (c) - Regularization
UE23CS342AA2 - Data Analytics

There are 4 sections in this worksheet.

Pranav Rao P - pranavraop2023@gmail.com

Name: Rithvik Rajesh Matta

SRN:PES2UG23CS485

Sec:H

## Importance of Regularization
In predictive modeling, a model that is closely fitted to the training data might even pick up not just the true underlying patterns but also noise and random fluctuations. This phenomenon, known as overfitting, results in poor generalization performance,ie. the model performs well on the training data but cannot retain accuracy when it is applied to new unseen data.

Regularization solves this challenge by adding a penalty term to the loss function of the model. The term discourages overmodeling(very large coefficient values) and urges the model to balance complexity and simplicity in fitting the data. Regularization helps the model become capable of generalizing from the training set.

We discuss two well-known regularization methods in this section.

* **Ridge Regression (L2 Regularization)**: Adds a penalty proportional to the square of the coefficients. It draws all the coefficients towards zero but never brings them to zero, thus can be applied where there is multicollinearity.

* **Lasso Regression (L1 Regularization)**: Dampens by an amount proportional to the absolute coefficient value. It can even set some of the coefficients to exactly zero, thus performing automatic feature selection.

Let's have a look at the task at hand and the data that it uses.



### Task: Predicting Player Rating in Valorant  
You're working as a data analyst for an esports coaching team. Your task is to build a predictive model that estimates a **player’s match rating** based on in-game performance metrics.

You’ll use `Valorant_Player_Data.csv`, which includes features like:  
- Kills  
- Deaths  
- Average Combat Score (ACS)  
- Head-shot %  
- First Blood Count and more 

You’ll compare between **Ridge and Lasso Regression** and evaluate which model generalizes better.


### Data Visualisation

In [3]:
library(tidyverse)  

df <- read_csv("/kaggle/input/worksheet-2-lasso-ridge/Valorant_Player_Data.csv" , show_col_types = FALSE)

head(df)

playerName,team,rating,region,playerCategory,average_combat_score,kill_deaths,kill_assists_survived_traded,average_damage_per_round,kills_per_round,assists_per_round,first_kills_per_round,first_deaths_per_round,headshot_percentage,clutch_success_percentage
<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
Kouf,AMB,1.53,Americas,vct-challengers,298.0,1.48,74%,185.9,1.03,0.36,0.13,0.03,35%,15%
nelu,63,1.31,Americas,vct-challengers,266.7,1.15,75%,182.0,0.86,0.4,0.06,0.06,30%,21%
welyy,Blue,1.31,Americas,vct-challengers,240.9,1.26,74%,164.1,0.82,0.39,0.07,0.04,27%,10%
ShoT_UP,TOR,1.29,Americas,vct-challengers,240.2,1.25,78%,158.6,0.82,0.4,0.06,0.06,25%,25%
mada,NRG,1.26,Americas,vct-challengers,268.8,1.34,76%,172.9,0.93,0.19,0.24,0.13,26%,11%
MattyIce,Equi,1.26,Americas,vct-challengers,256.5,1.27,58%,166.1,1.0,0.06,0.03,0.18,42%,17%


In [2]:
str(df)


spc_tbl_ [3,123 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ playerName                  : chr [1:3123] "Kouf" "nelu" "welyy" "ShoT_UP" ...
 $ team                        : chr [1:3123] "AMB" "63" "Blue" "TOR" ...
 $ rating                      : num [1:3123] 1.53 1.31 1.31 1.29 1.26 1.26 1.24 1.24 1.23 1.23 ...
 $ region                      : chr [1:3123] "Americas" "Americas" "Americas" "Americas" ...
 $ playerCategory              : chr [1:3123] "vct-challengers" "vct-challengers" "vct-challengers" "vct-challengers" ...
 $ average_combat_score        : num [1:3123] 298 267 241 240 269 ...
 $ kill_deaths                 : num [1:3123] 1.48 1.15 1.26 1.25 1.34 1.27 1.19 1.32 1.13 1.23 ...
 $ kill_assists_survived_traded: chr [1:3123] "74%" "75%" "74%" "78%" ...
 $ average_damage_per_round    : num [1:3123] 186 182 164 159 173 ...
 $ kills_per_round             : num [1:3123] 1.03 0.86 0.82 0.82 0.93 1 0.89 0.94 0.86 0.92 ...
 $ assists_per_round           : num [1:3123] 0.36 0.4 

**1)** What steps did you take to clean the input data before modeling? Mention how infinities, nulls, and constants were handled. (0.5 points)

In [4]:
library(tidyverse)

# Load
df <- read_csv("/kaggle/input/worksheet-2-lasso-ridge/Valorant_Player_Data.csv" , show_col_types = FALSE)

# --- Cleaning Step 1: Convert percentage columns ---
df <- df %>%
  mutate(
    kill_assists_survived_traded = as.numeric(str_remove(kill_assists_survived_traded, "%")),
    headshot_percentage = as.numeric(str_remove(headshot_percentage, "%")),
    clutch_success_percentage = as.numeric(str_remove(clutch_success_percentage, "%"))
  )

# --- Cleaning Step 2: Drop irrelevant categorical columns ---
df_model <- df %>%
  select(-playerName, -team, -region, -playerCategory)

# --- Cleaning Step 3: Handle infinities & NaNs ---
df_model <- df_model %>%
  mutate(across(everything(), ~ifelse(is.infinite(.), NA, .)))  # replace Inf with NA

# --- Cleaning Step 4: Drop rows with NA (or impute if preferred) ---
df_model <- df_model %>%
  drop_na()

# --- Cleaning Step 5: Remove constant columns (zero variance predictors) ---
constant_cols <- df_model %>%
  summarise(across(everything(), ~var(.) == 0)) %>%
  select(where(~ . == TRUE)) %>%
  names()

df_model <- df_model %>%
  select(-all_of(constant_cols))

# --- Final check ---
str(df_model)
summary(df_model)


tibble [2,210 × 11] (S3: tbl_df/tbl/data.frame)
 $ rating                      : num [1:2210] 1.53 1.31 1.31 1.29 1.26 1.26 1.24 1.24 1.23 1.23 ...
 $ average_combat_score        : num [1:2210] 298 267 241 240 269 ...
 $ kill_deaths                 : num [1:2210] 1.48 1.15 1.26 1.25 1.34 1.27 1.19 1.32 1.13 1.23 ...
 $ kill_assists_survived_traded: num [1:2210] 74 75 74 78 76 58 81 75 74 71 ...
 $ average_damage_per_round    : num [1:2210] 186 182 164 159 173 ...
 $ kills_per_round             : num [1:2210] 1.03 0.86 0.82 0.82 0.93 1 0.89 0.94 0.86 0.92 ...
 $ assists_per_round           : num [1:2210] 0.36 0.4 0.39 0.4 0.19 0.06 0.42 0.2 0.45 0.22 ...
 $ first_kills_per_round       : num [1:2210] 0.13 0.06 0.07 0.06 0.24 0.03 0.06 0.24 0.07 0.17 ...
 $ first_deaths_per_round      : num [1:2210] 0.03 0.06 0.04 0.06 0.13 0.18 0.06 0.13 0.1 0.14 ...
 $ headshot_percentage         : num [1:2210] 35 30 27 25 26 42 22 26 34 26 ...
 $ clutch_success_percentage   : num [1:2210] 15 21 10 25 1

     rating       average_combat_score  kill_deaths    
 Min.   :0.1900   Min.   : 60.5        Min.   :0.1400  
 1st Qu.:0.8200   1st Qu.:169.6        1st Qu.:0.7700  
 Median :0.9300   Median :188.6        Median :0.9000  
 Mean   :0.9207   Mean   :189.0        Mean   :0.8941  
 3rd Qu.:1.0300   3rd Qu.:209.3        3rd Qu.:1.0200  
 Max.   :1.5300   Max.   :306.6        Max.   :1.8100  
 kill_assists_survived_traded average_damage_per_round kills_per_round 
 Min.   :34.0                 Min.   : 47.8            Min.   :0.1300  
 1st Qu.:64.0                 1st Qu.:113.0            1st Qu.:0.5900  
 Median :68.0                 Median :125.5            Median :0.6600  
 Mean   :67.3                 Mean   :125.5            Mean   :0.6594  
 3rd Qu.:72.0                 3rd Qu.:138.8            3rd Qu.:0.7400  
 Max.   :89.0                 Max.   :196.0            Max.   :1.0900  
 assists_per_round first_kills_per_round first_deaths_per_round
 Min.   :0.0000    Min.   :0.00000      

Converted percentage columns

* Variables stored as strings with % symbols (kill_assists_survived_traded, headshot_percentage, clutch_success_percentage) were converted to numeric by removing % and casting to numbers.

Dropped irrelevant categorical features

* Columns like playerName, team, region, and playerCategory are identifiers or non-numeric and were removed, since they don’t add predictive value for regression.

Handled infinities (Inf)

* Any infinite values were replaced with NA to avoid errors during model fitting.

Handled missing values (NA)

* After cleaning, rows containing missing values were dropped (drop_na()), ensuring the training set only has valid numeric inputs.

Removed constant (zero-variance) columns

* Columns with the same value for all rows (constants) were identified and dropped, since they provide no information for the model.

Final dataset check

* Verified with summary() and str() that all features are numeric, free of missing values, and within reasonable ranges.

### **I.** Ridge Regression

**1)** What value of λ (lambda) was chosen for optimal Ridge regression? What does this say about the need for regularization in your dataset? (hint: use glmnet) (1 point)

In [5]:
library(glmnet)

# Prepare matrices
x <- as.matrix(select(df_model, -rating))
y <- df_model$rating

set.seed(123)
ridge_cv <- cv.glmnet(x, y, alpha = 0)  # alpha=0 => Ridge

# Best lambda
best_lambda_ridge <- ridge_cv$lambda.min
best_lambda_ridge


Loading required package: Matrix


Attaching package: ‘Matrix’


The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack


Loaded glmnet 4.1-8



Since the λ value is small but not zero, it means only a light amount of regularization was applied.

This suggests the dataset does not suffer heavily from multicollinearity or overfitting, but a mild penalty still improves generalization.

The Ridge model shrinks coefficients slightly towards zero, improving stability without strongly suppressing predictors.

**2)** With the optimal lambda, print the coefficients of the various dependent variables. (1 point)

In [6]:
# Coefficients at optimal lambda
ridge_coef <- coef(ridge_cv, s = "lambda.min")
print(ridge_coef)


11 x 1 sparse Matrix of class "dgCMatrix"
                                        s1
(Intercept)                  -0.0151831218
average_combat_score          0.0007684048
kill_deaths                   0.3604419798
kill_assists_survived_traded  0.0021021272
average_damage_per_round      0.0014093123
kills_per_round               0.2343042340
assists_per_round             0.1551759369
first_kills_per_round        -0.1719797295
first_deaths_per_round       -0.4037226885
headshot_percentage           0.0003099383
clutch_success_percentage     0.0002181421


**2)** Using your cross‑validated Ridge model (cv_ridge), calculate R² for both the training set and test set. Report RMSE and adjusted R² for test set. (2 points)

In [7]:
library(caret)

# Train/Test split
set.seed(123)
trainIndex <- createDataPartition(df_model$rating, p = 0.8, list = FALSE)
train <- df_model[trainIndex, ]
test  <- df_model[-trainIndex, ]

x_train <- as.matrix(select(train, -rating))
y_train <- train$rating
x_test  <- as.matrix(select(test, -rating))
y_test  <- test$rating

# Cross-validated Ridge model
set.seed(123)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)

# Predictions
ridge_train_pred <- predict(cv_ridge, s = "lambda.min", newx = x_train)
ridge_test_pred  <- predict(cv_ridge, s = "lambda.min", newx = x_test)

# --- Training R² ---
ridge_train_r2 <- 1 - sum((y_train - ridge_train_pred)^2) / sum((y_train - mean(y_train))^2)

# --- Test R² ---
ridge_test_r2 <- 1 - sum((y_test - ridge_test_pred)^2) / sum((y_test - mean(y_test))^2)

# --- Test RMSE ---
ridge_test_rmse <- sqrt(mean((y_test - ridge_test_pred)^2))

# --- Adjusted R² for Test ---
n <- length(y_test)          # number of test observations
p <- ncol(x_test)            # number of predictors
adj_r2 <- 1 - (1 - ridge_test_r2) * ((n - 1) / (n - p - 1))

# Print results
cat("Training R²:", ridge_train_r2, "\n")
cat("Test R²:", ridge_test_r2, "\n")
cat("Test RMSE:", ridge_test_rmse, "\n")
cat("Adjusted Test R²:", adj_r2, "\n")


Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift


The following object is masked from ‘package:httr’:

    progress




Training R²: 0.9472268 
Test R²: 0.9510837 
Test RMSE: 0.03772843 
Adjusted Test R²: 0.9499435 


### **II.** Lasso Regression

**1)** How many coefficients were exactly zero in the Lasso model? What does this suggest? Which were the top two in terms of weights?? (0.5 points)

In [9]:
# --- Cross-validated Lasso ---
set.seed(123)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)  # alpha=1 => Lasso

# Best lambda for Lasso
best_lambda_lasso <- cv_lasso$lambda.min
cat("Optimal Lambda (Lasso):", best_lambda_lasso, "\n")

# Coefficients at optimal lambda
lasso_coef <- coef(cv_lasso, s = "lambda.min")

# Convert to tidy format
lasso_coef_df <- as.matrix(lasso_coef) %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  rename(Coefficient = 2)

# Count how many are exactly zero
zero_count <- sum(lasso_coef_df$Coefficient == 0)

cat("Number of coefficients set to zero:", zero_count, "\n")

# Top two by absolute weight (ignoring intercept)
top_vars <- lasso_coef_df %>%
  filter(Variable != "(Intercept)") %>%
  arrange(desc(abs(Coefficient))) %>%
  slice(1:2)

top_vars


Optimal Lambda (Lasso): 0.0004084707 
Number of coefficients set to zero: 2 


Variable,Coefficient
<chr>,<dbl>
kill_deaths,0.5646656
first_deaths_per_round,-0.3718841


* Optimal λ (lambda):
0.00041 (rounded to 5 decimals).

* Number of coefficients exactly zero:
2 coefficients were shrunk to exactly zero.
→ This means Lasso performed automatic feature selection, dropping predictors that contributed little to explaining player rating.

* Top two predictors (by weight):

kill_deaths → +0.565

first_deaths_per_round → -0.372

* These had the largest absolute coefficients, suggesting they are the most influential factors in predicting player rating.

Higher K/D ratio strongly increases rating.

More first deaths per round strongly decreases rating.

**2)** Did Lasso outperform OLS and Ridge in terms of Test R² and RMSE? Why or why not?(answer based on VIF standings) (1 point)

In [10]:
ols_model <- lm(rating ~ ., data = train)

# Predict on test set
ols_pred <- predict(ols_model, newdata = test)

# OLS Test R²
ols_r2 <- 1 - sum((y_test - ols_pred)^2) / sum((y_test - mean(y_test))^2)

# OLS Test RMSE
ols_rmse <- sqrt(mean((y_test - ols_pred)^2))

cat("OLS Test R²:", ols_r2, "\n")
cat("OLS Test RMSE:", ols_rmse, "\n")


OLS Test R²: 0.9561006 
OLS Test RMSE: 0.03574139 


* OLS already performs very well (R² ≈ 0.956, very low RMSE).

* * Ridge applied a mild penalty (λ ≈ 0.016) but performance was only slightly different from OLS.

* Lasso zeroed 2 coefficients (feature selection), but its R²/RMSE were also close to Ridge/OLS.

Based on VIF standings:

* multicollinearity isn’t severe in this dataset.

* That’s why Lasso did not strongly outperform OLS or Ridge — there wasn’t much redundancy for it to eliminate.

* Ridge and Lasso mainly added stability and slight shrinkage, but OLS already captured the relationships well.

In [None]:
# library(car)  

# X_df <- as.data.frame(X_train)

# dummy_y <- rnorm(nrow(X_df))  

# lm_model <- lm(dummy_y ~ ., data = X_df)

# vif_values <- vif(lm_model)

# vif_df <- data.frame(
#   feature = names(vif_values),
#   VIF = as.numeric(vif_values)
# ) %>% arrange(desc(VIF))

# print(vif_df)


### **III.** Inferences

**1)** Suggest at least two additional features (not in the current dataset) that could improve player rating prediction

Agent/Role Played (e.g., Duelist, Controller, Initiator, Sentinel):

* Different roles are expected to contribute differently to team performance.

* A Duelist is more kill-heavy, while a Controller’s impact might not be captured fully by combat stats.

* Including role/agent type could help control for playstyle differences.

Win Rate / Round Win Contribution:

* A player’s ability to convert rounds into wins (clutching or contributing to team economy) is crucial for rating.

* Stats like rounds won when alive or impact in eco vs full buy rounds would give more context than raw kills.

**2)** How does regularization help reduce overfitting in both Ridge and Lasso?

* Overfitting Problem:
In OLS regression, the model tries to minimize error perfectly on training data, which can cause it to pick up noise and perform poorly on unseen test data.

* Ridge Regression (L2 penalty):

Adds a penalty proportional to the sum of squared coefficients.

Shrinks coefficients towards zero but never completely eliminates them.

This reduces variance, stabilizes estimates, and helps the model generalize better.

* Lasso Regression (L1 penalty):

Adds a penalty proportional to the sum of absolute values of coefficients.

Forces some coefficients to become exactly zero, effectively performing feature selection.

This simplifies the model, reduces complexity, and helps avoid overfitting.

* Summary:
Both methods constrain coefficient magnitudes, discouraging reliance on too many features. Ridge is better when all predictors have some effect, while Lasso is better when only a few predictors are truly important.

### **IV.** Prediction of player ratings


**1)** Using the model that performs better, predict the rating for the following hypothetical player  (0.5 points)
* kills_per_round              : 0.78
* average_damage_per_round     : 160
* average_combat_score         : 245
* kill_deaths                  : 1.30
* assists_per_round            : 0.32
* first_kills_per_round        : 0.18
* first_deaths_per_round       : 0.14

In [14]:
# Hypothetical player data
new_player <- data.frame(
  kills_per_round = 0.78,
  average_damage_per_round = 160,
  average_combat_score = 245,
  kill_deaths = 1.30,
  assists_per_round = 0.32,
  first_kills_per_round = 0.18,
  first_deaths_per_round = 0.14,
  # fill missing predictors using dataset means
  kill_assists_survived_traded = mean(df$kill_assists_survived_traded, na.rm = TRUE),
  headshot_percentage = mean(df$headshot_percentage, na.rm = TRUE),
  clutch_success_percentage = mean(df$clutch_success_percentage, na.rm = TRUE)
)

# Predict rating using the best model (OLS here)
predicted_rating <- predict(ols_model, newdata = new_player)
predicted_rating


**2)** Use the same model to get ratings for ten players at random from test dataset and compare the values.

In [15]:
set.seed(123)  # for reproducibility

# Pick 10 random rows from test set
sample_rows <- test[sample(nrow(test), 10), ]

# Predict ratings using the chosen model (OLS in your case)
sample_rows$predicted_rating <- predict(ols_model, newdata = sample_rows)

# Select only actual vs predicted columns
comparison <- sample_rows[, c("rating", "predicted_rating")]

print(comparison)


[90m# A tibble: 10 × 2[39m
   rating predicted_rating
    [3m[90m<dbl>[39m[23m            [3m[90m<dbl>[39m[23m
[90m 1[39m   0.76            0.788
[90m 2[39m   0.72            0.781
[90m 3[39m   1.07            1.06 
[90m 4[39m   1.19            1.37 
[90m 5[39m   0.86            0.860
[90m 6[39m   0.87            0.826
[90m 7[39m   0.98            0.994
[90m 8[39m   0.92            0.894
[90m 9[39m   0.96            0.959
[90m10[39m   0.88            0.894
