                                         #Baseball Player Salary Prediction Using Lasso Regression

## Problem Statement

The objective of this project is to predict baseball player salaries using performance, experience, and team-related features.

## 1. Import Required Libraries

In [37]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

## 2.Data Loading and Understanding

In [38]:
df=pd.read_csv("Hitters.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,...,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,-Andy Allanson,293,66,1,30,29,14,1,293,66,...,30,29,14,A,E,446,33,20,,A
1,-Alan Ashby,315,81,7,24,38,39,14,3449,835,...,321,414,375,N,W,632,43,10,475.0,N
2,-Alvin Davis,479,130,18,66,72,76,3,1624,457,...,224,266,263,A,W,880,82,14,480.0,A
3,-Andre Dawson,496,141,20,65,78,37,11,5628,1575,...,828,838,354,N,E,200,11,3,500.0,N
4,-Andres Galarraga,321,87,10,39,42,30,2,396,101,...,48,46,33,N,E,805,40,4,91.5,N


In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 21 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  322 non-null    object 
 1   AtBat       322 non-null    int64  
 2   Hits        322 non-null    int64  
 3   HmRun       322 non-null    int64  
 4   Runs        322 non-null    int64  
 5   RBI         322 non-null    int64  
 6   Walks       322 non-null    int64  
 7   Years       322 non-null    int64  
 8   CAtBat      322 non-null    int64  
 9   CHits       322 non-null    int64  
 10  CHmRun      322 non-null    int64  
 11  CRuns       322 non-null    int64  
 12  CRBI        322 non-null    int64  
 13  CWalks      322 non-null    int64  
 14  League      322 non-null    object 
 15  Division    322 non-null    object 
 16  PutOuts     322 non-null    int64  
 17  Assists     322 non-null    int64  
 18  Errors      322 non-null    int64  
 19  Salary      263 non-null    f

## 3. Handling Missing Salary Values

In [40]:
df['Salary'].fillna(df['Salary'].median(skipna=True),inplace=True)
df.isna().sum()

Unnamed: 0    0
AtBat         0
Hits          0
HmRun         0
Runs          0
RBI           0
Walks         0
Years         0
CAtBat        0
CHits         0
CHmRun        0
CRuns         0
CRBI          0
CWalks        0
League        0
Division      0
PutOuts       0
Assists       0
Errors        0
Salary        0
NewLeague     0
dtype: int64

Salary had missing values.

Replaced them with the median salary.

Median is safer than average for salary data.

## 4.Converting Categorical Columns to Numbers

In [41]:
dms = pd.get_dummies(df[['League','Division','NewLeague']],drop_first=True)

In [42]:
dms.head()

Unnamed: 0,League_N,Division_W,NewLeague_N
0,False,False,False
1,True,True,True
2,False,True,False
3,True,False,True
4,True,False,True


In [43]:
dms.columns

Index(['League_N', 'Division_W', 'NewLeague_N'], dtype='object')

Categorical values were converted into boolean features.

True means the category applies, False means it does not.

These values are automatically handled as numeric by the model.

## 5.Defining Target and Feature Variables

In [44]:
y=df["Salary"]
x_=df.drop(['Unnamed: 0','Salary','League','Division','NewLeague'],axis=1).astype('float64')
X=pd.concat([x_, dms[['League_N','Division_W','NewLeague_N']]],axis=1)

## 6.Train–Test Split

In [45]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)

## 7. Training Lasso Model and Intercept

In [46]:
lasso_model=Lasso().fit(X_train,y_train)
lasso_model.intercept_

np.float64(342.8733925858769)

Lasso model is trained successfully.

Intercept represents the starting salary before adding feature effects

## 8.Model Prediction 

In [49]:
y_pred=lasso_model.predict(X_test)

In [50]:
#View Actual vs Predicted
comparison = pd.DataFrame({
    "Actual_Value": y_test.values,
    "Predicted_Value": y_pred
})

comparison.head()

Unnamed: 0,Actual_Value,Predicted_Value
0,425.0,315.849479
1,325.0,833.502178
2,425.0,313.000766
3,1100.0,632.583261
4,425.0,1119.961086


## 9.Model Evaluation

In [51]:
##RMSE
np.sqrt(mean_squared_error(y_test,y_pred))

np.float64(345.6190692407428)

In [52]:
#R² Score
r2_score(y_test,y_pred)

0.3657513009571691

In [53]:
#Add Error Column
comparison["Error"] = comparison["Actual_Value"] - comparison["Predicted_Value"]
comparison.head()

Unnamed: 0,Actual_Value,Predicted_Value,Error
0,425.0,315.849479,109.150521
1,325.0,833.502178,-508.502178
2,425.0,313.000766,111.999234
3,1100.0,632.583261,467.416739
4,425.0,1119.961086,-694.961086


In [54]:
#Model Coefficients
lasso_model.coef_

array([-1.98558949e+00,  5.50494749e+00,  4.79612807e+00,  1.02123896e-01,
       -8.11521080e-01,  4.87004116e+00, -9.97808288e+00, -2.19391227e-01,
        6.16237616e-01,  9.03214960e-03,  8.73990383e-01,  7.84172593e-01,
       -8.13423037e-01,  1.83989460e-01,  4.04846687e-01, -4.08650952e+00,
        2.67092023e+01, -1.11463261e+02, -0.00000000e+00])

# 10.Lasso with Cross-Validation (LassoCV)

In [55]:
lasso_cv_model=LassoCV(alphas=np.random.randint(0,1000,100),cv=10,max_iter=10000,n_jobs=-1).fit(X_train,y_train)

In [56]:
#Best Alpha Value
lasso_cv_model.alpha_

np.int32(5)

In [58]:
#Tuned Lasso Model and Performance
lasso_tuned = Lasso().set_params(alpha=5).fit(X_train,y_train)
y_pred_tuned = lasso_tuned.predict(X_test)
np.sqrt(mean_squared_error(y_test,y_pred_tuned))

np.float64(345.6397880532268)

Regularization tuning had minimal impact on error.

Lasso performance remains similar after tuning

## 11.Final Lasso Feature Importance

In [59]:
pd.Series(lasso_tuned.coef_,index=X_train.columns)

AtBat          -1.970418
Hits            5.377108
HmRun           3.926642
Runs            0.152401
RBI            -0.501808
Walks           4.883807
Years          -8.310651
CAtBat         -0.230618
CHits           0.657385
CHmRun          0.065128
CRuns           0.857110
CRBI            0.757633
CWalks         -0.808948
PutOuts         0.182900
Assists         0.396243
Errors         -3.544343
League_N       10.512048
Division_W    -94.836778
NewLeague_N     0.000000
dtype: float64

## Final Project Conclusion

In this project, Lasso Regression was used to predict baseball player salaries using performance and experience data. Missing values were handled using median imputation, and categorical features were converted into numeric form.

Lasso was chosen because it helps reduce model complexity by removing less important features. Using LassoCV, an optimal alpha value was selected.

The final model achieved an R² of ~0.36 and an RMSE of ~346, showing moderate predictive performance. Key features like Hits, Walks, and Home Runs had the strongest impact on salary.

Overall, this project demonstrates how Lasso Regression can be used for feature selection and baseline regression modeling, with scope for improvement using better tuning and advanced models.