<a href="https://colab.research.google.com/github/surajacharya12/Predicting-Building-Damage-Grade-by-Earthquake/blob/main/Hitters_dataset_to_predict_player_salaries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#This project performs a Support Vector Regression (SVR) analysis on the Hitters dataset to predict player salaries. Here's a breakdown of the steps:

## Theory of Non-Linear Support Vector Regression (SVR)

Support Vector Regression (SVR) is a supervised learning model used for regression analysis. It is an extension of Support Vector Machines (SVM) which are primarily used for classification. The goal of SVR is to find a function that deviates from the target values by no more than a small epsilon ($\epsilon$), and at the same time is as flat as possible. Flatness in the context of SVR means that we want to minimize the weights of the model.

**Non-Linear SVR** is used when the relationship between the independent variables (features) and the dependent variable (target) is not linear. In such cases, a linear model would not be able to capture the complexity of the data, leading to poor performance.

The key to Non-Linear SVR is the use of **Kernel Functions**. Kernel functions are mathematical functions that transform the input data into a higher-dimensional space where it may be possible to find a linear relationship. This transformation is done implicitly by the kernel function, without actually computing the coordinates of the data in the higher-dimensional space. This is known as the "kernel trick".

In this project, a **Radial Basis Function (RBF) kernel** is used. The RBF kernel is a popular choice for non-linear SVR and is defined as:

$K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)$

where:
- $\mathbf{x}_i$ and $\mathbf{x}_j$ are two data points.
- $||\mathbf{x}_i - \mathbf{x}_j||^2$ is the squared Euclidean distance between the two data points.
- $\gamma$ is a parameter that controls the influence of a single training example.

The RBF kernel maps the input data into an infinite-dimensional space, allowing SVR to find complex non-linear relationships.

By using a kernel function like RBF, Non-Linear SVR can effectively model complex patterns in the data that would be impossible to capture with a simple linear regression model. The goal is still to find a function that fits the data within a certain margin of error ($\epsilon$), but this function can now be non-linear in the original feature space.

1. **Load Libraries and Data:** (Cells 1 to 19)



Essential libraries for data manipulation, visualization, and machine learning are imported. The Hitters dataset is loaded from Google Drive and the first few rows are displayed.


In [1]:
from warnings import filterwarnings
filterwarnings("ignore")


In [2]:
import pandas as pd


In [3]:
import numpy as np


In [4]:
import matplotlib.pyplot as plt


In [5]:
import seaborn as sns


In [6]:
import statsmodels.api as sm


In [7]:
import statsmodels.formula.api as smf


In [8]:
from sklearn.linear_model import LinearRegression


In [9]:
from sklearn.metrics import mean_squared_error,r2_score


In [10]:
from sklearn.model_selection import train_test_split,cross_val_score,cross_val_predict,ShuffleSplit,GridSearchCV


In [11]:
from sklearn.decomposition import PCA


In [12]:
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier


In [13]:
from sklearn.preprocessing import scale


In [14]:
from sklearn import model_selection


In [15]:
from sklearn.svm import SVR


In [16]:
import time


In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Non-Linear Support Vector Regression(SVR)**


**Theory**

When we cannot separate data with a straight line we use Non – Linear SVM. In this, we have Kernel functions. They transform non-linear spaces into linear spaces. It transforms data into another dimension so that the data can be classified.

It transforms two variables x and y into three variables along with z. Therefore, the data have plotted from 2-D space to 3-D space. Now we can easily classify the data by drawi

In [18]:
hts = pd.read_csv("/content/drive/MyDrive/non linera/Hitters.csv")
hts.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


2. **Data Preprocessing:** (Cells 20 to 24)
   * Rows with missing values in the 'Salary' column are removed.
   * Categorical features ('League', 'Division', 'NewLeague') are one-hot encoded to convert them into a numerical format suitable for the model.
   * The original categorical columns and the 'Salary' column are dropped from the feature set (X).
   * The target variable (y) is defined as the 'Salary' column.

In [19]:
hts.dropna(inplace=True)


In [20]:
one_hot_encoded = pd.get_dummies(hts[["League","Division","NewLeague"]])
one_hot_encoded.head()


Unnamed: 0,League_A,League_N,Division_E,Division_W,NewLeague_A,NewLeague_N
1,False,True,False,True,False,True
2,True,False,False,True,True,False
3,False,True,True,False,False,True
4,False,True,True,False,False,True
5,True,False,False,True,True,False


In [21]:
new_hts = hts.drop(["League","Division","NewLeague","Salary"],axis=1).astype("float64")


In [22]:
X = pd.concat([new_hts,one_hot_encoded[["League_N","Division_W","NewLeague_N"]]],axis=1)
X.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,League_N,Division_W,NewLeague_N
1,315.0,81.0,7.0,24.0,38.0,39.0,14.0,3449.0,835.0,69.0,321.0,414.0,375.0,632.0,43.0,10.0,True,True,True
2,479.0,130.0,18.0,66.0,72.0,76.0,3.0,1624.0,457.0,63.0,224.0,266.0,263.0,880.0,82.0,14.0,False,True,False
3,496.0,141.0,20.0,65.0,78.0,37.0,11.0,5628.0,1575.0,225.0,828.0,838.0,354.0,200.0,11.0,3.0,True,False,True
4,321.0,87.0,10.0,39.0,42.0,30.0,2.0,396.0,101.0,12.0,48.0,46.0,33.0,805.0,40.0,4.0,True,False,True
5,594.0,169.0,4.0,74.0,51.0,35.0,11.0,4408.0,1133.0,19.0,501.0,336.0,194.0,282.0,421.0,25.0,False,True,False


In [23]:
y = hts.Salary # Target-dependent variable


In [24]:
hts.shape


(263, 20)

3. **Data Splitting:** (Cell 25 - Cell26)
   * The dataset is split into training and testing sets.

In [25]:
#Independent Variables
X.shape

(263, 19)

In [26]:
#Dependent Variables
y.shape


(263,)

4. **Model Training (Initial):** (Cell 27)
   * A Support Vector Regressor with a Radial Basis Function (RBF) kernel is initialized and trained on the training data.

In [27]:
X_train = X.iloc[:210]
X_test = X.iloc[210:]
y_train = y[:210]
y_test = y[210:]

print("X_train Shape: ",X_train.shape)
print("X_test Shape: ",X_test.shape)
print("y_train Shape: ",y_train.shape)
print("y_test Shape: ",y_test.shape)

X_train Shape:  (210, 19)
X_test Shape:  (53, 19)
y_train Shape:  (210,)
y_test Shape:  (53,)


5. **Model Evaluation (Initial):** (Cells 30)
   * The model's performance is evaluated on both the training and testing sets using Root Mean Squared Error (RMSE).

In [28]:
SVR_Radial_Basis = SVR(kernel='rbf').fit(X_train, y_train)


6. **Model Evaluation (Initial R-squared):** (Cell 31)
   * The model's performance is evaluated on the training set using R-squared scores.

In [29]:
SVR_Radial_Basis


In [30]:
y_pred=SVR_Radial_Basis.predict(X_train)


7. **Model Evaluation (Initial RMSE):** (Cell 33)
   * The model's performance is evaluated on the testing set using Root Mean Squared Error (RMSE).

In [31]:
#Train Error
np.sqrt(mean_squared_error(y_train,y_pred))


np.float64(466.0383888753838)

8. **Model Evaluation (Initial R-squared):** (Cell 34 - Cell 35)
   * The model's performance is evaluated on the testing set using R-squared scores.

In [32]:
r2_score(y_train,y_pred)


-0.003366481949100386

In [33]:
y_pred=SVR_Radial_Basis.predict(X_test)


9. **Hyperparameter Tuning:** (Cell 36 - Cell 37)
   * A grid search with cross-validation is performed to find the optimal 'C' parameter for the SVR model.




In [34]:
#Test Error
np.sqrt(mean_squared_error(y_test,y_pred))

np.float64(374.4664803549714)

In [35]:
r2_score(y_test,y_pred)


0.03992953009212119

10. **Model Training (Tuned):** (Cell 38 - Cell 39)
    * An SVR model is trained using the best 'C' parameter found during tuning.




In [36]:
SVR_Radial_Basis


In [47]:

svr_parameters = {"C": np.arange(0.2,10,0.1)}
svr_cv_model= GridSearchCV(SVR_Radial_Basis,svr_parameters,cv=15).fit(X_train,y_train)

11. **Model Evaluation (Tuned RMSE):** (Cell 40)
    * The performance of the tuned model is evaluated on the training set using RMSE.

In [38]:
svr_cv_model.best_params_


{'C': np.float64(9.900000000000002)}

12. **Model Evaluation (Tuned R-squared):** (Cell 43 - Cell 50)
    * The performance of the tuned model is evaluated on the training set using R-squared scores and RMSE.

In [39]:
svr_tuned = SVR(
    kernel="rbf",
    C=pd.Series(svr_cv_model.best_params_)[0]
).fit(X_train, y_train)

In [40]:
svr_tuned


In [41]:
y_pred=svr_tuned.predict(X_train)


In [42]:
#Train Error
np.sqrt(mean_squared_error(y_train,y_pred))

np.float64(408.3299662279138)

In [43]:
r2_score(y_train,y_pred)


0.22973757918634996

In [44]:
y_pred=svr_tuned.predict(X_test)


In [45]:
#Test Error
np.sqrt(mean_squared_error(y_test,y_pred))

np.float64(319.81293180361496)

In [46]:
r2_score(y_test,y_pred)


0.2997239785888175