# Used Car Sale Prices

The goal of this project is to create a model that accurately predicts the sale price of a used car.

The data used for this project was found on Kaggle.com, uploaded by Aditya. The data contains 9 csv files, with each file storing the information about one make of car, including Audi, BMW, Ford, Hyundai, Mercedes, Skoda, Toyota, Vauxhall and Volkswagen. 

In this project we shall undertake the following tasks:

0. Data and Package Imports
1. Exploratory Data Analysis
2. Data Preprocessing
3. Model Creation and Evaluation
4. Conclusions

## 0: Data and Package Imports

In this section we shall import the 9 different csv files and the necessary visualisation libraries from Python. We shall first import the visualisation packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

We shall now import each csv file into a Pandas dataframe.

In [None]:
audi = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/audi.csv')
bmw = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/bmw.csv')
ford = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/ford.csv')
hyundai = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/hyundi.csv')
mercedes = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/merc.csv')
skoda = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/skoda.csv')
toyota = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/toyota.csv')
vauxhall = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/vauxhall.csv')
vw = pd.read_csv('/kaggle/input/used-car-dataset-ford-and-mercedes/vw.csv')

Rather than working with 9 different dataframes, it will be simpler if we were able to concatenate each dataframe into one larger dataframe. Let us first check the columns of each dataframe to ensure that the information stored about each vehicle is the same.

In [None]:
print("Columns in the Audi dataframe:") 
print(list(audi.columns))
print("-" * 50)
print("Columns in the BMW dataframe:")
print(list(bmw.columns))
print("-" * 50)
print("Columns in the Ford dataframe:")
print(list(ford.columns))
print("-" * 50)
print("Columns in the Hyundai dataframe:")
print(list(hyundai.columns))
print("-" * 50)
print("Columns in the Mercedes dataframe:")
print(list(mercedes.columns))
print("-" * 50)
print("Columns in the Skoda dataframe:")
print(list(skoda.columns))
print("-" * 50)
print("Columns in the Toyota dataframe:")
print(list(toyota.columns))
print("-" * 50)
print("Columns in the Vauxhall dataframe:")
print(list(vauxhall.columns))
print("-" * 50)
print("Columns in the VW dataframe:")
print(list(vw.columns))

We can see that the columns within each dataframe are the same, with the exception of the "Tax" column in the Hyundai dataframe. Let us change the name of this column so that we are able to join the dataframes together.

In [None]:
hyundai.rename({'tax(£)': 'tax'},axis=1,inplace=True)

Let us check the columns of the Hyundai dataframe now.

In [None]:
print(list(hyundai.columns))

We notice that the column names are now identical for each dataframe, meaning that we are able to merge the dataframes together. First, let us create a new column in each dataframe called "make", which is simply the name of the manufacturer who produced the car, so that this information is not lost in our new dataframe.

In [None]:
audi['make'] = 'Audi'
bmw['make'] = 'BMW'
ford['make'] = 'Ford'
hyundai['make'] = 'Hyundai'
mercedes['make'] = 'Mercedes'
skoda['make'] = 'Skoda'
toyota['make'] = 'Toyota'
vauxhall['make'] = 'Vauxhall'
vw['make'] = 'Volkswagen'

We are now able to join the dataframes into a single larger dataframe which contains all the information about every car within our dataset. 

In [None]:
df = pd.concat([audi, bmw, ford, hyundai, mercedes, skoda, toyota, vauxhall, vw], axis=0, ignore_index=True)

Let us now check the info of the new dataframe.

In [None]:
df.info()

We can see that our dataframe contains nearly data regarding nearly 100000 used cars from across the UK. Let us reorder the columns so that the data is presented in a logical order. 

In [None]:
df = df[['make','model','year','fuelType','mileage','engineSize','transmission','mpg','tax','price']]

Let us now check the head of the dataframe.

In [None]:
df.head()

Here we can see some examples of the types of data stored within each column. We note that the "Model", "fuelType" and "transmission" variables are stored in the "object" format, which means we will have to use label encoding or dummy variables in order to input them into our machine learning algorithms. 

## 1: Exploratory Data Analysis

In this section we shall begin to explore the data in order to identify and key relationships.

### 1.1: Understanding the Variables and Cleaning the Dataset

Let us begin investigating the range of different values possible for each variable.

In [None]:
df.nunique(axis=0)

We can see that, particulary for our categorical variables, that there are a range of different values that could be taken. Let us investigate these further. 

In [None]:
print('Unique values for "fuelType" column:', sorted(list(df['fuelType'].unique())))
print('Unique values for "transmission" column:', sorted(list(df['transmission'].unique())))


The options shown within these catgeorical variables are completely independent and we are therefore not able to reduce the number of categories within these features. Let us now investigate the numerical columns.

In [None]:
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

#### Year

We immediately notice that there seems to be an isue with the "year" column, with at least 1 vehicle having a value of 2060. Let us remove that datapoint.

In [None]:
df[df['year'] == 2060]

We shall remove this entry from our dataset, as it is difficult to determine when this vehicle was first registered.

In [None]:
df = df.drop(df.index[39175])

We also notice that at least 1 vehicle was first registered in 1970. Let us investigate this date in more detail.

In [None]:
df[df['year'] == 1970]

Let us remove these two vehicles from the dataset also.

In [None]:
indexNames = df[df['year'] == 1970].index
df = df.drop(indexNames)

#### Engine Size

From the output of the describe method above, we also observe that there are some vehicles recorded with an engine size of 0. This is obviously impossible and these vehicles must be investigated. Let us first look at which vehicles have been recorded with this value.

In [None]:
df[df['engineSize'] == 0]

We can see that there are 272 vehicles that supposedly have an engine size of 0. Let us determine whcih percentage of our dataset this is. 

In [None]:
len(df[df['engineSize'] == 0]) * 100 / len(df)

We see that these vehicles account for less than a quarter of a percent of the total dataset. As a result, we can remove them.

In [None]:
engineIndex = df[df['engineSize']==0].index
df = df.drop(engineIndex)

Now that the vehicles with an engine size of 0 have been removed, let us investigate the vehicles with particularly low mileage. 

#### Mileage

We observe that there are vehicles with a recorded mileage of 1. Let us find these vehicles.

In [None]:
df[df['mileage']==1]

Some of these vehicles were first registered in 2020 and therefore a mileage of 1 is understandable. However, for vehicles registered before this, a mileage value this low does not make sense. Let us see the percentage of our dataset which are vehicles registered in 2019 or before than have a mileage figure of 1.

In [None]:
len(df[(df['mileage']==1) & (df['year']<= 2019)]) * 100 / len(df)

Once again, this is such a small percentage of our entire dataset that removing them will not affect our ability to accurately predict the prices of used cars. As a result, these vehicles shall be removed.

In [None]:
mileageIndex = df[(df['mileage']==1) & (df['year']<= 2019)].index
df = df.drop(mileageIndex)

#### Tax

Since tax payments are made on almost all vehicles purchased in the UK, let us now investigate the vehicles that have a recorded tax value of 0.

In [None]:
df[df['tax'] == 0]

In [None]:
len(df[df['tax'] == 0]) * 100 / len(df)

Since car tax payments within the UK are based on the age of the vehicle, its CO2 emissions, and various other factors, we are not able to impute these values with estimates of the tax that should be paid. As a result, since the vehicles in question account for only 6% of the total dataset, and the fact the our dataset is rather large, we can simply remove these cars.

In [None]:
taxindex = df[df['tax']==0].index
df = df.drop(taxindex)

#### MPG

Finally, let us now investigate vehicles that have a recorded MPG value of less than 5.

In [None]:
df[df['mpg'] < 5]

We notice that all of these vehicles, except for the Volkswagen Golf SV, have either Diesel or Hybrid engines. The diesel vehicles within this subset of data are all pickup trucks and as a result the low mpg figure is understandable and could easily be correct. The hybrid vehicles do not solely depend on their petrol or diesel engine as a result of the combination with electricity, which could explain the low mpg figure in these cases. In the case of the Volkswagen Golf SV, the low mpg figure is difficult to explain and as a result we shall drop this vehicle from the dataset.

In [None]:
df = df.drop(df[df['mpg']==0.3].index)

Let us now check the info method on our dataframe again.

In [None]:
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

Let us now check that we have no missing entries with our dataset.

In [None]:
df.isnull().sum()

We do not have any missing entries within our dataset. Our data has been successfully cleaned.

### 1.2: Analysing Relationships Between Variables

#### 1.2.1: Numerical Variables

In this section we shall analyse the relationships between the variables in our dataset. We shall start by creating a correlation heatmap of the numerical variables.

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),annot=True)

Firstly we notice that there is an extremely positive correlation between year and price and an extremely negative correlation between mileage and price. This makes sense, since newer cars are generally more expensive and cars with more mileage are relatively cheaper. We also notice a negative correlation between mileage and year - the newer a car is the less miles it is likely to have travelled. Furthermore, we notice a positive correlation between engine size and price, as well as engine size and tax. This follows expectation, since it is common practice for manufacturers to sell models with larger engines for a higher price in comparison to the same model with a smaller engine. As a result, due to the higher price, a larger tax payment is required, hence the positive correlation. This also explains the positive correlation between tax and price. 

Let us highlight these observations through the use of scatterplots.

In [None]:
sns.scatterplot(x='mileage',y='price',data=df)
plt.title('Scatter plot of Mileage against Price')

We notice that the earlier mileage on a vehicle has the most negative impact on the price. This can be seen since the slope on the plot is much steeper for lower mileage, while the rate of decrease of the price reduces as the mileage increases.

In [None]:
sns.scatterplot(x='engineSize',y='price',data=df)
plt.title('Scatter Plot of Engine Size against Price')

We clearly see that as the engine size of the vehicle increases, the price tends to increase too. Let us produce a pairplot to discover any relationships that have not been noticed yet.

In [None]:
sns.pairplot(df)

There are no clear and obvious variable relationships shown here that have not already been discussed. 

#### 1.2.2: Categorical Variables

In this section we shall attempt to identify any key trends between our categorical variables and the target variable.

##### 1.2.2.1: Make and Model

Let us investigate how the make and model of the vehicle affects the price. 

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(x='make',y='price',data=df)

We notice that the German made cars, namely Audi, BMW, Mercedes and Volkswagen, all have a higher price on average than the rest of the manufacturers within the dataset. It appears that there are vehicles from Hyundai and Skoda where the price seems to be an outlier. Let us investigate these two vehicles, starting with the Hyundai.

In [None]:
df[(df['make'] == 'Hyundai') & (df['price'] > 80000)]

After doing some research, it can be seen that this make and model of vehicle would cost a customer in the region of £15,000 to buy brand new. As a result, the price of £92,000 for this 3 year old version is clearly a mistake and as a result we shall drop this point from the dataset.

In [None]:
hyundai_error = df[(df['make'] == 'Hyundai') & (df['price'] > 80000)].index
df = df.drop(hyundai_error)

Let us now investigate the problematic Skoda.

In [None]:
df[(df['make'] == 'Skoda') & (df['price']> 80000)]

Once again, it can be found that a brand new model of this car costs around £25,000 to purchase. Similarly to above, the price entered here is clearly in error and we shall remove this point from the dataset also.

In [None]:
skoda_error = df[(df['make'] == 'Skoda') & (df['price']> 80000)].index
df = df.drop(skoda_error)

Within this dataset we have over 190 different models of vehicle. Therefore, producing box plots to investigate this in detail would be too complex. However, it is somewhat obvious that the model purchased does have a clear effect on the price of the vehicle.

##### 1.2.2.2: Fuel Type

Let us determine the effect that the fuel type of a vehicle has on its price.

In [None]:
sns.boxplot(x='fuelType',y='price',data=df)

We can see that, on average, petrol vehicles are cheaper to purchase than vehicles with different fuel types. Hybrid vehicles are the most expensive to purchase on average, possibly due to the advanced technology required in order to merge petrol and electric motors. We can clearly observe that the fuel type is an important feature in determining a vehicles sale price.

##### 1.2.2.3: Transmission

Let us determine if the transmission type of a vehicle has an influence on the sale price.

In [None]:
sns.boxplot(x='transmission',y='price',data=df)

The first observartion here is that vehicles with manual transmission tend to be cheaper to purchase than other transmission types. This may be due to advanced resources required to design and implement automatic transmission systems. We can clearly see that this feature has a significant influence on the price.

#### 1.2.3: Analysing the Distribution of Numerical Variables

In order to achieve optimised prediction results, we must first ensure that our numerical features are normally distributed. To do this, we produced histograms and check that the follow the "bell" shaped curve.

##### 1.2.3.1: Price

In [None]:
sns.distplot(df['price'],bins=50)

We can see that our target variable is extremely positively skewed. We shall apply a log transformation of this feature in the data preprocessing section.

##### 1.2.3.2: Year

In [None]:
sns.distplot(df['year'],bins=50)

In this scenario, we notice that our data is negatively skewed.

##### 1.2.3.3: Mileage

In [None]:
sns.distplot(df['mileage'],bins=50)

Our mileage data is positively skewed.

##### 1.2.3.4: Tax

In [None]:
sns.distplot(df['tax'],bins=50)

We have slight positive skew in this case.

##### 1.2.3.5:  MPG

In [None]:
sns.distplot(df['mpg'],bins=50)

We observe a positive skew for our "MPG" data.

##### 1.2.6: Engine Size

In [None]:
sns.distplot(df['engineSize'],bins=50)

We also notice positive skew in this feature too.

## 2: Data Preprocessing

In this section, we shall deal with skewed data and create dummy variables for our categorical features. 

### 2.1: Dealing with Skewed Data

Let us apply a log transform to our numerical columns in an attempt to reduce skewness.

In [None]:
df['price'] = np.log(df['price'])
df['year'] = np.log(df['year'])
df['mileage'] = np.log(df['mileage'])
df['tax'] = np.log(df['tax'])
df['mpg'] = np.log(df['mpg'])
df['engineSize'] = np.log(df['engineSize'])

### 2.2: Creating Dummy Variables

In order to use our categorical variables in the machine learning algorithms, we must create dummy variables for them. However, let us begin by removing the "make" column from our dataset since we can infer this information from the vehicles "model".

In [None]:
df = df.drop('make',axis=1)

We can now create dummy variables for the categorical features within our dataset. We must set the parameter "drop_first" to be true in order to reduce multicolinearity.

In [None]:
transmission = pd.get_dummies(df['transmission'],drop_first=True)
model = pd.get_dummies(df['model'],drop_first=True)
fueltype = pd.get_dummies(df['fuelType'],drop_first=True)
df = pd.concat([df,transmission,model,fueltype],axis=1)
df = df.drop(['transmission','model','fuelType'],axis=1)

Let us check the head of the dataframe to ensure that the dummy variables were created successfully.

In [None]:
df.head()

### 2.3: Creating Training and Test Sets

We must now create training and test sets for our data.

In [None]:
X = df.drop('price',axis=1)
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=101)

Our dataset is now ready for use in the machine learning algorithms.

## 3: Model Creation and Analysis

In this section we shall implement a range of machine learning algorithms in order to predict the prices of used cars, whilst simultaneously investigating the effects of scaling our data.

### 3.1: Non-scaled Data

#### 3.1.1: Linear Regression

The first model that we shall implement will be a linear regression model. First we must fit our model to the training data.

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

We can now create predictions using our trained model.

In [None]:
linreg_preds = linreg.predict(X_test)

Since this is a continuous regression problem, our key metrics for analysis purposes are the root mean squared error and the R2 score. Let us import these metrics from scikit-learn and use them to analyse our linear regression model. 

In [None]:
from sklearn.metrics import r2_score, mean_squared_error
linreg_r2 = r2_score(np.exp(y_test),np.exp(linreg_preds))
linreg_RMSE = np.sqrt(mean_squared_error(np.exp(y_test),np.exp(linreg_preds)))
print("Linear Regression R2 Score: {}".format(linreg_r2))
print("Linear Regression RMSE: {}".format(linreg_RMSE))

Our R2 score of 0.9 is very good and represents 90% of the variance of the price of a used car based on the independent variables we have used here. Our RMSE value of roughly £3000 is large, but in comparison to the average price of a car in our dataset, this value is reasonable. One reason for this larger value may be due to the fact there remains some vehicles which have an extremely large price, which would be considered outliers due to the being over 1.5x larger than the Upper Quartile price value.

#### 3.1.2: Decision Tree

The process of training a machine learning model and generating predictions will be the same as described above for all machine learning algorithms implemented from this point forward. We shall now implement the decision tree algorithm.

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(X_train, y_train)
dtr_preds = dtr.predict(X_test)
dtr_r2 = r2_score(np.exp(y_test),np.exp(dtr_preds))
dtr_RMSE = np.sqrt(mean_squared_error(np.exp(y_test),np.exp(dtr_preds)))
print("Decision Tree R2 Score: {}".format(dtr_r2))
print("Decision Tree RMSE: {}".format(dtr_RMSE))

The decision tree regressor has managed to explain just over 93% of the variance within the price feature based on the independent variables used in the model. We have also managed to reduce the RMSE by approximately £500 in comparison to the Linear Regression model. 

#### 3.1.3: Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
rfr_preds = rfr.predict(X_test)
rfr_r2 = r2_score(np.exp(y_test),np.exp(rfr_preds))
rfr_RMSE = np.sqrt(mean_squared_error(np.exp(y_test),np.exp(rfr_preds)))
print("Random Forest R2 Score: {}".format(rfr_r2))
print("Random Forest RMSE: {}".format(rfr_RMSE))

Our random forest regressor manages to explain approximately 96% of the variance within the dependent variable. The root mean squared error is also the lowest that we have seen, at approximately £1800, which equates to roughly 10% error based on the average price of vehicles within the dataset.

#### 3.1.4: Support Vector Regression

In [None]:
from sklearn.svm import SVR
svr = SVR()
svr.fit(X_train, y_train)
svr_preds = svr.predict(X_test)
svr_r2 = r2_score(np.exp(y_test),np.exp(svr_preds))
svr_RMSE = np.sqrt(mean_squared_error(np.exp(y_test),np.exp(svr_preds)))
print("Support Vector Regression R2 Score: {}".format(svr_r2))
print("Support Vector Regression RMSE: {}".format(svr_RMSE))

We observe that Support Vector Regression is currently the worst model we have implemented, achieving a R2 score of approximately 87% and a RMSE of approximately £3500.

#### 3.1.5: MLP Regressor

In [None]:
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor()
mlp.fit(X_train, y_train)
mlp_preds = mlp.predict(X_test)
mlp_r2 = r2_score(np.exp(y_test),np.exp(mlp_preds))
mlp_RMSE = np.sqrt(mean_squared_error(np.exp(y_test),np.exp(mlp_preds)))
print("MLP Regressor R2 Score: {}".format(mlp_r2))
print("MLP Regressor RMSE: {}".format(mlp_RMSE))

Our MLP regressor managed to achieve an R2 score of approximately 93%, but with a RMSE of around £2600.

#### 3.1.6: Summary of Findings

In [None]:
d = {'Model': ['Linear Regression', 'Decision Tree', 'Random Forest', 'Support Vector Regressor', 'MLP Regressor'],
    'R2 Score': [linreg_r2, dtr_r2, rfr_r2, svr_r2, mlp_r2],
    'RMSE': [linreg_RMSE, dtr_RMSE, rfr_RMSE, svr_RMSE, mlp_RMSE]}
results = pd.DataFrame(data=d)
results

We can see that our random forest model achieved both the best R2 score and RMSE. The support vector regressor acheived the worst R2 score. Let us plot this information for visual understanding.

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x='Model',y='R2 Score',data=results,order=['Support Vector Regressor', 'Linear Regression', 'MLP Regressor','Decision Tree','Random Forest'])
plt.title('R2 Score for Each Model')

In [None]:
sns.scatterplot(x='R2 Score',y='RMSE',data=results,hue='Model')

### 3.2: Scaled Data

In this section, we shall scale the data and reimplement the same models implemented above. We shall use the Standard Scaler to scale our training and test sets.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Let us now implement the same models form section 3.1 above.

#### 3.2.1: Decision Tree

In [None]:
s_dtr = DecisionTreeRegressor()
s_dtr.fit(X_train,y_train)
s_dtr_preds = s_dtr.predict(X_test)
s_dtr_r2 = r2_score(np.exp(y_test),np.exp(s_dtr_preds))
s_dtr_RMSE = np.sqrt(mean_squared_error(np.exp(y_test),np.exp(s_dtr_preds)))
print("Scaled Decision Tree R2 Score: {}".format(s_dtr_r2))
print("Scaled Decision Tree RMSE: {}".format(s_dtr_RMSE))

Our decision tree model manages to explain 93% of the variance in our target variable, whilst producing a root mean squared error of approximately £2500.

#### 3.2.2: Random Forest

In [None]:
s_rfr = RandomForestRegressor()
s_rfr.fit(X_train, y_train)
s_rfr_preds = s_rfr.predict(X_test)
s_rfr_r2 = r2_score(np.exp(y_test),np.exp(s_rfr_preds))
s_rfr_RMSE = np.sqrt(mean_squared_error(np.exp(y_test),np.exp(s_rfr_preds)))
print("Scaled Random Forest R2 Score: {}".format(s_rfr_r2))
print("Scaled Random Forest RMSE: {}".format(s_rfr_RMSE))

The random forest regressor on our scaled training and test sets acheives an R2 score of approximately 96%, with a root mean squared error of roughly £1850.

#### 3.2.3: Support Vector Regression

In [None]:
s_svr = SVR()
s_svr.fit(X_train, y_train)
s_svr_preds = s_svr.predict(X_test)
s_svr_r2 = r2_score(np.exp(y_test),np.exp(s_svr_preds))
s_svr_RMSE = np.sqrt(mean_squared_error(np.exp(y_test),np.exp(s_svr_preds)))
print("Scaled Support Vector Regression R2 Score: {}".format(s_svr_r2))
print("Scaled Support Vector Regression RMSE: {}".format(s_svr_RMSE))

Our support vector regression model based on the scaled data acheives an R2 score of approximately 92% and an RMSE value of roughly £2300.

Let us join these results into a dataframe for a more direct comparison.

In [None]:
d = {'Model': ['Scaled Decision Tree', 'Scaled Random Forest', 'Scaled Support Vector Regressor'],
    'R2 Score': [s_dtr_r2,s_rfr_r2,s_svr_r2],
    'RMSE': [s_dtr_RMSE,s_rfr_RMSE,s_svr_RMSE]}
scaled_results = pd.DataFrame(d)
scaled_results

We notice that our scaled random forest regressor is able to explain the most variance within the price of used cars in comaprison to the other models built using the scaled data. We also have the lowest root mean squared error as a result of using this model. 

Let us now analyse the results obtained by all of the models we have implemented so far.

In [None]:
full_results = pd.concat([results,scaled_results],ignore_index=True)
full_results

We observe that in both the unscaled and scaled sets, the random forest regressor performed better than all of the other types of regressors implemented within this project. In the case of the random forest regressor, scaling the data only slightly imporved the model, whereas the scaling of the data led to a reduction in the performance in the model. However, scaling the data led to an approximate 7% increase in the amount of variance explained by the support vector regression model. 