# _Regression notebook_

# <ins> A. Task definition and general information </ins> 

## Regression task - Predict Car Selling Price

### general knowledge about the dataset:
This dataset contains information about used cars listed on www.cardekho.com and published on Kaggle.

The columns in the given dataset is as follows:

- Car_Name
- Year (of manufacture)
- Selling_Price
- Present_Price
- Kms_Driven
- Fuel_Type
- Seller_Type
- Transmission
- Owner

# <ins> B. Basic familiarity with the Dataset </ins>

### imports

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.ensemble import RandomForestRegressor
from sklearn.dummy import DummyRegressor

plt.style.use('seaborn')

In [None]:
df = pd.read_csv("../input/vehicle-dataset-from-cardekho/car data.csv")
df.rename(columns = {'Owner':'Past_Owners'},inplace = True)
df

## attributes information

<br>
let's take a look at the attributes categories:

In [None]:
df.info()

In [None]:
df.isnull().any()

### Basic information about the data:

In [None]:
df.describe(include='all')

### let's take a look at the numeric attributes histograma:

In [None]:
atttibutes_hist = df[["Kms_Driven", "Present_Price", "Selling_Price", "Year"]].hist(bins=50, figsize=(20,15))
atttibutes_hist

#### let's look at the categorial attributes histograma (as pies):


In [None]:
fig, ax = plt.subplots(2,2, figsize = (12,12))
((ax1, ax2), (ax3, ax4)) = ax

labels = df['Fuel_Type'].value_counts().index.tolist()
values = df['Fuel_Type'].value_counts().tolist()
ax1.pie(x=values, labels=labels, autopct="%1.2f%%", shadow=True, explode=[0, 0.2, 0.2])
ax1.set_title("Fuel Type:", fontdict={'fontsize': 14})

labels = df['Seller_Type'].value_counts().index.tolist()[:2]
values = df['Seller_Type'].value_counts().tolist()[:2]
ax2.pie(x=values, labels=labels, autopct="%1.2f%%", shadow=True, explode=[0, 0.2])
ax2.set_title("Seller Type:", fontdict={'fontsize': 14})

labels = df['Transmission'].value_counts().index.tolist()[:2]
values = df['Transmission'].value_counts().tolist()
ax3.pie(x=values, labels=labels, autopct="%1.2f%%", shadow=True, explode=[0, 0.2])
ax3.set_title("Transmission:", fontdict={'fontsize': 14})

labels = df['Past_Owners'].value_counts().index.tolist()
values = df['Past_Owners'].value_counts().tolist()
ax4.pie(x=values, labels=labels, autopct="%1.2f%%", shadow=True, explode=[0, 0.2, 0.2])
ax4.set_title("number of Past Owners:", fontdict={'fontsize': 20})


# <ins> C. Clean and prepare the data </ins>  

### - Unique values
#### as we can see, in the 'Fuel type' attribute, there is only 2 observation that is uniqe. because it's just one I will remove this observation.

In [None]:
print(df['Fuel_Type'].value_counts())

In [None]:
df = df[df['Fuel_Type'] != "CNG"]
print(df['Fuel_Type'].value_counts())

## - Handeling text and categorial attributes

#### first of all, I will use "get_dummies" function to "convert" every categorial attribute.

In [None]:
df_copy = df.copy() # save for later use
df = pd.get_dummies(df, columns=['Fuel_Type', 'Seller_Type', 'Transmission'])
df

#### The year coulmn is not generalize, so I will generate it to Age. this is a better information.

In [None]:
df['Car_Age']= 2019-df['Year'] # the dataset is from 2019

#### I will drop the names and year columns.
It is true that in theory the names can give us a good information, but we have only 300 rows and 98 uniqe names, so, not in this case.

In [None]:
df.drop(columns=['Car_Name'], inplace=True)
df.drop(columns=['Year'], inplace=True)

In [None]:
df.head(3)

In [None]:
pd.DataFrame(data={'features': df.columns})

<br>

# <ins>D. Dig into the DATA - correlations and patterns</ins>

## Let's try so uncover some patterns.

#### although linear correlations are not the only correlations we can find, it can gives us a good start. I will use Pearson’s correlation coefficient in the next matrixes.

### Correlation Matrix:

In [None]:
cmap = sns.diverging_palette(30, 230, 90, 20, as_cmap=True)
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(df.corr(),annot=True, cmap=cmap)
sns.set(font_scale=1)

#### high correlations with selling price:

In [None]:
corr_matrix = df.corr()
corralations = corr_matrix['Selling_Price'].sort_values(ascending = False) 
high_corr = (corralations > 0.2)|(corralations < -0.2)
pd.DataFrame(corralations[high_corr])
corralations[high_corr].index

- here we can see the features wich have a significant linear correlation with Selling Price:

### heatmap correlations which is greater then -+0.2:

In [None]:
print("heatmap of the high correlations with Selling Price:")
fig, ax = plt.subplots(figsize=(12,12))
sns.heatmap(df[corralations[high_corr].index].corr(),annot=True, cmap=cmap)
sns.set(font_scale=1)

### categorial features correlations:

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
fig.suptitle('categorial features:')

sns.swarmplot(ax=axes[0,0], x="Fuel_Type", y="Selling_Price", data=df_copy)
sns.swarmplot(ax=axes[0,1], x="Seller_Type", y="Selling_Price", data=df_copy)
sns.swarmplot(ax=axes[1,0], x="Transmission", y="Selling_Price", data=df_copy)
sns.swarmplot(ax=axes[1,1], x="Past_Owners", y="Selling_Price", data=df_copy)

### numerical features correlations:

In [None]:
sns.pairplot(df[['Selling_Price', 'Present_Price', 'Kms_Driven', 'Car_Age']], kind='reg')

#### to get a better understanding of the Age affect I will plot it another way:

In [None]:
print('This bar plot represents an estimate of central tendency for a Selling-Price with the height of each rectangle and provides some indication of the uncertainty around that estimate price using error bars.')
fig = plt.figure(figsize=(10,5))
sns.barplot('Car_Age','Selling_Price',data=df).set_title('Selling Price range by Car Age')

# <ins> E. Select a Performance Measure </ins>

I will use 2 performance measurements: R2 and MAE.

<br>R2:
The coefficient of determination, R2 ("R squared"), is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

( source - https://en.wikipedia.org/wiki/Coefficient_of_determination )

in simple words, R2 is the percentage of the explained variance from the general variance.
<br>The percentage of explained variance allows us to know how much of the variance of the dependent variable is explained by the independent variables.<br>
The higher the percentage of explained variance, the more it means that X helps us predict Y.

MAE:

from these 3 metrics:
- MAE is the easiest to understand, because it's the average error.
- MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.
All of these are loss functions, because we want to minimize them.

I chose MAE because it gives a basic, simple-to-understand assessment of the error that the model has.

# <ins> F. Test Set and Train Test + Scaling</ins>

In [None]:
X = df.drop(columns=['Selling_Price'])
Y = df['Selling_Price']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)

- I will scale the data in 2 ways To see if there are significant differencesץ
### scaling the numerical features with StandardScaler and MinMax functions:

In [None]:
scaler_standard = StandardScaler()
scaler_MinMax = MinMaxScaler()

X_train_standardized = X_train.copy()
X_test_standardized = X_test.copy()
X_train_normalized = X_train.copy()
X_test_normalized = X_test.copy()

numerical_features = ['Present_Price', 'Kms_Driven', 'Past_Owners', 'Car_Age']

# Standardization:
scaler_standard.fit(X_train[numerical_features])
X_train_standardized[numerical_features] = scaler_standard.transform(X_train_standardized[numerical_features])

# the scaling is with the the same fitted scaler (by the train data)
X_test_standardized[numerical_features] = scaler_standard.transform(X_test_standardized[numerical_features])

# Normalization:
scaler_MinMax.fit(X_train[numerical_features])
X_train_normalized[numerical_features] = scaler_MinMax.transform(X_train_normalized[numerical_features])

# the scaling is with the the same fitted scaler (by the train data)
X_test_normalized[numerical_features] = scaler_MinMax.transform(X_test_normalized[numerical_features])


In [None]:
print('note: the mean is 0 and std is 1')
X_train_standardized.describe()[numerical_features].iloc[[1, 2]]

In [None]:
print('note: the min is 0 and max is 1')
indexes = [False, False, False, True, False, False, False, True]
X_train_normalized.describe()[numerical_features].iloc[indexes]

### Now we have scaled test set and train set, we can continue to find a good model!
<br> <br>

### but first, let's see the dummy model:

In [None]:
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X_train, y_train)
R2_score = dummy_regr.score(X_test, y_test)
y_predict = dummy_regr.predict(X_test)
mae = MAE(y_test, y_predict)
print('The dummy model have a R2 score of ' + str(R2_score)[:6] + " as expected (around 0), and mean absolute error of " + str(mae)[:4])

# <ins> G. Linear-Regression Model </ins>

In [None]:
LR = LinearRegression()

In [None]:
kf = KFold(n_splits=10, random_state=42, shuffle=True)

R2_scores_standardized = cross_val_score(LR, X_train_standardized, y_train, cv=kf)
y_predict_standardized = cross_val_predict(LR, X_train_standardized, y_train, cv=kf)
mae_standarsized = MAE(y_train, y_predict_standardized)

R2_scores_normalized = cross_val_score(LR, X_train_normalized, y_train, cv=kf)
y_predict_normalized = cross_val_predict(LR, X_train_normalized, y_train, cv=kf)
mae_normalized = MAE(y_train, y_predict_normalized)

In [None]:
fig, axes = plt.subplots(1,2)
((ax1, ax2)) = axes

y_predicted = cross_val_predict(LR, X_train_standardized, y_train, cv=kf)
ax1.scatter(y_train, y_predicted, alpha=0.3, color='orange')
ax1.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'k--', lw=4)
ax1.set_xlabel('Actual')
ax1.set_ylabel('Predicted')
ax1.set_title('standardized:')

y_predicted = cross_val_predict(LR, X_train_normalized, y_train, cv=kf)
ax2.scatter(y_train, y_predicted, alpha=0.3, color='red')
ax2.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'k--', lw=4)
ax2.set_xlabel('Actual')
ax2.set_ylabel('Predicted')
ax2.set_title('normalized:')

plt.show()

#### standadized train set cross validation:

In [None]:
print("the scores of cross validation are:")
print(R2_scores_standardized)
print()
print("mean R2 is: " + str(R2_scores_standardized.mean())[:5] + " with std of  " + str(R2_scores_standardized.std())[:5] + " and MAE of " + str(mae_standarsized)[:6])

#### normalized train set cross validation:

In [None]:
print("the scores of cross validation are:")
print(R2_scores_normalized)
print()
print("mean R2 is: " + str(R2_scores_normalized.mean())[:5] + " with std of  " + str(R2_scores_normalized.std())[:5] + " and MAE of " + str(mae_normalized)[:6])

- this is preaty good, but I think that I can improve that with a creating new features!
- note: it looks like the scaling method isn't matter, I will check that later also. but you can see below that the values of the features are different.

In [None]:
X_train_normalized

In [None]:
X_train_standardized

### I will make new featurs to use the data more efficiently:

In [None]:
# create 3 more features:
df['KMs_Per_year'] = df['Kms_Driven']/df['Car_Age']
df['Present_Price_Age_ratio'] = df['Present_Price']/df['Car_Age']
df['Present_Price_KMs_ratio'] = df['Present_Price']/df['Kms_Driven']
df.describe()[['KMs_Per_year', 'Present_Price_Age_ratio', 'Present_Price_KMs_ratio']]

In [None]:
corr_matrix = df.corr()
corr_matrix['Selling_Price'].sort_values(ascending=False)

- Present_Price_Age_ratio is very significant.
- Present_Price_KMs_ratio more significant than KMS alone.
- KMs_Per_year is more significant than KMS ang Car Age separately.

##### Let's see those correlations:

In [None]:
sns.pairplot(df[['Selling_Price', 'Present_Price_Age_ratio', 'Present_Price_KMs_ratio', 'KMs_Per_year']], kind='reg')

I will repeat the previous steps to include the new features in the train and test sets. 
<br>note: the test and train sets rows will not change because Im using the same random_state. so, the rows will remain the same in each set.

### Scaling the numerical features (including the new ones) with StandardScaler and MinMax functions:

In [None]:
# test train split
X = df.drop(columns=['Selling_Price'])
Y = df['Selling_Price']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)

# scaling the numerical features
scaler_standard = StandardScaler()
scaler_MinMax = MinMaxScaler()

X_train_standardized = X_train.copy()
X_test_standardized = X_test.copy()
X_train_normalized = X_train.copy()
X_test_normalized = X_test.copy()

numerical_features = ['Present_Price', 'Kms_Driven', 'Past_Owners', 'Car_Age', 'KMs_Per_year', 'Present_Price_Age_ratio', 'Present_Price_KMs_ratio']

# Standardization:
scaler_standard.fit(X_train[numerical_features])
X_train_standardized[numerical_features] = scaler_standard.transform(X_train_standardized[numerical_features])

# the scaling is with the the same fitted scaler (by the train data)
X_test_standardized[numerical_features] = scaler_standard.transform(X_test_standardized[numerical_features])

# Normalization:
scaler_MinMax.fit(X_train[numerical_features])
X_train_normalized[numerical_features] = scaler_MinMax.transform(X_train_normalized[numerical_features])

# the scaling is with the the same fitted scaler (by the train data)
X_test_normalized[numerical_features] = scaler_MinMax.transform(X_test_normalized[numerical_features])


In [None]:
kf = KFold(n_splits=10, random_state=42, shuffle=True)

R2_standardized = cross_val_score(LR, X_train_standardized, y_train, cv=kf)
y_predict_standardized = cross_val_predict(LR, X_train_standardized, y_train, cv=kf)
mae_standarsized = MAE(y_train, y_predict_standardized)

print('standartize:')
print('R2 score: ' + str(R2_standardized.mean())[:6])
print('R2 std: ' + str(R2_standardized.std())[:6])
print('MAE: ' + str(mae_standarsized)[:6])

print()

R2_normalized = cross_val_score(LR, X_train_normalized, y_train, cv=kf)
y_predict_normalized = cross_val_predict(LR, X_train_standardized, y_train, cv=kf)
mae_normalized = MAE(y_train, y_predict_normalized)

print('normalize: ')
print('R2 score: ' + str(R2_normalized.mean())[:6])
print('R2 std: ' + str(R2_normalized.std())[:6])
print('MAE: ' + str(mae_normalized)[:6])

### This is a big improvment!
note that the results are still the same, it seems that normalization and standatization have the same affect on linear regression models.
<br>I will use only one of them the next step.

<br> I will try to increase the R2 with feature selection. 




## feature selection:

In [None]:
# I will use this function for make a copy of
# train set by specific correlation limit.

# copy X with columns which grater than specific limit:
def copy_by_corr_limit(X, lim, limits):
    X_copy = X.copy()
    s = (limits < lim)
    X_copy = X_copy[X_copy.columns[~s]]
    return X_copy

In [None]:
print('I will use the next list to select features by correlations')
print('correlations (without the sign+-):')
correlations = abs(corr_matrix['Selling_Price']).sort_values(ascending=False)
correlations.drop('Selling_Price', inplace=True)
correlations

In [None]:
corr_limits = [0, 0.03, 0.09, 0.25, 0.35, 0.40, 0.55, 0.552, 0.555, 0.9]

mean_scores = []
std_scores = []
mae_scores = []

for limit in corr_limits:
    X_train_copy = copy_by_corr_limit(X_train_standardized, limit, correlations)
    R2_scores = cross_val_score(LR, X_train_copy, y_train, cv=kf)
    y_predict = cross_val_predict(LR, X_train_copy, y_train, cv=kf)
    mae_score = MAE(y_train, y_predict)

    
    mean_scores.append(R2_scores.mean())
    std_scores.append(R2_scores.std())
    mae_scores.append(mae_score)
    
pd.DataFrame(data={'lim correlation:':corr_limits, 'R2_score': mean_scores, 'R2_std': std_scores, 'MAE score': mae_scores}) 

this is a minority improvment, so it is not significant. that is why I will go with 0.00 correlation limit.
## Testing our best linear regression model:

In [None]:
X_train_copy = copy_by_corr_limit(X_train_standardized, 0.00, correlations)
X_test_copy = copy_by_corr_limit(X_test_standardized, 0.00, correlations)

LR.fit(X_train_copy, y_train)
R2_score = LR.score(X_test_copy, y_test)
y_predict = LR.predict(X_test_copy)
mae_score = MAE(y_test, y_predict)

indexes = list(range(1, len(y_predict)+1))
fig, axs = plt.subplots(1, 1, figsize=(9, 3), sharey=True)
axs.plot(indexes, y_predict, label='target_predicted', color='orange')
axs.plot(indexes, y_test, label='target_value', color='purple')
axs.legend()
axs.set_xlabel('targes indexes')
axs.set_ylabel('Selling Price')
fig.suptitle('Predicted values VS True Values:')
plt.show()

pd.DataFrame(index=['test LR model'], data={'R2_score': R2_score, 'MAE score': mae_score})

<br>
<br>
<br>

# H. Random Forest Regressor model 

In [None]:
RFR = RandomForestRegressor()

In [None]:
kf = KFold(n_splits=10, random_state=42, shuffle=True)

R2_scores_standardized = cross_val_score(RFR, X_train_standardized, y_train, cv=kf)
y_predict_standardized = cross_val_predict(RFR, X_train_standardized, y_train, cv=kf)
mae_standarsized = MAE(y_train, y_predict_standardized)

R2_scores_normalized = cross_val_score(RFR, X_train_normalized, y_train, cv=kf)
y_predict_normalized = cross_val_predict(RFR, X_train_normalized, y_train, cv=kf)
mae_normalized = MAE(y_train, y_predict_normalized)

In [None]:
fig, axes = plt.subplots(1,2)
((ax1, ax2)) = axes

y_predicted = cross_val_predict(RFR, X_train_standardized, y_train, cv=kf)
ax1.scatter(y_train, y_predicted, alpha=0.3, color='orange')
ax1.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'k--', lw=4)
ax1.set_xlabel('Actual')
ax1.set_ylabel('Predicted')
ax1.set_title('standardized:')

y_predicted = cross_val_predict(RFR, X_train_normalized, y_train, cv=kf)
ax2.scatter(y_train, y_predicted, alpha=0.3, color='red')
ax2.plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'k--', lw=4)
ax2.set_xlabel('Actual')
ax2.set_ylabel('Predicted')
ax2.set_title('normalized:')

plt.show()

#### standadized train set cross validation:

In [None]:
print("the scores of cross validation are:")
print(R2_scores_standardized)
print()
print("mean R2 is: " + str(R2_scores_standardized.mean())[:5] + " with std of  " + str(R2_scores_standardized.std())[:5] + " and MAE of " + str(mae_standarsized)[:6])

#### normalized train set cross validation:

In [None]:
print("the scores of cross validation are:")
print(R2_scores_normalized)
print()
print("mean R2 is: " + str(R2_scores_normalized.mean())[:5] + " with std of  " + str(R2_scores_normalized.std())[:5] + " and MAE of " + str(mae_normalized)[:6])

#### Those are great scores! let's see the score with the test set:

### Random Forest Regressor standardized data Test:

In [None]:
RFR = RandomForestRegressor()

RFR.fit(X_train_standardized, y_train)
R2_score = RFR.score(X_test_standardized, y_test)
y_predict = RFR.predict(X_test_standardized)
mae_score = MAE(y_test, y_predict)

indexes = list(range(1, len(y_predict)+1))
fig, axs = plt.subplots(1, 1, figsize=(9, 3), sharey=True)
axs.plot(indexes, y_predict, label='target_predicted', color='orange')
axs.plot(indexes, y_test, label='target_value', color='purple')
axs.legend()
axs.set_xlabel('targes indexes')
axs.set_ylabel('Selling Price')
fig.suptitle('Predicted values VS True Values:')
plt.show()

pd.DataFrame(index=['test LR model'], data={'R2_score': R2_score, 'MAE score': mae_score}) 


### Random Forest Regressor normalized data

In [None]:
RFR = RandomForestRegressor()

RFR.fit(X_train_normalized, y_train)
R2_score = RFR.score(X_test_normalized, y_test)
y_predict = RFR.predict(X_test_normalized)
mae_score = MAE(y_test, y_predict)

indexes = list(range(1, len(y_predict)+1))
fig, axs = plt.subplots(1, 1, figsize=(9, 3), sharey=True)
axs.plot(indexes, y_predict, label='target_predicted', color='orange')
axs.plot(indexes, y_test, label='target_value', color='purple')
axs.legend()
axs.set_xlabel('targes indexes')
axs.set_ylabel('Selling Price')
fig.suptitle('Predicted values VS True Values:')
plt.show()

pd.DataFrame(index=['test LR model'], data={'R2_score': R2_score, 'MAE score': mae_score}) 


#### this score is pretty good!

- I will try to increase the R2 and the MAE of the Random Forest Regressor model by choosing the best hyperparams.
- I will use only the normalized data because it has a better scores.

## Randomized Search:

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
# search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = RFR, param_distributions = random_grid, n_iter = 100, cv = kf, verbose=2, random_state=42, n_jobs = -1, scoring='r2')
# Fit the random search model
rf_random.fit(X_train_normalized, y_train)

#### Massive improvement:

In [None]:
print("best R2 score is:")
print(rf_random.best_score_)

In [None]:
rf_random.best_params_

## Testing best Random forest Regressor model:

In [None]:
rf = rf_random.best_estimator_
rf.fit(X_train_normalized, y_train)

R2_score = rf.score(X_test_normalized, y_test)
y_predict = rf.predict(X_test_normalized)
mae_score = MAE(y_test, y_predict)

indexes = list(range(1, len(y_predict)+1))
fig, axs = plt.subplots(1, 1, figsize=(9, 3), sharey=True)
axs.plot(indexes, y_predict, label='target_predicted', color='orange')
axs.plot(indexes, y_test, label='target_value', color='purple')
axs.legend()
axs.set_xlabel('targes indexes')
axs.set_ylabel('Selling Price')
fig.suptitle('Predicted values VS True Values:')
plt.show()

pd.DataFrame(index=['test RFR model'], data={'R2_score': R2_score, 'MAE score': mae_score}) 


#### Although the high score with the cross validation, the test is not so good.
#### I will check the feature importances list, and if I can use feature selection to improve the model.
## feature selection:

In [None]:
rf.feature_importances_
feature_imp = pd.Series(rf.feature_importances_,index=X_train_normalized.columns).sort_values(ascending=False)
print("feature importances list:")
feature_imp

#### check the limits scores:

In [None]:
imp_limits = [0, 0.0001, 0.017, 0.02, 0.03, 0.035, 0.036, 0.037, 0.04, 0.07, 0.08, 0.1, 0.3]

mean_scores = []
std_scores = []
mae_scores = []

for limit in imp_limits:
    X_train_copy = copy_by_corr_limit(X_train_normalized, limit, feature_imp)
    R2_scores = cross_val_score(rf, X_train_copy, y_train, cv=kf)
    y_predict = cross_val_predict(rf, X_train_copy, y_train, cv=kf)
    mae_score = MAE(y_train, y_predict)

    mean_scores.append(R2_scores.mean())
    std_scores.append(R2_scores.std())
    mae_scores.append(mae_score)
    
pd.DataFrame(data={'lim importance:':imp_limits, 'R2_score': mean_scores, 'R2_std': std_scores, 'MAE score': mae_scores}) 

The best score is without any limit. this is what we did earlier, so it didn't help us as I hoped.

<br>
<br>

#### In conclusion:


# <ins>I. My best model:</ins>

### Linear Regression with the params below:

In [None]:
rf_random.best_params_

### The scores are:

In [None]:
X_train_copy = copy_by_corr_limit(X_train_standardized, 0.00, correlations)
X_test_copy = copy_by_corr_limit(X_test_standardized, 0.00, correlations)

LR.fit(X_train_copy, y_train)
R2_score = LR.score(X_test_copy, y_test)
y_predict = LR.predict(X_test_copy)
mae_score = MAE(y_test, y_predict)

indexes = list(range(1, len(y_predict)+1))
fig, axs = plt.subplots(1, 1, figsize=(9, 3), sharey=True)
axs.plot(indexes, y_predict, label='target_predicted', color='orange')
axs.plot(indexes, y_test, label='target_value', color='purple')
axs.legend()
axs.set_xlabel('targes indexes')
axs.set_ylabel('Selling Price')
fig.suptitle('Predicted values VS True Values:')
plt.show()

pd.DataFrame(index=['test LR model'], data={'R2_score': R2_score, 'MAE score': mae_score}) 


# This _linear regression_ model has R2 score which is a 10% improvement compared to  top-5 most voted notebooks at kaggle for this DATASET. The main reason is because I added 3 new features.
you can check here the most voted notebooks: https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho/code?datasetId=33080&sortBy=voteCount

### what can I do better?

- handle outliers.
- grid search for the RFR model.
- check more algoritems (more models types).

### I would love to get comments, reviews and suggestions for improvement!
I would especially love to get ideas why there are significant differences between the training (including cross-validation) and the testing with the Random Forest Regressor model.