# Sales of Summer Cloth in E-Commerce Wish

The task was to predict the number of units sold of the products. The problem has been solved through a regression model. 

The impetus has been given to data preprocessing as I have been reading up on it quite a lot and I wanted to implement and see it for myself. 

Feature Selection has also been done to get the best features that would help us in our predictions.
GridSeach has been performed on the best model to get more optimised parameters of our model. 
An attempt at model boosting has been done to get even better predictions. You can follow through the steps or skip to the part you want to see through below:

* [Importing Data](#import)
* [Data Preprocessing](#preprocess)
* [Correlation between features](#correlation)
* [Feature Selection](#selection)
* [Regression Models](#regmodel)
* [GridSearchCV](#gridsearch)
* [Final Model](#finalmodel) (by model boosting method)

<a id='import'></a>

# Importing Data

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Basic 
import numpy as np
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
sales = pd.read_csv('/kaggle/input/summer-products-and-sales-in-ecommerce-wish/summer-products-with-rating-and-performance_2020-08.csv')
sales.head(1)

<a id='preprocess'></a>

# Data Preprocessing

Data Preprocessing is a key step in the path to making models that can predict/classify depending on the dataset we have and the question we aim to answer. At some level, this requires you to be aware of the background of your data and the question you intend to answer. There is a lot of underlying inferences that can be extracted while answering the actual question. This is possible due to the data preprocessing phase. 

I have looked up quite a few online articles about what different methods are involved and when should we apply what. It has been an interesting read and I have attempted to apply what I learnt here.

Below are some steps that are used at times in the Data Preprocessing stage. The list is not exhaustive, these are some of the methods that were used in this dataset

1. [Removing the null values](#1)
2. [Transform categorical variables](#2)  (as necessary)
3. [Removing the features that have 1 unique value](#3)
4. [Engineer new feature](#4) (if there is a possibility)
5. [Remove unnecessary features](#5)
6. Binning data (if required) (was not required here)

In [None]:
sales.info()

<a id='1'></a>

## Removing the null values

In [None]:
sales.isnull().sum()[sales.isnull().sum() !=0]

* All the 5 rating count features with null values can be replaced with 0. We are doing so because if the value could be null (for that rating) because no customer wanted to rate it so.


* Product color, Origin Country and Product Size variation will be dealt with in the next section. 


* 'has_urgency_banner' feature tells us whether or not the product has an urgency banner. We can see that the count of null values for 'has_urgency_banner' and 'urgency_text' is the same. This implies that the products that do not have an urgency text have been given a null value in 'has_urgency_banner'. Therefore, we can say that this feature can become a categorical variable: 
    * 1 denoting it has an urgency text
    * 0 denoting it does not have an urgency text (Replace null with 0)
    
  
* Merchant name, info subtitle, profile picture will be dealt in a separate section.

In [None]:
# rating features
sales['rating_five_count'].replace(np.nan, 0, inplace=True)
sales['rating_four_count'].replace(np.nan, 0, inplace=True)
sales['rating_three_count'].replace(np.nan, 0, inplace=True)
sales['rating_two_count'].replace(np.nan, 0, inplace=True)
sales['rating_one_count'].replace(np.nan, 0, inplace=True)

# urgency banner
sales['has_urgency_banner'].replace(np.nan, 0, inplace=True)

<a id='2'></a>

## Transform categorical variables

* In this section, I will curate and reduce the different values that are present under some of the features so that there is some uniformity in the dataset and the sparsity is reduced. You can go ahead and see further.

* Transforming categorical variables has a major step of changing it into a ***one hot encoding format***. I have done that a little ahead in this notebook. If you want to go through that, you can click [here](#7).

In [None]:
sales.dtypes[sales.dtypes == 'object']

* 'title', 'title_orig', 'merchant_title', 'merchant_name', 'merchant_info_subtitle', 'merchant_profile_picture', 'product_url', 'product_picture', 'product_id', 'theme', 'crawl_month', 'shipping_option_name', 'currency_buyer', 'urgency_text' will be dealt with in separate sections further on. 


* 'tags' will be used to engineer a new feature later on.


* 'product_color', 'product_variation_size_id' and 'origin_country' will be used in this category to reduce the number of categories in each feature.

### 1. Product Color

In [None]:
count = sales['product_color'].value_counts()
count[count>3]

We will club the colors into some basic colors and not keep so many different shades of colours.

The code in this section and results arequite long and will seem redundant because I have tries to segregate all the singular colours into a paarticular basic colour. I have done so as I wanted to be exhaustive about it. 

Another way to go about it would be address only some colours that appear more times than the rest while clubbing the few occurrences of other colours under the 'others' category. (Read ahead if you did not quite understand what I tried to say here!)

In [None]:
sales['product_color'].replace('armygreen', 'green', inplace=True)
sales['product_color'].replace('winered', 'red', inplace=True)
sales['product_color'].replace('navyblue', 'blue', inplace=True)
sales['product_color'].replace('lightblue', 'blue', inplace=True)
sales['product_color'].replace('khaki', 'green', inplace=True)
sales['product_color'].replace('gray', 'grey', inplace=True)
sales['product_color'].replace('rosered', 'red', inplace=True)
sales['product_color'].replace('skyblue', 'blue', inplace=True)
sales['product_color'].replace('coffee', 'brown', inplace=True)
sales['product_color'].replace('darkblue', 'blue', inplace=True)
sales['product_color'].replace('rose', 'red', inplace=True)
sales['product_color'].replace('fluorescentgreen', 'green', inplace=True)
sales['product_color'].replace('navy', 'blue', inplace=True)
sales['product_color'].replace('lightpink', 'pink', inplace=True)

In [None]:
count = sales['product_color'].value_counts()
count[count==3]

In [None]:
sales['product_color'].replace('orange-red', 'red', inplace=True)
sales['product_color'].replace('Black', 'black', inplace=True)
sales['product_color'].replace('lightgreen', 'green', inplace=True)
sales['product_color'].replace('White', 'white', inplace=True)

In [None]:
count = sales['product_color'].value_counts()
count[count==2]

In [None]:
sales['product_color'].replace('wine', 'red', inplace=True)
sales['product_color'].replace('Pink', 'pink', inplace=True)
sales['product_color'].replace('Army green', 'green', inplace=True)
sales['product_color'].replace('coralred', 'red', inplace=True)
sales['product_color'].replace('lightred', 'red', inplace=True)
sales['product_color'].replace('apricot', 'orange', inplace=True)
sales['product_color'].replace('navy blue', 'blue', inplace=True)
sales['product_color'].replace('burgundy', 'red', inplace=True)
sales['product_color'].replace('silver', 'grey', inplace=True)
sales['product_color'].replace('camel', 'brown', inplace=True)
sales['product_color'].replace('lakeblue', 'blue', inplace=True)
sales['product_color'].replace('lightyellow', 'yellow', inplace=True)
sales['product_color'].replace('watermelonred', 'red', inplace=True)
sales['product_color'].replace('coolblack', 'black', inplace=True)
sales['product_color'].replace('applegreen', 'green', inplace=True)
sales['product_color'].replace('mintgreen', 'green', inplace=True)
sales['product_color'].replace('dustypink', 'pink', inplace=True)

In [None]:
count = sales['product_color'].value_counts()
count[count==1]

In [None]:
sales['product_color'].replace('ivory', 'white', inplace=True)
sales['product_color'].replace('lightkhaki', 'green', inplace=True)
sales['product_color'].replace('lightgray', 'grey', inplace=True)
sales['product_color'].replace('darkgreen', 'green', inplace=True)
sales['product_color'].replace('RED', 'red', inplace=True)
sales['product_color'].replace('tan', 'brown', inplace=True)
sales['product_color'].replace('jasper', 'red', inplace=True)
sales['product_color'].replace('nude', 'white', inplace=True)
sales['product_color'].replace('army', 'brown', inplace=True)
sales['product_color'].replace('light green', 'green', inplace=True)
sales['product_color'].replace('offwhite', 'white', inplace=True)
sales['product_color'].replace('Blue', 'blue', inplace=True)
sales['product_color'].replace('denimblue', 'blue', inplace=True)
sales['product_color'].replace('Rose red', 'red', inplace=True)
sales['product_color'].replace('lightpurple', 'purple', inplace=True)
sales['product_color'].replace('prussianblue', 'blue', inplace=True)
sales['product_color'].replace('offblack', 'black', inplace=True)
sales['product_color'].replace('violet', 'purple', inplace=True)
sales['product_color'].replace('gold', 'yellow', inplace=True)
sales['product_color'].replace('wine red', 'red', inplace=True)
sales['product_color'].replace('rosegold', 'red', inplace=True)
sales['product_color'].replace('claret', 'red', inplace=True)
sales['product_color'].replace('army green', 'green', inplace=True)
sales['product_color'].replace('lightgrey', 'grey', inplace=True)

In [None]:
count = sales['product_color'].value_counts()
count

In [None]:
sales['product_color'].replace(np.nan, 'others', inplace=True)

* 'black', 'white', 'blue', 'red', 'green', 'yellow', 'pink', 'grey', 'purple', 'orange', 'brown', 'beige' are the basic colours we will go forward with. 

* Replaced np.nan with 'others' already.

* We will add a category 'dual' for products that have two colours (in the format __ & __).

* We will add a category 'others' for products that are multi-coloured or have a print on them. 

In [None]:
def color(col):
    ls = ['black', 'white', 'blue', 'red', 'green', 'yellow', 'pink', 'grey', 'purple', 'orange', 'brown', 'beige']
    if col not in ls:
        if '&' in col:
            return 'dual'
        else:
            return 'others'
    return col

In [None]:
sales['product_color'] = sales['product_color'].apply(color)

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x = 'product_color', data = sales, order = sales['product_color'].value_counts().iloc[:].index)
plt.xlabel('Product Colour')
plt.ylabel('Count')
plt.show()

### 2. Product Variation Size ID

* The categories which we will have are: XXXS, XXS, XS, S, M, L, XL, XXL, XXXL, XXXXL, XXXXXL, Others

* The code will be lengthy here too but only for the purpose of this being an exhaustive data cleaning. However, there are some names that have only a single occurrence and as you will see later on that it is 84 in number and will become very tedious to categorize. Thus, we will club all these under 'Others'.

* I am filtering out the categories other than 'Others' first and later will club everyhting under 'Others' using a function *'size_name'*

* All the null values will also be under the category 'Others'.

In [None]:
count = sales['product_variation_size_id'].value_counts()
count[count>3]

In [None]:
sales['product_variation_size_id'].replace('S.', 'S', inplace=True)
sales['product_variation_size_id'].replace('Size S', 'S', inplace=True)
sales['product_variation_size_id'].replace('XS.', 'XS', inplace=True)
sales['product_variation_size_id'].replace('s', 'S', inplace=True)
sales['product_variation_size_id'].replace('M.', 'M', inplace=True)
sales['product_variation_size_id'].replace('2XL', 'XXL', inplace=True)
sales['product_variation_size_id'].replace('Size XS', 'XS', inplace=True)
sales['product_variation_size_id'].replace('Size-XS', 'XS', inplace=True)
sales['product_variation_size_id'].replace('4XL', 'XXXXL', inplace=True)
sales['product_variation_size_id'].replace('SIZE XS', 'XS', inplace=True)

In [None]:
count = sales['product_variation_size_id'].value_counts()
count[count==3]

In [None]:
sales['product_variation_size_id'].replace('SizeL', 'L', inplace=True)
sales['product_variation_size_id'].replace('Size-S', 'S', inplace=True)

In [None]:
count = sales['product_variation_size_id'].value_counts()
count[count==2]

In [None]:
sales['product_variation_size_id'].replace('5XL', 'XXXXXL', inplace=True)
sales['product_variation_size_id'].replace('3XL', 'XXXL', inplace=True)
sales['product_variation_size_id'].replace('S(bust 88cm)', 'S', inplace=True)
sales['product_variation_size_id'].replace('Size4XL', 'XXXXL', inplace=True)
sales['product_variation_size_id'].replace('Size -XXS', 'XXS', inplace=True)
sales['product_variation_size_id'].replace('SIZE-XXS', 'XXS', inplace=True)
sales['product_variation_size_id'].replace('Size M', 'M', inplace=True)
sales['product_variation_size_id'].replace('size S', 'S', inplace=True)
sales['product_variation_size_id'].replace('S Pink', 'S', inplace=True)
sales['product_variation_size_id'].replace('Size S.', 'S', inplace=True)
sales['product_variation_size_id'].replace('Suit-S', 'S', inplace=True)

In [None]:
count = sales['product_variation_size_id'].value_counts()
count.count()

In [None]:
def size_name(size):
    ls = ["XXXS", "XXS", "XS", "S", "M", "L", "XL", "XXL", "XXXL", "XXXXL", "XXXXXL"]
    if size in ls:
        return size
    return "Others"

In [None]:
sales['product_variation_size_id'].replace(np.nan, 'Others', inplace=True)
sales['product_variation_size_id'] = sales['product_variation_size_id'].apply(size_name)

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x = 'product_variation_size_id', data = sales, order = sales['product_variation_size_id'].value_counts().iloc[:].index)
plt.xlabel('Product Variation Size ID')
plt.ylabel('Count')
plt.show()

### 3. Origin country

In [None]:
sales['origin_country'].value_counts()

* We will keep categories 'CN', 'US' and 'Others' (this will have VE, SG, AT and GB).

* We are doing so because the occurrences of countries apart from CN and US are really less and will interfere with the predictions of the model.

* Allthe null values are categorized under 'Others'

In [None]:
def origin_name(country):
    ls = ["VE", "SG", "GB", "AT"]
    if country in ls:
        return "Others"
    return country

In [None]:
sales['origin_country'].replace(np.nan, "Others", inplace=True)
sales['origin_country'] = sales['origin_country'].apply(origin_name)

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x = 'origin_country', data = sales, order = sales['origin_country'].value_counts().iloc[:].index)
plt.xlabel('Origin Country')
plt.ylabel('Count')
plt.show()

<a id='3'></a>

## Columns with 1 unique value

* Columns with only 1 unique value will not add any value to our model, so it would be best to drop them out.

In [None]:
ls = sales.nunique()
ls[ls==1]

In [None]:
sales.drop(labels = ['currency_buyer', 'theme', 'crawl_month'], axis=1, inplace=True)

<a id='4'></a>

## Engineer a new feature

* We will import the CSV file that has all the unique categories of tags sorted by count.

* The aim is to find out the percentage of total number of tags available for a particular product. Our new feature will be 'tags_percentage'.

* The reason behind engineering this feature is that the more number of tags a product has, the more it will turn up in searches. The probability of its units being sold more in number will be high.

* We will drop the 'tags' feature thereafter because we do not need it for the model. 

In [None]:
collect_tags = pd.read_csv('/kaggle/input/summer-products-and-sales-in-ecommerce-wish/unique-categories.sorted-by-count.csv')
print('Total number of tags: ', collect_tags.shape[0])

In [None]:
# Return percentage of tags present for a product

def tag_number(tags):
    ls = tags.split(',')
    return len(ls)/collect_tags.shape[0]

In [None]:
sales['tags_percentage'] = sales['tags'].apply(tag_number)

In [None]:
sales.drop(labels = ['tags'], axis=1, inplace=True)

<a id='5'></a>

## Remove unnecessary features

In [None]:
sales.dtypes[sales.dtypes == 'object']

* Columns: title, title_orig, merchant_profile_picture, product_url, product_picture, product_id, merchant_id, merchant_info_subtitle, merchant_name, merchant_title, shipping_option_name, urgency_text

* These will be dropped for now, as the likelihood of these affecting the number of units sold is pretty less. For some of the features present above, a corresponding feature already exists in the dataset that provides more information relevant to the model we want to make. 

* The rating count column will also be removed for now as we already have features of the distribution of rating count across (5/4/3/2/1) which gives us a more detailed information than 'rating count'

In [None]:
sales.drop(labels = ['title', 'title_orig', 'merchant_profile_picture', 'product_url', 'product_picture', 'product_id', 'merchant_id', 
                     'merchant_info_subtitle', 'merchant_name', 'merchant_title', 'shipping_option_name', 'urgency_text'], 
           axis=1, 
           inplace=True)

In [None]:
sales.drop(labels = ['rating_count'], axis=1, inplace=True)

<a id='correlation'></a>

# Correlation between features

* We will check for correlation of all features with the number of units sold. 

* For the three categorical variables (product colour, variation size and origin country) however, we will do a separate check of correlation (using the one hot encoded format) with the units sold. This has been done because it will be difficult to visualise efficiently otherwise. 

<a id='7'></a>

### Categorical Variables: One hot encoding

* We will change our categorical variables to one hot encoding format. 

### 1. Product Color

In [None]:
# product color
dummies_color = pd.get_dummies(sales['product_color'], drop_first=True) # give us the one hot ecoded features
dummies_color.drop(labels = 'others', axis=1, inplace=True) # remove the 'others' feature as n-1 encoded features represents n features

### 2. Product Variation Size ID

In [None]:
# product variation size id
dummies_variation = pd.get_dummies(sales['product_variation_size_id'])
dummies_variation.drop(labels = ['Others'], axis = 1, inplace=True)

### 3. Origin Country

In [None]:
dummies_origin = pd.get_dummies(sales['origin_country'])
dummies_origin.drop(labels=['Others'], axis = 1, inplace=True)


In [None]:
# concatenating all the one hot encoded features for the three categorical variables above

feat_onehot = pd.concat([dummies_color, dummies_variation, dummies_origin, sales['units_sold']], axis=1)
feat_onehot.head(1)

In [None]:
feat_onehot_corr = feat_onehot.corr()

feat_onehot_corr['units_sold'].sort_values(ascending=False)

* From the above result we can safely say that the dependency of units sold on the product color, variation size or origin country is very unlikely. 

* For the same reason, we will DROP these three features. 

In [None]:
sales.drop(labels = ['product_color', 'product_variation_size_id', 'origin_country'], 
           axis=1, 
           inplace=True)

### The correlation between the rest of the features and units of the product sold

In [None]:
sales_corr = sales.corr()

plt.figure(figsize = (18, 16))
sns.heatmap(sales_corr, annot=True, cmap='Blues_r')
plt.title('Correlation between features')
plt.show()

In [None]:
sales_corr['units_sold'].sort_values(ascending=False)

* We can see above the correlation all features hold with the units sold. The method for correlation is *pearson*.

* We will use the **SelectKBest method** to capture the best features for the model. 

<a id='selection'></a>

# Feature Selection

In [None]:
# separating the independent and dependent variables

y = sales['units_sold']
X = sales.drop(labels = ['units_sold'], axis = 1)
print("Shape of X is {} and that of y is {}".format(X.shape, y.shape))

In [None]:
# Splitting the dataset 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

print('Shape of training set ', X_train.shape)
print('Shape of test set ', X_test.shape)

### SelectKBest

* Selects features according to the k highest scores.

* Scoring function used here is Mutual Info Regression

### Scoring Function: Mutual Info Regression

* We could have used the default scoring function: f_regression but that captures linear dependencies better.

* mutual_info_regression can capture any type of dependency between variables which is what we would need here. Check out the comparison [here](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html).

* **Mutual Information Regression**: 

    * Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

    * The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances. 
    
Source: [Link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression)

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression

# feature selection
def select_features(X_train, y_train, X_test):
    # configure to select all features
    fs = SelectKBest(score_func=mutual_info_regression, k='all')
    # learn relationship from training data
    fs.fit(X_train, y_train)
    # transform train input data
    X_train_fs = fs.transform(X_train)
    # transform test input data
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs
 

In [None]:
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)

plt.bar([i for i in range(len(fs.scores_))], fs.scores_)
plt.tick_params(color='white', labelcolor='white')
plt.xlabel('Features', color='white')
plt.ylabel('Score of Features', color='white')
plt.show()

* We shall select the best 8 features for our model.

In [None]:
def select_features(X_train, y_train, X_test):
    # configure to select all features
    fs = SelectKBest(score_func=mutual_info_regression, k=8)
    # learn relationship from training data
    fs.fit(X_train, y_train)
    # transform train input data
    X_train_fs = fs.transform(X_train)
    # transform test input data
    X_test_fs = fs.transform(X_test)
    return X_train_fs, X_test_fs, fs

In [None]:
# Selecting features

X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)

print('Shape of Training set with the best features: ', X_train_fs.shape)

In [None]:
cols = fs.get_support(indices=True)

print('Best columns that we are using for our model\n')
for i in cols:
    print (sales.columns[i])

<a id='regmodel'></a>

# Regression Model

We will try out models:
1. Linear Regression
2. Polynomial Regression
3. SVR
4. Decision Forest Regression
5. Random Forest Regression

In [None]:
# Importing models
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Regression Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Cross validation
from sklearn.model_selection import cross_val_score

In [None]:
regressors = [LinearRegression(),
             DecisionTreeRegressor(random_state=1),
             RandomForestRegressor(n_estimators = 10, random_state=1)]

df = pd.DataFrame(columns = ['Name', 'Train Score', 'Test Score', 'Mean Absolute Error', 'Mean Squared Error', 
                             'Cross Validation Score (Mean Accuracy)', 'R2 Score'])

In [None]:
for regressor in regressors:
    regressor.fit(X_train_fs, y_train)
    y_pred = regressor.predict(X_test_fs)
    
    # print classifier name
    s = str(type(regressor)).split('.')[-1][:-2]
    
    # Train Score
    train = regressor.score(X_train_fs, y_train)
    
    # Test Score
    test = regressor.score(X_test_fs, y_test)
    
    # MAE score
    mae = mean_absolute_error(y_test, y_pred)
    
    # MSE Score
    mse = mean_squared_error(y_test, y_pred)
    
    accuracy = cross_val_score(estimator = regressor, X = X_train_fs, y = y_train, cv=10)
    cv = accuracy.mean()*100
    
    r2 = r2_score(y_test, y_pred)
    
    df = df.append({'Name': s, 'Train Score': train, 'Test Score': test, 'Mean Absolute Error': mae, 
                    'Mean Squared Error': mse, 'Cross Validation Score (Mean Accuracy)': cv,
                   'R2 Score': r2},
                  ignore_index=True)

In [None]:
df

In [None]:
# Making Polynomial Features
from sklearn.preprocessing import PolynomialFeatures

poly_reg = PolynomialFeatures(degree = 3)
X_train_poly = poly_reg.fit_transform(X_train_fs)
X_test_poly = poly_reg.fit_transform(X_test_fs)

# Fitt PolyReg to training set
regressor = LinearRegression()
regressor.fit(X_train_poly, y_train)

# Predicting test values
y_pred = regressor.predict(X_test_poly)

df = df.append({'Name': str(type(regressor)).split('.')[-1][:-2] + ' (Poly)', 
                'Train Score': regressor.score(X_train_poly, y_train), 
                'Test Score': regressor.score(X_test_poly, y_test), 
                'Mean Absolute Error': mean_absolute_error(y_test, y_pred), 
                'Mean Squared Error': mean_squared_error(y_test, y_pred), 
                'Cross Validation Score (Mean Accuracy)': cross_val_score(estimator = regressor, X = X_train_fs, y = y_train, cv=10).mean()*100,
                'R2 Score': r2_score(y_test, y_pred)},
                  ignore_index=True)

In [None]:
# Scaling
from sklearn.preprocessing import StandardScaler

# Applying feature scaling for this
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train_fs)
X_test_sc = sc.fit_transform(X_test_fs)

regressor = SVR(kernel='rbf')
regressor.fit(X_train_sc, y_train)

# Predicting test values
y_pred = regressor.predict(X_test_sc)

df = df.append({'Name': str(type(regressor)).split('.')[-1][:-2], 
                'Train Score': regressor.score(X_train_sc, y_train), 
                'Test Score': regressor.score(X_test_sc, y_test), 
                'Mean Absolute Error': mean_absolute_error(y_test, y_pred), 
                'Mean Squared Error': mean_squared_error(y_test, y_pred), 
                'Cross Validation Score (Mean Accuracy)': cross_val_score(estimator = regressor, X = X_train_sc, y = y_train, cv=10).mean()*100,
                'R2 Score': r2_score(y_test, y_pred)},
                  ignore_index=True)

In [None]:
df

<a id='gridsearch'></a>

# GridSearchCV

We will perform GridSearch on Random Forest Regression that has already given us best results out of the pool of models we tried.

In [None]:
from sklearn.model_selection import GridSearchCV

reg = RandomForestRegressor(random_state=1)

param_grid = { 
    'n_estimators': np.arange(4, 30, 2),
    'max_depth' : [4,5,6,7,8],
}

In [None]:
CV_reg = GridSearchCV(estimator=reg, param_grid=param_grid, cv= 5)
CV_reg.fit(X_train_fs, y_train)

In [None]:
CV_reg.best_params_

We will now check the results from the regressor with the best results.

In [None]:
regressor = RandomForestRegressor(n_estimators=18, random_state=1, max_depth=4)

regressor.fit(X_train_fs, y_train)

# Predicting test values
y_pred = regressor.predict(X_test_fs)

df = df.append({'Name': str(type(regressor)).split('.')[-1][:-2] + ' (after GridSearchCV)', 
                'Train Score': regressor.score(X_train_fs, y_train), 
                'Test Score': regressor.score(X_test_fs, y_test), 
                'Mean Absolute Error': mean_absolute_error(y_test, y_pred), 
                'Mean Squared Error': mean_squared_error(y_test, y_pred), 
                'Cross Validation Score (Mean Accuracy)': cross_val_score(estimator = regressor, X = X_train_fs, y = y_train, cv=10).mean()*100,
                'R2 Score': r2_score(y_test, y_pred)},
                  ignore_index=True)

<a id='finalmodel'></a>

# Final Model: Model Boosting

**MODEL BOOSTING:** We have used VotingRegressor to boost our results. 

VotingRegressor: A voting regressor is an ensemble meta-estimator that fits several base regressors, each on the whole dataset. Then it averages the individual predictions to form a final prediction. Click [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html) for more details.

The voting regressor uses *linear regressor* and the best possible *random forest regressor* to give predictions.

In [None]:
from sklearn.ensemble import VotingRegressor

regressor = VotingRegressor([('lr',LinearRegression()), ('rf', RandomForestRegressor(n_estimators=18, random_state=1, max_depth=4))])

regressor.fit(X_train_fs, y_train)

# Predicting test values
y_pred = regressor.predict(X_test_fs)

df = df.append({'Name': str(type(regressor)).split('.')[-1][:-2], 
                'Train Score': regressor.score(X_train_fs, y_train), 
                'Test Score': regressor.score(X_test_fs, y_test), 
                'Mean Absolute Error': mean_absolute_error(y_test, y_pred), 
                'Mean Squared Error': mean_squared_error(y_test, y_pred), 
                'Cross Validation Score (Mean Accuracy)': cross_val_score(estimator = regressor, X = X_train_fs, y = y_train, cv=10).mean()*100,
                'R2 Score': r2_score(y_test, y_pred)},
                  ignore_index=True)

In [None]:
df


The specifications of the most optimum model:

VotingRegressor: 
1. Linear Regressor
2. Random Forest Regressor (n_estimators=18, max_depth=4)

with results:

* Train Score: 0.88
* Test Score: 0.83
* MAE: 1451.06
* MSE: 8.88e+06
* CV Score (Mean Accuracy): 77.16
* R2 Score: 0.83

***Hey! ***

***Do comment and let me know about scope of improvement in this.***

***Do upvote if you like my notebook. It will be much motivation!!***