# Sales of summer clothes in E-commerce Wish

## Content:

## 1. EDA
Topics covered and questions to answer from the data:
* Comparison between price and retail price
* Sales versus origin country
* Sales comparison by colors
* Sales comparison by ratings of products
* What factor contributes most to a fast shipping badge?
* Tags encoding
* Which badge contribuits most to the sales of a product?
* What kind of merchants are likely to gain product success?

## 2. Experimenting with the data to gain prediction insights
We are going to build a model that can help predict how well a product is going to sell, i.e., the exact sales for each product.

Such a model has many implications and could be used in many different ways, the most straightforward being to adjust how much of a product should be kept in stock.

Before we select and train a model, we should experiment and combine more with data to find inspirations to prediction. For exemple, do proportions of good ratings and bad ratings, number of tags (making a product more discoverable) and price drops factor into the success of a product?


## 3. Prepare the data for Machine Learning algorithms
After we settled on all the features to be used, we need to prepare the data for Machine Learning algorithms.


## 4. Select and train a model
In this project, I will be using four common classification model to see which performs best. In the end, fine-tune the best model.

OBS: We need to divide the dataset into training set and test set. The training set is be preprocessed, and each model is trained and validated using cross-validation. During this process, we put the test set aside and don't even look at it to make sure the model is unbaised. Once the model type and hyperparameters have been selected, the generalized error is measured on the test set.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import plotly
import plotly.graph_objects as go
import plotly.express as px
import matplotlib.pyplot as plt
from matplotlib_venn import venn2, venn2_circles, venn2_unweighted
from matplotlib_venn import venn3, venn3_circles

# Data cleaning

In [None]:
import os
print(os.listdir("../input"))

In [None]:
wish=pd.read_csv("../input/summer-products-and-sales-in-ecommerce-wish/summer-products-with-rating-and-performance_2020-08.csv")
wish.info()

In [None]:
pd.set_option('display.max_columns', None)
wish.head()

In [None]:
wish.drop(columns=['title_orig','merchant_name','merchant_info_subtitle','merchant_id',
                   'merchant_profile_picture','product_url','product_picture','product_id','crawl_month','theme','currency_buyer'],inplace=True)


In [None]:
wish.isnull().sum()

In [None]:
nan_replace={'has_urgency_banner':0,'urgency_text':'N/A','origin_country':'unknown','product_color':'unknown'}
wish.fillna(nan_replace,inplace=True)
wish_cln=wish.dropna()
wish_cln.info()

# 1. EDA

# Comparison between price and retail price

Retail price is used by the seller to indicate a regular value or the price before discount. How do price, retai price and the price drop in discount define the product success?

In [None]:
price_cmp=wish_cln[['price','retail_price','units_sold']]
price_cmp.describe()
price_cmp.head()

In [None]:
plt.figure(figsize=(20,12))
sns.scatterplot(data=wish_cln,
               x="price",
               y='units_sold')
plt.show()

In [None]:
trace1 = go.Violin(y=price_cmp["price"],name='Price')
trace2 = go.Violin(y=price_cmp["retail_price"],name='Retail price')
fig=go.Figure([trace1, trace2])
fig.update_layout(
title='Comparison between price and retail price',
yaxis_title='Price(EUR)')
fig.show()

In [None]:
price_cmp['price_drops']=price_cmp["retail_price"]-price_cmp["price"]
plt.figure(figsize=(20,12))
sns.regplot(data=price_cmp,
           x='price_drops',
           y='units_sold')
plt.title('Prices drops versus units sold')
plt.show()

There is a visible downward trend in units sold as the price increases. Products with high sales are usually concentrated in the price range of 0-20.

The difference between actuall price and retail price is quite large. Prices are more concentrated while the retail prices have significantely more outliers. This could be a popular sales strategy.

The steep prices drops don't necessarily result in product success. 


# Sales versus origin country

In [None]:
country_price=wish_cln[['units_sold','origin_country']]
country_mean_price=country_price.groupby('origin_country')['units_sold'].mean().reset_index()
country_mean_price.rename(columns={'units_sold': 'units_sold_mean'},inplace=True)

In [None]:
to_codes={'CN':'CHN',
         'GB':'GBR',
         'SG':'SGP',
         'US':'USA',
         'VE':'VEN'}
country_mean_price['code']=country_mean_price['origin_country'].map(to_codes)
country_mean_price

In [None]:
country_sales_map=px.choropleth(country_mean_price,
                       color='units_sold_mean',
                       locations='code',
                       hover_name='code',
                       color_continuous_scale=px.colors.sequential.Plasma,
                       title='Sales verses origin country')
country_sales_map.show()

Products from Singapore and China have higher average sales than the ones from other countries such as Britain, US and Vietnam.

# Sales comparison by colors

Find out the ten most popular colors by sorting out units sold.

In [None]:
color_sale=wish_cln.groupby('product_color')['units_sold'].sum()
color_sale=color_sale.reset_index().sort_values(by='units_sold',ascending=False)
color_sale

In [None]:
top_10_color_sale=color_sale.head(10)

In [None]:
fig=px.bar(data_frame=top_10_color_sale,
      x='product_color',
      y='units_sold')
fig.update_layout(title='Top 10 color sales')
fig.show()

 Let's take a look at the sales of all the colors in case we miss some emerging fashion trend.

In [None]:
fig=px.bar(data_frame=color_sale,
      x='product_color',
      y='units_sold')
fig.update_layout(title='All color sales')
fig.show()

We can see that some more specific colors sells well too, such as orange, navyblue and winered. 

# Sales comparison by ratings of products

To start with, let's take a look at the distribution of ratings.

In [None]:
rating_cols=['rating_count','rating_five_count','rating_four_count',
             'rating_three_count','rating_two_count','rating_one_count']
ratings_data=wish_cln[rating_cols+['uses_ad_boosts']]

ratings_data.groupby('uses_ad_boosts').describe()

 Use box plot to visualize how add boosts define the distributions of ratings.

In [None]:
fig = go.Figure()
for col in rating_cols:
    fig.add_trace(go.Box(x=ratings_data['uses_ad_boosts'],
                         y=ratings_data[col],
                         name=col,
                         boxmean=True,
                         boxpoints=False))
fig.update_traces(quartilemethod="exclusive")
fig.update_layout(boxmode='group',
                  title='Relations between ad boosts and rating',
                  xaxis = dict(
                  tickvals = [0,1],
                  ticktext = ['Without add boosts','With add boosts']))
fig.show()

 By dividing the data into two groups of "with" and "without add boosts", we can see that surprisingly,  produsts without add boosts gain higher number of ratings on average, the same goes for number of 5, 4, 3, 2, 1-star ratings.
 
 Now let's analyse how ratings factor into sales, which is what really matters.

In [None]:
cmp_table=wish_cln[['units_sold','rating','rating_count']]
plt.figure(figsize=(20,12))
sns.jointplot(data=cmp_table,
             x='rating',
             y='units_sold')
plt.show()


Successful products sold more than 20,000 pieces usually have a mean rating above 3.5.

Now let's analyze how mean rating and number of ratings define sales respectively using 3D plot.

In [None]:
line=go.Scatter3d(x=cmp_table['rating'],
                  y=cmp_table['rating_count'],
                  z=cmp_table['units_sold'])
fig=go.Figure(line)
fig.update_layout(title='Impact of rating and rating count to sales',
                  height = 1000,
                  width = 1000,
                  scene = dict(
                  xaxis_title='rating',
                  yaxis_title='rating_count',
                  zaxis_title='units_sold'))
fig.show()

There's a visible upward trend of sales as numbers of ratings increase. Meanwhile, the average rating popularity will not have such a big impact on sales, however, as mentioned earlier, products with higher sales are mainly concentrated in ratings above 3.5.

# What factor contributes most to a fast shipping badge?

In [None]:
index,name=wish_cln['shipping_option_name'].factorize()
wish_cln['shipping_option_index']=index

In [None]:
corr_map=wish_cln[['badge_fast_shipping','shipping_option_index','shipping_option_price','shipping_is_express','countries_shipped_to']]
corr_map=corr_map.corr()
plt.figure(figsize=(20,12))
sns.heatmap(corr_map,annot=True,cmap='Blues')
plt.xticks(rotation=45,fontsize=14)
plt.yticks(rotation=45,fontsize=14)
plt.show()

# Tags encoding

In [None]:
from wordcloud import WordCloud

tags_for_count=[]

for x in wish_cln['tags']:
    for word in str(x).split(sep=','):
        word=word.lower()
        tags_for_count.append(word)
tags_for_count       

In [None]:
plt.subplots(figsize=(25,15))
wordcloud = WordCloud(
                          background_color='white',
                          width=1920,
                          height=1080
                         ).generate(" ".join(tags_for_count))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

 From the words clouds, we can see the most frequent words the merchants put in tags are "women", "fashion", "plus", "size", "sexy" and "shirt", ect..

# Which badge contributes most to the sales of a product?

 For better visualization of the distribution of badges, I sorted out new columns of "badge1_badge2", "badge1_badge3", "badge2_badge3" and "badge1_badge2_badge3" to indicate if the product have two of them or all of them.

In [None]:
wish_cln[wish_cln['badges_count']!=0].head(10)
badges=wish_cln[['badges_count','badge_local_product', 'badge_product_quality', 'badge_fast_shipping']]

badges_cats=[]

for i in badges.index:
    categories = ['badge_local_product', 'badge_product_quality', 'badge_fast_shipping']
    codes = badges.loc[[i],['badge_local_product', 'badge_product_quality', 'badge_fast_shipping']].values.reshape(3,).tolist()
    zipped = zip(codes,categories)
    my_cats=[]
    for m,n in list(zipped):
        my_cats.append(m*n)
    badges_cats.append(my_cats)
badges_cats = pd.Series((v[0]+v[1]+v[2] for v in badges_cats))

In [None]:
badges.drop(columns=['badge_local_product', 'badge_product_quality', 'badge_fast_shipping'],inplace=True)
badges['badges_cats']=badges_cats.values
badges['records']=np.ones((1514,))
badges_data=badges.groupby(['badges_count','badges_cats']).count().reset_index()
badges_data

Overwhelming majority products (1368/1514) don't have any badge; it's more common to have badge of product quality among those have badges; only 2 products have all the badges.

In [None]:
badges_cmp=wish_cln[['title','units_sold','badges_count','badge_local_product','badge_product_quality','badge_fast_shipping']]
plt.figure(figsize=(20,20))
sns.pairplot(data=badges_cmp,kind='reg')
plt.show()

Neither the number of badges nor any kind of badge affects much the sales.

# What kind of merchants are likely to gain product success?

Pull out all the columns concerning merchant information from dataset.

As the ratings and numbers of ratings of merchants are discrete numericals, it's better to divide them into several bins.

In [None]:
merchant_sales=wish_cln[['merchant_title','merchant_rating_count',
                         'merchant_rating','merchant_has_profile_picture','units_sold']]

In [None]:
merchant_sales['merchant_rating'].max()

In [None]:
merchant_sales['merchant_rating'].min()

In [None]:
bins1 = [2.9, 3.5, 4.0, np.inf]
cats1 = pd.cut(merchant_sales['merchant_rating'],bins1)
merchant_sales['merchant_raing_cats']=cats1

In [None]:
bins2 = [0, 250000, 900000, np.inf]
cats2 = pd.cut(merchant_sales['merchant_rating_count'],bins2)
merchant_sales['raing_count_cats']=cats2

In [None]:
merchant_top_50 = merchant_sales.groupby(['merchant_has_profile_picture','merchant_title','merchant_raing_cats','raing_count_cats'])['units_sold'].sum().nlargest(50).reset_index()

In [None]:
fig = px.bar(data_frame = merchant_top_50,
           x = 'merchant_title',
           y = 'units_sold',
           color = 'merchant_raing_cats',
           facet_col = 'merchant_has_profile_picture',
           facet_row = 'raing_count_cats',
           width = 1200, height = 800)
fig.update_layout(title = 'Top 50 merchants')
fig.show()

Among the top 50 merchants, the majority have number of ratings less than 250,000, mean ratings above 4.0 and have profile picture.

# 2. Experimenting with the data to gain prediction insights

In [None]:
wish_cln_copy = wish_cln.copy()

Clean and simplify the product color column by keeping the values of top ten best seller colors and setting other colors as "others".

In [None]:
color_sale = wish_cln_copy.groupby('product_color')['units_sold'].sum()
color_sale = color_sale.reset_index().sort_values(by = 'units_sold',ascending=False)
top_10_color_sale = color_sale.head(10)
top_10 = list(top_10_color_sale['product_color'])

In [None]:
wish_cln_copy['product_color'][~wish_cln_copy['product_color'].isin(top_10)]='other'

In [None]:
wish_cln_copy['product_color'].unique()

Also, let's add a column of the numbers of tags.

In [None]:
f = lambda x: len(x)
wish_cln_copy['tags_num'] = wish_cln_copy['tags'].apply(f)

In [None]:
wish_cln_copy['rating_count'].hist()

### We will divide the dataset into training set and test set. The training set is preprocessed, and each model is trained and validated using cross-validation. During this process, we put the test set aside and don't even look at it to make sure the model is unbaised. Once the model type and hyperparameters have been selected, the generalized error is measured on the test set.

 As the EDA showed before, the numbers of ratings have a major impact on sales. To ensure the test set is representative of the various categories of numbers ratings in the whole dataset, we need to create a category attribute and divide the dataset into homogeneous subgroup.

In [None]:
sns.scatterplot(data=wish_cln_copy,x='rating_count',y='units_sold')

In [None]:
wish_cln_copy["rating_count_cat"] = pd.cut(wish_cln_copy["rating_count"],
                               bins=[0, 300, 1000, np.inf],
                               labels=[1, 2, 3])

In [None]:
wish_cln_copy["rating_count_cat"].value_counts()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(wish_cln_copy, wish_cln_copy["rating_count_cat"]):
    strat_train_set = wish_cln_copy.iloc[train_index]
    strat_test_set = wish_cln_copy.iloc[test_index]

As we can see, the test and train sets generated using stratified sampling has rating count category proportions almost identical to those in the full dataset.

In [None]:
strat_test_set["rating_count_cat"].value_counts() / len(strat_test_set)

In [None]:
strat_train_set["rating_count_cat"].value_counts() / len(strat_train_set)

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("rating_count_cat", axis=1, inplace=True)

 Make a copy pf train set to experiment the attributes and their correlations with the target attribute.

In [None]:
wish_exp = strat_train_set.copy()

In [None]:
corr_matrix = wish_exp.corr()

In [None]:
corr_matrix["units_sold"].sort_values(ascending=False)

Looking at the the correlation matrix, it's clearly that numbers of ratings from one to five all greatly contribute to the sales, which doesn't sound right from common sense: a product with massive sheerly bad ratings is not likely to sell well. We probably want to compare the numbers of each kind of rating with the overall numbers of ratings.

In [None]:
wish_exp['rating_three_count_prop']=wish_exp['rating_three_count']/wish_exp['rating_count']
wish_exp['rating_four_count_prop']=wish_exp['rating_four_count']/wish_exp['rating_count']
wish_exp['rating_five_count_prop']=wish_exp['rating_five_count']/wish_exp['rating_count']
wish_exp['rating_two_count_prop']=wish_exp['rating_two_count']/wish_exp['rating_count']
wish_exp['rating_one_count_prop']=wish_exp['rating_one_count']/wish_exp['rating_count']

Also, let's calculate the price drop.

In [None]:
wish_exp['drops']=wish_exp["retail_price"]-wish_exp["price"]

In [None]:
corr_matrix = wish_exp.corr()
corr_matrix["units_sold"].sort_values(ascending=False)

We can keep the new attributs of proportions of ratings and price drop.

# 3. Prepare the data for Machine Learning algorithms

In [None]:
wish = strat_train_set.drop("units_sold", axis=1) # drop labels for training set
wish_labels = strat_train_set["units_sold"].copy()

In [None]:
wish.columns

Which numerical features are most important?

First of all, we need to drop text features. To predict the sales of products, the inventory and countries_shipped_to is a possible source of leakage, because these information can change over time. I will also not include unimportant features such as 'badge_fast_shipping','shipping_is_express', to avoid overfitting.

In [None]:
wish_num = wish.drop(['title','tags','product_variation_size_id','product_variation_inventory',
                      'inventory_total','product_color','origin_country','urgency_text',
                      'shipping_option_name','badge_fast_shipping','shipping_option_index',
                      'merchant_title','countries_shipped_to'], axis=1)

In [None]:
wish_num.head()

In [None]:
wish_cat = wish[['product_color','origin_country','shipping_option_name']]

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder(sparse=False)
wish_cat_1hot = cat_encoder.fit_transform(wish_cat)
wish_cat_1hot

In [None]:
cat_encoder.categories_

Use FunctionTransformer to add the combined attributes we discussed earlier.

In [None]:
from sklearn.preprocessing import FunctionTransformer

one_ix, two_ix, three_ix, four_ix, five_ix, rating_count_ix, retail_price_ix, price_ix= [
    list(wish.columns).index(col)
    for col in ("rating_one_count", "rating_two_count", 
                "rating_three_count", "rating_four_count",
               'rating_five_count', 'rating_count','retail_price','price')]

In [None]:
def add_extra_features(X):
    rating_one_count_prop = X[:,one_ix]/ X[:,rating_count_ix]
    rating_two_count_prop = X[:,two_ix]/ X[:,rating_count_ix]
    rating_three_count_prop = X[:,three_ix]/ X[:,rating_count_ix]
    rating_four_count_prop = X[:,four_ix]/ X[:,rating_count_ix]
    rating_five_count_prop = X[:,five_ix]/ X[:,rating_count_ix]
    drops = X[:,retail_price_ix] - X[:,price_ix]
    return np.c_[X, rating_one_count_prop, rating_two_count_prop, 
                 rating_three_count_prop,rating_four_count_prop, 
                 rating_five_count_prop,drops]

attr_adder = FunctionTransformer(add_extra_features, validate=False)
wish_extra_attribs = attr_adder.transform(wish.values)

In [None]:
attr_adder

Test if the FunctionTransformer works fine before building the pipeline.

In [None]:
wish_extra_attribs = pd.DataFrame(
    wish_extra_attribs,
    columns = list(wish.columns)+["rating_one_count_prop", "rating_two_count_prop",
                               'rating_three_count_prop','rating_four_count_prop',
                               'rating_five_count_prop','drops'],
    index = wish.index)
wish_extra_attribs.head()

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False))])

wish_num_tr = num_pipeline.fit_transform(wish_num)


In [None]:
wish_num_tr

In [None]:
# Building the full pipeline to preprocessing numerical and categorical features.
from sklearn.compose import ColumnTransformer
num_attribs = list(wish_num)
cat_attribs = list(wish_cat)

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

wish_prepared = full_pipeline.fit_transform(wish)

In [None]:
wish_prepared

# 3. Select and train a model

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost.sklearn import XGBClassifier 
from sklearn.model_selection import KFold,cross_val_score

In [None]:
base_models = [('DT_model',DecisionTreeClassifier(random_state=42)),
            ('RF_model',RandomForestClassifier(random_state=42,n_jobs=-1)),
            ('LR_model',LogisticRegression(random_state=42,n_jobs=-1)),
            ("XGB_model", XGBClassifier(random_state=42, n_jobs=-1))]
# split data into 'kfolds' parts for cross validation,
# use shuffle to ensure random distribution of data:
kfolds = 4
split = KFold(n_splits=kfolds,shuffle=True,random_state=42)

# Preprocessing, fitting, making predictions and scoring for every model:
for name,model in base_models:
    model_steps = Pipeline(steps=[('model',model)])
    model_steps.fit(wish_prepared, wish_labels)
    cv_results = cross_val_score(model_steps,wish_prepared,wish_labels,cv=split,scoring='accuracy',
                              n_jobs=-1)
    # output:
    min_score = round(min(cv_results),4)
    max_score = round(max(cv_results),4)
    mean_score = round(np.mean(cv_results),4)
    std_dev = round(np.std(cv_results),4)
    print(f'{name} cross validation accuracy score:{mean_score} +- {std_dev} (std) min:{min_score},max:{max_score}')



As the accuracy score of RF_model and XGB_model are close and in order to avoid overfitting, let's validate both models on the test set.

In [None]:
XGB_clf = XGBClassifier(random_state=42,n_jobs=-1)

X_test = strat_test_set.drop("units_sold", axis=1)
y_test = strat_test_set["units_sold"].copy()
X_test_prepared = full_pipeline.transform(X_test)

kfolds = 4
split = KFold(n_splits=kfolds,shuffle=True,random_state=42)

cv_results = cross_val_score(XGB_clf,X_test_prepared,y_test,cv=split,scoring='accuracy',
                              n_jobs=-1)
min_score = round(min(cv_results),4)
max_score = round(max(cv_results),4)
mean_score = round(np.mean(cv_results),4)
std_dev = round(np.std(cv_results),4)
print(f'XGB_model cross validation accuracy score:{mean_score} +- {std_dev} (std) min:{min_score},max:{max_score}')


In [None]:
RF_clf = RandomForestClassifier(random_state=42,n_jobs=-1)

X_test = strat_test_set.drop("units_sold", axis=1)
y_test = strat_test_set["units_sold"].copy()
X_test_prepared = full_pipeline.transform(X_test)

kfolds=4
split=KFold(n_splits=kfolds,shuffle=True,random_state=42)

cv_results=cross_val_score(RF_clf,X_test_prepared,y_test,cv=split,scoring='accuracy',
                              n_jobs=-1)
min_score=round(min(cv_results),4)
max_score=round(max(cv_results),4)
mean_score=round(np.mean(cv_results),4)
std_dev=round(np.std(cv_results),4)
print(f'RF_model cross validation accuracy score:{mean_score} +- {std_dev} (std) min:{min_score},max:{max_score}')


The Random Forest model performs the best for predicting the sales of procuts.

I also did some hyperparameter optimization. Sometimes I got a slightly better accuracy score the refined model, sometimes it's even worse.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
param_distribs = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

forest_clf = RandomForestClassifier(random_state=42, n_jobs=-1)
rnd_search = RandomizedSearchCV(forest_clf, param_distributions=param_distribs,
                                n_iter=5, cv=4,scoring="accuracy", random_state=42)
rnd_search.fit(wish_prepared, wish_labels)

In [None]:
rnd_search.best_params_

In [None]:
# final_model = grid_search.best_estimator_
final_RF_clf = RandomForestClassifier(random_state=42,n_jobs=-1,n_estimators= 1366,
 min_samples_split= 5,
 min_samples_leaf= 1,
 max_features='sqrt',
 max_depth= 30,
 bootstrap= True)

X_test = strat_test_set.drop("units_sold", axis=1)
y_test = strat_test_set["units_sold"].copy()
X_test_prepared = full_pipeline.transform(X_test)

kfolds=4
split=KFold(n_splits=kfolds,shuffle=True,random_state=42)

cv_results=cross_val_score(final_RF_clf,X_test_prepared,y_test,cv=split,scoring='accuracy',
                              n_jobs=-1)
min_score=round(min(cv_results),4)
max_score=round(max(cv_results),4)
mean_score=round(np.mean(cv_results),4)
std_dev=round(np.std(cv_results),4)
print(f'Final_RF_model cross validation accuracy score:{mean_score} +- {std_dev} (std) min:{min_score},max:{max_score}')


## Comments, questions, suggestions? Let me know!

## If you like the notebook or learned something please upvote! :D