# Variable scale / Magnitude

In [86]:
# Import pandas library and disable warnings
import pandas as pd
import warnings
warnings.simplefilter("ignore")
# Import MinMaxScaler to scale the features
from sklearn.preprocessing import MinMaxScaler
# Import train_test_split to separate train and test set
from sklearn.model_selection import train_test_split

In [87]:
# Load avocado dataset and store it to variable d
d = pd.read_csv('Data/avocado.csv')
d.head()

Unnamed: 0,Date,AveragePrice,Total Volume,Small Hass Avocado,Large Hass Avocado,Extra Large Hass Avocado,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-01-04,1.75,27365.89,9307.34,3844.81,615.28,13598.46,13061.1,537.36,0.0,organic,2015,Southeast
1,2015-01-04,1.49,17723.17,1189.35,15628.27,0.0,905.55,905.55,0.0,0.0,organic,2015,Chicago
2,2015-01-04,1.68,2896.72,161.68,206.96,0.0,2528.08,2528.08,0.0,0.0,organic,2015,HarrisburgScranton
3,2015-01-04,1.52,54956.8,3013.04,35456.88,1561.7,14925.18,11264.8,3660.38,0.0,conventional,2015,Pittsburgh
4,2015-01-04,1.64,1505.12,1.27,1129.5,0.0,374.35,186.67,187.68,0.0,organic,2015,Boise


In [88]:
# Run this
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Date                      18249 non-null  object 
 1   AveragePrice              18249 non-null  float64
 2   Total Volume              18249 non-null  float64
 3   Small Hass Avocado        18249 non-null  float64
 4   Large Hass Avocado        18249 non-null  float64
 5   Extra Large Hass Avocado  18249 non-null  float64
 6   Total Bags                18249 non-null  float64
 7   Small Bags                18249 non-null  float64
 8   Large Bags                18249 non-null  float64
 9   XLarge Bags               18249 non-null  float64
 10  type                      18249 non-null  object 
 11  year                      18249 non-null  int64  
 12  region                    18249 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 1.8+ MB


# 1. Load data with numeric variable only

In [89]:
# Use numerical variables to create DataFrame data
data = d.select_dtypes(exclude = ['int64', 'object'])

In [90]:
# Print descriptive statistics of these variables to see variable's magnitudes
data.describe()

Unnamed: 0,AveragePrice,Total Volume,Small Hass Avocado,Large Hass Avocado,Extra Large Hass Avocado,Total Bags,Small Bags,Large Bags,XLarge Bags
count,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0
mean,1.405978,850644.0,293008.4,295154.6,22839.74,239639.2,182194.7,54338.09,3106.426507
std,0.402677,3453545.0,1264989.0,1204120.0,107464.1,986242.4,746178.5,243966.0,17692.894652
min,0.44,84.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.1,10838.58,854.07,3008.78,0.0,5088.64,2849.42,127.47,0.0
50%,1.37,107376.8,8645.3,29061.02,184.99,39743.83,26362.82,2647.71,0.0
75%,1.66,432962.3,111020.2,150206.9,6243.42,110783.4,83337.67,22029.25,132.5
max,3.25,62505650.0,22743620.0,20470570.0,2546439.0,19373130.0,13384590.0,5719097.0,551693.65


As we can see, our variables have different magnitudes/scales, the minimum and maximum values of the variables are different. For example, the minimum value and maximum value of average price for avocado are 0.44 and 3.25, respectively. And for Small bags of avocados sold the minimum and maximum values are 0 and 5.719097e+06. 

In [91]:
# Print the dataframe's columns
data.columns

Index(['AveragePrice', 'Total Volume', 'Small Hass Avocado',
       'Large Hass Avocado', 'Extra Large Hass Avocado', 'Total Bags',
       'Small Bags', 'Large Bags', 'XLarge Bags'],
      dtype='object')

In [92]:
# Get the range of variables
for col in['AveragePrice', 'Total Volume', 'Small Hass Avocado','Large Hass Avocado', 'Extra Large Hass Avocado', 'Total Bags',
            'Small Bags', 'Large Bags', 'XLarge Bags']:
    print(col, 'range: ', data[col].max() - data[col].min())

AveragePrice range:  2.81
Total Volume range:  62505561.96
Small Hass Avocado range:  22743616.17
Large Hass Avocado range:  20470572.61
Extra Large Hass Avocado range:  2546439.11
Total Bags range:  19373134.37
Small Bags range:  13384586.8
Large Bags range:  5719096.61
XLarge Bags range:  551693.65


The ranges of our variables are different. 

# 2. Feature Scaling - first touch with SCIKIT-learn

Models such as logistic regression, linear regression or other models that involve a matrix are very sensitive to different scales of input variables. 

But there is something that could help you with this issue and so *feature scaling*. As you can guess, this process change the scale of variables. There are several ways how you can scale your features and in this notebook we'll demonstrate **MinMaxScalling** technique which scale to minimum and maximum values. You can find more information about scikit-learn [on this website](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

The formula for min-max scaling is: **x_scaled = x - min(x) / max(x) - min(x)**

- our scaler substracts the minimum value from all observations in our dataset and divide it by the range of values
- it will render values between 0 and 1 

In [93]:
# Let's split our dataset to training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    data[[ 'Total Volume', 'Small Hass Avocado','Large Hass Avocado', 'Extra Large Hass Avocado', 'Total Bags',
           'Small Bags', 'Large Bags', 'XLarge Bags']],
    data.AveragePrice,
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((12774, 8), (5475, 8))

In [94]:
# We call our scaler
scaler = MinMaxScaler()

# Next we fit the scaler to training data: this computes the minimum and maximum values to be used 
# for later scaling
scaler.fit(X_train)

# The last step is re-scale the datasets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [95]:
# The scaler stores the maximum values of the features, learned from train set
print(scaler.data_max_)

[62505646.52 22743616.17 20445501.03  1880231.38 19373134.37 13384586.8
  5719096.61   454343.65]


In [96]:
#let's have a look at the scaled training dataset
print('Mean: ', X_train_scaled.mean(axis=0))
print('Standard Deviation: ', X_train_scaled.std(axis=0))
print('Minimum value: ', X_train_scaled.min(axis=0))
print('Maximum value: ', X_train_scaled.max(axis=0))

Mean:  [0.01354779 0.01276357 0.0144709  0.01199523 0.01229469 0.01354084
 0.00942789 0.00666706]
Standard Deviation:  [0.05558859 0.05596901 0.05903959 0.05605783 0.05156184 0.05643891
 0.04312771 0.03808955]
Minimum value:  [0. 0. 0. 0. 0. 0. 0. 0.]
Maximum value:  [1. 1. 1. 1. 1. 1. 1. 1.]


After this rescaling, all of the features have the range between 0 and 1.

**Optional**

Find other options to rescale the variables in the scikit-learn documentation.

# Appendix

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html