# [In Progress] Scaling Techniques

Data can be in various shapes and sizes. For understanding the behavior of the
data, and with respect to each feature, it is important for us to bring the data
to the similar scale. This is where scaling techniques come into play.

Lets for a minute take the example of House Prices Predictions. The dataset could
contain the following features:

- Number of Bedrooms: Range = 1-5
- Number of Bathrooms: Range = 1-3
- Area of the House: Range = 1000-5000 sqft
- Distance from the City Center: Range = 1-10 miles
- Price of the House: Range = 100,000-500,000 USD

The task is to predict the price of the house.

If we do not scale the data, try to think about how would you compare the number
of bedrooms with the price of the house. The range of the number of bedrooms is
1-5, whereas the price of the house is in the range of 100,000-500,000 USD. The
difference in the range is huge. Similarly, the area of the house is in the range
of 1000-5000 sqft, which is again different from the price of the house.

Even if we tried to evaluate the data by pure numbers, the sheer scale of the
features would make it difficult for us to understand the data.

Mathematically think of Machine Learning equation as $Y = M \times X + C$, but this is single variable.

In the case of multiple variables, the equation would be

$Y = M_1 \times X_1 + M_2 \times X_2 + M_3 \times X_3 + M_4 \times X_4 + C$

where

- $X_1$ is the number of bedrooms
- $X_2$ is the number of bathrooms
- $X_3$ is the area of the house
- $X_4$ is the distance from the city center
- $Y$ is the price of the house.

Given the above equation, a small change in the value of $X_1$ would influence
the value of $Y$ by a significant amount. However, a small change in the value
of $X_4$ may or may not influence the value of $Y$ by a large amount.
Though the scale of $X_4$ is much larger than the scale of $X_1$, still its
domain relevance is quite little, but in terms of equations, the influence would
be quite the reverse, change in $X_1$, mathematically should not influence the
value of $Y$ much.

Scaling the values down and bringing them to a common scale would help us in
making the data comparable and understandable. Plotting them on graphs become
easier, and the data becomes more interpretable.

Data scaling is a technique used to normalize the range of
independent variables or features of data. In data processing, it is also known
as data normalization and is generally performed during the data preprocessing
step.


In [2]:
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')


def get_stock_data(symbol, ):
    return pd.read_sql(
        f'select * from ohlc_data where symbol = \'{symbol}\'',
        engine,
        parse_dates=['datetime']
    ).set_index('datetime').sort_index().rename(columns={
        'open': 'Open',
        'high': 'High',
        'low': 'Low',
        'close': 'Close',
    })


engine = create_engine(
    'postgresql://postgres:postgres@localhost:6004/postgres'
)

data = get_stock_data('NIFTY')
data.head()

Unnamed: 0_level_0,Open,High,Low,Close,symbol
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2015-01-01 09:14:00,8272.8,8272.8,8272.8,8272.8,NIFTY
2015-01-01 09:15:00,8272.8,8272.8,8272.8,8272.8,NIFTY
2015-01-01 09:16:00,8253.15,8253.15,8253.15,8253.15,NIFTY
2015-01-01 09:17:00,8254.15,8254.15,8254.15,8254.15,NIFTY
2015-01-01 09:18:00,8261.15,8261.15,8261.15,8261.15,NIFTY
