In [14]:
import pandas as pd
import numpy as np 
from sklearn.preprocessing import MinMaxScaler,StandardScaler


In [15]:
df=pd.read_csv('clean_datas.csv')
df.head()

Unnamed: 0,year,month,stateDescription,sectorName,customers,price,revenue,sales
0,2001,1,Wyoming,all sectors,,4.31,48.1284,1116.17208
1,2001,1,Wyoming,commercial,,5.13,12.67978,247.08691
2,2001,1,Wyoming,industrial,,3.26,19.60858,602.30484
3,2001,1,Wyoming,other,,4.75,0.76868,16.17442
4,2001,1,Wyoming,residential,,6.01,15.07136,250.60591


#### Handling data normalization

Data normalization is the process of transforming data into a common scale or range, to eliminate differences in magnitude and make the data more comparable and interpretable. Normalization is an important step in data preprocessing, as it can improve the accuracy and performance of machine learning models and other data analysis techniques.

ETL (extract, transform, load) integration: This involves extracting data from different sources, transforming it to meet the needs of the target system, and loading it into a target database or data warehouse. ETL tools are commonly used to automate this process, making it more efficient and less error-prone. Python has several powerful ETL tools like Apache Nifi, Apache Airflow, and Apache Beam. These tools allow developers to extract data from various sources, perform transformations, and load it into a target database or data warehouse.

##### Min-Max Normalization

This method scales the data to a fixed range, typically between 0 and 1. The formula for min-max normalization is:

x_norm = (x - x_min) / (x_max - x_min)

In [16]:
sales_min = df['sales'].min()
sales_max= df['sales'].max()

In [17]:
sales_norm=(df['sales']-sales_min)/(sales_max-sales_min)

In [18]:
sales_norm.describe()

count    85870.000000
mean         0.015259
std          0.054357
min          0.000000
25%          0.000738
50%          0.003694
75%          0.011074
max          1.000000
Name: sales, dtype: float64

In [24]:
# Using sklearn
scaler=MinMaxScaler()  # object creation
norm=scaler.fit_transform(df[['sales']])

In [26]:
print(norm.min())
print(norm.max())

0.0
1.0


#### Z-score Normalization / StandardScaler

This method scales the data to have zero mean and unit variance. The formula for z-score normalization is:

x_norm = (x - mean) / std

where x is the original value, mean and std are the mean and standard deviation of the data, respectively, and x_norm is the normalized value.

In [28]:
x_mean = df['sales'].mean()
x_std= df['sales'].std()

sales_norm = (df['sales'] - x_mean) / x_std
sales_norm.describe()

count    8.587000e+04
mean    -9.681321e-18
std      1.000000e+00
min     -2.807211e-01
25%     -2.671478e-01
50%     -2.127704e-01
75%     -7.699104e-02
max      1.811622e+01
Name: sales, dtype: float64

In [29]:
# Using sklearn
scaler= StandardScaler()
norm=scaler.fit_transform(df[['sales']])

In [36]:
print(f"{norm.mean():.4f}")
print(f"{norm.std():.4f}")

-0.0000
1.0000
