In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [2]:
dataframe = pd.read_csv("clean_data.csv")
dataframe.head()

Unnamed: 0,year,month,stateDescription,sectorName,customers,price,revenue,sales
0,2001,1,Wyoming,all sectors,,4.31,48.1284,1116.17208
1,2001,1,Wyoming,commercial,,5.13,12.67978,247.08691
2,2001,1,Wyoming,industrial,,3.26,19.60858,602.30484
3,2001,1,Wyoming,other,,4.75,0.76868,16.17442
4,2001,1,Wyoming,residential,,6.01,15.07136,250.60591


**ETL (extract, transform, load) integration:** This involves extracting data from different sources, transforming it to meet the needs of the target system, and loading it into a target database or data warehouse. ETL tools are commonly used to automate this process, making it more efficient and less error-prone. Python has several powerful ETL tools like `Apache Nifi`, `Apache Airflow`, and `Apache Beam`. These tools allow developers to extract data from various sources, perform transformations, and load it into a target database or data warehouse.

## Handling data normalization

Data normalization is the process of transforming data into a common scale or range, to eliminate differences in magnitude and make the data more comparable and interpretable. Normalization is an important step in data preprocessing, as it can improve the accuracy and performance of machine learning models and other data analysis techniques.

1. **Min-Max Normalization**

This method scales the data to a fixed range, typically between 0 and 1. The formula for min-max normalization is:

```python
x_norm = (x - x_min) / (x_max - x_min)
```

In [3]:
sales_min = dataframe['sales'].min()
sales_max= dataframe['sales'].max()

sales_norm = (dataframe['sales'] - sales_min) / (sales_max -sales_min)
sales_norm.describe()

count    85870.000000
mean         0.015259
std          0.054357
min          0.000000
25%          0.000738
50%          0.003694
75%          0.011074
max          1.000000
Name: sales, dtype: float64

In [4]:
# Using sklearn
scaler = MinMaxScaler() #<- Object Creation

# scaler.fit(dataframe['sales']) #<- Calculate necessary numbers
# norm = scaler.transform(dataframe['sales']) #<- Scale given input data

norm = scaler.fit_transform(dataframe['sales'].values.reshape(-1,1))
print(norm.min())
print(norm.max())

0.0
1.0


2. **Z-score Normalization / StandardScaler**

   This method scales the data to have zero mean and unit variance. The formula for z-score normalization is:

```python
x_norm = (x - mean) / std
```

where `x` is the original value, `mean` and `std` are the mean and standard deviation of the data, respectively, and `x_norm` is the normalized value.

In [5]:
x_mean = dataframe['sales'].mean()
x_std= dataframe['sales'].std()

sales_norm = (dataframe['sales'] - x_mean) / x_std
sales_norm.describe()

count    8.587000e+04
mean    -9.681321e-18
std      1.000000e+00
min     -2.807211e-01
25%     -2.671478e-01
50%     -2.127704e-01
75%     -7.699104e-02
max      1.811622e+01
Name: sales, dtype: float64

In [15]:
# Using sklearn
scaler = StandardScaler() #<- Object Creation

# scaler.fit(dataframe['sales']) #<- Calculate necessary numbers
# norm = scaler.transform(dataframe['sales']) #<- Scale given input data

norm = scaler.fit_transform(dataframe['sales'].values.reshape(-1,1))
print(f"{norm.mean():.4f}")
print(f"{norm.std():.4f}")

-0.0000
1.0000


3. **Log transformation:**

   This method applies a logarithmic function to the data, to reduce the range of values and make the data more symmetric and normally distributed. The formula for log normalization is:

```python
x_norm = log(x)
```

where `x` is the original value, and `x_norm` is the normalized value.

In [13]:
norm = np.log(dataframe['sales'])
norm.head()

  result = getattr(ufunc, method)(*inputs, **kwargs)


0    7.017660
1    5.509740
2    6.400764
3    2.783431
4    5.523882
Name: sales, dtype: float64

4. **Power Transformation:**

   This method applies a power function to the data, to adjust the skewness and kurtosis of the distribution and make the data more symmetric and normally distributed. The formula for power normalization is:

```python
x_norm = sign(x) * abs(x) ** a
```

where `x` is the original value, `a` is the power parameter (typically between 0 and 1), `sign` is the sign function that returns the sign of x (+1 or -1), and `abs` is the absolute value function. The normalized value `x_norm` is obtained by raising the absolute value of `x` to the power of `a`, and then multiplying it by the sign of `x` to preserve the direction of the data.

In [11]:
power_factor = 0.5
norm = np.sign(dataframe['sales']) * np.power(np.abs(dataframe['sales']), power_factor)
norm.head()

0    33.409162
1    15.718998
2    24.541900
3     4.021743
4    15.830537
Name: sales, dtype: float64