In [2]:
import pandas as pd
dataframe = pd.read_csv("clean_data.csv")

## Handling data normalization

Data normalization is the process of transforming data into a common scale or range, to eliminate differences in magnitude and make the data more comparable and interpretable. Normalization is an important step in data preprocessing, as it can improve the accuracy and performance of machine learning models and other data analysis techniques.

**ETL (extract, transform, load) integration:** This involves extracting data from different sources, transforming it to meet the needs of the target system, and loading it into a target database or data warehouse. ETL tools are commonly used to automate this process, making it more efficient and less error-prone. Python has several powerful ETL tools like `Apache Nifi`, `Apache Airflow`, and `Apache Beam`. These tools allow developers to extract data from various sources, perform transformations, and load it into a target database or data warehouse.

1. **Min-Max Normalization**

This method scales the data to a fixed range, typically between 0 and 1. The formula for min-max normalization is:

```python
x_norm = (x - x_min) / (x_max - x_min)
```

In [3]:
sales_min = dataframe['sales'].min()
sales_max= dataframe['sales'].max()

In [7]:
sales_norm = (dataframe['sales'] - sales_min) / (sales_max -sales_min)

In [9]:
sales_norm.describe()

count    85870.000000
mean         0.015259
std          0.054357
min          0.000000
25%          0.000738
50%          0.003694
75%          0.011074
max          1.000000
Name: sales, dtype: float64

In [10]:
dataframe['sales'].describe()

count     85870.000000
mean       5980.048970
std       21302.453181
min           0.000000
25%         289.144572
50%        1447.518085
75%        4339.950965
max      391900.008970
Name: sales, dtype: float64