## Data Preprocessing and Feature Engineering

#### Introduction
Data preprocessing and feature engineering are crucial steps in preparing raw data for modeling. These steps involve cleaning the data, handling missing values, transforming the data, normalizing the features, and engineering new features. The goal is to create a high-quality dataset that can be used to build accurate and robust predictive models.

#### Loading the Data
The dataset is loaded from a CSV file into a pandas DataFrame. This step reads the raw data into memory for further processing.


In [8]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Load data
data = pd.read_csv('ucs_data.csv')


#### Data Cleaning
Data cleaning involves removing duplicates, filtering out invalid entries, and ensuring that the data is consistent and accurate. In this step, duplicates are removed, cancelled transactions are filtered out, and only positive quantities are retained.


In [9]:
# Data Cleaning
data.drop_duplicates(inplace=True)
data = data[~data['Invoice'].str.contains('C', na=False)]
data = data[data['Quantity'] > 0]


#### Handling Missing Values
Missing values can significantly impact the performance of machine learning models. In this step, records with missing `Customer ID` and `Description` values are removed to ensure data completeness.


In [10]:
# Handling Missing Values
data.dropna(subset=['Customer ID', 'Description'], inplace=True)


#### Data Transformation
Data transformation involves converting data into a suitable format for analysis. Here, the `InvoiceDate` is converted to a datetime format, and a new feature `TotalPrice` is created by multiplying `Quantity` and `Price`.


In [11]:
# Data Transformation
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
data['TotalPrice'] = data['Quantity'] * data['Price']


#### Data Normalization
Normalization scales the numerical features to a standard range, which helps in improving the performance and training stability of machine learning models. The `Quantity` and `Price` features are scaled to a range between 0 and 1 using MinMaxScaler.


In [12]:
# Normalization
scaler = MinMaxScaler()
data[['Quantity', 'Price']] = scaler.fit_transform(data[['Quantity', 'Price']])


#### Feature Engineering
Feature engineering involves creating new features from the existing data to enhance the predictive power of the model. In this step, new features such as `Year`, `Month`, and `DayOfWeek` are extracted from the `InvoiceDate`.


In [13]:
# Feature Engineering
data['Year'] = data['InvoiceDate'].dt.year
data['Month'] = data['InvoiceDate'].dt.month
data['DayOfWeek'] = data['InvoiceDate'].dt.dayofweek


#### One-Hot Encoding
One-hot encoding is applied to categorical features to convert them into a numerical format that can be used by machine learning algorithms. The `Country` feature is one-hot encoded, and the resulting encoded features are concatenated with the original dataset.


In [14]:
# One-Hot Encoding
encoder = OneHotEncoder(sparse=False)
country_encoded = encoder.fit_transform(data[['Country']])
country_df = pd.DataFrame(country_encoded, columns=encoder.get_feature_names_out(['Country']))
data = pd.concat([data.reset_index(drop=True), country_df], axis=1)




#### Saving the Preprocessed Data
The final preprocessed dataset is saved to a CSV file, `preprocessed_data.csv` for future use. This step ensures that the cleaned and transformed data is stored and can be easily accessed for model training and evaluation.


In [16]:
data.to_csv('preprocessed_data.csv', index=False)
print('Data Preprocessing and Feature Engineering Completed')


Data Preprocessing and Feature Engineering Completed
