---

<img src="../06_RESOURCES/books_data_transformation.png" alt="books_data_transformation-picture" height=500px>

---

# Books Data Transformation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/quantumudit/Analyzing-Books/blob/master/02_ETL/Books%20Data%20Transformation.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/quantumudit/Analyzing-Books/master?labpath=02_ETL/Books%20Data%20Transformation.ipynb)

## Introduction

In this notebook, we will be importing, cleaning, transforming and exporting a dataset obtained from the website [Books to Scrape](https://books.toscrape.com/). 

The purpose of this notebook is to clean the raw data and make it ready for analysis and visualization. The data source provides information on various books such as the book title, genre, availability in stock, price, and star ratings.

The scope of the cleaning operations includes reviewing the data, identifying and fixing any issues, changing the data types, addition of a custom index, and transforming the data as required. Finally, we will validate the data and export it to a CSV file for further analysis and visualization.

## Imports & Setup

In the section, we will be importing the necessary libraries for data cleaning and transformation in the books dataset.

This will include popular data analysis libraries such as `pandas` and `numpy` and we will also be importing the scraped books dataset for data cleaning and transformation.

Moreover, we will also be setting any necessary configurations or options that will be used throughout the notebook.

### Module Import & Setup

We will be importing the following libraries and modules:

- `pandas`: for loading, manipulating, and analyzing the dataset
- `numpy`: for performing numerical operations on the dataset
- `warnings`: for handling any warning that may occur during the execution of the code

We will also be setting any necessary configurations or options that will be used throughout the notebook, such as setting the display options for `pandas` dataframes. In addition, we'll be ignoring any warning that may appear during the execution of the notebook.

We are also importing helper functions from the **"helper_functions.py"** file that includes various functions help us to keep our code organized and reusable.

In [1]:
# imports
import numpy as np
import pandas as pd
import warnings

# import helper functions
from helper_functions import dataframe_structure, dict_to_table, datatype_details

# module setup
%matplotlib inline
pd.options.display.precision = 5
warnings.filterwarnings("ignore")

### Data Import

Now, we will be importing the scraped books dataset from the directory using the `pandas` library. The `read_csv()` method is used to get the data from the CSV file. Moreover, we'll making a deep copy of the dataframe for working.

In [2]:
# import data from csv
scraped_books_df = pd.read_csv("../01_SCRAPER/scraped_data.csv", index_col=False)

# create a copy of it for working
books = scraped_books_df.copy(deep=True)

# view the glimpse of the dataframe
books.head()

Unnamed: 0,title,genre,price,star_rating,stock_availability,book_image,last_updated_at_UTC
0,It's Only the Himalayas,Travel,45.17,Two,In stock,https://books.toscrape.com/media/cache/27/a5/2...,18-Jun-2022 19:04:35
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,Travel,49.43,Four,In stock,https://books.toscrape.com/media/cache/57/77/5...,18-Jun-2022 19:04:35
2,See America: A Celebration of Our National Par...,Travel,48.87,Three,In stock,https://books.toscrape.com/media/cache/9a/7e/9...,18-Jun-2022 19:04:35
3,Vagabonding: An Uncommon Guide to the Art of L...,Travel,36.94,Two,In stock,https://books.toscrape.com/media/cache/d5/bf/d...,18-Jun-2022 19:04:35
4,Under the Tuscan Sun,Travel,37.33,Three,In stock,https://books.toscrape.com/media/cache/98/c2/9...,18-Jun-2022 19:04:35


With the successful import of the data and necessary libraries, we can now proceed to the data overview section for a comprehensive understanding of the dataset.

## Data Overview

In this section, we'll delve into the specifics of the data and gain a better understanding of the structure and quality of the data we have received. 

This analysis will include a comprehensive evaluation of the number of rows and columns in the data, the types of variables present, and any initial observations about the data's structure and quality. We'll also perform duplicate checks to identify any duplicate records in the data.

This analysis will provide a high-level summary of the data and helps to identify any potential issues that need to be addressed in the data cleaning process.

We'll then divide the data overview process into the following sub-sections:

- **Metadata Information**: This will provide us the relevant additional information that helps us understand the data source.
- **Dataframe Details**: This will provide us the overall structure of the data.
- **Field Details**: This will provide us the general details of each field, such as, the field name, field type, etc.
- **Redundancy Checks**: This will check for duplicates in the data to make sure that we have a clean dataset to work with.

Now, let's dive into each of these sub-sections and get a deeper understanding of the data.

### Metadata Information

In this section, we'll be describing each field present in our dataset, providing insight into the information contained within each field. This information is crucial in helping us understand the structure of our data, and ensuring that we are able to effectively clean and transform it in preparation for analysis and visualization. Here's a breakdown of the fields contained within our dataset:

- **title** : Title of the book
- **genre** : Genre of the book
- **price** : Price of the book in Euros(£)
- **star_rating** : Rating of book out of 5
- **stock_availability** : Availability status of the book
- **book_image** : Image URL of the book
- **last_updated_at_UTC** : Latest UTC timestamp of item scraped

### Dataframe Details

This section is a cruicial one and it provides a high-level overview of the dataframe being analyzed. This section includes important information about the size and structure of the data, as well as any missing or null values in the data. The details of the dataframe that we'll get are as follows:

- **Dimensions**: The number of rows and columns in the dataframe.
- **Shape**: The shape of the dataframe, represented as a tuple (rows, columns).
- **Row Count**: The number of rows in the dataframe.
- **Column Count**: The number of columns in the dataframe.
- **Total Datapoints**: The total number of data points in the dataframe, calculated as the number of rows multiplied by the number of columns.
- **Null Datapoints**: The number of missing or null values in the dataframe.
- **Non-Null Datapoints**: The number of non-missing or non-null values in the dataframe.
- **Total Memory Usage**: The total memory usage of the dataframe, represented in bytes.
- **Average Memory Usage**: The average memory usage of each data point in the dataframe, represented in bytes.

This section provides a quick reference for the dataframe, and helps to identify any potential issues with the data that may need to be addressed in the cleaning process. By having a clear and concise overview of the dataframe, it's easier to move forward with the data cleaning and analysis process.

In [3]:
# get the dataframe structure details
df_structure = dataframe_structure(dataframe=books)

# prettify the dictionary response
tbl = dict_to_table(input_dict=df_structure,
                    column_headers=["Dataframe Attributes", "Value"])

# show table
print(tbl)

+----------------------+-----------+
| Dataframe Attributes | Value     |
+----------------------+-----------+
| Dimensions           | 2         |
| Shape                | (1000, 7) |
| Row Count            | 1000      |
| Column Count         | 7         |
| Total Datapoints     | 7000      |
| Null Datapoints      | 0         |
| Non-Null Datapoints  | 7000      |
| Total Memory Usage   | 512267    |
| Average Memory Usage | 64033.0   |
+----------------------+-----------+


💡 **Insights:**

The dataframe has $2$ dimensions and a shape of ($1000$, $7$), meaning it has $1000$ rows and $7$ columns.<br>
The dataframe has a total of $7000$ data points, with no null data points. The total memory usage of the dataframe is $5,12,267$ bytes and the average memory usage per column is $64,033$ bytes.

### Field Details

This section provides a detailed view of the columns in the dataframe, and includes important information that can help with the data cleaning and analysis process. This section is typically used to get an understanding of the data types and distributions of the columns, and to identify any missing or null values that may need to be addressed.

We can use the `info()` method in `pandas` to quickly get an understanding of the structure and distribution of the data, and identify any issues that may need to be addressed in the cleaning process.

Additionally, we'll use the function `datatype_details()` defined in the [helper_functions.py](./helper_functions.py) file to get a datatype details of the dataframe.

Some of the key information that we'll get from this section are as follows:

- **Column Names**: A list of the names of all the columns in the dataframe, which can help you to identify any columns that may need to be renamed for clarity or consistency.
- **Datatypes**: The datatype of each column in the dataframe, such as integer, float, string, etc. The datatype information is important because it determines how the data can be analyzed and manipulated.
- **Non-Null Count**: The number of non-null or non-missing values for each column in the dataframe. This information can help you to determine if there are any columns that have a high percentage of missing values, which may need to be handled differently in the cleaning process.
- **Null Count**: The number of null or missing values for each column in the dataframe. This information can help you to determine if there are any columns that have a high number of missing values, which may need to be handled differently in the cleaning process.
- **Memory Usage**: The memory usage of each column in the dataframe, represented in bytes. This information can be useful in determining if there are any columns that are using a large amount of memory, which may need to be optimized for performance.

In [4]:
# get field details
books.info(memory_usage ='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   title                1000 non-null   object 
 1   genre                1000 non-null   object 
 2   price                1000 non-null   float64
 3   star_rating          1000 non-null   object 
 4   stock_availability   1000 non-null   object 
 5   book_image           1000 non-null   object 
 6   last_updated_at_UTC  1000 non-null   object 
dtypes: float64(1), object(6)
memory usage: 500.3 KB


In [5]:
# get details of teh datatypes
datatype_details(books)

There are 6 fields with object datatype
There are 1 fields with float64 datatype


💡 **Insight:**

Based on the above results, it appears that the dataframe has $7$ columns and none of the columns have missing values. This is a positive indication that the data is relatively clean and ready for further analysis. However, it is always good to double check the data and perform some exploratory data analysis to make sure that there are no other issues with the data.

We also found that our dataset has $6$ object type data field and $1$ float type data field (`price`).

The data we have is almost clean however, we can still perform some enhancements for the ease of use.

### Redundancy Checks

In this section, we will conduct a duplicate check on the data in order to ensure that our dataset is accurate and free from any duplicate records. We will exclude the `last_updated_at_UTC` field from this check as it represents the timestamp when the data was collected and is not relevant for the purpose of identifying duplicates. 

This step is important to maintain the integrity of our analysis and ensure that we are working with accurate information.

In [6]:
# number of duplicated entries
duplicates_cnt = books.drop(columns="last_updated_at_UTC").duplicated().sum()
print(f"There are {duplicates_cnt} duplicate entries in the dataset")

There are 0 duplicate entries in the dataset


💡 **Insight:**

Having a duplicate-free dataset is important for ensuring that our analysis is accurate and meaningful. 

As we can see from the results of our duplicate check, there are no duplicate values in our data. Each row in the dataframe perfectly represents a unique book entry. This is a great start as it ensures that our analysis will be based on a clean and reliable dataset.

Moving forward, we can be confident in the accuracy of our insights as we analyze the relationships between the various attributes of the books, such as genre, rating, and price.

## Data Cleaning Operations

This is the section where the actual data cleaning and preparation will takes place. In this section, we will apply various data cleaning techniques to the dataframe to remove any errors, inconsistencies, or irrelevant information that may impact your analysis. The goal of this section is to ensure that the data is in a clean, consistent, and usable format for analysis.

Some common data cleaning operations include:

- **Handling missing values**: Replacing or removing missing values from the dataframe.
- **Removing duplicates**: Identifying and removing duplicate records from the dataframe.
- **Fixing data types**: Converting data from one type to another, as needed, to ensure consistency across columns.
- **Removing irrelevant information**: Removing any columns or records that are not needed for the analysis.
- **Standardizing values**: Converting values to a standardized format, such as converting date strings to date objects.

By documenting the data cleaning operations that were performed, you can ensure that the data is cleaned in a consistent and reproducible manner, and that your results can be easily validated by others.

The "Data Overview" section has provided us with a comprehensive understanding of the state of the data. Upon review, it has been determined that the data is in good shape, with no missing values present. However, there are still a few cleaning operations that that we need to perform in order to ensure that the data is ready for analysis.

We can divide the cleaning operation in two stages:

- **Data Standardization**: Involves standardization of specific fields for a better analysis
- **Datatype Fixes**: Involves fixing any datatype errors or inconsistencies in the data, so that it can be effectively analyzed.

By performing these cleaning operations, we will have a well-structured and usable data set that can be analyzed and used to gain insights and make informed decisions.

### Data Standardization

In this section, we'll standardize the values in `star_rating` and `stock_availability` columns. Therefore, the action items are as follows:

- The first step is to convert the letter values in the `star_rating` column, such as "One", "Two", etc. into their integer equivalent, such as 1, 2, etc.. This will allow us to perform numerical calculations and analysis on the ratings data.
- Next, we will standardize the values in the `stock_availability` column by converting "In Stock" to "Yes" and everything else to "No". This will allow us to easily determine if a product is available for purchase or not.

In [7]:
# view unique list of items in star_ratings column
books["star_rating"].unique()

array(['Two', 'Four', 'Three', 'One', 'Five'], dtype=object)

In [8]:
# mapping dictionary
ratings_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

# apply map over column
books["star_rating"] = books["star_rating"].map(ratings_map)

# view unique list of in star_ratings column
books["star_rating"].unique()

array([2, 4, 3, 1, 5], dtype=int64)

In [9]:
# checking glimpse of the data
books.head()

Unnamed: 0,title,genre,price,star_rating,stock_availability,book_image,last_updated_at_UTC
0,It's Only the Himalayas,Travel,45.17,2,In stock,https://books.toscrape.com/media/cache/27/a5/2...,18-Jun-2022 19:04:35
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,Travel,49.43,4,In stock,https://books.toscrape.com/media/cache/57/77/5...,18-Jun-2022 19:04:35
2,See America: A Celebration of Our National Par...,Travel,48.87,3,In stock,https://books.toscrape.com/media/cache/9a/7e/9...,18-Jun-2022 19:04:35
3,Vagabonding: An Uncommon Guide to the Art of L...,Travel,36.94,2,In stock,https://books.toscrape.com/media/cache/d5/bf/d...,18-Jun-2022 19:04:35
4,Under the Tuscan Sun,Travel,37.33,3,In stock,https://books.toscrape.com/media/cache/98/c2/9...,18-Jun-2022 19:04:35


In [10]:
# view unique list of items in stock_availability column
books["stock_availability"].unique()

array(['In stock'], dtype=object)

In [11]:
# apply list comprehension to condionally change value
books["stock_availability"] = ["Yes" if x == "In stock" else "No" for x in books["stock_availability"]]

# view unique list of items in stock_availability column
books["stock_availability"].unique()

array(['Yes'], dtype=object)

In [12]:
# checking glimpse of the data
books.head()

Unnamed: 0,title,genre,price,star_rating,stock_availability,book_image,last_updated_at_UTC
0,It's Only the Himalayas,Travel,45.17,2,Yes,https://books.toscrape.com/media/cache/27/a5/2...,18-Jun-2022 19:04:35
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,Travel,49.43,4,Yes,https://books.toscrape.com/media/cache/57/77/5...,18-Jun-2022 19:04:35
2,See America: A Celebration of Our National Par...,Travel,48.87,3,Yes,https://books.toscrape.com/media/cache/9a/7e/9...,18-Jun-2022 19:04:35
3,Vagabonding: An Uncommon Guide to the Art of L...,Travel,36.94,2,Yes,https://books.toscrape.com/media/cache/d5/bf/d...,18-Jun-2022 19:04:35
4,Under the Tuscan Sun,Travel,37.33,3,Yes,https://books.toscrape.com/media/cache/98/c2/9...,18-Jun-2022 19:04:35


### Datatype Fixes

In this section, we will take a closer look at the data types of each column in the dataframe to ensure that they are correctly assigned. The goal of this step is to ensure that each column is stored in a format that accurately represents its data, allowing us to perform accurate calculations and analysis.

During the "Data Standardization" steps, we have conditionally replaced values in the dataframe, so it is crucial to verify that the correct data types are in place. If needed, we will change the data type of a column to accurately reflect the data it contains.

For example, if we have a column that contains numbers, but it is stored as a string data type, we will change it to an integer or float data type. This will allow us to perform numerical calculations on the data and get meaningful results.

By ensuring that the data types are correctly assigned, we can be confident that the data is stored in a usable and meaningful format for analysis.

In [13]:
books.dtypes

title                   object
genre                   object
price                  float64
star_rating              int64
stock_availability      object
book_image              object
last_updated_at_UTC     object
dtype: object

💡 **Insight:**

The review of the data types in the dataframe has revealed that the `star_rating` column now has the correct datatype of "int64". This is a result of our previous standardization step where we converted the letter values into their integer equivalent. Additionally, the other fields in the dataframe have also been assigned appropriate data types.

However, there is one exception: the `last_updated_at_UTC` field is actually a datetime field, but it has been assigned a data type of "object". While this may seem like an issue, it is not a concern as we will not be performing any analysis on the timestamps. The timestamp is simply there to provide information on when the data was scraped.

In conclusion, the data types have been correctly assigned to ensure that the data is stored in a meaningful and usable format for analysis.

## Data Transformation

This section involves transforming the data in ways that will make it easier to work with, analyze and gain insights from. The data may need to be transformed in order to better represent the relationships and patterns within the data, or to prepare it for specific types of analysis.

We can perform aggregation of data at different levels of granularity, creating new variables through calculations, or transforming variables to meet specific requirements for analysis, etc. as per the requirement.

The current level of granularity of the data is sufficient for analysis, so no additional aggregations are necessary. However, before exporting the data for further analysis and visualization, there are few transformation steps that need to be taken to ensure the data is in the best format for these purposes. These steps will help to streamline the data and make it easier to work with, ultimately leading to more meaningful and impactful insights.

We can divide the data transformation in the following two sub-sections:

- **Field Renaming**: Adjusting column names to better represent their contents.
- **Custom Index Addition**: Adding a custom index to the dataframe for improved organization and easier analysis.

By completing these transformations, we can prepare the data for more in-depth analysis, and increase the chances of uncovering useful insights.

### Field Renaming

This section involves renaming columns in the dataframe to make them more descriptive, consistent, and easy to work with. Properly naming columns can greatly improve the readability and understandability of the data, making it easier to perform analysis and visualization.

In this section, we will review the current column names and determine if any changes are necessary. We may need to rename columns to make them more descriptive, to make them match the names of other columns in related datasets, or to eliminate any confusion that may arise from ambiguous or inconsistent column names. By properly renaming columns, we can improve the overall quality and organization of the data, and make it easier to work with and analyze.

In [14]:
# view current column names
list(books.columns)

['title',
 'genre',
 'price',
 'star_rating',
 'stock_availability',
 'book_image',
 'last_updated_at_UTC']

In [15]:
# list of new column names
new_column_names = [
    'Title',
    'Genre',
    'Price (£)',
    'Rating',
    'Stock Availability Status',
    'Cover Page',
    'Last Update Timestamp (UTC)'
]

# rename columns
books.columns = new_column_names

# view new column names in dataframe
list(books.columns)

['Title',
 'Genre',
 'Price (£)',
 'Rating',
 'Stock Availability Status',
 'Cover Page',
 'Last Update Timestamp (UTC)']

In [16]:
# view glimpse of data
books.head()

Unnamed: 0,Title,Genre,Price (£),Rating,Stock Availability Status,Cover Page,Last Update Timestamp (UTC)
0,It's Only the Himalayas,Travel,45.17,2,Yes,https://books.toscrape.com/media/cache/27/a5/2...,18-Jun-2022 19:04:35
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,Travel,49.43,4,Yes,https://books.toscrape.com/media/cache/57/77/5...,18-Jun-2022 19:04:35
2,See America: A Celebration of Our National Par...,Travel,48.87,3,Yes,https://books.toscrape.com/media/cache/9a/7e/9...,18-Jun-2022 19:04:35
3,Vagabonding: An Uncommon Guide to the Art of L...,Travel,36.94,2,Yes,https://books.toscrape.com/media/cache/d5/bf/d...,18-Jun-2022 19:04:35
4,Under the Tuscan Sun,Travel,37.33,3,Yes,https://books.toscrape.com/media/cache/98/c2/9...,18-Jun-2022 19:04:35


### Custom Index Addition

In this section, we will add a custom index to the dataframe to make it easier to access and manipulate the data. An index is a set of labels that is used to identify each row in the dataframe, and it can greatly improve the performance and efficiency of data operations.

By adding a custom index, we can make the data more organized and easier to work with. This can be particularly useful when we want to perform specific operations on a subset of the data, or when we want to access specific rows or columns more quickly. The custom index can be set to a specific column in the data, or it can be created based on a combination of columns. Whatever the case may be, adding a custom index can greatly improve the data preparation process and make the data more accessible and usable for analysis and visualization.

In [17]:
# create custom index column
custom_index_col = pd.RangeIndex(start=1000, stop=1000+len(books), step=1, name='BookID')

# add index column to dataframe
books.index = custom_index_col
books.index = 'B' + books.index.astype('string')

# view glimpse of data
books.head()

Unnamed: 0_level_0,Title,Genre,Price (£),Rating,Stock Availability Status,Cover Page,Last Update Timestamp (UTC)
BookID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
B1000,It's Only the Himalayas,Travel,45.17,2,Yes,https://books.toscrape.com/media/cache/27/a5/2...,18-Jun-2022 19:04:35
B1001,Full Moon over Noah’s Ark: An Odyssey to Mount...,Travel,49.43,4,Yes,https://books.toscrape.com/media/cache/57/77/5...,18-Jun-2022 19:04:35
B1002,See America: A Celebration of Our National Par...,Travel,48.87,3,Yes,https://books.toscrape.com/media/cache/9a/7e/9...,18-Jun-2022 19:04:35
B1003,Vagabonding: An Uncommon Guide to the Art of L...,Travel,36.94,2,Yes,https://books.toscrape.com/media/cache/d5/bf/d...,18-Jun-2022 19:04:35
B1004,Under the Tuscan Sun,Travel,37.33,3,Yes,https://books.toscrape.com/media/cache/98/c2/9...,18-Jun-2022 19:04:35


After undergoing transformation, the data appears to be more organized with a clear index column and descriptive, user-friendly field names. We will now conduct a quick validation of the data prior to exporting it.

## Data Validation

In this section, we will verify that the data meets certain quality standards and that all transformations and cleaning operations have been performed correctly.

Data validation is important because it ensures that the data is accurate and reliable, and that it will not produce any errors or inconsistencies during analysis. By validating the data, we can ensure that the results of our analysis are trustworthy and that any insights we gain are based on accurate and complete data.

Now that we have completed the data cleaning and transformation, it's time to thoroughly validate the data to ensure that it is ready for analysis and visualization. we will examine various columns and datatypes to confirm that all the data is accurate, complete, and consistent.

In [18]:
# basic info of dataframe
books.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, B1000 to B1999
Data columns (total 7 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Title                        1000 non-null   object 
 1   Genre                        1000 non-null   object 
 2   Price (£)                    1000 non-null   float64
 3   Rating                       1000 non-null   int64  
 4   Stock Availability Status    1000 non-null   object 
 5   Cover Page                   1000 non-null   object 
 6   Last Update Timestamp (UTC)  1000 non-null   object 
dtypes: float64(1), int64(1), object(5)
memory usage: 62.5+ KB


In [19]:
# review random data samples
books.sample(10)

Unnamed: 0_level_0,Title,Genre,Price (£),Rating,Stock Availability Status,Cover Page,Last Update Timestamp (UTC)
BookID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
B1933,Redeeming Love,Christian Fiction,20.47,5,Yes,https://books.toscrape.com/media/cache/21/21/2...,18-Jun-2022 19:05:58
B1394,"Girl, Interrupted",Nonfiction,42.14,3,Yes,https://books.toscrape.com/media/cache/f9/ec/f...,18-Jun-2022 19:05:07
B1970,The Four Agreements: A Practical Guide to Pers...,Spirituality,17.66,5,Yes,https://books.toscrape.com/media/cache/0f/7e/0...,18-Jun-2022 19:06:04
B1230,The Murder That Never Was (Forensic Instincts #5),Fiction,54.11,3,Yes,https://books.toscrape.com/media/cache/dc/44/d...,18-Jun-2022 19:04:53
B1678,Rogue Lawyer (Rogue Lawyer #1),Add a comment,50.11,3,Yes,https://books.toscrape.com/media/cache/92/e4/9...,18-Jun-2022 19:05:31
B1178,Fifty Shades Darker (Fifty Shades #2),Romance,21.96,1,Yes,https://books.toscrape.com/media/cache/cc/bd/c...,18-Jun-2022 19:04:49
B1244,I Am Pilgrim (Pilgrim #1),Fiction,10.6,4,Yes,https://books.toscrape.com/media/cache/ed/07/e...,18-Jun-2022 19:04:53
B1564,"Where'd You Go, Bernadette",Default,18.13,1,Yes,https://books.toscrape.com/media/cache/c6/7e/c...,18-Jun-2022 19:05:21
B1746,The Natural History of Us (The Fine Art of Pre...,Young Adult,45.22,3,Yes,https://books.toscrape.com/media/cache/5d/7f/5...,18-Jun-2022 19:05:38
B1909,The Love and Lemons Cookbook: An Apple-to-Zucc...,Food and Drink,37.6,2,Yes,https://books.toscrape.com/media/cache/0d/1f/0...,18-Jun-2022 19:05:55


After undergoing several cleaning and transformation steps, the dataset is now in a refined state, ready for analysis and visualization. To make the data easily accessible and available for future use, we will proceed by exporting the cleaned dataset in CSV file format as our next step.

## Data Export

In this section, we will export the cleaned and transformed dataset as a CSV file, which can be easily loaded and used for further analysis and reporting.

In [20]:
# export data into CSV file
books.to_csv("../03_DATA/books_data.csv", encoding='utf-8', index_label='BookID')

## Conclusion

In this notebook, we imported the raw dataset scraped from [Books to Scrape](https://books.toscrape.com/) website amd we have we performed a comprehensive data cleaning and transformation process to prepare the data for analysis and visualization. The process was broken down into several sections, including Data Overview, Dataframe Details, Field Details, Data Cleaning Operations, Datatype Fixes, Data Transformation, Field Renaming, Custom Index Addition, and Data Validation.

Throughout the cleaning and transformation process, we discovered the data was in good shape with no null values present and with appropriate datatypes. However, we performed several operations to ensure the data was in the desired format.

With the data now in a clean and usable format, the next steps would be to perform analysis and visualization on the dataset to gain actionable insights. Additionally, an interactive dashboard using Power BI could be built to present the insights to a wider audience.

---