Skip to content

This project focuses on scraping data related to books by their genre from the "Books To Scrape" website; performing necessary transformations on the scraped data and then analyzing & visualizing it using Jupyter Notebook and Power BI.

License

Notifications You must be signed in to change notification settings

quantumudit/Analyzing-Books

Repository files navigation

Project Logo


Scraping & Analyzing books from Books To Scrape website with Python and Power BI

built-with-love powered-by-coffee cc-nc-sa

OverviewPrerequisitesArchitectureDemoSupportLicense

Overview

The objective of this project is to gather information about books and their attributes from the website Books To Scrape.

The collected data will then undergo a thorough exploratory data analysis, aimed at gaining valuable insights.

These insights will be effectively visualized using the Power BI tool to provide a clear and comprehensive understanding of the data. The project's primary focus is to extract information from the website and turn it into useful insights and visual representations.

Website Snippet

The repository directory structure is as follows:

Analyzing-Books
├─ 01_SCRAPER
├─ 02_ETL
├─ 03_DATA
├─ 04_ANALYSIS
├─ 05_DASHBOARD
├─ 06_RESOURCES

The type of content present in the directories is as follows:

01_SCRAPER

This directory comprises a Python script designed to extract information from the website and a flat file that stores the data obtained through the scraping process.

The script in this directory automates the data scraping process, making it easier to collect information from the website in question. The flat file serves as a storage space for the scraped data, allowing for easy access and manipulation of the information gathered.

This setup facilitates the efficient and organized management of the data obtained from the website.

02_ETL

This directory houses a Jupyter Notebook that undertakes an ETL (Extract, Transform, Load) process on the data obtained through scraping.

The purpose of this process is to convert the raw data into a form that is suitable for analysis. The Jupyter Notebook performs transformations on the scraped dataset to clean, organize, and structure the data into a format that is ready for analysis.

This notebook is characterized by its thorough documentation, with each step of the analysis process clearly explained and described.

The data cleaning and transformation steps, in particular, are carefully documented, ensuring that the thought process and decisions made during these processes are easily understood.

This attention to detail in the documentation makes the notebook a valuable resource for anyone looking to understand how the data was cleaned and transformed to generate meaningful insights.

Finally, the transformed data is exported into the 03_DATA directory, making it easily accessible for further examination and analysis.

This Jupyter Notebook serves as a crucial step in the data preparation process, enabling the effective and efficient transformation of raw data into a form that can provide valuable insights.

03_DATA

This directory contains the data that can be directly used for data analysis and visualization.

The contents of this directory include only the pristine and organized data, ready to be utilized for data analysis and visualization.

This data has been thoroughly processed and scrubbed of any errors or inconsistencies, ensuring that it can be relied upon to provide accurate and meaningful insights.

04_ANALYSIS

This directory contains the python notebooks that analyzes the clean dataset to generate insights.

This directory contains a Jupyter Notebook that analyzes the clean dataset to uncover valuable insights.

This notebook performs complex data analysis and has been crafted to make it easy to work with the clean data.

The notebook is thoroughly annotated and the results of each analysis are clearly documented within the text cells. This detailed documentation makes it easy to follow the thought process and understand the insights generated through the analysis.

The well-documented nature of the notebook makes it a valuable resource for anyone looking to gain a deeper understanding of the data and the insights it contains.

05_DASHBOARD

This directory houses a straightforward markdown file that includes an embedded link to a Power BI report.

This report serves as a visual representation of the data and provides a dynamic, interactive experience for exploring and analyzing the information.

The simplicity of the markdown file, combined with the robust capabilities of Power BI, make it easy to access and interact with the data in a way that is both intuitive and insightful.

Whether you are looking to gain a high-level overview of the data, or to drill down into specific details, the Power BI report provides an effective and engaging way to work with the data.

06_RESOURCES

This directory serves as a repository for various visual elements used in this project, including images, icons, layouts, styling files, etc.

These elements play a crucial role in the overall presentation and visualization of the data and help to bring the insights generated by the analysis to life.

By having these elements easily accessible in a central location, the project is streamlined and efficient, allowing for faster and more effective data analysis and visualization.

Prerequisites

To fully grasp the concepts and processes involved in this project, it is recommended to have a solid understanding of the following skills:

  • Fundamental knowledge of Python and Jupyter Notebook
  • Familiarity with the Python libraries listed in the requirements.txt file
  • Basic proficiency in HTML and CSS
  • Basic familiarity with browser developer tools
  • An understanding of the basics of Power BI

Having these skills as a foundation will help to ensure a smooth and effective experience while working on this project.

The selection of applications and their installation process may differ depending on personal preferences and computer configurations.

Architecture

The architecture of this project is straightforward and can be easily understood through the accompanying diagram, as seen below:

Process Architecture

The project architecture consists of the following steps:

  • Data scraping: Data is collected from a website using a Python script and stored in a flat file.
  • Data cleaning: The raw data is processed and cleaned through the use of an ETL-specific Jupyter Notebook.
  • Data visualization: The cleaned and analysis-ready dataset is used for exploratory data analysis (EDA) through Jupyter Notebook and creating a comprehensive and insightful report through Power BI.

These steps are designed to be straightforward and efficient, allowing for quick and effective analysis of the data and generation of meaningful insights.

Demo

The following illustration demonstrates the process of collecting data from the website through scraping:

Scraping Graphic

Access the interactive Power BI dashboard by clicking on this link here:

Power BI Dashboard

Support

If you have any questions, concerns, or suggestions, feel free to reach out to me through any of the following channels:

Linkedin Badge Twitter Badge

If you find my work valuable, you can show your appreciation by buying me a coffee

buy-me-a-coffee

License

by-nc-sa

This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

About

This project focuses on scraping data related to books by their genre from the "Books To Scrape" website; performing necessary transformations on the scraped data and then analyzing & visualizing it using Jupyter Notebook and Power BI.

Topics

Resources

License

Stars

Watchers

Forks