# Transformational AI: Analyzing Ecommerce Large Datasets for Machine Learning

Welcome to your journey as a data scientist at Transformational AI, where you will play a pivotal role in shaping the future of our company's machine learning models. In this project, you will leverage two Amazon Reviews 2023 datasets, originally sourced from Hugging Face, to perform data transformation, analysis, and visualization. Through this notebook, you will gain hands-on experience with pandas DataFrames and explore powerful data visualization techniques using `seaborn` and `matplotlib`. You will also learn to export your cleaned and processed data into `.csv` and `.parquet` formats, ready for advanced machine learning tasks. Get ready to dive into the world of data science and make impactful contributions to our machine learning team by preparing top-quality datasets for model fine-tuning and customer satisfaction analysis.

Do not worry if this sounds like a lot right now. Transformational AI makes no mistakes and we hired you for a reason. We will take this one step at a time, and know you have our full support every step of the way. We will be using a lot of advanced language and words, which will add to your Pythonic knowledge and background, while, reinforcing core concepts that you will need to know for your upcoming PCEP exam.

We know you can do it and we cannot wait to see what you create. Let’s get started!

# Part 1: Preparing the Environment

## Part 1.1: Learn about Jupyter Notebooks & Python Virtual Environments

When working with Python for a company, organization, or personal project, it's essential to initialize your Python environment properly. Remember from the course the concept of a Python "virtual environment." This isolated environment allows you to write functions, store variables, work with data, and perform many other operations without interfering with other projects or the system Python environment.

In this step, we will set up and use the Python environment within Jupyter Notebook, which is an interactive computing environment. This will assist us in our tasks at Transformational AI, enabling us to efficiently manage and analyze data, write and test code, and visualize results.

If you remember from course, Python is a versioned language, and software packages are versioned as well. Let's check out which version of Python is installed thanks to our Jupyter Notebook and also let's check which version of Jupyter Notebook is installed here for us to use as well.

Run the following commands to check these values:

In [None]:
# Check which version of Python is installed
!python --version

# Check which version of Jupyter Notebook you are using
!jupyter-notebook --version

## Part 1.2: Set Up the Jupyter Notebook

To get started, we need to ensure that our Jupyter Notebook environment is equipped with the necessary libraries to run what we need to run. These libraries will help us manage data, create visualizations, and interact with our notebook more effectively.

For your work in this notebook, you will need the following packages:

- `datasets`: Providres access to a wide range of datasets, including the Amazon Reviews 2023 dataset we'll be using.
- `pandas`: A powerful data manipulation and analysis library that makes working with structured data easy with Python.
- `matplotlib`: A comprehensive library for creating static, animated, and interactive visualizations in Python.
- `seaborn`: Built on top of matplotlib, seaborn provides a high-level interface for drawing attractive and informative statistical graphics, great for visualizing trends from data.
- `ipywidgets`: Adds interactive widgets to your Jupyter notebooks, allowing for dynamic and interactive user interfaces. You will see various input fields throughout the notebook that leverages this package.

👉 Do not worry about memorizing these packages this for the PCEP exam. However, as you continue to build upon your Python skills moving forward, these packages will be great tools to have in your toolbox.

Run a command to install these packages into the Jupyter Notebook envionment:

In [None]:
#<---- YOUR CODE GOES HERE ---->

^ Note: If you see what looks like an error message that looks like this: `ERROR: pip's dependency resolver does not currently take into account all the packages that are installed....` do not worry. As long as you see this line printed at the end, you installed the packages successfully: `Successfully installed...`

#### Solution

In [None]:
!pip install datasets pandas matplotlib seaborn ipywidgets

# Part 2: Download the Reviews Dataset:

## Part 2.1: Learn about Working with Big Data

In this part, we will load the `Electronics` dataset using the Hugging Face datasets library into our Jupyter Notebook.

### Working with large datasets:

NOTE: We are working with a **BIG dataset**, so if we had all the time in the world, we could work through this **22.6 GB** into the notebook. However, we are in a time crunch and don't have time for you to utilize the entire dataset. We need to understand very quickly what products are the best, so let's take a subset of the dataset to use. Do not worry if it appears like the data is going slowly, this is just part of the data science. Consider going to get a cup of your favorite warm beverage ☕️ and coming back when it's done downloading.

### Exploring DataFrames:

To understand how to work with data in Python, `pandas` is a fantastic library that is widely used and very popular. In your work at Transformational AI, you will be using the `pandas` framework and to analyze the data in a `DataFrame`.

You can think of `DataFrames` as a data structure packet that lets you work with data that could be in different formats or was from many different places, and has then been consolidated into a Pythonic, easy-to-work-with, structure. A `DataFrame` lets us do basic operations like viewing data, checking data types, and basic statistics, all within our Jupyter Notebook. How cool, right? 😎

## Part 2.2: Download and visualize the reviews dataset with `pandas`

- For your first task at Transformational AI, we will be utilizing the `Electronics` dataset in order to help our machine learning engineers better understand what types of products are producing the most optimal customer engagement and attention.

- While we could guess what products people like, Transformational AI is a data-driven company first and foremost. We will be relying on your to make data-informed decisions about which products are the most valuable for customers. The data you supply to the team will be very important for the training operations of their model outputs, so we will need to work quickly and diligently in order to make sure you can supply the ML team with the most optimal data possible.

- When you work with software packages in Python, they will often come with a suite of options, or parameters, to tell the underlying core functions what to do. Sometimes, these inputs will be strings, booleans, integers, floats, or other types of data type inputs.

- First, you will be loading into your Jupyter Notebook an Amazon Reviews dataset from 2023. In order to move quickly, we will only be pulling the first 100 items. After all, we don't have all day and have a very important deadline to meet this week. Speed is of the essence.
- You will load the data incrementally into a `pandas` DataFrame that will be set with a hard limit of 100 items in a for loop.
- Additionally, the `load_dataset` module will need edits to input the right parameters, in this case, boolean values for `streaming` and `trust_remote_code`.
  - You do not need to know what these parameters are doing behind the scenes, but the underlying boolean operators will tell the `datasets` package what to do - a great use of boolean operators here.
- Afterwards, we will then print the table of outputs as a rich HTML table visualization.

#### Steps to Complete:
- [ ] Set the sample_limit to an integer value of 100 so that no more than 100 data items will be collected from the dataset
- [ ] Create a for loop like you learned about from class so that you increment up to and including 100 data samples
- [ ] Use boolean operators to set the missing `load_dataset` parameters to `True`.

In [None]:
from datasets import load_dataset
import pandas as pd

# Load the reviews dataset
reviews_dataset = load_dataset(
    "McAuley-Lab/Amazon-Reviews-2023",
    "raw_review_Electronics",
    split="full",
    streaming=#<---- YOUR CODE GOES HERE ---->,
    trust_remote_code=#<---- YOUR CODE GOES HERE ---->
)

# Initialize an empty list to collect review samples
reviews = []

# Limit the number of data samples collected
sample_limit = #<---- YOUR CODE GOES HERE ---->

# Iterate over the reviews dataset and collect samples
for #<---- YOUR CODE GOES HERE ---->
    if #<---- YOUR CODE GOES HERE ---->
        break
    reviews.append(#<---- YOUR CODE GOES HERE ---->)

# Convert the list of review samples to a pandas DataFrame
reviews_df = pd.DataFrame(reviews)

# Print the dataframe
reviews_df

#### Solution

In [None]:
from datasets import load_dataset
import pandas as pd

# Load the reviews dataset
reviews_dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_review_Electronics", split="full", streaming=True, trust_remote_code=True)

# Initialize an empty list to collect review samples
reviews = []

# Limit to 100 samples
sample_limit = 100

# Iterate over the reviews dataset and collect samples
for i, sample in enumerate(reviews_dataset):
    if i >= sample_limit:
        break
    reviews.append(sample)

# Convert the list of review samples to a pandas DataFrame
reviews_df = pd.DataFrame(reviews)

# Print the dataframe
reviews_df

## Let's Reflect...

In [None]:
#@title 💭 Question to Answer:
#@markdown Describe the data that you see. What are the columns, what data types do you see, and how are the reviews structured?
Question1 = '' #@param {type:"string"}

if len(Question1) > 0:
    print(f"Question answered! You wrote: {Question1}")
else:
    print("Please answer the question")

## Part 2.3: Analyze the reviews dataset with `.head()` from `pandas`
- In the above cell, when we wrote `reviews_df` - we printed the Amazon review data we downloaded as rich HTML outputs. Jupyter notebooks have many great data visualization tools, such as render the DataFrame as a rich HTML table, which is readable and visually appealing as it comes with pagination and the ability to show the first few and last few items in the DataFrame, making it visually interesting and interactive to work with.

- There will be some times where this is sufficient, especially to get a nice visualzation, but there will be other times where you need to leverage `pandas` a bit more, such as quickly gathering the first few items of the DataFrame with a method called `.head()` which you will do in the subsequent cells.

- The `.head()` method from `pandas` prints the DataFrame in a plain text format, which can be extremely useful for quickly inspecting the first few rows of the DataFrame without overwhelming the display with lots of data if you have a lot of columns or information to work with.

Let's try out this new data inspection method!

In [None]:
print(reviews_df.head())

## Let's Reflect...

In [None]:
#@title 💭 Question to Answer:
#@markdown What do you think of this output above - was this what you expected? Why or why not?
Question2 = '' #@param {type:"string"}

if len(Question2) > 0:
    print(f"Question answered! You wrote: {Question2}")
else:
    print("Please answer the question")

## Part 2.4 - Transform the DataFrame to be more visually readable

As you work with data at Transformational AI, you will see that what we get from another team, department, or company will not be up to our standards. `Pandas` returned the data in 5 output blocks. You will notice that the items 0-4, mean that it's giving us the first 5 items in our dataset. This is what we would expect. However, wouldn't it be easier to see all the data for an item on the same line?

Let's set some options for how `pandas` will output our dataframe. Remember from the course that you can set values, or options, for various structures. the `pd` `dataframe` is one of those that we can control the outputs for, such as:
- `max_columns`
- `width`
- `max_colwidth`
- and so on...

#### Steps to Complete:
- [ ] Set the `Items_To_Print` value to an integer of `10` so that we only output the first 10 items of the DataFrame, overwriting Pandas' default output.

In [None]:
import pandas as pd

Items_To_Print = #<---- YOUR CODE GOES HERE ---->

# Adjust display options for better readability
pd.set_option('display.max_columns', None)  # Display all columns
pd.set_option('display.width', 1000)        # Set a larger display width
pd.set_option('display.max_colwidth', 50)   # Set max column width for better readability

# Assuming reviews_df is your DataFrame
print(reviews_df.head(10))

#### Solution

In [None]:
Items_To_Print = 10

## Let's Reflect...

In [None]:
#@title 💭 Question to Answer:
#@markdown Describe what we did to the pandas DataFrame in order to change how the outputs are reflected
Question3 = '' #@param {type:"string"}

# Example of how to use the entered name
if len(Question3) > 0:
    print(f"Question answered! You wrote: {Question3}")
else:
    print("Please answer the question")

# Part 3: Download the Reviews Metadata Dataset:

In this next part, you will be reviewing the metadata associated with reviews. This will give us more topical information about the actual products we want to analyze.

## Part 3.1: Download and visualize the reviews metadata dataset with `pandas`

We will be working with a different dataset, specifically scoped on the metadata around the reviews. This will give us more numeric data about the products the machine learning team at Transformational AI wants to know more about.

#### Steps to Complete:
- [ ] Set the metadata_limit to an integer value of 100 so that no more than 100 data items will be collected from the dataset
- [ ] Create a for loop like you learned about from class so that you increment up to and including 100 data samples
- [ ] Use boolean operators to set the missing `load_dataset` parameters to `True`.

In [None]:
from datasets import load_dataset
import pandas as pd

# Load the item metadata dataset
item_metadata_dataset = load_dataset(
    "McAuley-Lab/Amazon-Reviews-2023",
    "raw_meta_Electronics",
    split="full",
    streaming=#<---- YOUR CODE GOES HERE ---->,
    trust_remote_code=#<---- YOUR CODE GOES HERE ---->
)

# Initialize an empty list to collect item metadata samples
metadata = []

# Limit to 100 samples
metadata_limit = #<---- YOUR CODE GOES HERE ---->

# Iterate over the metadata dataset and collect samples
for #<---- YOUR CODE GOES HERE ---->
    if #<---- YOUR CODE GOES HERE ---->
        break
    metadata.append(#<---- YOUR CODE GOES HERE ---->)

# Convert the list of item metadata samples to a pandas DataFrame
item_metadata_df = pd.DataFrame(metadata)

# Print the dataframe
item_metadata_df

#### Solution

In [None]:
from datasets import load_dataset
import pandas as pd

# Load the item metadata dataset
item_metadata_dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", "raw_meta_Electronics", split="full", streaming=True, trust_remote_code=True)

# Initialize an empty list to collect item metadata samples
metadata = []

# Limit to 100 samples
metadata_limit = 100

# Iterate over the metadata dataset and collect samples
for i, sample in enumerate(item_metadata_dataset):
    if i >= metadata_limit:
        break
    metadata.append(sample)

# Convert the list of item metadata samples to a pandas DataFrame
item_metadata_df = pd.DataFrame(metadata)

# Print the dataframe
item_metadata_df

## Let's Reflect...

In [None]:
#@title 💭 Question to Answer:
#@markdown Compare the Reviews Metadata dataset (the dataset immediately above this cell) and the Reviews dataset that you analyzed earlier. What are the differences between the 2 datasets?
Question4 = '' #@param {type:"string"}

if len(Question4) > 0:
    print(f"Question answered! You wrote: {Question4}")
else:
    print("Please answer the question")

## Part 3.2: Analyze the reviews metadata dataset with .head() from pandas

We will now use the `.head()` method again to analyze the `pandas` dataframe.

In [None]:
print(item_metadata_df.head(10))

^ Did you notice that now when we use the `.head()` method, our `pandas` output preferences were saved? This is the value of the Pythonic environment we are using, because as we set variables or settings declarations, they are preserved throughout the environment when we need to reference them again. This will save you lots of time as you continue your work here at Transformational AI.

## Part 3.3: Safeguarding Data Access with Large Datasets

Now that you are working with DataFrames and large datasets in Python, let's analyze our data a bit before moving on to the next section. Data analysis usually involves reviewing, investigating, or looking up specific information within the larger data context. However, this can lead to errors, especially if the data isn't formatted correctly, if data is missing, or if outputs aren't always consistent. As we work with Python, we will want to ensure our work remains robust and error-free at Transformational AI.

Let's look at how to search for product prices and discover more about the item_metadata_df data that we imported into our Jupyter Notebook. We will need a function to handle situations where the product information might be missing to manage lookup and indexing errors that can arise.

#### Steps to Complete:
- [ ] Review the items with and without a price
- [ ] Finish the `get_product` lookup function by handling the exception case where the product title or price is not found using try-except blocks


In [None]:
import pandas as pd

# Ensure the 'price' column is converted to numeric, setting errors='coerce' to handle non-numeric values
item_metadata_df['price'] = pd.to_numeric(item_metadata_df['price'], errors='coerce')

# Identify products with 'None' or non-float prices
none_price_products = item_metadata_df[item_metadata_df['price'].isna()][['title', 'price']]
valid_price_products = item_metadata_df[item_metadata_df['price'].notna()][['title', 'price']]

# Print the list of products with 'None' or non-float prices
print(f"❌ Products with no valid price listed (Total: {len(none_price_products)})")
for i, row in enumerate(none_price_products.itertuples(index=False), start=1):
    print(f"{i}. {row.title} - Price: {row.price}")

# Print the list of products with valid prices
print(f"\n✅ Products with valid price listed (Total: {len(valid_price_products)})")
for i, row in enumerate(valid_price_products.itertuples(index=False), start=1):
    print(f"{i}. {row.title} - Price: {row.price}")

### Create a product lookup function with exception handling

In [None]:
# Function to get the price of a product by its title
def get_product(product_title):
    try:
        # Attempt to find the price of the product
        price = item_metadata_df.loc[item_metadata_df['title'] == product_title, 'price'].values[0]
        if pd.isna(price):
            return "❌ No valid price listed"
        return price
    except IndexError:
        # Handle the case where the product title is not found
        #<---- YOUR CODE GOES HERE ---->

#### Solution

In [None]:
# Function to get the price of a product by its title
def get_product(product_title):
    try:
        # Attempt to find the price of the product
        price = item_metadata_df.loc[item_metadata_df['title'] == product_title, 'price'].values[0]
        if pd.isna(price):
            return "❌ No valid price listed"
        return price
    except IndexError:
        # Handle the case where the product title is not found
        return "❌ Product not found"

### Test the product lookup function (known product title)
Check a specific item (with a known product title) in the DataFrame with a lookup:

In [None]:
known_title = "FS-1051 FATSHARK TELEPORTER V3 HEADSET"
print(f"\nPrice of '{known_title}': {get_product(known_title)}")

### Test the product lookup function (unknown product title)

In [None]:
unknown_title = "Smart Phone Tripod Holder (Mobile Version)"
print(f"Price of '{unknown_title}': {get_product(unknown_title)}")

Above, you can see how we can handle data access errors gracefully. By implementing exception handling in the `get_product` function, you can manage cases where the product title is not found or the price is missing, ensuring your code is robust and reliable.

# Part 4: Comparing our datasets together

To get a better sense of these two datasets we just downloaded, let's review their columns. This will give us a better sense of the data structures involved, data types, and how we can use the datasets for which purposes.

## Part 4.1 Review the columns of the datasets with `.columns`:

In order to analyze the datasets quickly, we can leverage our `pandas` DataFrames several ways, such as looking at the `columns` in each dataset.

#### STEPS TO COMPLETE:
Place the right variable in the right spot:
- [ ] `reviews_df.columns`
- [ ] `item_metadata_df.columns`

In [None]:
print("Reviews DataFrame columns:", #<---- YOUR CODE GOES HERE ---->)
print("-------------------")
print("Metadata DataFrame columns:", #<---- YOUR CODE GOES HERE ---->)

#### Solution:

In [None]:
print("Reviews DataFrame columns:", reviews_df.columns)
print("Metadata DataFrame columns:", item_metadata_df.columns)

The above is helpful, but it is a bit messy to read. Let's use our new skills to create a for loop to iterate over the items so that they can be outputted as a list.

#### STEPS TO COMPLETE:
- [ ] Update the for loops to output the right information

In [None]:
print("Reviews DataFrame columns:")
for #<---- YOUR CODE GOES HERE ---->
    print(f"- {#<---- YOUR CODE GOES HERE ---->}")

print("-------------------")

print("Metadata DataFrame columns:")
for #<---- YOUR CODE GOES HERE ---->
    print(f"- {#<---- YOUR CODE GOES HERE ---->}")

#### Solution:

In [None]:
# Print column names to debug
print("Reviews DataFrame columns:")
for column in reviews_df.columns:
    print(f"- {column}")

print("-------------------")

print("Metadata DataFrame columns:")
for column in item_metadata_df.columns:
    print(f"- {column}")

## Part 4.2 Review the data types for each dataset with `.dtypes`

Next, let's analyze what the "data structure" is for each item. If you remember from the course, data structures can be text (strings), number (integer, float, etc.), etc. They are the "types" that data can be used to compare or contrast against.

For our purposes, we want to get a better sense of the data that we pulled into our environment for the Machine Learning team to later use.

In [None]:
# Print column names and their data types using .dtypes
print("Reviews DataFrame columns and data types:")
print(reviews_df.dtypes)

print("-------------------")

print("Metadata DataFrame columns and data types:")
print(item_metadata_df.dtypes)

The above is helpful, but let's use a for loop to iterate over the items so that they can be outputted as a list:

In [None]:
# Print column names and their data types
print("Reviews DataFrame columns and data types:")
for column in reviews_df.columns:
    print(f"- {column}: {reviews_df[column].dtype}")

print("-------------------")

print("Metadata DataFrame columns and data types:")
for column in item_metadata_df.columns:
    print(f"- {column}: {item_metadata_df[column].dtype}")

## Part 4.3 Create a concise summary of a DataFrame with `.info()`

While you were analyzing your data, your manager messages you asking for a quick summary of the breakdown between the summaries in the next 15 minutes. Scrambling to what to do, you remember that you can use the `.info()` method from `pandas` to get your manager exactly what they need.

- If you remember from class, a `method` in Python is a function that is associated with an object. `.info()` is method of the pandas DataFrame class where when you call `something_df.info()`, you are invoking the "info" method on the `something_df` object, which is an instance of a DataFrame.
- Also, if you recall from class, the act of calling a method (like `.info()`) on an object (like `something_df`) is known as method `invocation`, which is fundamental to how Python works as an object-oriented programming language.

#### STEPS TO COMPLETE:
- [ ] Print the info for the `reviews_df` DataFrame
- [ ] Print the info for the `item_metadata_df` DataFrame

In [None]:
# Print DataFrame info which includes data types
print("Reviews DataFrame info:")
#<---- YOUR CODE GOES HERE ---->

print("-------------------")

print("Metadata DataFrame info:")
#<---- YOUR CODE GOES HERE ---->

#### Solution

In [None]:
# Print DataFrame info which includes data types
print("Reviews DataFrame info:")
reviews_df.info()

print("-------------------")

print("Metadata DataFrame info:")
item_metadata_df.info()

## ^ Phew! 😅 That was a close call.
But fortunately for us, `pandas` comes with a lot of out of the box utilities for data analysis and data visualization that we might need to tweak or fine-tune to our usecases but without too much headache. Let's get back to our work at hand.

Great work!!

## Let's Reflect...

In [None]:
#@title 💭 Question to Answer:
#@markdown Which data analysis method you find to be the most insightful or helpful?
Question5 = '' #@param {type:"string"}

if len(Question5) > 0:
    print(f"Question answered! You wrote: {Question5}")
else:
    print("Please answer the question")

# Part 5: Preparing the data for the Machine Learning Team

Now that we are comfortable with our datasets and have a good sense of what we can leverage, the Machine Learning team has just informed you that they need to know what the top products are to train their advanced machine learning model on.

You will be further cleaning and transforming the data so that you can create an informed, data-driven decision about what products are most important to the team.

So where would we start? Let's assume that we need a threshold value in order to separate what makes a good product from a great product. If you remember from earlier, on the metadata

- `title`: we need to know the product
-	`average_rating`: we need to know if the product is above the treshold we set (4.5 or above)
- `price`: we need a real monetary price value. Notice how sometimes it could be NONE. We need something that is number with a decimal. (0.01 and above)


## Part 5.1: Clean the data for the requested parameters

We will create a new DataFrame called `top_products_df` with these parameters:
- has a title
- the average rating is greater than or equal to 4.5
- there is a value for the price

#### STEPS TO COMPLETE:
- [ ] Specify the 3 columns we want in the new dataframe `item_metadata_cleaned_df`
- [ ] Set the `average_rating_filter` variable to the correct float data type that is sought after in the instructions above

In [None]:
# Select relevant columns
item_metadata_cleaned_df = item_metadata_df[[#<---- YOUR CODE GOES HERE ---->]]
average_rating_filter = #<---- YOUR CODE GOES HERE ---->

# Convert the price column to numeric, setting errors='coerce' to handle non-numeric values
item_metadata_cleaned_df['price'] = pd.to_numeric(item_metadata_cleaned_df['price'], errors='coerce')

# Filter to get only products with average rating set to the filter, valid price, and non-null title
top_products_df = item_metadata_cleaned_df[
    (item_metadata_cleaned_df['average_rating'] >= average_rating_filter) &
    (item_metadata_cleaned_df['price'].notna()) &
    (item_metadata_cleaned_df['title'].notna())
]

# Print the new cleaned DataFrame
top_products_df

#### Solution

In [None]:
# Select relevant columns
item_metadata_cleaned_df = item_metadata_df[['title', 'average_rating', 'price']]
average_rating_filter = 4.5

# Convert the price column to numeric, setting errors='coerce' to handle non-numeric values
item_metadata_cleaned_df['price'] = pd.to_numeric(item_metadata_cleaned_df['price'], errors='coerce')

# Filter to get only products with average rating set to the filter, valid price, and non-null title
top_products_df = item_metadata_cleaned_df[
    (item_metadata_cleaned_df['average_rating'] >= average_rating_filter) &
    (item_metadata_cleaned_df['price'].notna()) &
    (item_metadata_cleaned_df['title'].notna())
]

# Print the new cleaned DataFrame
top_products_df

## Part 5.2: Analyze the data using pandas

Let's find out how many products are in this new cleaned dataset.

In [None]:
# Review the items of the original dataframe
num_items_original = item_metadata_df.shape[0]
print(f"Number of items in the original DataFrame: {num_items_original}")

# Review the new items in cleaned dataframe
num_items_cleaned = top_products_df.shape[0]
print(f"Number of items in the cleaned DataFrame: {num_items_cleaned}")

## Part 5.3: Calculate the Percentage of Cleaned Data

You should see that the number of items in the cleaned DataFrame is much smaller than the number of items in the original DataFrame. Let's calculate the percentage of items in the cleaned DataFrame relative to the original DataFrame.

#### STEPS TO COMPLETE:
- [ ] Complete the function `calculate_percentage` to perform this calculation. Use your knowledge from the course to write a part-of-whole division equation.
- [ ] Ensure to use the `return` keyword so that your `calculate_percentage` function returns the result.

In [None]:
# Input from user
part = float(input("Enter the number of items in the cleaned DataFrame: "))
whole = float(input("Enter the number of items in the original DataFrame: "))

# Function to calculate percentage
def calculate_percentage(part, whole):
    if whole == 0:
        raise ValueError("The whole value cannot be zero.")

    # Calculate the percentage
    percentage = #<---- YOUR CODE GOES HERE ---->
    return #<---- YOUR CODE GOES HERE ---->

# Calculate and print the percentage
try:
    result = calculate_percentage(part, whole)
    print(f"{part} is {result:.2f}% of {whole}")
except ValueError as e:
    print(e)

#### Solution:

In [None]:
# Function to calculate percentage
def calculate_percentage(part, whole):
    if whole == 0:
        raise ValueError("The whole value cannot be zero.")

    # Calculate the percentage
    percentage = (part / whole) * 100
    return percentage

## Part 5.4: Package the dataset up for analysis (.csv)

In order to save our new dataset, let's package it up into a `.csv` file so that your team at Transformational AI can analyze the outputs you found.

In [None]:
# Save the DataFrame as a CSV file
top_products_df.to_csv('top_products.csv', index=False)

# Verify that the file is saved
!ls -lh top_products.csv

## Part 5.5: Package the dataset up for analysis (.parquet)

You might find that various team members need data saved in particular ways. For example, dependending on how data is coordinated or analyzed, `.csv` might not cut it. There is a movement to move towards size-saving strategies, which can mean saving data in other formats.

Luckily, your team member at Transformational AI gave you a heads up about the team's love for the `.parquet` format. In order to impress your team and new manager, you have just enough time to transform your data outputs into this format.

In [None]:
# Ensure pyarrow or fastparquet is installed
!pip install pyarrow

In [None]:
# Save the DataFrame as a Parquet file
top_products_df.to_parquet('top_products.parquet', index=False)

# Verify that the file is saved
!ls -lh top_products.parquet

Really great work! 🥳 You'll now be able to send both files along to the teams, one for easy visualization (`.csv`), and one for easy data use (`.parquet`)

## Part 5.6: Collect the Top Rated Product Titles

When working with large datasets like the reviews datasets we are using, we might need to handle the data with lists and exceptions. These are fundamental Python skills that are important to know and have in your toolbox.

- Lists are a type of data structure that let's us store multiple items as a single variable
- To work with them, we will use the `append` method to collect the data and "add" it to a single variable to reference it in our work.
- In this case, we will be pulling out the titles from our `highly_rated_products_df` DataFrame and storing those in our list, which we will create and name `top_rated_titles`.

#### STEPS TO COMPLETE:
- [ ] Create an empty array called `top_rated_titles` which will be our list.
- [ ] Create a for loop to iterate over the `highly_rated_products_df` DataFrame and then use the `.append` method for the titles to add them to the `top_rated_titles` list
- [ ] Print the results of the `top_rated_titles` list

In [None]:
# Initialize an empty list to collect top-rated product titles
#<---- YOUR CODE GOES HERE ---->

# Create a for loop to iterate over the highly_rated_products_df DataFrame
# Use the `append` method for the titles to add them to the `top_rated_titles` list
#<---- YOUR CODE GOES HERE ---->

# Print the list of top-rated product titles
print("List of Top-Rated Product Titles:")
#<---- YOUR CODE GOES HERE ---->

### Solution:

In [None]:
# Initialize an empty list to collect top-rated product titles
top_rated_titles = []

# Create a for loop to iterate over the highly_rated_products_df DataFrame
# Use the `append` method for the titles to add them to the `top_rated_titles` list
for title in highly_rated_products_df['title']:
    top_rated_titles.append(title)

# Print the list of top-rated product titles
print("List of Top-Rated Product Titles:")
print(top_rated_titles)

## Part 5.7: Transforming our List Data
You might notice that when we print the values, it becomes a very long string. This is because lists become a comma separated long string, enclosed in square brackets. This is a default behavior of Python, which can be efficient to work with in certain cases. However, let's make the list a bit more human readable.

Something we can do is iterate through the list and print each item in the list separately and add numbers or bullet points to each element.

Run the following cell to do this:

In [None]:
for i, title in enumerate(top_rated_titles, start=1):
    print(f"{i}. {title}")

^ This looks much better and easier to read. Great work!

# Part 6: Visualize the data for the Business Team

To visualize our data, we will be using a Histogram. Histograms are a type of graph that coordinates groups of data points into user-specified ranges (bins). Each bar in a histogram represents the frequency (the count) of data points that fall within each range.

We will also be using a Scatterplot graph, which will show the exact points that the histogram creates groupings of.

## Part 6.1: Map the distribution of Prices for Highly Rated Products as a Histogram Chart
In this graph, we will want to see if there is a trend for price and ratings, specifically to visualize the distribution of prices for highly rated products (those with an average rating of 4.5 or higher).

#### STEPS TO COMPLETE:
- [ ] Use the `top_products_df` dataframe for the histogram graph
- [ ] Use the `price` column from the `top_products_df`
- [ ] Use a boolean operator to set the missing `sns.histplot` parameter `kde`  to `True`.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Distribution of Prices
plt.figure(figsize=(10, 6))
sns.histplot(
    <---- YOUR CODE GOES HERE ----> ['<---- YOUR CODE GOES HERE ---->'],
    bins=30,
    kde=#<---- YOUR CODE GOES HERE ---->
)
plt.title('Distribution of Prices for Highly Rated Products')
plt.xlabel('Price')
plt.ylabel('Frequency of Rating')
plt.show()

#### Solution

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Distribution of Prices
plt.figure(figsize=(10, 6))
sns.histplot(top_products_df['price'], bins=30, kde=True)
plt.title('Distribution of Prices for Highly Rated Products')
plt.xlabel('Price')
plt.ylabel('Frequency of Rating')
plt.show()

## Let's Reflect...

In [None]:
#@title 💭 Question to Answer:
#@markdown What trend (if any) do you notice with price and frequency
Question6 = '' #@param {type:"string"}

if len(Question6) > 0:
    print(f"Question answered! You wrote: {Question6}")
else:
    print("Please answer the question")

## Part 6.2: Map the distribution of Prices for Highly Rated Products as a Scatterplot Graph
In this graph, we will want to see if there is a trend for price and ratings, specifically to visualize the distribution of prices for highly rated products (those with an average rating of 4.5 or higher) as a scatterplot, rather than a histogram

#### STEPS TO COMPLETE:
- [ ] Use the `top_products_df` dataframe for the scatterplot graph
- [ ] Plot you graph with the `price` for the x axis and `average_rating` on the y axis

In [None]:
# Average Rating vs. Price
plt.figure(figsize=(10, 6))
sns.scatterplot(data=<......>, x='<---- YOUR CODE GOES HERE ---->', y='<---- YOUR CODE GOES HERE ---->')
plt.title('Average Rating vs. Price')
plt.xlabel('Price')
plt.ylabel('Average Rating')
plt.show()

#### Solution:

In [None]:
# Average Rating vs. Price
plt.figure(figsize=(10, 6))
sns.scatterplot(data=top_products_df, x='price', y='average_rating')
plt.title('Average Rating vs. Price')
plt.xlabel('Price')
plt.ylabel('Average Rating')
plt.show()

## Let's Reflect...

In [None]:
#@title 💭 Question to Answer:
#@markdown What trend (if any) do you notice with price and rating
Question7 = '' #@param {type:"string"}

if len(Question7) > 0:
    print(f"Question answered! You wrote: {Question7}")
else:
    print("Please answer the question")

## Part 6.3: Display the distribution of ratings and price together

The "frequency" in this context of a histogram or distribution plot refers to the number of times a particular value or range of values appears in the dataset. In our case, we want to know how often the ratings are coming up related to the price.

For business-facing teams, showing data a couple different ways can be quite helpful, so we will want to visualize these 2 graphs together for maximum impact.

In [None]:
# Create a Dashboard Layout
plt.figure(figsize=(15, 10))

# Subplot 1: Distribution of Prices
plt.subplot(2, 2, 1)
sns.histplot(top_products_df['price'], bins=30, kde=True)
plt.title('Distribution of Prices for Highly Rated Products')
plt.xlabel('Price')
plt.ylabel('Frequency of Rating')

# Subplot 2: Average Rating vs. Price
plt.subplot(2, 2, 2)
sns.scatterplot(data=top_products_df, x='price', y='average_rating')
plt.title('Average Rating vs. Price')
plt.xlabel('Price')
plt.ylabel('Average Rating')

# Feel free to add other visualizations as you like

# Show the Dashboard
plt.tight_layout()
plt.show()

## 6.4: Narrow down on the handful of products that will be most meaningful

To get a better sense of what products the teams will specifically want to double down on, let's use these graphs to create even better outputs.

#### STEPS TO COMPLETE:
- [ ] Specify a cutoff threshold for the results as float data type `4.7`

In [None]:
# Filter the DataFrame to include only products with an average rating of 4.7 or higher
average_rating_cutoff = #<---- YOUR CODE GOES HERE ---->
highly_rated_products_df = top_products_df[top_products_df['average_rating'] >= average_rating_cutoff]

# Display the filtered DataFrame
print(highly_rated_products_df)

#### Solution

In [None]:
# Filter the DataFrame to include only products with an average rating of 4.7 or higher
average_rating_cutoff = 4.7
highly_rated_products_df = top_products_df[top_products_df['average_rating'] >= average_rating_cutoff]

# Display the filtered DataFrame
print(highly_rated_products_df)

 Count the number of items in the newly cleaned dataset

In [None]:
# Review the new items in cleaned dataframe
highly_rated_num_items = highly_rated_products_df.shape[0]
print(f"Number of items in the highly rated DataFrame: {highly_rated_num_items}")

Calculate the percentage of the newly cleaned items with the function you wrote earlier:

In [None]:
# Input from user
part = float(input("Enter the number of items in the highly rated DataFrame: "))
whole = float(input("Enter the number of items in the original top products DataFrame: "))


# Calculate and print the percentage
try:
    result = calculate_percentage(part, whole)
    print(f"{part} is {result:.2f}% of {whole}")
except ValueError as e:
    print(e)

Plot the new Histogram distribution with your highly rated DataFrame:

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(highly_rated_products_df['price'], bins=30, kde=True)
plt.title('Distribution of Prices for Products with Rating 4.7 or Higher')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

Plot the new Scatterplot graph with your new highly rated DataFrame:

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=highly_rated_products_df, x='price', y='average_rating')
plt.title('Average Rating vs. Price for Products with Rating 4.7 or Higher')
plt.xlabel('Price')
plt.ylabel('Average Rating')
plt.show()

## 6.5: Create a Dashboard of Your New Findings
The next step will be to create a dashboard for these new results so that you can share this with the Machine Learning, Business, and Executive teams. This section will involve filtering, sorting, and visualizing data, which are key skills covered in the PCEP syllabus.

To accomplish this, we'll:
- Filter the DataFrame: We'll filter the DataFrame to include only products with an average rating of `4.7` or higher.
- Sort the DataFrame: We'll use the `.sort_values` method to sort the DataFrame by average rating in descending order.
- Display the DataFrame: We'll display the sorted DataFrame as a table.
- Create a Scatterplot: We'll create a scatterplot to visualize the relationship between price and average rating for the highly rated products.

#### STEPS TO COMPLETE:
- [ ] Set the `highly_rated_products_df_cutoff` with the desired float threshold value
- [ ] Update the `highly_rated_products_df` to use the `.sort_values` method like you learned in class.
- [ ] Fill in the code to display the new sorted DataFrame
- [ ] Complete the code to create a scatterplot using Seaborn.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

# Filter the DataFrame to include only products with an average rating of 4.7 or higher
highly_rated_products_df_cutoff = #<---- YOUR CODE GOES HERE ---->
highly_rated_products_df = top_products_df[top_products_df['average_rating'] >= highly_rated_products_df_cutoff]

# Sort the DataFrame by average_rating in descending order
highly_rated_products_df = highly_rated_products_df.#<---- YOUR CODE GOES HERE ---->(by='average_rating', ascending=False)

# Display the sorted DataFrame as a table
print("Table of Highly Rated Products (Rating 4.7 or Higher, Sorted by Rating):")
display(#<---- YOUR CODE GOES HERE ---->)

# Create the scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=highly_rated_products_df, x='price', y='average_rating')
plt.title('Average Rating vs. Price for Products with Rating 4.7 or Higher')
plt.xlabel('Price')
plt.ylabel('Average Rating')
plt.show()

#### Solution:

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

# Filter the DataFrame to include only products with an average rating of 4.7 or higher
highly_rated_products_df_cutoff = 4.7
highly_rated_products_df = top_products_df[top_products_df['average_rating'] >= highly_rated_products_df_cutoff]

# Sort the DataFrame by average_rating in descending order
highly_rated_products_df = highly_rated_products_df.sort_values(by='average_rating', ascending=False)

# Display the sorted DataFrame as a table
print("Table of Highly Rated Products (Rating 4.7 or Higher, Sorted by Rating):")
display(highly_rated_products_df)

# Create the scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=highly_rated_products_df, x='price', y='average_rating')
plt.title('Average Rating vs. Price for Products with Rating 4.7 or Higher')
plt.xlabel('Price')
plt.ylabel('Average Rating')
plt.show()

#### Amazing job!! You did it! 🥳 🎉

# 🎁 Wrap Up

Congratulations, you completed everything that the Machine Learning Team, the Business Team, and the Executive Team at Transformational AI need to get started. You completed this in record time and your manager is excited to offer you a promotion to the Data Science team thanks to all of your hard work! 🥇

To wrap up your excellent work, let's output all of your responses to the questions earlier into a `.txt` file. The code is already written for you, but let's look at what is happening:
- We create an `array` of questions to group all of your data inputs together
- We create a file name as a string called `output_file` where your responses will be written to.
- We write a for loop to iterate over each question, which lives inside of a with statement so that we can write each question and response as a line in the text file
- We use the `print` command to let us know when the file has been created and stored in our notebook.

In [None]:
#@title 💭 Add your name
#@markdown Write your first and last name here
Name = '' #@param {type:"string"}

if len(Name) > 0:
    print(f"You added your name: {Name}")
else:
    print("Please answer the question")

In [None]:
# Collect all the questions into a list
questions = [Question1, Question2, Question3, Question4, Question5, Question6, Question7]

# Define the file name where your answers will be saved to
output_file = 'user_inputs.txt'

# Write the project details and questions to the output file
with open(output_file, 'w') as file:
    # Write the project title and your name
    file.write("Transformational AI: Analyzing Ecommerce Large Datasets for Machine Learning\n")
    file.write(f"Completed by: {Name}\n")
    file.write("—————————————-\n\n")

    # Write each question and its response on subsequent line
    for i, question in enumerate(questions, start=1):
        file.write(f"Question {i}: {question}\n")

print(f"All questions have been written to {output_file}")