About Dataset
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon

Features

product_id - Product ID    
product_name - Name of the Product  
category - Category of the Product  
discounted_price - Discounted Price of the Product   
actual_price - Actual Price of the Product  
discount_percentage - Percentage of Discount for the Product  
rating - Rating of the Product  
rating_count - Number of people who voted for the Amazon rating  
about_product - Description about the Product  
user_id - ID of the user who wrote review for the Product  
user_name - Name of the user who wrote review for the Product  
review_id - ID of the user review  
review_title - Short review  
review_content - Long review  
img_link - Image Link of the Product  
product_link - Official Website Link of the Product  



In this workbook, I will perform the following tasks:

1. Data Cleaning:
   - Handle missing values
   - Remove duplicates (if any)
   - Correct data types
   - Address any inconsistencies in the dataset

2. Data Exploration:
   - Analyze basic statistics of numerical columns
   - Visualize distributions of key variables
   - Identify patterns and trends in the data

3. Answer Key Questions:
   - Determine top-selling products
   - Identify most popular product categories
   - Explore the relationship between ratings and other variables

4. Visualization:
   - Create informative charts and graphs to illustrate findings

The goal is to gain insights into product performance, customer preferences, and pricing strategies to inform business decisions.

Imported Libraries:  
    Numpy  
    Pandas  
    Seaborn  
    Matplotlib.pyplot



In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#importing the dtaa
df = pd.read_csv("/kaggle/input/amazon-sales-dataset/amazon.csv")
df

#good to have a look at the col values to get a better understanding of data 

In [None]:
#lets have a look at data types, columns, row number
df.info()

#all cols are object format, need to convert the numbers into numerical format

In [None]:
df['actual_price'] #need to remove the ₹ sign, also comma, so we'll be able to convert it into float
df['discounted_price']

In [None]:
df['actual_price']= df['actual_price'].replace( {'\₹': '' , ',': ''}, regex=True).astype(float)

df['discounted_price']= df['discounted_price'].replace( {'\₹': '' , ',': ''}, regex=True).astype(float)


In [None]:
df.info()

In [None]:
df['discount_percentage']  #need to remove the % sign, so we'll be able to convert it into float

df['discount_percentage']= df['discount_percentage'].replace( {'%': ''}, regex=True).astype(float)/100
df.info()

In [None]:
#how about rating col?
df['rating'] #sounds ok, lets have a look at all unique values in this col

sorted(df['rating'].unique(), reverse=False)
#rating is from 2 (min) to 5 (max) , and there a typo in there as | which needs to be removed or replaced

In [None]:
#lets see the row with \ as rating
df[df['rating']== '|']
#do we have any other review for this product?
#As my dataset is small I prefer to see if I can replace it with something rather than just removing rhe row

In [None]:
#do we have any other review for this product?
df[df['product_id']== 'B08L12N5H1']
#Nop! so i suggest to replace it with the AVG rating, also I had a look at review_content to see if it got good reviews or not

In [None]:
df['rating'] = pd.to_numeric(df['rating'], errors='coerce') #coerce helps to change non number to NA instead of getting an error

#so now I have that as NA and I need to replace it with Mean

avg_rating = df['rating'].mean()

df['rating'] = df['rating'].fillna(avg_rating)

In [None]:
#show me rows with null rating, also the format is float now
df[df['rating'].isnull()]

In [None]:
df.info()

In [None]:
#lets have a look at rating_count, we already know there are 2 null 
df[df['rating_count'].isnull()]



In [None]:
#let me see if we have duplicated for these 2 prods B0B94JPY2N , B0BQRJ3C47
#df[df['product_id']=='B0BQRJ3C47']

df[df['product_id'].isin(['B0B94JPY2N', 'B0BQRJ3C47'])]
#nop! we can delete these rows , or replace NA with 1 as there's at least one review here in the data, or replace them with avg(rating_count). 

In [None]:
# i think good to replace nan with 1 as these rows include at least one rating, also need to change the format
df['rating_count'] = df['rating_count'].fillna(1)

#Also need to remove commas
df['rating_count'] = df['rating_count'].replace( { ',': ''}, regex=True).astype('float64')

df['rating_count']

In [None]:
#the data format looks goo for all cols now
df.info()

In [None]:
#Any other null values?
df.isnull().sum().sort_values(ascending = False)

#Nop! Sounds great

In [None]:
#Lets have a look at a summary of numerical cols
df.describe()

#I see in this dataset we have a wide variety of products with different price from 39 to 139900. 

In [None]:
#lets see if we have any duplicates

# Find Duplicated rows
df.duplicated().any()

In [None]:
#I'm also curious about the table primary key, is it product_id ?
df['product_id'].nunique()

#only 1351 unique prod_id, with 1465 rows, which means some are the products are repeated, however we have not any duplicated rows

In [None]:
#show me rows with duplicated product id 

duplicated_products = df[df['product_id'].duplicated(keep=False)]

# Sort by product_id to group duplicates together
duplicated_products.sort_values('product_id')

#Now I can see what is different in the rows with the same product_id, just to understand the data better

In [None]:


#we've got 1465 rows (prod id) but 1351 unique and 1337 unique prod name, need to check them
#so I just found out the unique col is product_link col which it means this data table is based on the product_links

In [None]:
#ANy null values?
df.isnull().sum().sort_values(ascending = False)

#2 null in rating_count col