# Amazon Products Sales EDA 2024

I can't provide real-time or future datasets as my training only includes information up to January 2022, and I can't access or retrieve personal data unless it has been shared with me in the course of our conversation. However, I can guide you on how to possibly find such datasets.

For Amazon product sales data, you might want to explore public datasets available through platforms like Kaggle, UCI Machine Learning Repository, or even directly from Amazon if they release any sales data for research purposes. Additionally, you can check if there are any research papers or studies that have released datasets related to Amazon product sales, though you may need to verify the relevance and reliability of the data for your specific use case.

Remember to always review the terms of use and licensing agreements associated with any dataset you find to ensure that you are using it appropriately.

In [4]:
import numpy as np #for linear algebra
import pandas as pd #for datapreprocessing
import matplotlib.pyplot as plt
import plotly.express as px #for making interactive viz
import os

In [5]:
df = pd.read_csv("Amazon-Products.csv")
df

Unnamed: 0.1,Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,0,Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/31UISB90sY...,https://www.amazon.in/Lloyd-Inverter-Convertib...,4.2,2255,"₹32,999","₹58,990"
1,1,LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.2,2948,"₹46,490","₹75,990"
2,2,LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Cop...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Inverter-Convertible-...,4.2,1206,"₹34,490","₹61,990"
3,3,LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.0,69,"₹37,990","₹68,990"
4,4,Carrier 1.5 Ton 3 Star Inverter Split AC (Copp...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/41lrtqXPiW...,https://www.amazon.in/Carrier-Inverter-Split-C...,4.1,630,"₹34,490","₹67,790"
...,...,...,...,...,...,...,...,...,...,...
551580,1099,Adidas Regular Fit Men's Track Tops,sports & fitness,Yoga,https://m.media-amazon.com/images/I/71tHAR9pIY...,https://www.amazon.in/Adidas-Regular-Mens-Trac...,3.2,9,"₹3,449","₹4,599"
551581,1100,Redwolf Noice Toit Smort - Hoodie (Black),sports & fitness,Yoga,https://m.media-amazon.com/images/I/41pKrMZ5lQ...,https://www.amazon.in/Redwolf-Noice-Smort-Cott...,2.0,2,"₹1,199","₹1,999"
551582,1101,Redwolf Schrute Farms B&B - Hoodie (Navy Blue),sports & fitness,Yoga,https://m.media-amazon.com/images/I/41n9u+zNSc...,https://www.amazon.in/Redwolf-Schrute-Farms-Ho...,4.0,1,"₹1,199","₹1,999"
551583,1102,Puma Men Shorts,sports & fitness,Yoga,https://m.media-amazon.com/images/I/51LoWv5JDt...,https://www.amazon.in/Puma-Woven-Short-5208526...,4.4,37,,


In [6]:
print(f"The Number of Rows : {df.shape[0]} and Number of Columns is :- {df.shape[1]} ")

The Number of Rows : 551585 and Number of Columns is :- 10 


In [7]:
df.isnull().sum()

Unnamed: 0             0
name                   0
main_category          0
sub_category           0
image                  0
link                   0
ratings           175794
no_of_ratings     175794
discount_price     61163
actual_price       17813
dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551585 entries, 0 to 551584
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   Unnamed: 0      551585 non-null  int64 
 1   name            551585 non-null  object
 2   main_category   551585 non-null  object
 3   sub_category    551585 non-null  object
 4   image           551585 non-null  object
 5   link            551585 non-null  object
 6   ratings         375791 non-null  object
 7   no_of_ratings   375791 non-null  object
 8   discount_price  490422 non-null  object
 9   actual_price    533772 non-null  object
dtypes: int64(1), object(9)
memory usage: 42.1+ MB


In [9]:
df.describe()

Unnamed: 0.1,Unnamed: 0
count,551585.0
mean,7006.200471
std,5740.835523
min,0.0
25%,1550.0
50%,5933.0
75%,11482.0
max,19199.0


In [10]:
df["discount_price"] = df["discount_price"].str.replace("₹", "", regex=False)
df["discount_price"] = df["discount_price"].str.replace(",", "")
df["discount_price"] = pd.to_numeric(df["discount_price"], errors='coerce')
# df["discount_price"].astype(int)

# # Fill NaN with 0 or drop rows before converting to int
# df["discount_price"] = df["discount_price"].fillna(0).astype(int)  # or use dropna() instead

In [11]:
df["actual_price"] = df["actual_price"].str.replace("₹", "", regex=False)
df["actual_price"] = df["actual_price"].str.replace(",", "")
df["actual_price"] = pd.to_numeric(df["actual_price"], errors='coerce')
# df["actual_price"] = df["actual_price"].astype(int) 

In [12]:
# Now we see that is a floating value -- so convert it into the int values --->
# First we have to handle the "NA Values"  first (fill NA , dropna,NO NA values) 
# Same has been done with both "discount_price" and "actual price" columns


df["discount_price"] = df["discount_price"].fillna(df["discount_price"].mean()).astype(int)
df["actual_price"] = df["actual_price"].fillna(df["actual_price"].mean()).astype(int)
df = df.dropna(subset=["discount_price", "actual_price"])  # drop rows where price is NaN


In [13]:
df

Unnamed: 0.1,Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price
0,0,Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/31UISB90sY...,https://www.amazon.in/Lloyd-Inverter-Convertib...,4.2,2255,32999,58990
1,1,LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.2,2948,46490,75990
2,2,LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Cop...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Inverter-Convertible-...,4.2,1206,34490,61990
3,3,LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.0,69,37990,68990
4,4,Carrier 1.5 Ton 3 Star Inverter Split AC (Copp...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/41lrtqXPiW...,https://www.amazon.in/Carrier-Inverter-Split-C...,4.1,630,34490,67790
...,...,...,...,...,...,...,...,...,...,...
551580,1099,Adidas Regular Fit Men's Track Tops,sports & fitness,Yoga,https://m.media-amazon.com/images/I/71tHAR9pIY...,https://www.amazon.in/Adidas-Regular-Mens-Trac...,3.2,9,3449,4599
551581,1100,Redwolf Noice Toit Smort - Hoodie (Black),sports & fitness,Yoga,https://m.media-amazon.com/images/I/41pKrMZ5lQ...,https://www.amazon.in/Redwolf-Noice-Smort-Cott...,2.0,2,1199,1999
551582,1101,Redwolf Schrute Farms B&B - Hoodie (Navy Blue),sports & fitness,Yoga,https://m.media-amazon.com/images/I/41n9u+zNSc...,https://www.amazon.in/Redwolf-Schrute-Farms-Ho...,4.0,1,1199,1999
551583,1102,Puma Men Shorts,sports & fitness,Yoga,https://m.media-amazon.com/images/I/51LoWv5JDt...,https://www.amazon.in/Puma-Woven-Short-5208526...,4.4,37,2623,23111


In [14]:
df["ratings"].unique()

array(['4.2', '4.0', '4.1', '4.3', '3.9', '3.8', '3.5', nan, '4.6', '3.3',
       '3.4', '3.7', '2.9', '5.0', '4.4', '3.6', '2.7', '4.5', '3.0',
       '3.1', '3.2', '4.8', '4.7', '2.5', '1.0', '2.6', '2.8', '2.3',
       '1.7', 'Get', '1.8', '2.4', '4.9', '2.2', '1.6', '1.9', '2.0',
       '1.4', '2.1', 'FREE', '1.2', '1.3', '1.5', '₹68.99', '₹65', '1.1',
       '₹70', '₹100', '₹99', '₹2.99'], dtype=object)

In [15]:
df["ratings"] = df["ratings"].replace(['Get','FREE','₹68.99', '₹65','₹70', '₹100', '₹99', '₹2.99'], '0.0')
df["ratings"] = df["ratings"].astype(float)
df["ratings"].unique()

array([4.2, 4. , 4.1, 4.3, 3.9, 3.8, 3.5, nan, 4.6, 3.3, 3.4, 3.7, 2.9,
       5. , 4.4, 3.6, 2.7, 4.5, 3. , 3.1, 3.2, 4.8, 4.7, 2.5, 1. , 2.6,
       2.8, 2.3, 1.7, 0. , 1.8, 2.4, 4.9, 2.2, 1.6, 1.9, 2. , 1.4, 2.1,
       1.2, 1.3, 1.5, 1.1])

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551585 entries, 0 to 551584
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0      551585 non-null  int64  
 1   name            551585 non-null  object 
 2   main_category   551585 non-null  object 
 3   sub_category    551585 non-null  object 
 4   image           551585 non-null  object 
 5   link            551585 non-null  object 
 6   ratings         375791 non-null  float64
 7   no_of_ratings   375791 non-null  object 
 8   discount_price  551585 non-null  int32  
 9   actual_price    551585 non-null  int32  
dtypes: float64(1), int32(2), int64(1), object(6)
memory usage: 37.9+ MB


In [17]:
# df["no_of_ratings"] = df["no_of_ratings"].astype(float) 
# df["no_of_ratings"] = df["no_of_ratings"].str.replace("," , "")


df['no_of_ratings'] = df['no_of_ratings'].astype(str)
df['correct_no_of_ratings'] = pd.Series([df['no_of_ratings'][x][0].isdigit() for x in range(len(df['no_of_ratings']))])
# Drop columns with incorrect 'no_of_ratings'
df = df[df['correct_no_of_ratings'] == True]
df['correct_no_of_ratings'].value_counts()

correct_no_of_ratings
True    369558
Name: count, dtype: int64

In [18]:
df["no_of_ratings"] = df["no_of_ratings"].str.replace(',', '').astype(float)

df.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["no_of_ratings"] = df["no_of_ratings"].str.replace(',', '').astype(float)


Unnamed: 0.1,Unnamed: 0,name,main_category,sub_category,image,link,ratings,no_of_ratings,discount_price,actual_price,correct_no_of_ratings
0,0,Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/31UISB90sY...,https://www.amazon.in/Lloyd-Inverter-Convertib...,4.2,2255.0,32999,58990,True
1,1,LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.2,2948.0,46490,75990,True
2,2,LG 1 Ton 4 Star Ai Dual Inverter Split Ac (Cop...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Inverter-Convertible-...,4.2,1206.0,34490,61990,True
3,3,LG 1.5 Ton 3 Star AI DUAL Inverter Split AC (C...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/51JFb7FctD...,https://www.amazon.in/LG-Convertible-Anti-Viru...,4.0,69.0,37990,68990,True
4,4,Carrier 1.5 Ton 3 Star Inverter Split AC (Copp...,appliances,Air Conditioners,https://m.media-amazon.com/images/I/41lrtqXPiW...,https://www.amazon.in/Carrier-Inverter-Split-C...,4.1,630.0,34490,67790,True


In [19]:
# df["ratings"] = df["ratings"].fillna(0).astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 369558 entries, 0 to 551584
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             369558 non-null  int64  
 1   name                   369558 non-null  object 
 2   main_category          369558 non-null  object 
 3   sub_category           369558 non-null  object 
 4   image                  369558 non-null  object 
 5   link                   369558 non-null  object 
 6   ratings                369558 non-null  float64
 7   no_of_ratings          369558 non-null  float64
 8   discount_price         369558 non-null  int32  
 9   actual_price           369558 non-null  int32  
 10  correct_no_of_ratings  369558 non-null  bool   
dtypes: bool(1), float64(2), int32(2), int64(1), object(5)
memory usage: 28.5+ MB
