# Furniture Product Analytics Dashboard - Exploratory Data Analysis

This notebook performs comprehensive data analysis on the furniture e-commerce dataset to:
1. Understand the data structure and quality
2. Generate insights for the analytics dashboard
3. Identify patterns and trends in product data
4. Export processed data for the React frontend

**Dataset**: 312 furniture products with attributes including title, brand, description, price, categories, images, dimensions, and more.

## 1. Import Required Libraries

We'll use pandas for data manipulation, matplotlib/seaborn for static visualizations, and plotly for interactive charts that can be integrated into the React dashboard.

In [1]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
import json
import ast
from collections import Counter

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 50)

print("✓ All libraries imported successfully!")

✓ All libraries imported successfully!


## 2. Load and Explore Dataset

Loading the furniture dataset and performing initial exploration to understand:
- Dataset dimensions (rows and columns)
- Column names and data types
- Sample records
- Basic statistics

In [2]:
# Load the dataset
df = pd.read_csv('intern_data_ikarus.csv')

# Display basic information
print("="*80)
print("DATASET OVERVIEW")
print("="*80)
print(f"\nDataset Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nColumn Names:\n{list(df.columns)}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nMemory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display first few rows
print("\n" + "="*80)
print("SAMPLE DATA (First 3 rows)")
print("="*80)
display(df.head(3))

DATASET OVERVIEW

Dataset Shape: 312 rows × 12 columns

Column Names:
['title', 'brand', 'description', 'price', 'categories', 'images', 'manufacturer', 'package_dimensions', 'country_of_origin', 'material', 'color', 'uniq_id']

Data Types:
title                 object
brand                 object
description           object
price                 object
categories            object
images                object
manufacturer          object
package_dimensions    object
country_of_origin     object
material              object
color                 object
uniq_id               object
dtype: object

Memory Usage: 0.51 MB

SAMPLE DATA (First 3 rows)


Unnamed: 0,title,brand,description,price,categories,images,manufacturer,package_dimensions,country_of_origin,material,color,uniq_id
0,"GOYMFK 1pc Free Standing Shoe Rack, Multi-laye...",GOYMFK,"multiple shoes, coats, hats, and other items E...",$24.99,"['Home & Kitchen', 'Storage & Organization', '...",['https://m.media-amazon.com/images/I/416WaLx1...,GOYMFK,"2.36""D x 7.87""W x 21.6""H",China,Metal,White,02593e81-5c09-5069-8516-b0b29f439ded
1,"subrtex Leather ding Room, Dining Chairs Set o...",subrtex,subrtex Dining chairs Set of 2,,"['Home & Kitchen', 'Furniture', 'Dining Room F...",['https://m.media-amazon.com/images/I/31SejUEW...,Subrtex Houseware INC,"18.5""D x 16""W x 35""H",,Sponge,Black,5938d217-b8c5-5d3e-b1cf-e28e340f292e
2,Plant Repotting Mat MUYETOL Waterproof Transpl...,MUYETOL,,$5.98,"['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...",['https://m.media-amazon.com/images/I/41RgefVq...,MUYETOL,"26.8""L x 26.8""W",,Polyethylene,Green,b2ede786-3f51-5a45-9a5b-bcf856958cd8


## 3. Data Quality Assessment

Analyzing data quality by:
- Identifying missing values across all columns
- Calculating missing percentages
- Visualizing missingness patterns
- Understanding which fields need attention for ML models

In [None]:
# Calculate missing values
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})
missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

print("="*80)
print("MISSING DATA ANALYSIS")
print("="*80)
print(f"\nTotal Columns with Missing Data: {len(missing_data)}")
print(f"\n{missing_data.to_string(index=False)}")

# Visualize missing data
fig = px.bar(missing_data, 
             x='Column', 
             y='Missing_Percentage',
             title='Missing Data Percentage by Column',
             labels={'Missing_Percentage': 'Missing %'},
             color='Missing_Percentage',
             color_continuous_scale='Reds')
fig.update_layout(xaxis_tickangle=-45, height=500)
fig.show()

# Key Insights
print("\n" + "="*80)
print("KEY INSIGHTS:")
print("="*80)
print("• country_of_origin has highest missingness (59.94%) - not critical for recommendations")
print("• description missing in ~49% - we'll use title as fallback for NLP")
print("• price missing in ~31% - may need imputation or filtering for price-based features")
print("• manufacturer, material, color have moderate missingness - can be handled with 'Unknown' category")

## 4. Price Analysis

Analyzing the price distribution to understand:
- Price range and distribution
- Outliers and anomalies
- Price segments for different product categories
- Statistical summary of pricing