# Flipkart Laptop Reviews – Exploratory Data Analysis

This project performs an exploratory analysis of customer reviews for laptops listed on Flipkart. The dataset contains like brand, processor, RAM, storage, operating system, ratings, and more.

The goal is to identify patterns and meaningful insights from customer feedback. These insights can help in decision-making related to product development, customer satisfaction, or recommendation systems.

The steps in this notebook include:
- Setting up the environment
- Loading and verifying the dataset
- Exploring and cleaning the data
- Visualizing important trends
- Highlighting key patterns affecting laptop reviews


### Step 1 – Environment Setup and File Encoding

This project is built using:
- Python (via Anaconda)
- Jupyter Notebook (opened in VS Code)
- UTF-8 encoding is used when loading the CSV file to avoid problems with special characters like emojis or symbols in text data.

UTF-8 is the standard character encoding that supports almost every language and symbol. If the wrong encoding is used, the dataset may show corrupted characters (like �).

To verify the encoding:
- Open the file using a text editor like Notepad++
- Check the "Encoding" menu to confirm it's set to UTF-8
- If needed, re-save the file with UTF-8 encoding

In pandas, UTF-8 is set using `encoding='utf-8'` while reading the file.


#### Required Libraries


In [None]:
# Importing core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Enable inline plotting
%matplotlib inline

# Set visual theme
sns.set(style="whitegrid")

# Set display options for better viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

### Step 2 – Loading the Dataset

The dataset file `laptops_dataset_final_600.csv` is read using pandas with UTF-8 encoding. This ensures proper interpretation of all characters, especially in text-based columns like laptop names or reviews.


In [None]:
# #loading without encoding 
# df_raw = pd.read_csv('laptops_dataset_final_600.csv')
# print("Loaded successfully without encoding parameter.")
# display(df_raw.head())

In [None]:
# Recommended way — read with UTF-8 encoding
df = pd.read_csv('laptops_dataset_final_600.csv', encoding='utf-8')
print("Data loaded successfully using UTF-8 encoding.")

# Show the first few records
df.head()


In [None]:
# Last few records
print("Last 5 rows: ")
df.tail()


### Step 3 – Dataset Overview and Structure

The next step involves checking:
- Number of rows and columns
- Names and data types of all columns
- First and last few records
- Missing and duplicate data
- Unique values in important columns


In [None]:
# Show shape
print(f"Shape of the DataFrame: {df.shape}")

In [None]:
# Show all column names
print(f"Column names: {df.columns.tolist()}")

In [None]:
# Data types of each column
print("Data types: ")
print(df.dtypes)

In [None]:
print("Missing values in each column:")
print(df.isnull().sum())

In [None]:
# Duplicate rows
print(f"Number of duplicate rows: {df.duplicated().sum()}")

In [None]:
print(df['review'].iloc[0])  # Contains emoji or symbols?

- Encoding Display Test: The dataset was read using UTF-8 encoding. Special characters and emojis in the review text (such as 🥺❤️) are displayed correctly, confirming that no encoding issues are present.


### Step 4 – Understanding Key Columns

This dataset contains customer review information for various laptops listed on Flipkart. Below is a quick description of the available columns:

- `product_name`: The name and model of the laptop.
- `overall_rating`: The average rating given by all users for that product.
- `no_ratings`: Total number of users who rated the product.
- `no_reviews`: Total number of written reviews.
- `rating`: Rating given in this specific review (e.g., 1 to 5 stars).
- `title`: Title/summary of the review.
- `review`: Full content of the review.

These columns focus on customer feedback rather than hardware specifications. As such, analysis will focus on trends in ratings, review counts, and review content.


In [None]:
# Number of unique products
print("Number of unique laptop models:", df['product_name'].nunique())

In [None]:
# 1st five Sample product names
print("Sample product names:")
print(df['product_name'].unique()[:5])

In [None]:
# Distribution of individual review ratings
print("Rating distribution:")
print(df['rating'].value_counts().sort_index())

In [None]:
# Unique overall ratings
print("Unique average (overall) ratings:")
print(df['overall_rating'].unique())

In [None]:
# 1st ten Sample review titles
print("Sample review titles:")
print(df['title'].unique()[:10])

In [None]:
#1st three Sample review text
print("Sample full reviews:")
print(df['review'].unique()[:3])

#### Checking for Whitespace

This step checks if any text fields have unnecessary leading or trailing spaces. If any are found, they will be removed to clean the data.


In [None]:
# Check for leading/trailing whitespace in product_name
print("product_name: ",df['product_name'].apply(lambda x: x != x.strip()).sum())
print("title: ",df['title'].apply(lambda x: x != x.strip()).sum())
print("review: ",df['review'].apply(lambda x: x != x.strip()).sum())

All key text fields (`product_name`, `title`, and `review`) were checked for leading and trailing whitespace. No issues were found, so no cleaning was necessary for this part.


### Step 5 – Data Cleaning

To prepare the dataset for accurate analysis, the following actions are performed:
- Remove duplicate records
- Convert `no_ratings` and `no_reviews` columns to numeric types by removing non-numeric characters like commas and plus signs


In [None]:
print(f"Number of duplicate rows: {df.duplicated().sum()}")
print("Shape of the DataFrame: ", df.shape)

#copying the dataframe to a new dataframe
df_cp = df.copy()
print("Shape of the New DataFrame: ",df_cp.shape)

In [None]:
# Remove duplicate rows
df_cp.drop_duplicates(inplace=True)

# Confirm updated shape
print("Shape after removing duplicates:", df_cp.shape)

- Removing Duplicate Rows: The original dataset had 24,113 rows. After checking for duplicates, 7,122 duplicate rows were found and removed. The cleaned dataset now contains 16,991 unique records. A separate copy `df_cp` was created for cleaning to keep the original data intact.


#### Cleaning Count Columns (`no_ratings` and `no_reviews`)

The columns `no_ratings` and `no_reviews` originally contain text characters such as commas and plus signs (e.g., "1,234+", "100+"). These need to be converted into numeric types to support meaningful analysis.


In [None]:
df_cp.dtypes

In [None]:
# Preview the format of the values before cleaning
print("Sample no_ratings values (before):", df_cp['no_ratings'].unique()[:5])
print("Sample no_reviews values (before):", df_cp['no_reviews'].unique()[:5])

In [None]:
# function to clean count columns
def clean_count_column(col):
    return col.str.replace(',', '', regex=False).str.replace('+', '', regex=False).astype(int)

# Apply cleaning to both columns in df_cp
df_cp['no_ratings'] = clean_count_column(df_cp['no_ratings'])
df_cp['no_reviews'] = clean_count_column(df_cp['no_reviews'])


In [None]:
# Check the data types and preview cleaned values
print("Data types after cleaning:")
print(df_cp[['no_ratings', 'no_reviews']].dtypes)

print("\nSample cleaned values:")
print(df_cp[['no_ratings', 'no_reviews']].head())

print("Sample no_ratings values (after):", df_cp['no_ratings'].unique()[:5])
print("Sample no_reviews values (after):", df_cp['no_reviews'].unique()[:5])
