**Introduction:** Laptop Price Prediction

This project aims to develop a model for predicting laptop prices based on their hardware specifications.

 We will analyze a dataset containing various features of laptops, such as RAM, storage (ROM), CPU, and GPU. Through exploratory data analysis (EDA), we will investigate the relationships between these features and the target variable, which is the laptop price.

Following the EDA, we will train different machine learning models to predict prices accurately. Finally, we will deploy the best performing model as a web application using Streamlit, allowing users to easily estimate laptop prices based on their desired configurations.

_____________________

## Streamlit App: 

https://laptop-prediction.streamlit.app/

____

# Importing Libraries

In [1]:
# importing libraries

# Data Manipulation and Visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Importing Our `Laptop Price` Data

In [30]:
df = pd.read_csv('data.csv')

* Let's check the first five rows of our data.

In [31]:
df.head().T

Unnamed: 0,0,1,2,3,4
Unnamed: 0.1,0,1,2,3,4
Unnamed: 0,0,1,2,3,4
brand,HP,HP,Acer,Lenovo,Apple
name,Victus 15-fb0157AX Gaming Laptop,15s-fq5007TU Laptop,One 14 Z8-415 Laptop,Yoga Slim 6 14IAP8 82WU0095IN Laptop,MacBook Air 2020 MGND3HN Laptop
price,49900,39900,26990,59729,69990
spec_rating,73.0,60.0,69.323529,66.0,69.323529
processor,5th Gen AMD Ryzen 5 5600H,12th Gen Intel Core i3 1215U,11th Gen Intel Core i3 1115G4,12th Gen Intel Core i5 1240P,Apple M1
CPU,"Hexa Core, 12 Threads","Hexa Core (2P + 4E), 8 Threads","Dual Core, 4 Threads","12 Cores (4P + 8E), 16 Threads",Octa Core (4P + 4E)
Ram,8GB,8GB,8GB,16GB,8GB
Ram_type,DDR4,DDR4,DDR4,LPDDR5,DDR4


# Exploratory Data Analysis `(EDA)`

## Data Info:

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 893 entries, 0 to 892
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0.1       893 non-null    int64  
 1   Unnamed: 0         893 non-null    int64  
 2   brand              893 non-null    object 
 3   name               893 non-null    object 
 4   price              893 non-null    int64  
 5   spec_rating        893 non-null    float64
 6   processor          893 non-null    object 
 7   CPU                893 non-null    object 
 8   Ram                893 non-null    object 
 9   Ram_type           893 non-null    object 
 10  ROM                893 non-null    object 
 11  ROM_type           893 non-null    object 
 12  GPU                893 non-null    object 
 13  display_size       893 non-null    float64
 14  resolution_width   893 non-null    float64
 15  resolution_height  893 non-null    float64
 16  OS                 893 non

## Checking the Rows and Columns:

In [33]:
print(f"The Laptop Price Dataset has {df.shape[0]} rows and {df.shape[1]} columns")

The Laptop Price Dataset has 893 rows and 18 columns


## Checking Missing Values:

In [34]:
df.isnull().sum()

Unnamed: 0.1         0
Unnamed: 0           0
brand                0
name                 0
price                0
spec_rating          0
processor            0
CPU                  0
Ram                  0
Ram_type             0
ROM                  0
ROM_type             0
GPU                  0
display_size         0
resolution_width     0
resolution_height    0
OS                   0
warranty             0
dtype: int64

* There are no missing values in our dataset.

### Let's Drop the Unwanted Columns:

In [35]:
df.drop(['Unnamed: 0.1', 'Unnamed: 0'], axis=1, inplace=True)

## Descriptive Statistics:

In [36]:
df.describe()

Unnamed: 0,price,spec_rating,display_size,resolution_width,resolution_height,warranty
count,893.0,893.0,893.0,893.0,893.0,893.0
mean,79907.409854,69.379026,15.173751,2035.393057,1218.324748,1.079507
std,60880.043823,5.541555,0.939095,426.076009,326.756883,0.326956
min,9999.0,60.0,11.6,1080.0,768.0,0.0
25%,44500.0,66.0,14.0,1920.0,1080.0,1.0
50%,61990.0,69.323529,15.6,1920.0,1080.0,1.0
75%,90990.0,71.0,15.6,1920.0,1200.0,1.0
max,450039.0,89.0,18.0,3840.0,3456.0,3.0


`Price`: The minimum price is 9999.00 and the maximum price is 450039.00. While the Average price is 79907.40 \
`Specs Rating`: The minimum rating is 60.0 and the maximum rating is 89.0. While the Average rating is 69.38 \
`Display Size`: The minimum display size is 11.6 and the maximum display size is 18.0. While the Average display size is 15.17 \
`Resolution Width`: The minimum resolution width is 1080.0 and the maximum resolution width is 3840.0. \
`Resolution Height`: The minimum resolution height is 768.0 and the maximum resolution height is 3456.0. \
`Warranty`: The minimum warranty is 0 and the maximum warranty is 3 

## How Many Laptops Brands are there?

In [37]:
# Visualising the brands in the dataset
plt = px.histogram(df, x="brand", title="Brand Distribution",
                   color="brand", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

### Top 3 Laptop Brands:
1. HP
2. Lenovo
3. Asus

## Distribution of Specs Ratings:

In [38]:
# Visualising the distribution of specs rating in the dataset

plt = px.scatter(df, x="spec_rating", title="Specs Rating Distribution",
                 color_discrete_sequence=px.colors.qualitative.Pastel)
# Drawing the mean line
plt.add_shape(type='line', x0=df['spec_rating'].mean(), y0=0,
              x1=df['spec_rating'].mean(), y1=1000, line=dict(color='red', dash='dot'))
plt.show()

`Spec Rating` starts from 60.0 and ends at 89.0 with an average of 69.32

### Spec Rating Distribution with Brands:

In [39]:
# Visualising the Specs Distribution with Brands to check which brands has more Specs Rating
plt = px.scatter(df, x="spec_rating", y="brand", title="Specs Rating Distribution",
                 color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

Top 4 Brands with more Specs Rating:
1. HP
2. Lenovo
3. Asus
4. MSI

## Let's Check out these Top 10 Laptops:

In [40]:
# selecting the top specs rating laptops
df.sort_values(by='spec_rating', ascending=False).head(10).T

Unnamed: 0,562,890,253,736,697,543,823,549,332,489
brand,MSI,Asus,Lenovo,HP,MSI,HP,Dell,HP,Asus,Acer
name,CreatorPro Z16 HX B13VKTO-214IN Laptop,ROG Zephyrus G14 2023 GA402XV-N2034WS Gaming L...,Legion Slim 7 16IRH8 82Y3007QIN Gaming Laptop,ZBook Studio G9 16 Workstation WQUXGA Laptop,Vector GP68HX 13VH-072IN Gaming Laptop,Omen 17-ck2011TX Gaming Laptop,XPS 9530 2023 Laptop,Omen 16-u0024TX Gaming Laptop,Vivobook Pro 16 OLED 2023 K6602VU-LZ952WS Laptop,Predator Triton 500 SE PT516-52s NH.QFQSI.001 ...
price,419990,189990,194990,240707,284990,362999,278290,298999,159990,146990
spec_rating,89.0,89.0,89.0,89.0,89.0,88.0,88.0,88.0,88.0,86.0
processor,13th Gen Intel Core i9 13950HX,7th Gen AMD Ryzen 9 7940HS,13th Gen Intel Core i9 13900H,12th Gen Intel Core i9 12900H,13th Gen Intel Core i9 13980HX,13th Gen Intel Core i9 13900HX,13th Gen Intel Core i7 13700H,13th Gen Intel Core i9 13900HX,13th Gen Intel Core i9 13900H,12th Gen Intel Core i7 12700H
CPU,24 Cores (8P + 16E),"Octa Core, 16 Threads","14 Cores (6P + 8E), 20 Threads","14 Cores (6P + 8E), 20 Threads","24 Cores (8P + 16E), 32 Threads","24 Cores (8P + 16E), 32 Threads","14 Cores (6P + 8E), 20 Threads","24 Cores (8P + 16E), 32 Threads","14 Cores (6P + 8E), 20 Threads","14 Cores (6P + 8E), 20 Threads"
Ram,64GB,32GB,16GB,32GB,32GB,32GB,32GB,32GB,16GB,32GB
Ram_type,DDR5,DDR5,DDR5,DDR5,DDR5,DDR5,DDR5,DDR5,DDR5,LPDDR5
ROM,2TB,1TB,1TB,1TB,1TB,1TB,1TB,2TB,1TB,2TB
ROM_type,SSD,SSD,SSD,SSD,SSD,SSD,SSD,SSD,SSD,SSD


## Processors Distribution:

In [41]:
plt = px.histogram(df, x='processor', title="Processor Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

Top 3 Processors with most Count:
1. 12th Gen Intel Core i5 1235U
2. 13th Gen Intel Core i5 1335U
3. 12th Gen Intel Core i3 1215U

## CPU Distribution:

In [42]:
plt = px.histogram(df, x='CPU', title="CPU Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

Top 3 CPUs with most Count:
1. Quad Core, 8 Threads
2. Hexa Core, 12 Threads
3. 10 Core (2P + 8E), 12 Threads

## RAM Distribution:

The Distribution of RAM is Based on RAM Size and Type.

The Distribution of RAM is Based on RAM Size and Type.

In [43]:
plt = px.histogram(df, x='Ram_type', color='Ram', title="Ram Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

__The Distribution of RAM is as follows:__\
Most of the Rams are `8GB and 16GB of DDR4`, and `16GB of DDR4, LPDDR5, LPDDR4X, and DDR5`.

## ROM Distribution:

The Distribution of ROM is Based on RAM Size and Type.

In [44]:
plt = px.histogram(df, x='ROM_type', color='ROM', title="ROM Distribution", color_discrete_sequence=px.colors.qualitative.Pastel )
plt.show()

__The Distribution of RAM is as follows:__\
SSD and Hard-Disk. \
Most of the ROMs are SSD. \
Most Common in `SSD` is `512GB and 1TB`.\
Most Common in `HDD `is `1TB`.

## GPU Distribution:

In [45]:
plt = px.histogram(df, x='GPU', color='GPU', title="GPU Distribution")
plt.show()

Top 3 GPUs with most Count:
1. Intel Iris Xe Graphics
2. Intel UHD Graphics
3. Intel Integrated UHD

## Display Distribution:

In [46]:
plt = px.histogram(df, x='display_size', color='display_size', title="Display Size Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

Most Common Display Size: 15.6 and 14

## OS Distribution:

In [47]:
plt = px.histogram(df, x='OS', color='OS', title="OS Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

Most Common OS: `Windows 11 and Windows 10`

## Let's Plot Price VS Other Features:

In [48]:
# ploting the price distribution
plt = px.scatter(df, x='price', color='price', title="Price Distribution", color_discrete_sequence=px.colors.qualitative.Pastel)
plt.show()

In [49]:
# Now Plotting the Price with All other features using loop
for i in df.columns:
    if i != 'price':
        plt = px.scatter(df, x='price', y=i, title=f"{i} vs Price", color_discrete_sequence=px.colors.qualitative.Pastel)
        plt.show()

### Conclusion:
Price of the Laptop is depends on all other features. As the Specs (RAM, ROM, CPU, GPU, Display, OS and so on) are all important.

# Making Predictions (Machine Learning Part)

In [50]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score
import joblib

### Seperating the Columns:

Selecting the Numerical and Categorical Columns.

In [51]:
# Define the columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_columns.remove('price')  # Exclude the target column from numerical columns
numerical_columns.remove('spec_rating') # We Don't need this column as the ratings are based on the RAM, ROM, CUP, GPU etc.


### Scaling and Encoding:

In [52]:

# Create transformers for numerical and categorical features
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine transformers into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])


## Splitting the Dataset:

In [53]:
# Create and fit the preprocessing pipeline
X = df.drop(columns=['spec_rating', 'price']) 
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

preprocessor.fit(X_train)

# Save the preprocessor
joblib.dump(preprocessor, 'preprocessor.pkl')


['preprocessor.pkl']

## Defining the Models and Training:

In [54]:
# Train and save multiple models
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(n_estimators=100, max_depth=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, random_state=42),
    "SVR": SVR()
}

best_model = None
best_score = float('-inf')
best_model_name = ""

for name, model in models.items():
    model.fit(preprocessor.transform(X_train), y_train)
    y_pred = model.predict(preprocessor.transform(X_test))
    r2 = r2_score(y_test, y_pred)
    print(f"{name} R^2 score: {r2}")
    if r2 > best_score:
        best_score = r2
        best_model = model
        best_model_name = name

# Save the best model
joblib.dump(best_model, 'best_model.pkl')

print(f"Best model: {best_model_name} with R^2 score: {best_score}")

Linear Regression R^2 score: 0.8705190983697756
Random Forest R^2 score: 0.8267143889892317
XGBoost R^2 score: 0.8404990037693847
SVR R^2 score: -0.0607171938986637
Best model: Linear Regression with R^2 score: 0.8705190983697756


## Hyperparameter Tuning:

In [55]:
from sklearn.model_selection import RandomizedSearchCV

In [56]:
# Define the models and hyperparameter grids
models = {
    "Linear Regression": (LinearRegression(), {}),
    "Random Forest": (RandomForestRegressor(random_state=42), {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30, 40, 50]
    }),
    "XGBoost": (XGBRegressor(random_state=42), {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 4, 5, 6, 7],
        'learning_rate': [0.01, 0.1, 0.2, 0.3]
    }),
    "SVR": (SVR(), {
        'kernel': ['linear', 'rbf'],
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto']
    })
}

best_model = None
best_score = float('-inf')
best_model_name = ""

for name, (model, param_grid) in models.items():
    if param_grid:
        search = RandomizedSearchCV(model, param_grid, n_iter=10, cv=5, scoring='r2', random_state=42, n_jobs=-1)
        search.fit(preprocessor.transform(X_train), y_train)
        best_model_for_name = search.best_estimator_
    else:
        best_model_for_name = model
        best_model_for_name.fit(preprocessor.transform(X_train), y_train)
    
    y_pred = best_model_for_name.predict(preprocessor.transform(X_test))
    r2 = r2_score(y_test, y_pred)
    print(f"{name} R^2 score: {r2}")
    if r2 > best_score:
        best_score = r2
        best_model = best_model_for_name
        best_model_name = name

# Save the best model and its name
joblib.dump(best_model, 'hypertuned_best_model.pkl')

print(f"Best model: {best_model_name} with R^2 score: {best_score}")

Linear Regression R^2 score: 0.8705190983697756
Random Forest R^2 score: 0.8292279226472555
XGBoost R^2 score: 0.8532602840781043
SVR R^2 score: 0.5420927530121156
Best model: Linear Regression with R^2 score: 0.8705190983697756
