# Laptop Dataset Cleaning

This notebook is for understanding and learning cleaning of datasets and understanding Exploratory Dataset Analysis, its three steps include:
1) Data Collection
2) Data Checks to perform
3) Exploratory data analysis

Data source - https://www.kaggle.com/datasets/ehtishamsadiq/uncleaned-laptop-price-dataset

The dataset consists of 12 columns and 1303 rows

#### Importing all the libraries and getting the path for the dataset

In [40]:
import os
import sys
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [41]:
dir_path = os.getcwd()
file_path = os.path.join(dir_path,"laptopData.csv")
df = pd.read_csv(file_path)

In [42]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1.0,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2.0,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0
3,3.0,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336
4,4.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808


In [43]:
df.shape

(1303, 12)

## All the things we need to check for

1) Check Missing values

2) Check Duplicates

3) Check data type

4) Check the number of unique values of each column

5) Check statistics of data set

6) Check various categories present in the different categorical column


#### Missing Values

We will drop all the rows with rows having the value nan/null as all there are only 30 rows with nan values where all the column values are nan

In [44]:
df.isnull().sum()

Unnamed: 0          30
Company             30
TypeName            30
Inches              30
ScreenResolution    30
Cpu                 30
Ram                 30
Memory              30
Gpu                 30
OpSys               30
Weight              30
Price               30
dtype: int64

In [45]:
df_1 = df.dropna()
df_1

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1.0,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2.0,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000
3,3.0,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360
4,4.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,1298.0,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400
1299,1299.0,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200
1300,1300.0,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200
1301,1301.0,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200


#### Duplicates

Since all the duplicated values are the null values, once we drop the null values we automatically drop the duplicate values

In [46]:
df.duplicated().sum()

29

In [47]:
df_1.duplicated().sum()

0

In [48]:
df_1.drop_duplicates()

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1.0,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2.0,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0000
3,3.0,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.3360
4,4.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.8080
...,...,...,...,...,...,...,...,...,...,...,...,...
1298,1298.0,Lenovo,2 in 1 Convertible,14,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.8kg,33992.6400
1299,1299.0,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16GB,512GB SSD,Intel HD Graphics 520,Windows 10,1.3kg,79866.7200
1300,1300.0,Lenovo,Notebook,14,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2GB,64GB Flash Storage,Intel HD Graphics,Windows 10,1.5kg,12201.1200
1301,1301.0,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6GB,1TB HDD,AMD Radeon R5 M330,Windows 10,2.19kg,40705.9200


In [49]:
df_1.loc[:,'Weight'] = df_1.loc[:,'Weight'].str.replace('kg','')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_1.loc[:,'Weight'] = df_1.loc[:,'Weight'].str.replace('kg','')


In [50]:
df_1 = df_1[df_1.Weight != '?']

In [51]:
df_1 = df_1[df_1.Inches != '?']

In [52]:
df_1['Weight'] = df_1['Weight'].astype(float)
df_1['Inches'] = df_1['Inches'].astype(float)

In [69]:
df = df_1.drop('Unnamed: 0',axis =1)

#### DataTypes Analysis

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1271 entries, 0 to 1302
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Company           1271 non-null   object 
 1   TypeName          1271 non-null   object 
 2   Inches            1271 non-null   float64
 3   ScreenResolution  1271 non-null   object 
 4   Cpu               1271 non-null   object 
 5   Ram               1271 non-null   object 
 6   Memory            1271 non-null   object 
 7   Gpu               1271 non-null   object 
 8   OpSys             1271 non-null   object 
 9   Weight            1271 non-null   float64
 10  Price             1271 non-null   float64
dtypes: float64(3), object(8)
memory usage: 151.4+ KB


#### Check the different types of unique values

In [71]:
df.nunique()

Company              19
TypeName              6
Inches               24
ScreenResolution     40
Cpu                 118
Ram                  10
Memory               40
Gpu                 110
OpSys                 9
Weight              180
Price               776
dtype: int64

#### Check statistics of data set

In [72]:
df.describe()

Unnamed: 0,Inches,Weight,Price
count,1271.0,1271.0,1271.0
mean,15.132258,2.077852,59888.473922
std,1.95453,0.808083,37309.185217
min,10.1,0.0002,9270.72
25%,14.0,1.5,31914.72
50%,15.6,2.04,52054.56
75%,15.6,2.32,79274.2464
max,35.6,11.1,324954.72


### Exploring The Dataset

In [73]:
df.head()

Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37,71378.6832
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34,47895.5232
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86,30636.0
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83,135195.336
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37,96095.808


In [75]:
print("Categories in 'Company' variable:     ",end=" " )
print(df['Company'].unique())

print("Categories in 'TypeName' variable:     ",end=" " )
print(df['TypeName'].unique())

print("Categories in 'Inches' variable:     ",end=" " )
print(df['Inches'].unique())

print("Categories in 'ScreenResolution' variable:     ",end=" " )
print(df['ScreenResolution'].unique())

print("Categories in 'Cpu' variable:     ",end=" " )
print(df['Cpu'].unique())

print("Categories in 'Ram' variable:     ",end=" " )
print(df['Ram'].unique())

print("Categories in 'Memory' variable:     ",end=" " )
print(df['Memory'].unique())

print("Categories in 'Gpu' variable:     ",end=" " )
print(df['Gpu'].unique())

print("Categories in 'OpSys' variable:     ",end=" " )
print(df['OpSys'].unique())

print("Categories in 'Weight' variable:     ",end=" " )
print(df['Weight'].unique())

Categories in 'Company' variable:      ['Apple' 'HP' 'Acer' 'Asus' 'Dell' 'Lenovo' 'Chuwi' 'MSI' 'Microsoft'
 'Toshiba' 'Huawei' 'Xiaomi' 'Vero' 'Razer' 'Mediacom' 'Samsung' 'Google'
 'Fujitsu' 'LG']
Categories in 'TypeName' variable:      ['Ultrabook' 'Notebook' 'Gaming' '2 in 1 Convertible' 'Workstation'
 'Netbook']
Categories in 'Inches' variable:      [13.3 15.6 15.4 14.  12.  17.3 13.5 12.5 13.  18.4 13.9 11.6 25.6 35.6
 12.3 27.3 24.  33.5 31.6 17.  15.  14.1 11.3 10.1]
Categories in 'ScreenResolution' variable:      ['IPS Panel Retina Display 2560x1600' '1440x900' 'Full HD 1920x1080'
 'IPS Panel Retina Display 2880x1800' '1366x768'
 'IPS Panel Full HD 1920x1080' 'IPS Panel Retina Display 2304x1440'
 'IPS Panel Full HD / Touchscreen 1920x1080'
 'Full HD / Touchscreen 1920x1080' 'Touchscreen / Quad HD+ 3200x1800'
 'Touchscreen 2256x1504' 'Quad HD+ / Touchscreen 3200x1800'
 'IPS Panel 1366x768' 'IPS Panel 4K Ultra HD / Touchscreen 3840x2160'
 'IPS Panel Full HD 2160x1440' '4K Ultra

In [76]:
numeric_categories = [feature for feature in df.columns if df[feature].dtype != 'O']
categoric_categories = [feature for feature in df.columns if df[feature].dtype == 'O']

print("we have {} numeric features {}".format(len(numeric_categories),numeric_categories))
print("We have {} categoric features {}".format(len(categoric_categories),categoric_categories))

we have 3 numeric features ['Inches', 'Weight', 'Price']
We have 8 categoric features ['Company', 'TypeName', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys']


### Visualisation The Dataset

In [None]:
fig,ax