# Transforming datasets (Laptop, CPU, GPU)

This notebook transforms the:

* laptop data scraped from Scorptec

* cpu data scraped from Passmark

* gpu data from Passmark downloaded from: https://www.kaggle.com/datasets/alanjo/gpu-benchmarks

To be compatible and usable for the oncoming analysis.


The overall goal of this is to make matching cpu and gpu components with their benchmark data straightforward, and create columns for better querying.

Information about each transformation will be under their heading.

In [408]:
import pandas as pd
import numpy as np
import os, re, json

cwd = os.getcwd()

In [409]:
pd.options.mode.chained_assignment = None

## Laptop data transformation
The laptop data transformation involves:
- Creating columns for the laptop brand, cpu brand, and gpu brand.

- Extracting the CPU Model and GPU Model.

- Modifying the cpu and gpu columns so they match the keys of the cpu and gpu datasets

- Creating the display size, memory, and storage size queryable as numbers.

- Removing incorrect entries.

In [410]:
laptop_original = pd.read_csv(cwd + '/datasets/laptop_data/scorptec2023-04-07.csv', quotechar="'")
laptop_modify = laptop_original.copy()
laptop_modify

Unnamed: 0,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY
0,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,"16GB , RAM",GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz
1,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz
2,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,"16GB , DDR4 RAM",GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz
3,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,"16GB , RAM",GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz
4,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,"8GB , RAM",GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD
...,...,...,...,...,...,...,...,...
683,Lenovo ThinkPad P15v G3 15.6 inch FHD i7 16GB ...,21D80022AU,2699,Core i7-12700H,"16GB RAM DDR5 ,",Nvidia T600 4GB,512GB M.2 SSD,15.6 inch FHD IPS
684,MSI Creator M16 A12UC 16 inch i7 RTX3050 Win11...,Creator M16 A12UC-237AU,2399,Core i7-12700H,"16GB RAM ,",RTX3050 4GB,512GB SSD,16 inch QHD+ 60Hz
685,"MSI CreatorPro Z17 A12UMST 17"" Core i7 32GB 1T...",CreatorPro Z17 A12UMST-218AU,6599,Core i7-12700H,"32GB DDR5 RAM ,",RTX A5500 16GB,1TB M.2 NVMe PCIe Gen4 SSD,17inch QHD+ 165Hz
686,MSI CreatorPro X17HX A13VKS 17.3inch Core i9 6...,CreatorPro X17HX A13VKS-204AU,9599,Core i9-13980HX,"64GB DDR5 RAM ,",RTX 3500 12GB,"4TB , M.2 NVMe PCIe Gen4 SSD",17.3inch UHD 144Hz


### Extract Laptop Brand

The laptop brand is extracted from the laptop name by splitting out the first word of in laptop name.

In [411]:
# Create column for laptop brand
brandsSeries = laptop_modify.NAME.str.split(n=1, expand=True)[0]
brandsSeries.name = 'LAPTOP_BRAND'
laptop_modify = pd.concat([brandsSeries, laptop_modify], axis='columns')
laptop_modify.head()

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY
0,Gigabyte,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,"16GB , RAM",GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz
1,Razer,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz
2,MSI,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,"16GB , DDR4 RAM",GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz
3,MSI,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,"16GB , RAM",GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz
4,MSI,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,"8GB , RAM",GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD


### Extract Laptop CPU Brand

The CPU brand is extracted from the processor name based on the patterns in each brand's naming convention.

In [412]:
# Create column for CPU brand
cpu_Conditions = [
    laptop_modify.PROCESSOR.str.contains('Intel|(i\d(-|\s))', case=False),
    laptop_modify.PROCESSOR.str.contains('AMD|Ryzen|R\d', case=False),
    laptop_modify.LAPTOP_BRAND == 'Apple'
]
cpu_Values = ['Intel', 'AMD', 'Apple']
laptop_modify['CPU_BRAND'] = np.select(cpu_Conditions, cpu_Values, default='Unknown')

laptop_modify.head()

  laptop_modify.PROCESSOR.str.contains('Intel|(i\d(-|\s))', case=False),


Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND
0,Gigabyte,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,"16GB , RAM",GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz,AMD
1,Razer,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz,AMD
2,MSI,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,"16GB , DDR4 RAM",GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz,Intel
3,MSI,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,"16GB , RAM",GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz,Intel
4,MSI,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,"8GB , RAM",GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD,Intel


### Extract Laptop CPU Model

The CPU Model is extracted from the processor name and standardised based on the CPU Brand.

Each brand is matched with their own regular expression based on cpu model naming conventions.

In [413]:
# Extract cpu model
intel_regex = r'\b(([A-Z][A-Z0-9-]*|i.*)?\d{3,}\w*)'
amd_regex = r'\b(([A-Z]+-)?\d{3,}\w*)'
apple_regex = r'\b(M\d.*Core)'

def extract_laptop_cpu(row):
    processor = row.PROCESSOR
    cpuBrand = row.CPU_BRAND

    if cpuBrand == 'Intel':
        modelMatch = re.search(intel_regex, processor)
        if modelMatch != None:
            return modelMatch.group(0)

    elif cpuBrand == 'AMD':
        modelMatch = re.search(amd_regex, processor)
        if modelMatch != None:
            return modelMatch.group(0)

    elif cpuBrand == 'Apple':
        modelMatch = re.search(apple_regex, processor)
        if modelMatch != None:
            return modelMatch.group(0)

    return processor

laptop_modify['CPU_MODEL'] = laptop_modify.apply(extract_laptop_cpu, axis = 1)
laptop_modify.head()

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND,CPU_MODEL
0,Gigabyte,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,"16GB , RAM",GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz,AMD,5600H
1,Razer,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz,AMD,6900HX
2,MSI,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,"16GB , DDR4 RAM",GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz,Intel,i7-12700H
3,MSI,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,"16GB , RAM",GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz,Intel,i7-12700H
4,MSI,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,"8GB , RAM",GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD,Intel,i5-11400H


Some extracted model names are in a slightly different format than the cpu database. This step standardised the names across each brand.

In [414]:
# Fix up model names
def fix_cpu_model(row):
    brand = row.CPU_BRAND
    model = row.CPU_MODEL

    if brand == 'Intel':
        if model[0] == 'i' and ' ' in model:
            model = model.replace(' ', '-')

    if brand == 'Apple':
        if '/' in model:
            model = model.split('/')[0]
        modelComponents = []
        modelComponents += [re.search('^M\d', model).group(0)]

        series_model = re.search(r'\b(Pro|Max|Ultra)\b', model)
        if series_model:
            modelComponents += [series_model.group(0)]
            
        cores = re.search(r'(\d+)(-|\s)core', model, re.IGNORECASE)
        modelComponents += [cores.group(1)]
        modelComponents += ['Core']

        model = ' '.join(modelComponents)
        
    return model

laptop_modify.CPU_MODEL = laptop_modify.apply(fix_cpu_model, axis=1)

### Extract Laptop GPU Brand

The GPU brand is extracted based off each brand's product offerings and naming conventions.

In [415]:
# Create gpu brand column
gpu_Conditions = [
    laptop_modify.GRAPHICS.str.contains('Intel|UHD|Iris|Irix', case = False),
    laptop_modify.GRAPHICS.str.contains('Nvidia|GeForce|GTX|RTX|MX|Quadro|T\d\d\d', case = False),
    laptop_modify.GRAPHICS.str.contains('AMD|Radeon|RX', case=False),
    laptop_modify.LAPTOP_BRAND == 'Apple'
]
gpu_Values = ['Intel', 'Nvidia', 'AMD', 'Apple']

laptop_modify['GPU_BRAND'] = np.select(gpu_Conditions, gpu_Values, default='Unknown')
laptop_modify.head()

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND,CPU_MODEL,GPU_BRAND
0,Gigabyte,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,"16GB , RAM",GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz,AMD,5600H,Nvidia
1,Razer,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz,AMD,6900HX,Nvidia
2,MSI,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,"16GB , DDR4 RAM",GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz,Intel,i7-12700H,Nvidia
3,MSI,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,"16GB , RAM",GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz,Intel,i7-12700H,Nvidia
4,MSI,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,"8GB , RAM",GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD,Intel,i5-11400H,Nvidia


### Extract Laptop GPU Model

The GPU models are extracted from the GRAPHICS column based off each brand's naming conventions.

For intel, the marks for UHD Graphics depends on the laptop CPU. UHD_Graphics.json contains a list of each Intel UHD CPU and the UHD Series they belong to, extracted from Intel's website (https://ark.intel.com/content/www/us/en/ark.html#@PanelLabel211968). For laptops listing UHD Graphics as the GRAPHICS, their CPU_MODEL is looked up on the list and the corresponding UHD Graphics Series is used as the GPU_Model.

In [416]:
# Load in UHD Data
# UHD_Graphics.json contains the UHD Generation each UHD Intel Chip belongs to
f = open('UHD_Graphics.json')
UHD_Data = json.load(f)
f.close()

In [417]:
# Extract gpu model
def extract_laptop_gpu(row):
    gpu_regex = r'\b[A-Z0-9-]*\d\d[A-Z0-9-]*\b'
    brand = row.GPU_BRAND
    name = row.GRAPHICS

    if brand == 'Nvidia':
        gpu_model = re.search(gpu_regex, name)
        if gpu_model != None:
            gpu_model = gpu_model.group(0)
            if 'Ti' in name:
                gpu_model += ' Ti'
            if 'Super' in name:
                gpu_model += ' Super'
            return gpu_model
        elif 'Ti' in name:
            gpuNameList = name.split('Ti')
            gpu_model = re.search(gpu_regex, gpuNameList[0]).group(0)
            gpu_model += ' Ti'
            return gpu_model
        else:
            return 'Unknown'
        
    elif brand == 'AMD':
        amd_brand_regex = 'Radeon (R\d )?Graphics'
        if re.search(amd_brand_regex, name) != None:
            return row.CPU_MODEL
        gpu_model = re.search(gpu_regex, name)
        if gpu_model != None:
            gpu_model = gpu_model.group(0)
            if 'XT' in name:
                gpu_model += ' XT'
            return gpu_model
        else:
            return 'Unknown'
        
    elif brand == 'Intel':
        gpuNameList = name.split()
        if 'Iris' in name or 'Irix' in name:
            if 'Intel' in gpuNameList:
                gpuNameList.remove('Intel')
            if 'Graphics' in gpuNameList:
                gpuNameList.remove("Graphics")
            if 'Irix' in gpuNameList:
                # Fix Irix type
                gpuNameList[gpuNameList.index('Irix')] = 'Iris'
            if 'X' in gpuNameList:
                # Fix X type
                gpuNameList[gpuNameList.index('X')] = 'Xe'
            return ' '.join(gpuNameList)
        elif 'UHD' in gpuNameList:
            # Search through  UHD json for corresponding UHD Graphic series
            for series in UHD_Data:
                if row.CPU_MODEL in UHD_Data[series]:
                    return series
            return 'UHD'
        else:
            return 'Unknown'
    
    elif brand == 'Apple':
        if 'GPU' in row.GRAPHICS:
            return row.GRAPHICS
        else:
            return row.PROCESSOR    

    else:
        return 'Unknown'

laptop_modify['GPU_MODEL'] = laptop_modify.apply(extract_laptop_gpu, axis=1)
laptop_modify.head()

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND,CPU_MODEL,GPU_BRAND,GPU_MODEL
0,Gigabyte,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,"16GB , RAM",GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz,AMD,5600H,Nvidia,3060
1,Razer,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz,AMD,6900HX,Nvidia,3070 Ti
2,MSI,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,"16GB , DDR4 RAM",GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz,Intel,i7-12700H,Nvidia,3050 Ti
3,MSI,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,"16GB , RAM",GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz,Intel,i7-12700H,Nvidia,3060
4,MSI,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,"8GB , RAM",GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD,Intel,i5-11400H,Nvidia,1650


Fix up GPU_MODEL entries where the entered format is slightly different.

In [418]:
# fix gpu model
def fix_gpu_model(row):
    model = row.GPU_MODEL

    if bool(re.search('^Iris Xe$', model, re.IGNORECASE)) ^ bool(model == 'Iris Xe'):
        return 'Iris Xe'
    
    if model[-2:] == 'TI':
        model = model[:-2] + ' Ti'
    
    if model[:3] in ['GTX', 'RTX']:
        model = model[3:]
    elif model[:2] == 'RX':
        model = model[2:]
    
    return model

laptop_modify.GPU_MODEL = laptop_modify.apply(fix_gpu_model, axis=1)

### Check laptops with incorrect/missing data

Some laptop entries scraped have incorrect columns and/or is missing a portion of its data.

Examining these the cause of this was their listings of the website being slightly different from the rest.

As they only make up a small portion of the dataset, they can be removed without majorly affecting the data.

In [419]:
# Show laptops with incorrect/missing data
laptop_modify[laptop_modify.GPU_MODEL == 'Unknown']

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND,CPU_MODEL,GPU_BRAND,GPU_MODEL
8,MSI,MSI Katana GF66 12UC Black 15.6inch Core i7 RT...,Katana GF66 12UC-656AU,1349,i7-12650H,"8GB , RAM",GBLAN,15.6inch FHD 144Hz,GeForce RTX 3050 4GB,Intel,i7-12650H,Unknown,Unknown
41,ASUS,ASUS ROG Zephyrus Duo 16 Black 16inch Ryzen 9 ...,GX650RX-LB149W,5599,Ryzen 9-6900HX,"32GB , DDR5 RAM",ScreenPad Plus 14inch 4k,2TB M.2 PCIE SSD,16inch UHD+ 120Hz and FHD+ 240Hz,AMD,6900HX,Unknown,Unknown
101,Gigabyte,Gigabyte G5 KD Black 15.6inch Core i5 RTX 3060...,G5 KD-52AU123SO,1799,i5-11400H,"16GB , RAM",GBLAN,15.6inch FHD 144Hz,GeForce RTX 3060 Max-P Design,Intel,i5-11400H,Unknown,Unknown
118,ASUS,ASUS ROG Zephyrus Duo 16 Black 16inch Ryzen 9 ...,GX650PY-NM056W,6999,Ryzen 9-7945HX,"64GB , DDR5 RAM",14inch ScreenPad Plus 4k,1TB M.2 PCIE SSD,16inch QHD+ 240Hz,AMD,7945HX,Unknown,Unknown
271,Asus,Asus ZenBook ProDuo 15 OLED 15.6in UHD Touch C...,UX582ZM-H2009X,4399,i9-12900HK,32GB RAM,ScreenPad Duo,1TB PCIe NVMe SSD,15.6 inch 4K UHD OLED Touch,Intel,i9-12900HK,Unknown,Unknown
283,Microsoft,Microsoft Surface Laptop 5 13.5inch Core i5 16...,R7B-00016,2399,"256GB SSD ,",13.5-inch PixelSense Gorilla Glass 3 Touch Dis...,HD Cam,Intel Iris Xe,WIFI 6 + BT 5.1,Unknown,"256GB SSD ,",Unknown,Unknown
287,ASUS,ASUS ZenBook Pro Duo 15 OLED 15.6 inch i7 16GB...,UX582ZM-KY012W,3799,Core i7-12700H,16GB RAM onboard,ScreenPad Plus,1TB PCe Gen4 SSD,15.6 inch FHD OLED Touch,Intel,i7-12700H,Unknown,Unknown
351,Asus,Asus Vivobook S 15.6 inch FHD OLED i5 8GB 256G...,K3502ZA-L1367W,1499,Neutral Grey,Core i5-12500H,15.6 inch FHD OLED,"8GB ,",256GB M.2 SSD,Unknown,Neutral Grey,Unknown,Unknown
521,Microsoft,Microsoft Surface Laptop 5 13.5inch Core i7 16...,RB1-00039,2699,"256GB SSD ,",13.5-inch PixelSense Gorilla Glass 3 Touch Dis...,HD Cam,Intel Iris Xe,WIFI 6 + BT 5.1,Unknown,"256GB SSD ,",Unknown,Unknown
522,Microsoft,Microsoft Surface Laptop 5 13.5inch Core i7 16...,RBH-00041,2849,"512GB SSD ,",13.5-inch PixelSense Gorilla Glass 5 Touch Dis...,HD Cam,Intel Iris Xe,WIFI 6 + BT 5.1,Unknown,"512GB SSD ,",Unknown,Unknown


In [420]:
# Remove Laptops with incorrect/missing GPU data
laptop_modify.drop(laptop_modify[laptop_modify.GPU_MODEL == 'Unknown'].index, inplace=True)

### Create RAM column
The RAM is extracted from the MEMORY column as an integer so it can be queried for the visualisation.

In [421]:
# Clean up Memory Column and create RAM column
laptop_modify.MEMORY = laptop_modify.MEMORY.map(lambda s: s.replace(',', '').strip())
laptop_modify['RAM'] = laptop_modify.MEMORY.str.extract(r'(^\d+)')
laptop_modify.RAM = laptop_modify.RAM.astype(int)
laptop_modify.head()

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND,CPU_MODEL,GPU_BRAND,GPU_MODEL,RAM
0,Gigabyte,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,16GB RAM,GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz,AMD,5600H,Nvidia,3060,16
1,Razer,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz,AMD,6900HX,Nvidia,3070 Ti,16
2,MSI,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,16GB DDR4 RAM,GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz,Intel,i7-12700H,Nvidia,3050 Ti,16
3,MSI,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,16GB RAM,GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz,Intel,i7-12700H,Nvidia,3060,16
4,MSI,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,8GB RAM,GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD,Intel,i5-11400H,Nvidia,1650,8


### Create STORAGE_SIZE column
The total storage size is extracted from the STORAGE column by adding each drive listed in the column. The total storage size is entered as integers so it can be queried in the visualisation.

In [422]:
# Clean up Storage column and create STORAGE_SIZE column
laptop_modify.STORAGE = laptop_modify.STORAGE.map(lambda s: s.replace(',', '').strip())

def calc_storage(row):
    totalStorage = 0
    hd_regex = r'(\d+)\s?(GB|TB)'
    storage = row.STORAGE
    storageList = storage.split('+')
    storageList = [item.strip() for item in storageList]
    for hardDrive in storageList:
        hdStorage = re.search(hd_regex, hardDrive)
        if 'TB' in hdStorage.group(0):
            hdStorage = int(hdStorage.group(1))
            hdStorage = hdStorage*1000
        else:
            hdStorage = int(hdStorage.group(1))
        totalStorage += int(hdStorage)

    return totalStorage

laptop_modify['STORAGE_SIZE'] = laptop_modify.apply(calc_storage, axis=1)
laptop_modify.head()

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND,CPU_MODEL,GPU_BRAND,GPU_MODEL,RAM,STORAGE_SIZE
0,Gigabyte,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,16GB RAM,GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz,AMD,5600H,Nvidia,3060,16,512
1,Razer,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz,AMD,6900HX,Nvidia,3070 Ti,16,1000
2,MSI,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,16GB DDR4 RAM,GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz,Intel,i7-12700H,Nvidia,3050 Ti,16,1000
3,MSI,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,16GB RAM,GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz,Intel,i7-12700H,Nvidia,3060,16,1000
4,MSI,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,8GB RAM,GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD,Intel,i5-11400H,Nvidia,1650,8,512


### Create DISPLAY_SIZE column
The display size is extracted from the DISPLAY column as float so it can be queried for the visualisation.

In [423]:
# Clean up DISPLAY column and create DISPLAY_SIZE column
laptop_modify.DISPLAY = laptop_modify.DISPLAY.map(lambda s: s.replace(',', '').strip())

def extract_screen_size(row):
    display = row.DISPLAY
    display_regex = r'(\d+\.?\d?)(-|\s)?(inch|")'
    size = re.search(display_regex, display)
    if size != None:
        size = float(size.group(1))
        return size
    else:
        return 'Unknown'
laptop_modify['DISPLAY_SIZE'] = laptop_modify.apply(extract_screen_size, axis=1)
laptop_modify.head()

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND,CPU_MODEL,GPU_BRAND,GPU_MODEL,RAM,STORAGE_SIZE,DISPLAY_SIZE
0,Gigabyte,Gigabyte A5 K1 Black 15.6inch Ryzen 5 RTX 3060...,A5 K1-AAU1130SB,1799,Ryzen 5-5600H,16GB RAM,GeForce RTX 3060 Max-P Design,512GB M.2 PCIE SSD,15.6inch FHD 144Hz,AMD,5600H,Nvidia,3060,16,512,15.6
1,Razer,Razer Blade 14 Black 14inch Ryzen 9 RTX 3070 T...,RZ09-0427NEA3-R3B1,2699,Ryzen 9-6900HX,16GB DDR5 RAM,GeForce RTX 3070 Ti 8GB,1TB M.2 PCIE SSD,14inch QHD 165Hz,AMD,6900HX,Nvidia,3070 Ti,16,1000,14.0
2,MSI,MSI Katana GF66 12UD 15.6inch Core i7 RTX 3050...,Katana GF66 12UD-069AU,1549,i7-12700H,16GB DDR4 RAM,GeForce RTX 3050 Ti,1TB M.2 PCIE SSD,15.6inch FHD 144Hz,Intel,i7-12700H,Nvidia,3050 Ti,16,1000,15.6
3,MSI,MSI Katana GF76 12UE Black 17.3inch Core i7 RT...,Katana GF76 12UE-019AU,1899,i7-12700H,16GB RAM,GeForce RTX 3060 6GB,1TB M.2 PCIE SSD,17.3inch FHD 144Hz,Intel,i7-12700H,Nvidia,3060,16,1000,17.3
4,MSI,MSI GF63 Thin 11SC Black 15.6inch Core i5 GTX ...,GF63 Thin 11SC-1095AU,999,i5-11400H,8GB RAM,GeForce GTX 1650 4GB,512GB M.2 PCIE SSD,15.6inch FHD,Intel,i5-11400H,Nvidia,1650,8,512,15.6


Remove laptops where the display was entered incorrectly.

In [424]:
# Remove Laptops with incorrect Display value
laptop_modify.drop(laptop_modify[laptop_modify.DISPLAY_SIZE == 'Unknown'].index, inplace=True)

## CPU data transformation
The CPU data transformation involves:
- Adding brand column to cpu data.

- Extracting processor model as primary key.

- Removing irrelevant and incorrect entries.

In [425]:
cpu_original = pd.read_csv(cwd + '/datasets/processor_data/cpu_data2023-03-21.csv', quotechar="'")
cpu_modify = cpu_original.copy()
# Convert CPU Mark to integer
cpu_modify['CPU Mark'] = cpu_modify['CPU Mark'].str.replace(',', '').astype(int)
cpu_modify

Unnamed: 0,CPU Name,CPU Mark
0,AArch64 rev 2 (aarch64),2267
1,AArch64 rev 4 (aarch64),1824
2,AC8257V/WAB,711
3,AMD 3015Ce,2067
4,AMD 3015e,2709
...,...,...
4104,ZHAOXIN KaiXian KX-6640MA@2.2+GHz,1549
4105,ZHAOXIN KaiXian KX-U6580@2.5GHz,3227
4106,ZHAOXIN KaiXian KX-U6780A@2.7GHz,3881
4107,ZHAOXIN KaiXian ZX-C+ C4700@2.0GHz,1547


Remove CPU models not offered for current laptops

*Note: Filtering based of anecdotal observations. A better way of filtering would be to include the 'CPU First Seen on Charts' and 'CPU Class' in the scraped data, and filter based off these two columns. However, the anecdotal filter is sufficient for the current dataset and the scope of the project currently.*

In [426]:
# Remove processors not from: Intel, AMD, Apple
cpu_modify = cpu_modify.drop(cpu_modify[~cpu_modify['CPU Name'].str.contains('Intel|AMD|Apple')].index)
# Remove processors that are not within the following lineup-
# Intel: Celeron, Pentium, Core, Xeon
# AMD: Ryzen, Athlon
# Apple: Apple silicon (M series)
cpu_modify = cpu_modify.drop(cpu_modify[~cpu_modify['CPU Name'].str.contains('Celeron|Pentium|Core|Xeon|Ryzen|Athlon|Apple| M\d')].index)
# Remove remaining mobile processors
cpu_modify = cpu_modify.drop(cpu_modify[cpu_modify['CPU Name'].str.contains('Mobile')].index)
# Remove CPUs with a CPU Mark below 1500
cpu_modify = cpu_modify.drop(cpu_modify[cpu_modify['CPU Mark']<1900].index)
cpu_modify

Unnamed: 0,CPU Name,CPU Mark
169,AMD Athlon 200GE,4102
170,AMD Athlon 220GE,4424
171,AMD Athlon 240GE,4534
175,AMD Athlon 3000G,4482
176,AMD Athlon 300GE,4256
...,...,...
3599,Intel Xeon X5680 @ 3.33GHz,6794
3600,Intel Xeon X5687 @ 3.60GHz,5288
3601,Intel Xeon X5690 @ 3.47GHz,6967
3602,Intel Xeon X5698 @ 4.40GHz,3447


### Extract CPU Brand
The CPU Brand is extracted from the CPU Name based off each brand's products and naming conventions.

In [427]:
# Remove the unnecessary clockspeed in some of the CPU Names
cpu_modify['CPU Name'] = cpu_modify['CPU Name'].str.split(pat='@', expand=True)[0]
cpu_modify['CPU Name'].str.strip()

# Create Column for CPU Brand
cpu_Conditions = [
    cpu_modify['CPU Name'].str.contains('Intel', case=False),
    cpu_modify['CPU Name'].str.contains('AMD', case=False),
    cpu_modify['CPU Name'].str.contains('Apple', case=False)
]
cpu_Values = ['Intel', 'AMD', 'Apple']

cpu_modify['CPU Brand'] = np.select(cpu_Conditions, cpu_Values, default='Unknown')
cpu_modify

Unnamed: 0,CPU Name,CPU Mark,CPU Brand
169,AMD Athlon 200GE,4102,AMD
170,AMD Athlon 220GE,4424,AMD
171,AMD Athlon 240GE,4534,AMD
175,AMD Athlon 3000G,4482,AMD
176,AMD Athlon 300GE,4256,AMD
...,...,...,...
3599,Intel Xeon X5680,6794,Intel
3600,Intel Xeon X5687,5288,Intel
3601,Intel Xeon X5690,6967,Intel
3602,Intel Xeon X5698,3447,Intel


### Extract CPU Model
The CPU Model is extracted from the CPU Name based off each brand's naming conventions.

Some models may have more than one entry, usually due to the brand creating different versions of the same model, or the same CPU chip being entered under different names. Differentiating between these models is difficult as the laptop data does not specify which version of the chip is being used. Additionally, the marks of the different models are typically within a similar range. Therefore, for this analysis they will be aggregated and averaged.

In [428]:
# Create column for CPU Model and set as index
intel_regex = r'\s(([A-Z][A-Z0-9-]*|i.*)?\d{3,}\w*)'
amd_regex = r'\s((\w+-)?\d{3,}\w*)'
apple_regex = r'\s(M\d.*Core)'

def extract_cpu_model(row):
    cpu_name = row['CPU Name']
    cpu_brand = row['CPU Brand']

    if cpu_brand == 'Intel':
        match = re.search(intel_regex, cpu_name)
        if match:
            return match.group(1)
    if cpu_brand == 'AMD':
        match = re.search(amd_regex, cpu_name)
        if match:
            return match.group(1)
    if cpu_brand == 'Apple':
        match = re.search(apple_regex, cpu_name)
        if match:
            return match.group(1)
        
    return 'Unknown'

cpu_modify['CPU Model'] = cpu_modify.apply(extract_cpu_model, axis = 1)

# Entries where the CPU Model extraction failed
cpu_modify = cpu_modify.drop(cpu_modify[cpu_modify['CPU Model'] == 'Unknown'].index)

# Display count of duplicate values
cpu_modify['CPU Model'].value_counts()

E3-1225     5
E3-1220     5
E3-1230     5
E3-1275     5
E3-1240     5
           ..
i5-3320M    1
i5-3317U    1
i5-3230M    1
i5-3210M    1
X6550       1
Name: CPU Model, Length: 1655, dtype: int64

In [429]:
# Aggregate duplicates and average the CPU Marks
def agg_cpu_model(group):
    if len(group) == 1:
        return group.iloc[0]
    else:
        # For duplicate values, it's presumably the same CPU under two different CPU names.
        # Take the shorter of the two names and average the CPU Mark.
        # Note: Averaging the CPU Marks is sufficient for practical purposes as the values should be similar.
        # If we wanted to be pedantic, PassMark shows the number of samples for each CPU Mark and we could recalculate a precise value.
        shorter_name_idx = group['CPU Name'].str.len().idxmin()
        shorter_name = group.loc[shorter_name_idx, 'CPU Name']
        avg_mark = group['CPU Mark'].mean()
        brand = group.iloc[0]['CPU Brand']
        return pd.Series({'CPU Name': shorter_name, 'CPU Mark': avg_mark, 'CPU Brand': brand})

cpu_modify = cpu_modify.groupby('CPU Model').apply(agg_cpu_model)
cpu_modify = cpu_modify.unstack()
cpu_modify.drop(columns='CPU Model', inplace=True)
cpu_modify

Unnamed: 0_level_0,CPU Brand,CPU Mark,CPU Name
CPU Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1200,AMD,6309.5,AMD Ryzen 3 1200
1300,AMD,7234,AMD Ryzen 3 PRO 1300
1300X,AMD,6957,AMD Ryzen 3 1300X
1400,AMD,7756,AMD Ryzen 5 1400
1403,Intel,2284,Intel Pentium 1403 v2
...,...,...,...
i9-9940X,Intel,28402,Intel Core i9-9940X
i9-9960X,Intel,30354,Intel Core i9-9960X
i9-9980HK,Intel,14493,Intel Core i9-9980HK
i9-9980XE,Intel,32263,Intel Core i9-9980XE


## GPU data transformation
The GPU data transformation involves:
- Adding brand column to the dataset.

- Extracting GPU Model from GPU Name.

- Removing irrelevant and incorrect entries and columns.

In [430]:
gpu_original = pd.read_csv(cwd + '/datasets/downloaded_data/GPU_Benchmarks_Compilation/GPU_benchmarks_v7.csv', quotechar='"')
gpu_modify = gpu_original.copy()
gpu_modify

Unnamed: 0,gpuName,G3Dmark,G2Dmark,price,gpuValue,TDP,powerPerformance,testDate,category
0,GeForce RTX 3090 Ti,29094,1117,2099.99,13.85,450.0,64.65,2022,Unknown
1,GeForce RTX 3080 Ti,26887,1031,1199.99,22.41,350.0,76.82,2021,Desktop
2,GeForce RTX 3090,26395,999,1749.99,15.08,350.0,75.41,2020,Desktop
3,Radeon RX 6900 XT,25458,1102,1120.31,22.72,300.0,84.86,2020,Desktop
4,GeForce RTX 3080,24853,1003,999.00,24.88,320.0,77.66,2020,Desktop
...,...,...,...,...,...,...,...,...,...
2312,Intel 82852/82855 GM/GME Controller,1,107,,,,,2010,Unknown
2313,Quadro2 Pro,1,143,,,,,2009,Workstation
2314,Rage 128 Pro,1,40,,,,,2009,Unknown
2315,RAGE 128 PRO AGP 4X TMDS,1,142,,,,,2009,Unknown


Ad hoc remove irrelevant rows and columns.

In [431]:
# Remove irrelevant rows and columns
gpu_modify = gpu_modify[(gpu_modify.category != 'Desktop') & (gpu_modify.category != 'Workstation')]
gpu_modify.drop(gpu_modify[gpu_modify['gpuName'].str.contains('\+|/')].index, inplace=True)
gpu_modify.drop(gpu_modify[~gpu_modify['gpuName'].str.contains('Intel|AMD|Nvidia|GeForce|Radeon|RTX|GTX|MX|Quadro|T\d\d\d', case=False)].index, inplace=True)
gpu_modify.drop(gpu_modify[gpu_modify['gpuName'].str.contains('Radeon HD', case=False)].index, inplace=True)
gpu_modify.drop(columns=['G2Dmark', 'price', 'gpuValue', 'TDP', 'powerPerformance', 'testDate'], inplace=True)
gpu_modify.drop(gpu_modify[gpu_modify['G3Dmark']<700].index, inplace=True)
gpu_modify

Unnamed: 0,gpuName,G3Dmark,category
0,GeForce RTX 3090 Ti,29094,Unknown
11,RTX A4500,21546,Unknown
18,GeForce RTX 3080 Ti Laptop GPU,19507,Unknown
23,Radeon PRO W6800,18802,Unknown
26,GeForce RTX 3070 Ti Laptop GPU,18490,Unknown
...,...,...,...
1098,"Radeon R5 A10-9600P RADEON R5, 10 COMPUTE CORE...",716,Unknown
1099,Intel HD 5600,712,Unknown
1109,Radeon R7 A8-7650K,704,Unknown
1110,Radeon R7 A12-9700P RADEON,703,Unknown


### Create Brand Column
The brand column can be created from gpuName based off each brand's product naming conventions.

In [432]:
# Create brand column
gpu_Conditions = [
    gpu_modify.gpuName.str.contains('Intel', case = False),
    gpu_modify.gpuName.str.contains('Nvidia|GeForce|GTX|RTX|T\d+', case = False),
    gpu_modify.gpuName.str.contains('AMD|Radeon|RX', case=False)
]
gpu_Values = ['Intel', 'Nvidia', 'AMD']

gpu_modify['gpuBrand'] = np.select(gpu_Conditions, gpu_Values, default='Unknown')
gpu_modify

Unnamed: 0,gpuName,G3Dmark,category,gpuBrand
0,GeForce RTX 3090 Ti,29094,Unknown,Nvidia
11,RTX A4500,21546,Unknown,Nvidia
18,GeForce RTX 3080 Ti Laptop GPU,19507,Unknown,Nvidia
23,Radeon PRO W6800,18802,Unknown,AMD
26,GeForce RTX 3070 Ti Laptop GPU,18490,Unknown,Nvidia
...,...,...,...,...
1098,"Radeon R5 A10-9600P RADEON R5, 10 COMPUTE CORE...",716,Unknown,AMD
1099,Intel HD 5600,712,Unknown,Intel
1109,Radeon R7 A8-7650K,704,Unknown,AMD
1110,Radeon R7 A12-9700P RADEON,703,Unknown,AMD


### Create Model Column
The gpuModel column can be extracted off the gpuName based on each brand's product naming conventions.

Some models have more than a single entry either due to there being more than one type of the model, or the same same gpu being entered under alternate names. It is sometimes unclear which exact model the laptop graphics entry is referring to. As this dataset is focused on laptops, when handling duplicate models:
- if there are entries which are specified for laptop, only the laptop set will be used.
- otherwise all entries are taken into account

The values in the chosen set are averaged as further differentiating between the different versions is beyond the current scope.

In [433]:
# Extract gpu model
def extract_gpu_model(row):
    gpu_regex = r'\b[A-Z0-9-]*\d\d[A-Z0-9-]*\b'
    brand = row.gpuBrand
    name = row.gpuName

    if brand == 'Nvidia':
        gpu_model = re.search(gpu_regex, name)
        if gpu_model != None:
            gpu_model = gpu_model.group(0)
            if re.search('Ti', name, re.IGNORECASE):
                gpu_model += ' Ti'
            if re.search('Super', name, re.IGNORECASE):
                gpu_model += ' Super'
            return gpu_model
        else:
            return 'Unknown model'
        
    elif brand == 'AMD':
        gpu_model = re.search(gpu_regex, name)
        if gpu_model != None:
            gpu_model = gpu_model.group(0)
            if 'XT' in name:
                gpu_model += ' XT'
            return gpu_model
        else:
            return 'Unknown model'
        
    elif brand == 'Intel':
        intelName = name.split(' ')
        if intelName[0] == 'Intel':
            gpu_model = ' '.join(intelName[1:])
        else:
            gpu_model = name
        return gpu_model

    else:
        return 'Unknown brand'

gpu_modify['gpuModel'] = gpu_modify.apply(extract_gpu_model, axis=1)
gpu_modify.head()

Unnamed: 0,gpuName,G3Dmark,category,gpuBrand,gpuModel
0,GeForce RTX 3090 Ti,29094,Unknown,Nvidia,3090 Ti
11,RTX A4500,21546,Unknown,Nvidia,A4500
18,GeForce RTX 3080 Ti Laptop GPU,19507,Unknown,Nvidia,3080 Ti
23,Radeon PRO W6800,18802,Unknown,AMD,W6800
26,GeForce RTX 3070 Ti Laptop GPU,18490,Unknown,Nvidia,3070 Ti


Show duplicate values

In [434]:
# Check for duplicate gpu models
duplicateGpu = gpu_modify.groupby(by=['gpuBrand', 'gpuModel']).filter(lambda x: len(x)>1)
duplicateGpu = duplicateGpu[duplicateGpu.gpuModel != 'Unknown model']
duplicateGpu

Unnamed: 0,gpuName,G3Dmark,category,gpuBrand,gpuModel
52,GeForce RTX 2080 (Mobile),15107,Mobile,Nvidia,2080
54,Quadro RTX 5000 (Mobile),14832,"Mobile, Workstation",Nvidia,5000
65,Quadro RTX 5000 with Max-Q Design,13893,"Mobile, Workstation",Nvidia,5000
70,RTX A2000 12GB,13688,Unknown,Nvidia,A2000
71,GeForce RTX 2080 with Max-Q Design,13681,Mobile,Nvidia,2080
...,...,...,...,...,...
1008,Radeon R7 A8-9600 RADEON,798,Unknown,AMD,A8-9600
1035,Radeon R7 PRO A10-9700,771,Unknown,AMD,A10-9700
1036,GeForce 730A,769,Unknown,Nvidia,730A
1055,Quadro 1100M,755,Unknown,Unknown,Unknown brand


Aggregate and condense duplicate values

In [435]:
# Set the index to gpuName for dealing with duplicates
gpu_modify = gpu_modify.set_index('gpuName')

# Clean up duplicate gpu models
# If there are specified laptop versions, use the laptop versions
# Else use the non-laptop versions
# Average the marks across the duplicates
for model in duplicateGpu.gpuModel.unique():
    duplicateNames = gpu_modify[gpu_modify.gpuModel == model].index.values

    laptopInName = []
    laptopNotInName = []
    for gpuName in duplicateNames:
        if 'Laptop' in gpuName:
            laptopInName += [gpuName]
        else:
            laptopNotInName += [gpuName]

    if laptopInName != []:
        # remove laptop not in name
        averageMark = gpu_modify.loc[laptopInName].G3Dmark.mean()
        gpu_modify.drop(laptopNotInName, inplace=True)
        selectName = {'name': '', 'length': 999}
        for name in laptopInName:
            if (len(name)) < selectName['length']:
                selectName['length'] = len(name)
                selectName['name'] = name
        laptopInName.remove(selectName['name'])
        gpu_modify.at[selectName['name'], 'G3Dmark'] = averageMark
        gpu_modify.drop(laptopInName, inplace=True)
    else:
        averageMark = gpu_modify.loc[laptopNotInName].G3Dmark.mean()
        selectName = {'name': '', 'length': 999}
        for name in laptopNotInName:
            if (len(name)) < selectName['length']:
                selectName['length'] = len(name)
                selectName['name'] = name
        laptopNotInName.remove(selectName['name'])
        gpu_modify.at[selectName['name'], 'G3Dmark'] = averageMark
        gpu_modify.drop(laptopNotInName, inplace=True)

gpu_modify

Unnamed: 0_level_0,G3Dmark,category,gpuBrand,gpuModel
gpuName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GeForce RTX 3090 Ti,29094.0,Unknown,Nvidia,3090 Ti
RTX A4500,21546.0,Unknown,Nvidia,A4500
GeForce RTX 3080 Ti Laptop GPU,19507.0,Unknown,Nvidia,3080 Ti
Radeon PRO W6800,18802.0,Unknown,AMD,W6800
GeForce RTX 3070 Ti Laptop GPU,18490.0,Unknown,Nvidia,3070 Ti
...,...,...,...,...
"Radeon R5 A10-9600P RADEON R5, 10 COMPUTE CORES 4C",716.0,Unknown,AMD,A10-9600P
Intel HD 5600,712.0,Unknown,Intel,HD 5600
Radeon R7 A8-7650K,704.0,Unknown,AMD,A8-7650K
Radeon R7 A12-9700P RADEON,703.0,Unknown,AMD,A12-9700P


## Check for laptops with no data in the cpu or gpu set

Now that the initial data transformation and cleaning have been completed, the CPU Model and GPU Model in laptop data can be examined against the cpu and gpu datasets for entries where there's no match.

### CPU Check
Display laptop entries where there's no match between the laptop CPU model and the CPU model dataset.

In [436]:
cpu_models = cpu_modify.index
invalid_cpu = ~laptop_modify['CPU_MODEL'].isin(cpu_models)
invalid_cpu_laptops = laptop_modify[invalid_cpu]
invalid_cpu_laptops

Unnamed: 0,LAPTOP_BRAND,NAME,PRODUCTCODE,PRICE,PROCESSOR,MEMORY,GRAPHICS,STORAGE,DISPLAY,CPU_BRAND,CPU_MODEL,GPU_BRAND,GPU_MODEL,RAM,STORAGE_SIZE,DISPLAY_SIZE
267,Microsoft,"Microsoft Surface Laptop 4 For Business 13.5"" ...",5Q1-00016,1649,AMD Ryzen 5 4680U,8GB RAM,Radeon Graphics,256GB SSD,13.5 inch PixelSense Touch,AMD,4680U,AMD,4680U,8,256,13.5
270,ASUS,ASUS ExpertBook B1 14 inch FHD i5-1165G7 8GB 5...,B1400CEAE-EB0933R,1189,Intel Core i5-1165G7,8GB RAM onboard,Iris Xe Graphics,512GB SSD,14.0 inch FHD,Intel,i5-1165G7,Intel,Iris Xe,8,512,14.0
409,Microsoft,"Microsoft Surface Laptop 4 For Business 13.5"" ...",7IQ-00016,2049,AMD Ryzen 5 4680U,16GB RAM,Radeon Graphics,256GB SSD,13.5 inch PixelSense Touch,AMD,4680U,AMD,4680U,16,256,13.5


In this case:
- Intel Core i5-1165G7 is incorrectly entered. This CPU is an i7-series instead of i5-series.

- The two Microsoft AMD cpus unfortunately is not in the CPU database and will be ignored.

In [437]:
laptop_modify.PROCESSOR.iat[266] = 'Intel Core i7-1165G7'
laptop_modify.CPU_MODEL.iat[266] = 'i7-1165G7'

### GPU Check

Display laptop entries where there's no match between the laptop GPU model and the GPU model dataset.

In [438]:
gpu_models = gpu_modify['gpuModel']
invalid_gpu = ~laptop_modify['GPU_MODEL'].isin(gpu_models)
invalid_gpu_laptops = laptop_modify[invalid_gpu]
list(invalid_gpu_laptops.GPU_MODEL.unique())

['4060',
 '4080',
 '4070',
 '4090',
 '4050',
 '5625U',
 '5825U',
 '7730U',
 '5875U',
 'M1 8-core CPU/7-core GPU',
 '16‑core GPU',
 '16‑Core GPU',
 '32‑core GPU',
 '7530U',
 '4680U',
 '14‑core GPU',
 'M2 8-core CPU/10-core GPU',
 'M1 8-core CPU/8-core GPU',
 'M2 Chip 8-core CPU/10-core GPU',
 '38‑core GPU',
 '4980U',
 '32‑Core GPU',
 'M2 8-core CPU/8-core GPU',
 '19‑core GPU',
 '30‑core GPU',
 '5675U',
 '2000',
 'A1000',
 'A5500',
 '3500']

Unfortunately, these GPUs are not within the current GPU dataset and so will be ignored for now.

These GPUs are a mixture of Apple Silicone chips, which the Passmark database has no entries for, and newer GPUs which where not collected when the dataset was generated.

In [439]:
print('Number of laptops without matching GPU: ' + str(invalid_gpu_laptops.PRODUCTCODE.count()))
print('Number of Apple laptops: ' + str(sum(invalid_gpu_laptops.LAPTOP_BRAND == 'Apple')))
print('Number of laptops without matching GPU (excluding Apple): ' + str(sum(invalid_gpu_laptops.LAPTOP_BRAND != 'Apple')))
print('\nNumber of laptops with matching GPU: ' + str(laptop_modify[laptop_modify['GPU_MODEL'].isin(gpu_models)].PRODUCTCODE.count()))

Number of laptops without matching GPU: 161
Number of Apple laptops: 45
Number of laptops without matching GPU (excluding Apple): 116

Number of laptops with matching GPU: 514


## Export data to CSVs

In [440]:
# Create csv file with the tranformed data
laptop_modify.set_index('PRODUCTCODE', inplace=True)
laptop_modify.to_csv(cwd+'/datasets/laptop.csv', quotechar="'")

In [441]:
# Create csv file with the tranformed data)
cpu_modify.to_csv(cwd+'/datasets/cpu.csv', quotechar="'")

In [442]:
# Create csv file with the tranformed data
gpu_modify.to_csv(cwd+'/datasets/gpu.csv', quotechar="'")