----
# Data Cleaning
----

### Notebook Overview

In this notebook, I will perform data cleaning and prepare the scraped dataset for exploratory data analysis (EDA). The key steps involved include::

- **Removing Duplicates:** Dropping duplicated rows to ensure accurate analysis.

- **Handling Missing Values:** Addressing missing values within the dataset to ensure high data quality.

- **Feature Engineering:** Enhancing the dataset by extracting relevant features from product descriptions.

## Set Up

In [179]:
import numpy as np
import pandas as pd
import re
import matplotlib


## Data Loading
----

In [180]:
df = pd.read_csv('../../data/headphones_data.csv', index_col = 0)


## Utility Functions

In [181]:
def df_check(df):
    '''
    Outputs quality measures for dataframes

    Paramters
    ---------
    df: DataFrame for quality check

    Returns
    -------
    Statements with data quality info such as shape, duplicated values, missing values
    '''
    
    shape = df.shape
    # Calling sum twice - first sum returns column level results second sum to retrun total null values in all columns
    null_vals = df.isna().sum().sum()
    duplicated_rows = df.duplicated().sum()
    duplicated_cols = df.columns.duplicated().sum()

    print (
    f"""
    Data Quality Checks:
    --------------------------------------------
    No. of rows: {shape[0]}
    No. of columns: {shape[1]}
    No. of missing values: {null_vals}
    No. of duplicated rows: {duplicated_rows}
    No. of duplicated columns: {duplicated_cols}
    """
)
    


In [182]:
def search_description(description, regexp):
    '''
    Outputs binary value

    Paramters
    ---------
    description: string of product description
    regexp: regular expression

    Returns
    -------
    1 if regexp is present in description, 0 if not
    '''
    if re.search(regexp, description.lower()):
        return 1
    else:
        return 0

In [183]:
def get_colour(description):
    '''
    Outputs colour in product description

    Paramters
    ---------
    description: string of product description

    Returns
    -------
    Colour mentioned in the description
    '''
    
    # Using matplotlib to get list of colours (instead of manually creating a list)    
    colour_names = matplotlib.colors.CSS4_COLORS.keys()

    # Looping through the list of colours to see if any of the colours are in the product description 
    for colour in colour_names:
        if re.search(rf'\b{colour}\b', description):
            return colour
        else:
            # if first colour in the list is not found try the next
            continue
    
    # Cases where no colour in the the colour_list is found in description
    return 'Not Specified'
    

In [184]:
def get_battery_life(description):
    '''
    Outputs battery life listed in product description

    Paramters
    ---------
    description: string of product description

    Returns
    -------
    Battery life in hours
    '''
        
    regexp = r'(\b[1-9]\d*)\s*(battery|batteries|hours?|hrs?|h)'
    if re.search(regexp,description):
        # using .group to only get the int part of the regexp
        return re.search(regexp,description).group(1)    
        
    else:
        return 'Not Specified'

## Preliminary Checks

In [185]:
df_check(df)


    Data Quality Checks:
    --------------------------------------------
    No. of rows: 1500
    No. of columns: 5
    No. of missing values: 688
    No. of duplicated rows: 396
    No. of duplicated columns: 0
    


In [186]:
df.info() # Checking data types

<class 'pandas.core.frame.DataFrame'>
Index: 1500 entries, 0 to 1499
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Product ID   1500 non-null   object
 1   Description  1500 non-null   object
 2   Price        1500 non-null   object
 3   Rating       812 non-null    object
 4   Is Prime     1500 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 70.3+ KB


### Dealing the with the duplicates

In [187]:
df[df.duplicated(keep=False)].sort_values(by = 'Product ID', ascending=False) # quick 4 eyes check of duplicated rows

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime
1145,B0DD2WMQ1N,"Ear Clip Headphone, Open Ear, Wireless Headphones, 5.3 Bone Conduction Headphones, Multifunctional Touch Control, Fast Transmission Earphone, Ideal for Adults",12.79,,0
1137,B0DD2WMQ1N,"Ear Clip Headphone, Open Ear, Wireless Headphones, 5.3 Bone Conduction Headphones, Multifunctional Touch Control, Fast Transmission Earphone, Ideal for Adults",12.79,,0
455,B0DCYYFF3G,"Wireless Headphones for TV Watching, FM Wireless Radio, Wireless Monitor, Wireless Headset with Digital Decorder, Wired Headset, Over-Ear Wireless Headset for Adult and Senior, Compatible Computer TV",29.98,,0
1080,B0DCYYFF3G,"Wireless Headphones for TV Watching, FM Wireless Radio, Wireless Monitor, Wireless Headset with Digital Decorder, Wired Headset, Over-Ear Wireless Headset for Adult and Senior, Compatible Computer TV",29.98,,0
66,B0DBPQCCM6,"Wireless Earbuds, In Ear Headphones Bluetooth 5.3 with HiFi Stereo Deep Bass, Bluetooth Earbuds Noise Cancelling with 4 ENC Mic, LED Display, 40H Playtime, IPX7 Waterproof Earphones for Android iOS",19.99,4.8 out of 5 stars,0
...,...,...,...,...,...
102,B00FZLV9L8,"Over Ear Wireless Bluetooth Headphones with Mic - August EP650 - Custom App for Easy EQ Sound Control, aptX Low Latency, NFC, Rich Bass Clear Sound, 30 days Stand By High-Performance Comfort [Red]",44.95,4.3 out of 5 stars,1
1471,B00FZLV9G8,"Over Ear Wireless Bluetooth Headphones with Mic - August EP650 - Custom App for Easy EQ Sound Control, aptX Low Latency, NFC, Rich Bass Clear Sound, 30 days Stand By High-Performance Comfort [White]",39.95,4.3 out of 5 stars,1
811,B00FZLV9G8,"Over Ear Wireless Bluetooth Headphones with Mic - August EP650 - Custom App for Easy EQ Sound Control, aptX Low Latency, NFC, Rich Bass Clear Sound, 30 days Stand By High-Performance Comfort [White]",39.95,4.3 out of 5 stars,1
1237,B00F54Y6GU,"Over Ear Wireless Bluetooth Headphones with Mic - August EP650 - Custom App for Easy EQ Sound Control, aptX Low Latency, NFC, Rich Bass Clear Sound, 30 days Stand By High-Performance Comfort [Black]",39.95,4.3 out of 5 stars,1


In [188]:
duplicated = df[df.duplicated(keep='first')].sort_values(by = 'Product ID', ascending=False)

In [189]:
# using size to count number of occurrences of each duplicated headphone
duplicated.groupby(['Product ID'])[['Product ID']].size().sort_values(ascending=False)


Product ID
B09TXBWYRF    47
B082P6L3T5    44
B09PQSVFQT    29
B00N3UC444    27
B07SNBHTKD    23
              ..
B0BYMD6JR9     1
B0BZZ1XH2J     1
B0C1N93S1L     1
B0C2C6Q5MV     1
B0DD2WMQ1N     1
Length: 157, dtype: int64

In [190]:
df[df['Product ID'] == 'B0DD2WMQ1N']

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime
1137,B0DD2WMQ1N,"Ear Clip Headphone, Open Ear, Wireless Headphones, 5.3 Bone Conduction Headphones, Multifunctional Touch Control, Fast Transmission Earphone, Ideal for Adults",12.79,,0
1145,B0DD2WMQ1N,"Ear Clip Headphone, Open Ear, Wireless Headphones, 5.3 Bone Conduction Headphones, Multifunctional Touch Control, Fast Transmission Earphone, Ideal for Adults",12.79,,0


In [191]:
# Dropping duplicates
df = df.drop_duplicates()

In [192]:
# Re-checking dataframe after dropping the duplicated rows
df_check(df)


    Data Quality Checks:
    --------------------------------------------
    No. of rows: 1104
    No. of columns: 5
    No. of missing values: 599
    No. of duplicated rows: 0
    No. of duplicated columns: 0
    


### Dealing with Missing Values

In [193]:
df.isna().sum()

Product ID       0
Description      0
Price            0
Rating         599
Is Prime         0
dtype: int64

In [194]:
# Viewing the null value in rating
df[df['Rating'].isna()]

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime
202,B0CW9FL8BT,"Noise Cancelling Headphones, Wireless over Ear Bluetooth Headphones, With Mic 3-in-1multi-function Headset Folded and Stored Easily Memory Foam Ear Cups for Travel, Home Office #9",9.09,,0
281,B09FLTJQWB,"NCRD Wireless Noise Canceling Overhead Headphones with Mic, Over Ear Wireless Bluetooth Headphones, Deep Bass, for Adults, TV, Online Class, Home Office (Color : Red)",505.30,,0
308,B0D89ZZK43,"PCKOBEVER Bluetooth Wireless Headphones,Cute Cat Ear Earphone For Kids,Over Ear Headsets Foldable Stereo Headphones LED Light Up,Wireless Headphones With Microphone For Kids Girls Boys(Pink)",10.99,,0
310,B09FLT6T8X,"NCRD Quiet Comfort Wireless Bluetooth Headphones, Noise-Cancelling, Wireless Bluetooth Headphones, Deep Bass, Hi-Fi Sound, 20H Playtime Headset for Adults, TV, Online Class, Home Office",337.75,,0
313,B0BNWYLGCS,"Bewinner Cute Cat Ear Gaming Headphones, LED Lights 3.5mm Wired Wireless BT Foldable Gaming Headset for PC, Laptop Headset with Noise Canceling Microphone for Gift, Game (Green)",36.90,,1
...,...,...,...,...,...
1495,B0CYT9HGRJ,"GeRRiT Bluetooth Headset, Bluetooth Earpiece with MIC for Business, Office and Driving, Trucker Bluetooth Headset with Charging Case, in- Ear Headphones Wireless Earphones",175.49,,0
1496,B0CP88GBB5,ARTSZY Wireless Headphone Foldable Headset with Deep Bass Stereo Radio 3 Modes Microphone Earphone,179.63,,0
1497,B0CP87KH1G,ARTSZY Wireless Earbuds Bluetooth 5.0 Waterproof Touch Control Wireless Bluetooth Earbuds with Mic Earphone in-Ear Deep Bass Built-in Mic Bluetooth Headphones,179.12,,0
1498,B0CP87FZ6Q,ARTSZY Wireless Earbuds Bluetooth Waterproof Touch Control Wireless Bluetooth Earbuds with Mic Earphones in-Ear Deep Bass Built-in Mic Bluetooth Headphones,176.34,,0


**Comment:**

I have decided to delete rows with null ratings, even though this results in losing about half of my dataset. In this project, my aim is to maintain as much authenticity of the data used in the recommeder system.

Imputing missing rating values with the mean or any other value could introduce some bias and potentially skew recommendations given the large number of missing values.

While reducing the dataset size by 50% seems rather significant, I believe that retaining only authentic and untouched data will improive the reliability of the recommender system. 

In [195]:
# Dropping the rows with the missing rating 
df = df.dropna()

In [196]:
# Re-checking dataframe
df_check(df)


    Data Quality Checks:
    --------------------------------------------
    No. of rows: 505
    No. of columns: 5
    No. of missing values: 0
    No. of duplicated rows: 0
    No. of duplicated columns: 0
    


### Dealing with Not Specified 

When scraping if information was not present I added a placeholder 'NOT SPECIFIED' I would have to get rid of this in certain columns where I need all info: Price

In [197]:
df.columns       

Index(['Product ID', 'Description', 'Price', 'Rating', 'Is Prime'], dtype='object')

In [198]:
df['Price'].value_counts()

Price
Not Specified    27
19.99            27
14.99            19
15.99            18
24.99            15
                 ..
9.79              1
31.20             1
8.88              1
15.19             1
176.50            1
Name: count, Length: 255, dtype: int64

In [199]:
df = df[df['Price'] != 'Not Specified']

In [200]:
df['Price'].value_counts()

Price
19.99     27
14.99     19
15.99     18
24.99     15
17.99     14
          ..
9.79       1
31.20      1
8.88       1
15.19      1
176.50     1
Name: count, Length: 254, dtype: int64

In [201]:
df_check(df)


    Data Quality Checks:
    --------------------------------------------
    No. of rows: 478
    No. of columns: 5
    No. of missing values: 0
    No. of duplicated rows: 0
    No. of duplicated columns: 0
    


### Resetting the Index

In [202]:
df = df.reset_index(drop = True)

## Product Description
----

In [203]:
df['Description']

0      INFURTURE Active Noise Cancelling Headphones, H1 Wireless Over Ear Bluetooth Headphones, Deep Bass Headset, Low Latency, Memory Foam Ear Cups,40H Playtime, for Adults, Kids, TV, Travel, Home Office
1                                                Lindy NC-60 Wired Active Noise Cancelling (ANC) Headphones, 40mm Drivers, Comfortable, Light, Carrycase, 1.5m Audio Cable, 3.5mm Jack, 6.3mm Adapter, Black
2                                                                                                                                                      Sony MDRZX310L.AE Foldable Headphones - Metallic Blue
3                                                                                                                                              Sony MDR-ZX110 Overhead Headphones - Black , BASIC, Pack of 1
4                  LORELEI X6 Over-Ear Headphones with Microphone, Lightweight Foldable & Portable Stereo Bass Headphones with 1.45M No-Tangle, Wired Headphones for Smartphone Tabl

### Feature Engineering - Wireless

In [204]:
df['Description'] = df['Description'].str.lower()

In [205]:
# using apply args to take in regexp input 
# args expects tuple hence ,
df['Wireless'] = df['Description'].apply(search_description, args=(r'\bwireless\b',))

### Feature Engineering - Noise Cancelling

In [206]:
df['Noise Cancelling'] = df['Description'].apply(search_description, args=(r'\bnoise[-\s]?cancelling\b',))

In [207]:
df['Noise Cancelling'].value_counts()

Noise Cancelling
0    411
1     67
Name: count, dtype: int64

### Feature Engineering - Colour


In [208]:
df['Description']

0      infurture active noise cancelling headphones, h1 wireless over ear bluetooth headphones, deep bass headset, low latency, memory foam ear cups,40h playtime, for adults, kids, tv, travel, home office
1                                                lindy nc-60 wired active noise cancelling (anc) headphones, 40mm drivers, comfortable, light, carrycase, 1.5m audio cable, 3.5mm jack, 6.3mm adapter, black
2                                                                                                                                                      sony mdrzx310l.ae foldable headphones - metallic blue
3                                                                                                                                              sony mdr-zx110 overhead headphones - black , basic, pack of 1
4                  lorelei x6 over-ear headphones with microphone, lightweight foldable & portable stereo bass headphones with 1.45m no-tangle, wired headphones for smartphone tabl

In [209]:
# Apply the function to each row in the Description column
df['Colour'] = df['Description'].apply(get_colour)

In [210]:
df['Colour'].value_counts()

Colour
Not Specified    188
black            109
blue              41
pink              36
green             19
white             17
gold              15
purple            13
red               11
grey               7
silver             5
orange             5
yellow             3
gray               2
beige              2
ivory              1
brown              1
navy               1
violet             1
cyan               1
Name: count, dtype: int64

In [211]:
df['Colour'] = df['Colour'].replace('gray', 'grey')

In [212]:
df

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime,Wireless,Noise Cancelling,Colour
0,B08HDBZNZ9,"infurture active noise cancelling headphones, h1 wireless over ear bluetooth headphones, deep bass headset, low latency, memory foam ear cups,40h playtime, for adults, kids, tv, travel, home office",49.99,4.3 out of 5 stars,1,1,1,Not Specified
1,B074DZ39QJ,"lindy nc-60 wired active noise cancelling (anc) headphones, 40mm drivers, comfortable, light, carrycase, 1.5m audio cable, 3.5mm jack, 6.3mm adapter, black",58.78,4.3 out of 5 stars,1,0,1,black
2,B00I3LUYNG,sony mdrzx310l.ae foldable headphones - metallic blue,18.00,4.5 out of 5 stars,0,0,0,blue
3,B00NBR70DO,"sony mdr-zx110 overhead headphones - black , basic, pack of 1",15.99,4.5 out of 5 stars,0,0,0,black
4,B083P1HG9S,"lorelei x6 over-ear headphones with microphone, lightweight foldable & portable stereo bass headphones with 1.45m no-tangle, wired headphones for smartphone tablet mp3 / 4 (space black)",15.99,4.4 out of 5 stars,0,0,0,black
...,...,...,...,...,...,...,...,...
473,B0DC32R4K9,"rydohi wireless bluetooth headphones over ear, hi-fi stereo headset with deep bass, foldable and lightweight, wired and wireless modes built in mic for cell phones, tv, pc and traveling (pink)",14.90,4.4 out of 5 stars,0,1,0,pink
474,B09BW1BHC3,"shokz openrun pro, [england athletics recommended] bone conduction headphones, open-ear sports earphones with mic, ip55 waterproof bluetooth wireless headset for running workout driving (blue)",159.95,4.5 out of 5 stars,1,1,0,blue
475,B09D43P2G4,bbyogooz earbuds for kids with storage case cute kids earbud with mic microphone for school wired in-ear headphones for girls boys adultskids earbuds (pink unicorn),15.44,4.2 out of 5 stars,0,0,0,pink
476,B098TG1QM1,hngm headphone stand bamboo wood aluminum headphone stand gaming headset earphone display rack hanger holder bracket headsets storage accessories (color : silver),19.24,4.7 out of 5 stars,0,0,0,silver


### Feature Engineering - Battery Life

In [213]:
df['Battery Life'] = df['Description'].apply(get_battery_life)

### Feature Engineering - Microphone

In [214]:
df['Microphone'] = df['Description'].apply(search_description, args=(r'\b(mic?|microphone?)\b',))

### Feature Engineering: Over Ear

In [215]:
df['Over Ear'] = df['Description'].apply(search_description, args=(r'\b(over[\s-]ear?|overhead?)\b',))

In [216]:
df[df['Over Ear'] == 0]

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime,Wireless,Noise Cancelling,Colour,Battery Life,Microphone,Over Ear
1,B074DZ39QJ,"lindy nc-60 wired active noise cancelling (anc) headphones, 40mm drivers, comfortable, light, carrycase, 1.5m audio cable, 3.5mm jack, 6.3mm adapter, black",58.78,4.3 out of 5 stars,1,0,1,black,Not Specified,0,0
2,B00I3LUYNG,sony mdrzx310l.ae foldable headphones - metallic blue,18.00,4.5 out of 5 stars,0,0,0,blue,Not Specified,0,0
6,B0D5D294RR,"marshall major v bluetooth wireless headphones, 100 hours playtime - brown",129.99,4.6 out of 5 stars,0,1,0,brown,100,0,0
7,B0C8V45ZF5,roxel rx-90 wired headphones with microphone - lightweight on ear headphones for android/ios devices - comfortable head cushion ergonomic - answer incoming calls - perfect for music lovers (black),12.99,4.3 out of 5 stars,0,0,0,black,Not Specified,1,0
10,B00I3LV336,"sony zx310ap on-ear headphones compatible with smartphones, tablets and mp3 devices - metallic black",18.99,4.4 out of 5 stars,0,0,0,black,Not Specified,0,0
...,...,...,...,...,...,...,...,...,...,...,...
472,B0CBTZCXK1,"qaekie bone conduction headphones - bluetooth 5.3 open ear headphones with hd mic,12hrs playtime deep bass sport wireless headphones,sweatproof bone headphones for running,cycling,hiking,driving",67.29,3.9 out of 5 stars,0,1,0,Not Specified,12,1,0
474,B09BW1BHC3,"shokz openrun pro, [england athletics recommended] bone conduction headphones, open-ear sports earphones with mic, ip55 waterproof bluetooth wireless headset for running workout driving (blue)",159.95,4.5 out of 5 stars,1,1,0,blue,Not Specified,1,0
475,B09D43P2G4,bbyogooz earbuds for kids with storage case cute kids earbud with mic microphone for school wired in-ear headphones for girls boys adultskids earbuds (pink unicorn),15.44,4.2 out of 5 stars,0,0,0,pink,Not Specified,1,0
476,B098TG1QM1,hngm headphone stand bamboo wood aluminum headphone stand gaming headset earphone display rack hanger holder bracket headsets storage accessories (color : silver),19.24,4.7 out of 5 stars,0,0,0,silver,Not Specified,0,0


### Feature Engineering : Gaming 

In [217]:
df['Gaming'] = df['Description'].apply(search_description, args=(r'\bgaming\b',))

### Feature Engineering : Foldable 

In [218]:
df['Foldable'] = df['Description'].apply(search_description, args=(r'\bfoldable\b',))

### Feature Engineering : Brand

In [219]:
import spacy

In [220]:
def get_brand(description):
    # Load a pre-trained model
    nlp = spacy.load("en_core_web_sm")

    # Process the text
    doc = nlp(description)
    # Extract named entities
    for ent in doc.ents:
        if ent.label_ == "ORG":  # ORG for organisations/brands
            return ent.text
        else:
            return 'Unknown Brand'

In [221]:
df['Brand'] = df['Description'].apply(get_brand)

In [222]:
df

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime,Wireless,Noise Cancelling,Colour,Battery Life,Microphone,Over Ear,Gaming,Foldable,Brand
0,B08HDBZNZ9,"infurture active noise cancelling headphones, h1 wireless over ear bluetooth headphones, deep bass headset, low latency, memory foam ear cups,40h playtime, for adults, kids, tv, travel, home office",49.99,4.3 out of 5 stars,1,1,1,Not Specified,40,0,1,0,0,Unknown Brand
1,B074DZ39QJ,"lindy nc-60 wired active noise cancelling (anc) headphones, 40mm drivers, comfortable, light, carrycase, 1.5m audio cable, 3.5mm jack, 6.3mm adapter, black",58.78,4.3 out of 5 stars,1,0,1,black,Not Specified,0,0,0,0,Unknown Brand
2,B00I3LUYNG,sony mdrzx310l.ae foldable headphones - metallic blue,18.00,4.5 out of 5 stars,0,0,0,blue,Not Specified,0,0,0,1,sony
3,B00NBR70DO,"sony mdr-zx110 overhead headphones - black , basic, pack of 1",15.99,4.5 out of 5 stars,0,0,0,black,Not Specified,0,1,0,0,sony
4,B083P1HG9S,"lorelei x6 over-ear headphones with microphone, lightweight foldable & portable stereo bass headphones with 1.45m no-tangle, wired headphones for smartphone tablet mp3 / 4 (space black)",15.99,4.4 out of 5 stars,0,0,0,black,Not Specified,1,1,0,1,Unknown Brand
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
473,B0DC32R4K9,"rydohi wireless bluetooth headphones over ear, hi-fi stereo headset with deep bass, foldable and lightweight, wired and wireless modes built in mic for cell phones, tv, pc and traveling (pink)",14.90,4.4 out of 5 stars,0,1,0,pink,Not Specified,1,1,0,1,
474,B09BW1BHC3,"shokz openrun pro, [england athletics recommended] bone conduction headphones, open-ear sports earphones with mic, ip55 waterproof bluetooth wireless headset for running workout driving (blue)",159.95,4.5 out of 5 stars,1,1,0,blue,Not Specified,1,0,0,0,ip55
475,B09D43P2G4,bbyogooz earbuds for kids with storage case cute kids earbud with mic microphone for school wired in-ear headphones for girls boys adultskids earbuds (pink unicorn),15.44,4.2 out of 5 stars,0,0,0,pink,Not Specified,1,0,0,0,
476,B098TG1QM1,hngm headphone stand bamboo wood aluminum headphone stand gaming headset earphone display rack hanger holder bracket headsets storage accessories (color : silver),19.24,4.7 out of 5 stars,0,0,0,silver,Not Specified,0,0,1,0,


In [223]:
df['Brand'].value_counts()

Brand
Unknown Brand                       297
sony                                  8
c8                                    5
rgb                                   4
doqaus bluetooth headphones over      3
radio & wired                         3
louise & mann                         3
android                               2
osszit kids headphones                2
jyps                                  2
usb                                   2
jyps kids wireless                    2
betron                                2
jyps kids wireless headphones         1
netagon                               1
philips tat8506wt                     1
panasonic                             1
xosda kids                            1
samson technologies                   1
xosda bulk                            1
microphone & volume limited           1
rechargeable &                        1
ref                                   1
bowers & wilkins                      1
ukcoco                            

-----
**Comment:**

My attempt at using SpaCy to extract the brand names from Product Descrption did not work too well. 

Majority of the data set containes brand names which are unknown and the ones it managed to find - most are not a brand and are common words in description like wireless/bluetooth or a colours. Therefore, I will drop this column from the dataframe.

In [224]:
df = df.drop(columns = ['Brand'], axis = 0)

## Final Clean Up
---

In [225]:
df.head(10)

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime,Wireless,Noise Cancelling,Colour,Battery Life,Microphone,Over Ear,Gaming,Foldable
0,B08HDBZNZ9,"infurture active noise cancelling headphones, h1 wireless over ear bluetooth headphones, deep bass headset, low latency, memory foam ear cups,40h playtime, for adults, kids, tv, travel, home office",49.99,4.3 out of 5 stars,1,1,1,Not Specified,40,0,1,0,0
1,B074DZ39QJ,"lindy nc-60 wired active noise cancelling (anc) headphones, 40mm drivers, comfortable, light, carrycase, 1.5m audio cable, 3.5mm jack, 6.3mm adapter, black",58.78,4.3 out of 5 stars,1,0,1,black,Not Specified,0,0,0,0
2,B00I3LUYNG,sony mdrzx310l.ae foldable headphones - metallic blue,18.0,4.5 out of 5 stars,0,0,0,blue,Not Specified,0,0,0,1
3,B00NBR70DO,"sony mdr-zx110 overhead headphones - black , basic, pack of 1",15.99,4.5 out of 5 stars,0,0,0,black,Not Specified,0,1,0,0
4,B083P1HG9S,"lorelei x6 over-ear headphones with microphone, lightweight foldable & portable stereo bass headphones with 1.45m no-tangle, wired headphones for smartphone tablet mp3 / 4 (space black)",15.99,4.4 out of 5 stars,0,0,0,black,Not Specified,1,1,0,1
5,B09PQSVFQT,"kvidio bluetooth headphones over ear, 65 hours playtime wireless headphones with microphone, foldable lightweight headset with deep bass,hifi stereo sound for travel work pc cellphone (black)",14.2,4.5 out of 5 stars,0,1,0,black,65,1,1,0,1
6,B0D5D294RR,"marshall major v bluetooth wireless headphones, 100 hours playtime - brown",129.99,4.6 out of 5 stars,0,1,0,brown,100,0,0,0,0
7,B0C8V45ZF5,roxel rx-90 wired headphones with microphone - lightweight on ear headphones for android/ios devices - comfortable head cushion ergonomic - answer incoming calls - perfect for music lovers (black),12.99,4.3 out of 5 stars,0,0,0,black,Not Specified,1,0,0,0
8,B086D1Y52Q,"iclever hs18 over ear headphones with microphone - lightweight stereo headphones, adjustable foldable wired headphones with 3.5mm jack for online class/meeting/pc/phone/computer",16.99,4.5 out of 5 stars,0,0,0,Not Specified,Not Specified,1,1,0,1
9,B0BL14JNJJ,"headphones wired over ear adult, stereo hifi music headphone foldable compact wired headset",8.99,3.6 out of 5 stars,0,0,0,Not Specified,Not Specified,0,1,0,1


### Rating

To remove `out of 5 stars` as this information is redundant information.

In [226]:
df['Rating'] = df['Rating'].astype('str')

In [227]:
df['Rating'] = df['Rating'].str.replace('out of 5 stars', '')

In [228]:
df['Rating'] = df['Rating'].astype(float)

### Data Types

In [229]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Product ID        478 non-null    object 
 1   Description       478 non-null    object 
 2   Price             478 non-null    object 
 3   Rating            478 non-null    float64
 4   Is Prime          478 non-null    int64  
 5   Wireless          478 non-null    int64  
 6   Noise Cancelling  478 non-null    int64  
 7   Colour            478 non-null    object 
 8   Battery Life      478 non-null    object 
 9   Microphone        478 non-null    int64  
 10  Over Ear          478 non-null    int64  
 11  Gaming            478 non-null    int64  
 12  Foldable          478 non-null    int64  
dtypes: float64(1), int64(7), object(5)
memory usage: 48.7+ KB


## Export to CSV
----

In [230]:
cleaned_df = df.copy()

In [55]:
cleaned_df.to_csv('../../data/cleaned_headphones_data.csv')

In [231]:
cleaned_df.head(10)

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime,Wireless,Noise Cancelling,Colour,Battery Life,Microphone,Over Ear,Gaming,Foldable
0,B08HDBZNZ9,"infurture active noise cancelling headphones, h1 wireless over ear bluetooth headphones, deep bass headset, low latency, memory foam ear cups,40h playtime, for adults, kids, tv, travel, home office",49.99,4.3,1,1,1,Not Specified,40,0,1,0,0
1,B074DZ39QJ,"lindy nc-60 wired active noise cancelling (anc) headphones, 40mm drivers, comfortable, light, carrycase, 1.5m audio cable, 3.5mm jack, 6.3mm adapter, black",58.78,4.3,1,0,1,black,Not Specified,0,0,0,0
2,B00I3LUYNG,sony mdrzx310l.ae foldable headphones - metallic blue,18.0,4.5,0,0,0,blue,Not Specified,0,0,0,1
3,B00NBR70DO,"sony mdr-zx110 overhead headphones - black , basic, pack of 1",15.99,4.5,0,0,0,black,Not Specified,0,1,0,0
4,B083P1HG9S,"lorelei x6 over-ear headphones with microphone, lightweight foldable & portable stereo bass headphones with 1.45m no-tangle, wired headphones for smartphone tablet mp3 / 4 (space black)",15.99,4.4,0,0,0,black,Not Specified,1,1,0,1
5,B09PQSVFQT,"kvidio bluetooth headphones over ear, 65 hours playtime wireless headphones with microphone, foldable lightweight headset with deep bass,hifi stereo sound for travel work pc cellphone (black)",14.2,4.5,0,1,0,black,65,1,1,0,1
6,B0D5D294RR,"marshall major v bluetooth wireless headphones, 100 hours playtime - brown",129.99,4.6,0,1,0,brown,100,0,0,0,0
7,B0C8V45ZF5,roxel rx-90 wired headphones with microphone - lightweight on ear headphones for android/ios devices - comfortable head cushion ergonomic - answer incoming calls - perfect for music lovers (black),12.99,4.3,0,0,0,black,Not Specified,1,0,0,0
8,B086D1Y52Q,"iclever hs18 over ear headphones with microphone - lightweight stereo headphones, adjustable foldable wired headphones with 3.5mm jack for online class/meeting/pc/phone/computer",16.99,4.5,0,0,0,Not Specified,Not Specified,1,1,0,1
9,B0BL14JNJJ,"headphones wired over ear adult, stereo hifi music headphone foldable compact wired headset",8.99,3.6,0,0,0,Not Specified,Not Specified,0,1,0,1


## Conclusion
-------



In this notebook, I have cleaned the dataset ready for EDA. Here is a review of the key steps/insights:

1. **Dropped Duplicates:** Removed duplicate entries to maintain data quality

2. **Addressed Missing Values:** Dropped all records with null values in ratings column to keep authenticity of the data. 

3. **Feature Extraction:** Performed feature extraction to enhance the dataset, focusing on creating relevant features from product descriptions.

With the dataset now cleaned and features added, I am prepared to proceed to the next phase: EDA. 
