----
# Data Cleaning
----

### Notebook Overview

In this notebook, I will perform data cleaning and prepare the scraped dataset for exploratory data analysis (EDA). The key steps involved include::

- **Removing Duplicates:** Dropping duplicated rows to ensure accurate analysis.

- **Handling Missing Values:** Addressing missing values within the dataset to ensure high data quality.

- **Feature Extraction:** Enhancing the dataset by extracting relevant features from product descriptions.

## Set Up
---

In [76]:
import numpy as np
import pandas as pd
import re
import matplotlib


## Data Loading
----

In [77]:
df = pd.read_csv('../../data/scraped_data.csv', index_col = 0)


## Utility Functions

In [78]:
def df_check(df):
    '''
    Outputs quality measures for dataframes

    Paramters
    ---------
    df: DataFrame for quality check

    Returns
    -------
    Statements with data quality info such as shape, duplicated values, missing values
    '''
    
    shape = df.shape
    # Calling sum twice - first sum returns column level results second sum to retrun total null values in all columns
    null_vals = df.isna().sum().sum()
    duplicated_rows = df.duplicated().sum()
    duplicated_cols = df.columns.duplicated().sum()

    print (
    f"""
    Data Quality Checks:
    --------------------------------------------
    No. of rows: {shape[0]}
    No. of columns: {shape[1]}
    No. of missing values: {null_vals}
    No. of duplicated rows: {duplicated_rows}
    No. of duplicated columns: {duplicated_cols}
    """
)
    


In [79]:
def search_description(description, regexp):
    '''
    Outputs binary value

    Paramters
    ---------
    description: string of product description
    regexp: regular expression

    Returns
    -------
    1 if regexp is present in description, 0 if not
    '''
    if re.search(regexp, description.lower()):
        return 1
    else:
        return 0

In [80]:
def get_colour(description):
    '''
    Outputs colour in product description

    Paramters
    ---------
    description: string of product description

    Returns
    -------
    Colour mentioned in the description
    '''
    
    # Using matplotlib to get list of colours (instead of manually creating a list)    
    colour_names = matplotlib.colors.CSS4_COLORS.keys()

    # Looping through the list of colours to see if any of the colours are in the product description 
    for colour in colour_names:
        if re.search(rf'\b{colour}\b', description):
            return colour
        else:
            # if first colour in the list is not found try the next
            continue
    
    # Cases where no colour in the the colour_list is found in description
    return 'Not Specified'
    

In [81]:
def get_battery_life(description):
    '''
    Outputs battery life listed in product description

    Paramters
    ---------
    description: string of product description

    Returns
    -------
    Battery life in hours
    '''
        
    regexp = r'(\b[1-9]\d*)\s*(battery|batteries|hours?|hrs?|h)'
    if re.search(regexp,description):
        # using .group to only get the int part of the regexp
        return re.search(regexp,description).group(1)    
        
    else:
        return 'Not Specified'

## Preliminary Checks

In [82]:
df_check(df)


    Data Quality Checks:
    --------------------------------------------
    No. of rows: 1503
    No. of columns: 5
    No. of missing values: 719
    No. of duplicated rows: 371
    No. of duplicated columns: 0
    


In [83]:
df.info() # Checking data types

<class 'pandas.core.frame.DataFrame'>
Index: 1503 entries, 0 to 1502
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Product ID   1503 non-null   object
 1   Description  1470 non-null   object
 2   Price        1503 non-null   object
 3   Rating       817 non-null    object
 4   Is Prime     1503 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 70.5+ KB


### Dealing the with the duplicates

In [84]:
df[df.duplicated(keep=False)].sort_values(by = 'Product ID', ascending=False) # quick 4 eyes check of duplicated rows

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime
1475,B0DD2WMQ1N,"Ear Clip Headphone, Open Ear, Wireless Headpho...",12.85,,0
1469,B0DD2WMQ1N,"Ear Clip Headphone, Open Ear, Wireless Headpho...",12.85,,0
1440,B0DCBMGNNZ,Shinyruo 3.5mm In Microphone Replacement Game ...,1.87,,0
1446,B0DCBMGNNZ,Shinyruo 3.5mm In Microphone Replacement Game ...,1.87,,0
672,B0D9LHK28S,Kanayu 100 Packs Kids Earbuds Bulk Basic Stude...,45.17,4.1 out of 5 stars,0
...,...,...,...,...,...
849,B00F54Y6GU,Over Ear Wireless Bluetooth Headphones with Mi...,33.95,4.3 out of 5 stars,0
1359,B00F54Y6GU,Over Ear Wireless Bluetooth Headphones with Mi...,33.95,4.3 out of 5 stars,0
126,B00F54Y6GU,Over Ear Wireless Bluetooth Headphones with Mi...,33.95,4.3 out of 5 stars,0
635,B001EOSZT4,Philips LFH2236 Stereo Headphones for dictatio...,31.44,4.3 out of 5 stars,0


In [85]:
duplicated = df[df.duplicated(keep='first')].sort_values(by = 'Product ID', ascending=False)

In [86]:
# using size to count number of occurrences of each duplicated headphone
duplicated.groupby(['Product ID'])[['Product ID']].size().sort_values(ascending=False)


Product ID
B07KY8G9NM    39
B0C8SJSL9H    34
B07SNBHTKD    28
B00N3UC444    28
B0BZD8KVM7    21
              ..
B09Z2S5VWM     1
B09Z2S3TSC     1
B09T322DTJ     1
B09RSSDC79     1
B0DD2WMQ1N     1
Length: 137, dtype: int64

In [87]:
df[df['Product ID'] == 'B0DD2WMQ1N']

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime
1469,B0DD2WMQ1N,"Ear Clip Headphone, Open Ear, Wireless Headpho...",12.85,,0
1475,B0DD2WMQ1N,"Ear Clip Headphone, Open Ear, Wireless Headpho...",12.85,,0


In [88]:
# Dropping duplicates
df = df.drop_duplicates()

In [89]:
# Re-checking dataframe after dropping the duplicated rows
df_check(df)


    Data Quality Checks:
    --------------------------------------------
    No. of rows: 1132
    No. of columns: 5
    No. of missing values: 645
    No. of duplicated rows: 0
    No. of duplicated columns: 0
    


### Dealing with Missing Values

In [90]:
df.isna().sum()

Product ID       0
Description     32
Price            0
Rating         613
Is Prime         0
dtype: int64

In [91]:
# Viewing the null value in rating
df[df['Rating'].isna()]

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime
160,B0DF2TJ4D8,,39.99,,0
212,B0DF2TJ4D8,"Foldable Wireless TV Headphones, Active Noise ...",39.99,,0
242,B0D4R8FYKD,MIXCU 48-Pack Wired Headphones Bundle with Mic...,Not Specified,,0
275,B09FLTJQWB,NCRD Wireless Noise Canceling Overhead Headpho...,505.30,,0
284,B0D7M7C4M2,"VR Headset, Virtual Reality Headset with Contr...",52.98,,0
...,...,...,...,...,...
1498,B0CDQ4BZ72,"LEONYS Bluetooth Wireless Earbuds,wireless Ear...",126.50,,0
1499,B0CXS6RCM6,"Sound Proof Headphones, 3.5mm Plug Headset Wit...",10.52,,0
1500,B0CZD5WKRF,HWYYLXS Wireless Over-Ear Headphones With Micr...,23.99,,0
1501,B0CPCFQBF9,"Bluetooth Earbuds with Flashlight, Bass Noise ...",117.33,,0


**Comment:**

I have decided to delete rows with null ratings, even though this results in losing about half of my dataset. In this project, my aim is to maintain as much authenticity of the data used in the recommeder system.

Imputing missing rating values with the mean or any other value could introduce some bias and potentially skew recommendations given the large number of missing values.

While reducing the dataset size by 50% seems rather significant, I believe that retaining only authentic and untouched data will improive the reliability of the recommender system. 

Also deciding to srop null rows for price - only 32 rows.

In [92]:
# Dropping all rows with the missing rating 
df = df.dropna()

In [93]:
# Re-checking dataframe
df_check(df)


    Data Quality Checks:
    --------------------------------------------
    No. of rows: 488
    No. of columns: 5
    No. of missing values: 0
    No. of duplicated rows: 0
    No. of duplicated columns: 0
    


### Dealing with Not Specified 

When scraping if information was not present I added a placeholder 'NOT SPECIFIED' I would have to get rid of this in certain columns where I need all info: Price

In [94]:
df.columns       

Index(['Product ID', 'Description', 'Price', 'Rating', 'Is Prime'], dtype='object')

In [95]:
df['Price'].value_counts()

Price
15.99            22
Not Specified    22
19.99            22
14.99            17
13.99            12
                 ..
95.94             1
15.98             1
23.94             1
27.03             1
15.48             1
Name: count, Length: 270, dtype: int64

In [96]:
df = df[df['Price'] != 'Not Specified']

In [97]:
df['Price'].value_counts()

Price
15.99    22
19.99    22
14.99    17
13.99    12
16.99    11
         ..
95.94     1
15.98     1
23.94     1
27.03     1
15.48     1
Name: count, Length: 269, dtype: int64

In [98]:
df['Price'] = df['Price'].str.replace(',', '').astype(float)

In [99]:
df_check(df)


    Data Quality Checks:
    --------------------------------------------
    No. of rows: 466
    No. of columns: 5
    No. of missing values: 0
    No. of duplicated rows: 0
    No. of duplicated columns: 0
    


### Resetting the Index

In [100]:
df = df.reset_index(drop = True)

## Product Description
----

In [101]:
df['Description']

0      Artix CL750 Wired Headphones with Mic & Volume...
1      Logitech G435 LIGHTSPEED & Bluetooth Wireless ...
2      Sony MDRZX310L.AE Foldable Headphones - Metall...
3      Sony MDR-ZX110 Overhead Headphones - Black , B...
4      LORELEI X6 Over-Ear Headphones with Microphone...
                             ...                        
461    JINSERTA RGB Cat Ear Headphones,Bluetooth 5.3 ...
462    Audiofly AFT2 True Wireless Bluetooth In-Ear H...
463    JINSERTA RGB Cat Ear Headphones,Bluetooth 5.3 ...
464    3.5mm Earbuds Wired Headphones for Samsung A25...
465           Koss KPH14V Side Firing Headphone (Violet)
Name: Description, Length: 466, dtype: object

### Feature Extraction - Wireless

In [102]:
df['Description'] = df['Description'].str.lower()

In [103]:
# using apply args to take in regexp input 
# args expects tuple hence ,
df['Wireless'] = df['Description'].apply(search_description, args=(r'\bwireless\b',))

### Feature Extraction - Noise Cancelling

In [104]:
df['Noise Cancelling'] = df['Description'].apply(search_description, args=(r'\bnoise[-\s]?cancelling\b',))

In [105]:
df['Noise Cancelling'].value_counts()

Noise Cancelling
0    399
1     67
Name: count, dtype: int64

### Feature Extraction - Colour


In [106]:
df['Description']

0      artix cl750 wired headphones with mic & volume...
1      logitech g435 lightspeed & bluetooth wireless ...
2      sony mdrzx310l.ae foldable headphones - metall...
3      sony mdr-zx110 overhead headphones - black , b...
4      lorelei x6 over-ear headphones with microphone...
                             ...                        
461    jinserta rgb cat ear headphones,bluetooth 5.3 ...
462    audiofly aft2 true wireless bluetooth in-ear h...
463    jinserta rgb cat ear headphones,bluetooth 5.3 ...
464    3.5mm earbuds wired headphones for samsung a25...
465           koss kph14v side firing headphone (violet)
Name: Description, Length: 466, dtype: object

In [107]:
# Apply the function to each row in the Description column
df['Colour'] = df['Description'].apply(get_colour)

In [108]:
df['Colour'].value_counts()

Colour
Not Specified    187
black            108
blue              42
pink              34
gold              16
white             16
green             16
purple            12
red               10
grey               7
orange             4
silver             3
yellow             3
navy               2
violet             2
ivory              1
gray               1
beige              1
cyan               1
Name: count, dtype: int64

In [109]:
df['Colour'] = df['Colour'].replace('gray', 'grey')

In [110]:
df

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime,Wireless,Noise Cancelling,Colour
0,B087JVV8FK,artix cl750 wired headphones with mic & volume...,21.30,4.2 out of 5 stars,0,0,1,Not Specified
1,B07W7LNTM5,logitech g435 lightspeed & bluetooth wireless ...,33.24,4.2 out of 5 stars,0,1,0,black
2,B00I3LUYNG,sony mdrzx310l.ae foldable headphones - metall...,18.00,4.5 out of 5 stars,0,0,0,blue
3,B00NBR70DO,"sony mdr-zx110 overhead headphones - black , b...",14.49,4.5 out of 5 stars,0,0,0,black
4,B083P1HG9S,lorelei x6 over-ear headphones with microphone...,11.99,4.4 out of 5 stars,0,0,0,black
...,...,...,...,...,...,...,...,...
461,B0BQF1W43H,"jinserta rgb cat ear headphones,bluetooth 5.3 ...",28.99,3.8 out of 5 stars,0,1,0,Not Specified
462,B08GSSBHK3,audiofly aft2 true wireless bluetooth in-ear h...,121.62,4.1 out of 5 stars,0,1,0,Not Specified
463,B0CPXL7VT7,"jinserta rgb cat ear headphones,bluetooth 5.3 ...",28.99,3.8 out of 5 stars,0,1,0,Not Specified
464,B0CCDDPS87,3.5mm earbuds wired headphones for samsung a25...,15.50,4.1 out of 5 stars,0,0,0,blue


### Feature Extraction - Battery Life

In [111]:
df['Battery Life'] = df['Description'].apply(get_battery_life)

### Feature Extraction - Microphone

In [112]:
df['Microphone'] = df['Description'].apply(search_description, args=(r'\b(mic?|microphone?)\b',))

### Feature Extraction: Over Ear

In [113]:
df['Over Ear'] = df['Description'].apply(search_description, args=(r'\b(over[\s-]ear?|overhead?)\b',))

### Feature Extraction : Gaming 

In [115]:
df['Gaming'] = df['Description'].apply(search_description, args=(r'\bgaming\b',))

### Feature Extraction : Foldable 

In [116]:
df['Foldable'] = df['Description'].apply(search_description, args=(r'\bfoldable\b',))

### Feature Extraction : Brand

In [117]:
import spacy

In [118]:
def get_brand(description):
    # Load a pre-trained model
    nlp = spacy.load("en_core_web_sm")

    # Process the text
    doc = nlp(description)
    # Extract named entities
    for ent in doc.ents:
        if ent.label_ == "ORG":  # ORG for organisations/brands
            return ent.text
        else:
            return 'Unknown Brand'

In [119]:
df['Brand'] = df['Description'].apply(get_brand)

In [120]:
df['Brand'].value_counts()

Brand
Unknown Brand                       297
sony                                  8
c8                                    5
radio & wired                         3
doqaus bluetooth headphones over      3
rgb                                   3
louise & mann                         3
mac                                   2
jyps                                  2
betron                                2
jyps kids wireless                    2
osszit kids headphones                2
android                               2
670nc                                 1
philips tat8506wt                     1
usb                                   1
koss kph14w                           1
samson technologies                   1
ukcoco                                1
ref                                   1
ip68                                  1
microphone & volume                   1
lomiluskr                             1
xosda bulk                            1
orzly                             

-----
**Comment:**

My attempt at using SpaCy to extract the brand names from Product Descrption did not work too well. 

Majority of the data set containes brand names which are unknown and the ones it managed to find - most are not a brand and are common words in description like wireless/bluetooth or a colours. Therefore, I will drop this column from the dataframe.

In [121]:
df = df.drop(columns = ['Brand'], axis = 0)

## Final Clean Up
---

In [122]:
df.head(10)

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime,Wireless,Noise Cancelling,Colour,Battery Life,Microphone,Over Ear,Gaming,Foldable
0,B087JVV8FK,artix cl750 wired headphones with mic & volume...,21.3,4.2 out of 5 stars,0,0,1,Not Specified,Not Specified,1,1,0,1
1,B07W7LNTM5,logitech g435 lightspeed & bluetooth wireless ...,33.24,4.2 out of 5 stars,0,1,0,black,18,0,1,1,0
2,B00I3LUYNG,sony mdrzx310l.ae foldable headphones - metall...,18.0,4.5 out of 5 stars,0,0,0,blue,Not Specified,0,0,0,1
3,B00NBR70DO,"sony mdr-zx110 overhead headphones - black , b...",14.49,4.5 out of 5 stars,0,0,0,black,Not Specified,0,1,0,0
4,B083P1HG9S,lorelei x6 over-ear headphones with microphone...,11.99,4.4 out of 5 stars,0,0,0,black,Not Specified,1,1,0,1
5,B0BCXJYD3G,"bluetooth headphones over-ear, powerlocus wire...",13.99,4.5 out of 5 stars,0,1,0,Not Specified,Not Specified,1,1,0,1
6,B01N3L2IEX,"rockpapa i20 wired headphones, wired headset o...",14.99,4.3 out of 5 stars,0,0,0,black,Not Specified,1,0,0,1
7,B0828S1TPM,"oneodio bluetooth headphones over ear, studio ...",33.59,4.4 out of 5 stars,0,1,0,Not Specified,110,1,1,0,1
8,B09PQSVFQT,"kvidio bluetooth headphones over ear, 65 hours...",14.2,4.5 out of 5 stars,0,1,0,black,65,1,1,0,1
9,B0C8V45ZF5,roxel rx-90 wired headphones with microphone -...,12.99,4.2 out of 5 stars,0,0,0,black,Not Specified,1,0,0,0


### Rating

To remove `out of 5 stars` as this information is redundant information.

In [123]:
df['Rating'] = df['Rating'].astype('str')

In [124]:
df['Rating'] = df['Rating'].str.replace('out of 5 stars', '')

In [125]:
df['Rating'] = df['Rating'].astype(float)

### Data Types

In [126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 466 entries, 0 to 465
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Product ID        466 non-null    object 
 1   Description       466 non-null    object 
 2   Price             466 non-null    float64
 3   Rating            466 non-null    float64
 4   Is Prime          466 non-null    int64  
 5   Wireless          466 non-null    int64  
 6   Noise Cancelling  466 non-null    int64  
 7   Colour            466 non-null    object 
 8   Battery Life      466 non-null    object 
 9   Microphone        466 non-null    int64  
 10  Over Ear          466 non-null    int64  
 11  Gaming            466 non-null    int64  
 12  Foldable          466 non-null    int64  
dtypes: float64(2), int64(7), object(4)
memory usage: 47.5+ KB


## Export to CSV
----

In [127]:
cleaned_df = df.copy()

In [128]:
cleaned_df.to_csv('../../data/cleaned_data.csv')

In [129]:
cleaned_df.head(10)

Unnamed: 0,Product ID,Description,Price,Rating,Is Prime,Wireless,Noise Cancelling,Colour,Battery Life,Microphone,Over Ear,Gaming,Foldable
0,B087JVV8FK,artix cl750 wired headphones with mic & volume...,21.3,4.2,0,0,1,Not Specified,Not Specified,1,1,0,1
1,B07W7LNTM5,logitech g435 lightspeed & bluetooth wireless ...,33.24,4.2,0,1,0,black,18,0,1,1,0
2,B00I3LUYNG,sony mdrzx310l.ae foldable headphones - metall...,18.0,4.5,0,0,0,blue,Not Specified,0,0,0,1
3,B00NBR70DO,"sony mdr-zx110 overhead headphones - black , b...",14.49,4.5,0,0,0,black,Not Specified,0,1,0,0
4,B083P1HG9S,lorelei x6 over-ear headphones with microphone...,11.99,4.4,0,0,0,black,Not Specified,1,1,0,1
5,B0BCXJYD3G,"bluetooth headphones over-ear, powerlocus wire...",13.99,4.5,0,1,0,Not Specified,Not Specified,1,1,0,1
6,B01N3L2IEX,"rockpapa i20 wired headphones, wired headset o...",14.99,4.3,0,0,0,black,Not Specified,1,0,0,1
7,B0828S1TPM,"oneodio bluetooth headphones over ear, studio ...",33.59,4.4,0,1,0,Not Specified,110,1,1,0,1
8,B09PQSVFQT,"kvidio bluetooth headphones over ear, 65 hours...",14.2,4.5,0,1,0,black,65,1,1,0,1
9,B0C8V45ZF5,roxel rx-90 wired headphones with microphone -...,12.99,4.2,0,0,0,black,Not Specified,1,0,0,0


## Conclusion
-------



In this notebook, I have cleaned the dataset ready for EDA. Here is a review of the key steps/insights:

1. **Dropped Duplicates:** Removed duplicate entries to maintain data quality

2. **Addressed Missing Values:** Dropped all records with null values in ratings column to keep authenticity of the data. 

3. **Feature Extraction:** Performed feature extraction to enhance the dataset, focusing on creating relevant features from product descriptions.

With the dataset now cleaned and features added, I am prepared to proceed to the next phase: EDA. 
