#### Exploring and Cleaning the Products Data:

Import relevant packages:

In [13]:
import pandas as pd
import numpy as np

Read in Data:

In [2]:
products = pd.read_csv('../Raw_Data/products.csv', lineterminator='\n')

##### Exploring Data:

In [3]:
products.head()

Unnamed: 0,id,product_name,category,product_description,price,location,page_id,create_time
0,ac2140ae-f0d5-4fe7-ac08-df0f109fd734,"Second-Hand Sofas, Couches & Armchairs for Sal...",,,,,1426592234,2022-02-26
1,243809c0-9cfc-4486-ad12-3b7a16605ba9,"Mirror wall art | in Wokingham, Berkshire | Gu...","Home & Garden / Dining, Living Room Furniture ...","Mirror wall art. Posted by Nisha in Dining, Li...",£5.00,"Wokingham, Berkshire",1426704584,2022-02-26
2,1c58d3f9-8b93-47ea-9415-204fcc2a22e6,"Stainless Steel Food Steamer | in Inverness, H...",Home & Garden / Other Household Goods,Morphy Richard’s (model no 48755)Stainless ste...,£20.00,"Inverness, Highland",1426704579,2022-02-26
3,860673f1-57f6-47ba-8d2f-13f9e05b8f9a,"Sun loungers | in Skegness, Lincolnshire | Gum...",Home & Garden / Garden & Patio / Outdoor Setti...,I have 2 of these - collection only as I don’t...,£20.00,"Skegness, Lincolnshire",1426704576,2022-02-26
4,59948726-29be-4b35-ade5-bb2fd7331856,Coffee side table from Ammunition ammo box hai...,"Home & Garden / Dining, Living Room Furniture ...",Great reclaimed army ammunition box used as co...,£115.00,"Radstock, Somerset",1426704575,2022-02-26


In [4]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8091 entries, 0 to 8090
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   id                   8091 non-null   object
 1   product_name         8091 non-null   object
 2   category             7156 non-null   object
 3   product_description  7156 non-null   object
 4   price                7156 non-null   object
 5   location             7156 non-null   object
 6   page_id              8091 non-null   int64 
 7   create_time          8091 non-null   object
dtypes: int64(1), object(7)
memory usage: 505.8+ KB


Plans/Immediate Observations:

- Drop rows where any entry is null. On first glance, the Category, Price, Location and Product Description are the useful pieces of information. (Will be the features)
- Price altered into an a float32. (Float32 as Pytorch works with Float32)
- Worth splitting location into 2 columns: Town and County/City. 
- The "category" field can be split by forward slashes as a product can be included in multiple categories. 
- Transform this notebook into a script so when new data is streamed in, can be cleaned easily. 

In [5]:
products = products.loc[products.isna().sum(axis=1) == 0]

In [6]:
products.shape

(7156, 8)

In [7]:
products.head(5)

Unnamed: 0,id,product_name,category,product_description,price,location,page_id,create_time
1,243809c0-9cfc-4486-ad12-3b7a16605ba9,"Mirror wall art | in Wokingham, Berkshire | Gu...","Home & Garden / Dining, Living Room Furniture ...","Mirror wall art. Posted by Nisha in Dining, Li...",£5.00,"Wokingham, Berkshire",1426704584,2022-02-26
2,1c58d3f9-8b93-47ea-9415-204fcc2a22e6,"Stainless Steel Food Steamer | in Inverness, H...",Home & Garden / Other Household Goods,Morphy Richard’s (model no 48755)Stainless ste...,£20.00,"Inverness, Highland",1426704579,2022-02-26
3,860673f1-57f6-47ba-8d2f-13f9e05b8f9a,"Sun loungers | in Skegness, Lincolnshire | Gum...",Home & Garden / Garden & Patio / Outdoor Setti...,I have 2 of these - collection only as I don’t...,£20.00,"Skegness, Lincolnshire",1426704576,2022-02-26
4,59948726-29be-4b35-ade5-bb2fd7331856,Coffee side table from Ammunition ammo box hai...,"Home & Garden / Dining, Living Room Furniture ...",Great reclaimed army ammunition box used as co...,£115.00,"Radstock, Somerset",1426704575,2022-02-26
5,16dbc860-696e-4cda-93f6-4dd4926573fb,Modern Shannon Sofa for sale at low cost | in ...,"Home & Garden / Dining, Living Room Furniture ...",New Design Shannon Corner sofa 5 Seater Avail...,£450.00,"Delph, Manchester",1426704570,2022-02-26


Prices:

Will use regular expressions to extract the price

In [11]:
price_pattern = r'([0-9]*,?[0-9]+)'
products['price'] = products['price'].str.extract(price_pattern, expand=True)
products['price'] = products['price'].str.replace(',','')

In [14]:
products['price'] = products['price'].astype(np.float32)

In [16]:
products['price'].describe()

count      7156.000000
mean        358.831604
std        5392.854492
min           0.000000
25%          10.000000
50%          40.000000
75%         150.000000
max      399900.000000
Name: price, dtype: float64

Based on this: 

- Notice that the price is positively skewed. 
- May need to drop outlier rows where price is 0 and price is extremely high. However will leave for now. 