<a href="https://colab.research.google.com/github/alaa-shehab/Ikea-Sofas-EDA/blob/main/Notebooks/Web_scraping_documented.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![My Image](Ikea_logo.png)

# EDA on Ikea Sofa Types Dataset

### About this Dataset:
In this project, we are scraping data from ikea's sofa types webpages which are:<br>
- Sofa bed<br>
- Chaise Longue<br>
- Modular<br>
- Leather<br>
- Fabric<br>
to get insights about sofas' price and its relations with other columns.<br>

### Attributes<br>
- Name: Sofas' Names<br>
- Type: Types of Sofas<br>
- Number_of_Seats: Number of seats in each sofa<br>
- Colour: Colour of each sofa<br>
- Price: Price of the sofa according to colour and number of seats<br>
- Material: Material of the sofa<br>
- Others: Washable cover or not<br>
<br>


### Performing Data Cleaning and Visualisation with the below listed libraries/tools:-
- Pandas
- Numpy
- Matplotlib
- Seaborn
- selenium

### Importing Used Libraries

In [None]:
import pandas as pd
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

### Step 1: Web Scraping for the products' links

In [None]:
# links to be scraped for each category
fur_typ=["https://www.ikea.com/eg/en/cat/fabric-sofas-10661/?page=17"
        ,"https://www.ikea.com/eg/en/cat/leather-coated-fabric-sofas-10662/?page=2"
        ,"https://www.ikea.com/eg/en/cat/modular-sofas-16238/?page=8"
        ,"https://www.ikea.com/eg/en/cat/sofa-beds-10663/?page=7"
        ,"https://www.ikea.com/eg/en/cat/chaise-longues-57527/?page=3"
        ]

In [None]:
# to stop the live web scraping and work in the background
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options = chrome_options)

# links_arr contains all the links of the main page
links_arr = []

for i in range(len(fur_typ)):
  try:
    driver = webdriver.Chrome()
    driver.get(fur_typ[i])
    links = driver.find_elements("xpath","//div[@class='plp-mastercard__item plp-mastercard__image']/a")

    for i in links:
        links_arr.append(i.get_attribute("href"))
  except Exception as e:
    print(e)


In [None]:
len(links_arr)

36

In [None]:
# Create Chrome options
chrome_options = Options()
chrome_options.add_argument("--disable-images")

# Create a new instance of the Chrome driver with options
driver = webdriver.Chrome(options=chrome_options)

# links_sub_arr contains all the links of the sub page.
links_sub_arr=[]

for ii in range(len(links_arr)):
  try:
    driver.get(links_arr[ii])
    links_sub = driver.find_elements("xpath","//div[@class='pip-product-styles__items']/a")

    for i in links_sub:
        links_sub_arr.append(i.get_attribute("href"))
  except Exception as e:
    print(e)


In [None]:
len(links_sub_arr)

92

### Step 2: Getting the Elements from the sub page using Xpath

In [None]:

sofa_name = '//span[@class="pip-header-section__title--big notranslate"]'
sofa_details = '//h1/div/div/span/span[@class="pip-header-section__description-text"]'
sofa_price ='//div/div/div/div/div/span/span/span/span[@class="pip-temp-price__integer"]'
sofa_material ='//span[@class="pip-key-facts__type"]'

field = [sofa_name, sofa_details, sofa_price, sofa_material]


In [None]:
# links_sub_data contains all the data of the sub page
links_sub_data = []

for i in range(len(links_sub_arr)):
  try:
    driver.get(links_sub_arr[i])

    # row contains all the data of the sub page
    row = []

    for j in field:
        elements = driver.find_elements("xpath", j)
        for element in elements:
            row.append(element.text)
  except Exception as e:
    print(e)

  links_sub_data.append(row)


In [None]:
links_sub_data

[['ÄLVDALEN', '3-seat sofa-bed, Knisa dark grey', '18,995'],
 ['FRIHETEN', 'Corner sofa-bed with storage, Bomstad black', '28,795', 'Firm'],
 ['FRIHETEN', 'Corner sofa-bed with storage, Skiftebo blue', '25,695', 'Firm'],
 ['BÅRSLÖV',
  '3-seat sofa-bed with chaise longue, Tibbleby light grey-turquoise',
  '34,995'],
 ['GRÖNLID',
  'Crnr sofa-bed, 5-seat w chaise lng, Inseros white',
  '107,075',
  'Extra soft',
  'Washable cover'],
 ['GRÖNLID',
  'Crnr sofa-bed, 5-seat w chaise lng, Ljungen light green',
  '111,675',
  'Extra soft',
  'Washable cover'],
 ['GRÖNLID',
  'Crnr sofa-bed, 5-seat w chaise lng, Ljungen light red',
  '110,875',
  'Extra soft',
  'Washable cover'],
 ['GRÖNLID',
  'Crnr sofa-bed, 5-seat w chaise lng, Ljungen medium grey',
  '111,675',
  'Extra soft',
  'Washable cover']]

### Step 3: Saving the data in csv format

In [None]:
# to save data in columns
col1 = []
for x in links_sub_data:
    col1.append(x[0])

col2 = []
for x in links_sub_data:
    col2.append(x[1])

col3 = []
for x in links_sub_data:
    col3.append(int(x[2].replace(',', '')))

col4 = []
for x in links_sub_data:
    if len(x) > 3:
        col4.append(x[3])
    else:
        col4.append(np.nan)

col5 = []
for x in links_sub_data:
    if len(x) > 4:
        col5.append(x[4])
    else:
        col5.append(np.nan)

# to put the data in dataframe
df = pd.DataFrame({'name': col1, 'details': col2, 'price': col3,'material': col4,'others': col5})

# to save data in csv
df.to_csv("data.csv")



In [None]:
df

Unnamed: 0,name,details,price,material
0,ÄLVDALEN,"3-seat sofa-bed, Knisa dark grey",18995,
1,FRIHETEN,"Corner sofa-bed with storage, Bomstad black",28795,Firm
2,FRIHETEN,"Corner sofa-bed with storage, Skiftebo blue",25695,Firm
3,BÅRSLÖV,"3-seat sofa-bed with chaise longue, Tibbleby l...",34995,
4,GRÖNLID,"Crnr sofa-bed, 5-seat w chaise lng, Inseros white",107075,Extra soft
5,GRÖNLID,"Crnr sofa-bed, 5-seat w chaise lng, Ljungen li...",111675,Extra soft
6,GRÖNLID,"Crnr sofa-bed, 5-seat w chaise lng, Ljungen li...",110875,Extra soft
7,GRÖNLID,"Crnr sofa-bed, 5-seat w chaise lng, Ljungen me...",111675,Extra soft
