## Data scrapping project 

### Nike Sneaker Sale 

1) In this project we are trying to scrape the data from the Nike Sale store. 

       link ( https://www.nike.com/in/w/sale-shoes-3yaepzy7ok )

2) We'd be using Selenium to enable infinite scrolling on the concerned page, Beautiful Soup to extract the data and Pandas to store the data in a csv file. 

3) The key information we are looking for is : 
  
      a) Title,

      b) Discription,

      c) Sale Price (Current Price),

      d) MRP                                             

4) In the end we'll finish of the project by storing the data in a csv file.                                                


In [40]:
from bs4 import BeautifulSoup
from selenium import webdriver 
import pandas as pd 
import time

In [43]:
driver  = webdriver.Chrome(r"C:\Users\hp\saksham\chromedriver.exe")
driver.get('https://www.nike.com/in/w/sale-shoes-3yaepzy7ok')

  """Entry point for launching an IPython kernel.


In [44]:
# the very first that we need to achieve is to activate the infinite scrolling 
last_height = driver.execute_script('return document.body.scrollHeight')

In [38]:
last_height
# this basically says the first scrolling end has a height of 4456 and if we execute the same code again it will give us the 2*4456

In [39]:
# driver.execute_script('return document.body.scrollHeight')


In [48]:
#  Enable infinite scrolling 
while True:
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
    time.sleep(2)
    new_height = driver.execute_script('return document.body.scrollHeight')
    if new_height == last_height:
        break
    else:
        last_height = new_height
        
#  Collect data using beautiful soup ans selenium 

soup = BeautifulSoup(driver.page_source, 'lxml')
product_card = soup.find_all('div', {"class":"product-card__body"})


data_df = pd.DataFrame({'Link':[''],
                        'Title':[''],
                        'Discription':[''],
                        'Colors_available':[''],
                        'Current_Price': [''],
                        'MRP': ['']})             # built a data frame to store the collected data 


for product in product_card:
    try:
        link = product.find('a', {"class":"product-card__img-link-overlay"}).get('href')
        title = product.find('div', {"class":"product-card__title"}).text
        discription = product.find('div', {"class":"product-card__subtitle"}).text
        colour_available = product.find('div', {"class":"product-card__product-count font-override__body1"}).text
        current_price = (product.find('div', {"class":"product-price is--current-price css-1ydfahe"}).text).strip('₹')
        MRP = (product.find('div', {"class":"product-price in__styling is--striked-out css-0"}).text).strip('MRP :₹ ')
        data_df = data_df.append({'Link':link,
                                   'Title':title,
                                   'Discription':discription,
                                   'Colors_available':colour_available,
                                   'Current_Price': current_price,
                                   'MRP': MRP}, ignore_index=True)
    except:
        pass
    
#  this for loop collects all the necessary data from the nike sale store and store it in DataFrame2 i.e. data_df2


In [49]:
data_df

Unnamed: 0,Link,Title,Discription,Colors_available,Current_Price,MRP
0,,,,,,
1,https://www.nike.com/in/t/metcon-8-flyease-eas...,Nike Metcon 8 FlyEase,Men's Easy On/Off Training Shoes,2 Colours,10 107.00,11 895.00
2,https://www.nike.com/in/t/zoom-fly-5-road-runn...,Nike Zoom Fly 5,Women's Road Running Shoes,2 Colours,14 247.00,14 995.00
3,https://www.nike.com/in/t/sb-force-58-skate-sh...,Nike SB Force 58,Skate Shoe,1 Colour,5 097.00,5 995.00
4,https://www.nike.com/in/t/air-zoom-superrep-3-...,Nike Air Zoom SuperRep 3,Women's HIIT Class Shoes,1 Colour,8 747.00,10 295.00
...,...,...,...,...,...,...
68,https://www.nike.com/in/t/wio-9-road-running-s...,Nike Winflo 9,Men's Road Running Shoes,2 Colours,8 067.00,8 495.00
69,https://www.nike.com/in/t/air-huarache-shoes-f...,Nike Air Huarache,Men's Shoes,2 Colours,9 777.00,10 295.00
70,https://www.nike.com/in/t/sb-bruin-high-skate-...,Nike SB Bruin High,Skate Shoes,2 Colours,7 117.00,7 495.00
71,https://www.nike.com/in/t/air-jordan-xxxvi-low...,Air Jordan XXXVI Low PF,Men's Basketball Shoes,1 Colour,15 477.00,16 295.00


In [52]:
pd.set_option('display.max_rows', None)

In [53]:
data_df

Unnamed: 0,Link,Title,Discription,Colors_available,Current_Price,MRP
0,,,,,,
1,https://www.nike.com/in/t/metcon-8-flyease-eas...,Nike Metcon 8 FlyEase,Men's Easy On/Off Training Shoes,2 Colours,10 107.00,11 895.00
2,https://www.nike.com/in/t/zoom-fly-5-road-runn...,Nike Zoom Fly 5,Women's Road Running Shoes,2 Colours,14 247.00,14 995.00
3,https://www.nike.com/in/t/sb-force-58-skate-sh...,Nike SB Force 58,Skate Shoe,1 Colour,5 097.00,5 995.00
4,https://www.nike.com/in/t/air-zoom-superrep-3-...,Nike Air Zoom SuperRep 3,Women's HIIT Class Shoes,1 Colour,8 747.00,10 295.00
5,https://www.nike.com/in/t/blazer-mid-77-d-olde...,Nike Blazer Mid '77 D,Older Kids' Shoes,1 Colour,5 947.00,6 995.00
6,https://www.nike.com/in/t/air-max-dawn-older-s...,Nike Air Max Dawn,Older Kids' Shoes,1 Colour,6 367.00,7 495.00
7,https://www.nike.com/in/t/air-force-1-pltaform...,Nike Air Force 1 PLT.AF.ORM LV8,Women's Shoes,1 Colour,10 257.00,10 795.00
8,https://www.nike.com/in/t/air-max-plus-shoes-0...,Nike Air Max Plus,Shoes,1 Colour,14 247.00,14 995.00
9,https://www.nike.com/in/t/nikecourt-legacy-sho...,NikeCourt Legacy,Women's Shoes,1 Colour,5 977.00,6 295.00


In [55]:
data_df.to_csv(r"C:\Users\hp\saksham\nike_scrapped.csv")

In [61]:
pd.read_csv(r"C:\Users\hp\saksham\nike_scrapped.csv")

Unnamed: 0.1,Unnamed: 0,Link,Title,Discription,Colors_available,Current_Price,MRP,Percentage_discount
0,0,,,,,,,
1,1,https://www.nike.com/in/t/metcon-8-flyease-eas...,Nike Metcon 8 FlyEase,Men's Easy On/Off Training Shoes,2 Colours,10107.0,11895.0,17.690709
2,2,https://www.nike.com/in/t/zoom-fly-5-road-runn...,Nike Zoom Fly 5,Women's Road Running Shoes,2 Colours,14247.0,14995.0,5.250228
3,3,https://www.nike.com/in/t/sb-force-58-skate-sh...,Nike SB Force 58,Skate Shoe,1 Colour,5097.0,5995.0,17.618207
4,4,https://www.nike.com/in/t/air-zoom-superrep-3-...,Nike Air Zoom SuperRep 3,Women's HIIT Class Shoes,1 Colour,8747.0,10295.0,17.697496
5,5,https://www.nike.com/in/t/blazer-mid-77-d-olde...,Nike Blazer Mid '77 D,Older Kids' Shoes,1 Colour,5947.0,6995.0,17.622331
6,6,https://www.nike.com/in/t/air-max-dawn-older-s...,Nike Air Max Dawn,Older Kids' Shoes,1 Colour,6367.0,7495.0,17.71635
7,7,https://www.nike.com/in/t/air-force-1-pltaform...,Nike Air Force 1 PLT.AF.ORM LV8,Women's Shoes,1 Colour,10257.0,10795.0,5.245198
8,8,https://www.nike.com/in/t/air-max-plus-shoes-0...,Nike Air Max Plus,Shoes,1 Colour,14247.0,14995.0,5.250228
9,9,https://www.nike.com/in/t/nikecourt-legacy-sho...,NikeCourt Legacy,Women's Shoes,1 Colour,5977.0,6295.0,5.320395


### Below is all the scrape-code which i tried before finaizing the project 

In [34]:
# soup = BeautifulSoup(driver.page_source, 'lxml')
# product_card = soup.find_all('div', {"class":"product-card__body"})


In [35]:
# soup.find('a', {"class":"product-card__img-link-overlay"}).get('href')

# soup.find('div', {"class":"product-card__title"}).text

# (soup.find('div', {"class":"product-price in__styling is--striked-out css-0"}).text).strip('MRP :₹ ')

# # soup.find('div', {"class":"product-card__subtitle"}).text

In [36]:
# data_df = pd.DataFrame({'Link':[''],
#                         'Title':[''],
#                         'Discription':[''],
#                         'Colors_available':[''],
#                         'Current_Price': [''],
#                         'MRP': ['']})

In [37]:
# for product in product_card:
#     try:
#         link = product.find('a', {"class":"product-card__img-link-overlay"}).get('href')
#         title = product.find('div', {"class":"product-card__title"}).text
#         discription = product.find('div', {"class":"product-card__subtitle"}).text
#         colour_available = product.find('div', {"class":"product-card__product-count font-override__body1"}).text
#         current_price = (product.find('div', {"class":"product-price is--current-price css-1ydfahe"}).text).strip('₹')
#         MRP = (product.find('div', {"class":"product-price in__styling is--striked-out css-0"}).text).strip('MRP :₹ ')
#         data_df2 = data_df.append({'Link':link,
#                                    'Title':title,
#                                    'Discription':discription,
#                                    'Colors_available':colour_available,
#                                    'Current_Price': current_price,
#                                    'MRP': MRP}, ignore_index=True)
#     except:
#         pass
    
    