## Macys.com Data Web Scraping Project

**Shu Liu, (shutel at hotmail dot com)**

I scraped the 'brand', 'link' and 'name' of 208898 items on Macy.com.

### 1. Get the main categories

In [1]:
import time
import pandas as pd
import codecs
from bs4 import BeautifulSoup
import requests

In [2]:
url = 'https://www.macys.com'

In [3]:
homepage = BeautifulSoup(requests.get(url, headers = {'user-agent': 'Scrapy_project'}).text, 'lxml')
mainCategories = homepage.find(id = 'mainNavigation').find_all('li', class_ = "fob")

In [56]:
cats = dict() # the dictionary used to store 'categories'
for elm in mainCategories:
    cats[elm.text.encode('utf-8').strip()] = url + elm.a['href'].strip()

### 2. Explore the subcategories

The best way to explore all items on merchants' websites is to search by brands.

In [6]:
url_brands = cats['BRANDS']

### Get links of all brands

In [9]:
brandspage = BeautifulSoup(requests.get(url_brands, headers = {'user-agent': 'Scrapy_project'}).text, 'lxml')

In [10]:
brands = dict()
brandbox = brandspage.find_all('div', class_ = 'brand-box')

In [12]:
for box in brandbox:
    for lst in box.find_all('ul'):
        for elm in lst.find_all('li'):
            brands[elm.text.encode('utf-8').strip()] = url + elm.a['href']

In [13]:
len(brands)

1594

### 3. Scrape items brand by brand

Save scraped data to local machine by brand:

In [15]:
def save_data(df, fil):
    f = codecs.open(fil, 'w', 'utf-8')
    f_df = pd.DataFrame(df)
    f_df.to_csv(fil)
    f.close()

Extract detailed information of items:

In [16]:
def item_name(df, brand, itempage):
    items = itempage.find('ul', class_= 'items large-block-grid-3').find_all('div', class_ = 'productDescription')
    for elm in items:
        item = dict()
        item['brand'] = brand
        item['name'] = elm.a['title'].encode('utf-8').strip()
        item['link'] = url + elm.a['href'].strip()
        df.append(item)

Items scraping:

In [29]:
i = 0 
for brand, url_item in brands.iteritems():
    i += 1
    try:
        if i % 10 == 0:
            time.sleep(5)
        print 'Brand %d: %s' % (i, brand)
        df = list()
        turning = True
        
        # Turning Page:
        while turning: 
            itempage = BeautifulSoup(requests.get(url_item, headers = {'user-agent': 'Scrapy_project'}).text, 'lxml')
            item_name(df, brand, itempage)
            turning = itempage.find('ul', class_= 'filters')
            if turning:
                turning = turning.find('li', class_ = 'nextPage') 
                if not turning or turning.a['href'].strip() == '#':
                    turning = False
                else:
                    url_item = url + turning.a['href'].strip() 
        print 'End of items scraping for *%s*' % brand
        
        fil = str(i) + '.csv'
        save_data(df, fil)
        
    except:
        print '********************** Error *****************'
        time.sleep(10) # restart
print 'End of scraping all items!'

Brand 500: New Era
End of items scraping for *New Era*
Brand 501: RIPE
End of items scraping for *RIPE*
Brand 502: Perry Ellis
End of items scraping for *Perry Ellis*
Brand 503: Cathy's Concepts
End of items scraping for *Cathy's Concepts*
Brand 504: Club Room
End of items scraping for *Club Room*
Brand 505: Kenroy Home
End of items scraping for *Kenroy Home*
Brand 506: Prada
End of items scraping for *Prada*
Brand 507: Steve Madden
End of items scraping for *Steve Madden*
Brand 508: B Darlin
End of items scraping for *B Darlin*
Brand 509: Fame and Partners
End of items scraping for *Fame and Partners*
Brand 510: SK-II
End of items scraping for *SK-II*
Brand 511: Vanity Fair
End of items scraping for *Vanity Fair*
Brand 512: Kelsi Dagger Brooklyn
End of items scraping for *Kelsi Dagger Brooklyn*
Brand 513: G by GUESS
End of items scraping for *G by GUESS*
Brand 514: Tristar
End of items scraping for *Tristar*
Brand 515: Darbie Angell
End of items scraping for *Darbie Angell*
Brand 516:

End of items scraping for *Ellison First Asia*
Brand 632: Joseph Joseph
End of items scraping for *Joseph Joseph*
Brand 633: Paco Rabanne
End of items scraping for *Paco Rabanne*
Brand 634: Design Pac
End of items scraping for *Design Pac*
Brand 635: Mickey Mouse
End of items scraping for *Mickey Mouse*
Brand 636: Heart of Haiti
End of items scraping for *Heart of Haiti*
Brand 637: Bey-Berk
End of items scraping for *Bey-Berk*
Brand 638: Wendy Bellissimo
End of items scraping for *Wendy Bellissimo*
Brand 639: XSCAPE
End of items scraping for *XSCAPE*
Brand 640: City Studios
End of items scraping for *City Studios*
Brand 641: Say Yes to the Dress
End of items scraping for *Say Yes to the Dress*
Brand 642: SIS by Simone I Smith
End of items scraping for *SIS by Simone I Smith*
Brand 643: Kenneth Cole New York
End of items scraping for *Kenneth Cole New York*
Brand 644: Fitbit
End of items scraping for *Fitbit*
Brand 645: Emoji
End of items scraping for *Emoji*
Brand 646: Jack Spade
End o

End of items scraping for *Mitchell & Ness*
Brand 760: Dollhouse
End of items scraping for *Dollhouse*
Brand 761: Playtex
End of items scraping for *Playtex*
Brand 762: JM Collection
End of items scraping for *JM Collection*
Brand 763: Shark
End of items scraping for *Shark*
Brand 764: Forecaster
End of items scraping for *Forecaster*
Brand 765: littleBits
End of items scraping for *littleBits*
Brand 766: Coopersburg
End of items scraping for *Coopersburg*
Brand 767: Loloi
End of items scraping for *Loloi*
Brand 768: Majestic
End of items scraping for *Majestic*
Brand 769: Little Earth
End of items scraping for *Little Earth*
Brand 770: Madyson's Marshmallows
End of items scraping for *Madyson's Marshmallows*
Brand 771: Eyeshadow
End of items scraping for *Eyeshadow*
Brand 772: Penelope Mack
End of items scraping for *Penelope Mack*
Brand 773: BumpStart
End of items scraping for *BumpStart*
Brand 774: Wusthof
End of items scraping for *Wusthof*
Brand 775: Jay Franco
End of items scrapi

End of items scraping for *Miken*
Brand 890: Clarks
End of items scraping for *Clarks*
Brand 891: VESI
End of items scraping for *VESI*
Brand 892: White Mountain
End of items scraping for *White Mountain*
Brand 893: Touch by Alyssa Milano
End of items scraping for *Touch by Alyssa Milano*
Brand 894: Top Chef
End of items scraping for *Top Chef*
Brand 895: Belgique
End of items scraping for *Belgique*
Brand 896: Mare Mare
End of items scraping for *Mare Mare*
Brand 897: Swim Time
End of items scraping for *Swim Time*
Brand 898: Livex
End of items scraping for *Livex*
Brand 899: Team Beans
End of items scraping for *Team Beans*
Brand 900: Urban Habitat
End of items scraping for *Urban Habitat*
Brand 901: Sweet Romeo
End of items scraping for *Sweet Romeo*
Brand 902: Schmidt's
End of items scraping for *Schmidt's*
Brand 903: Bittermilk
End of items scraping for *Bittermilk*
Brand 904: Trina Turk
End of items scraping for *Trina Turk*
Brand 905: Tarte
End of items scraping for *Tarte*
Bran

End of items scraping for *Holiday Lane*
Brand 1024: Nowadays
********************** Error *****************
Brand 1025: Dirty Laundry
End of items scraping for *Dirty Laundry*
Brand 1026: Easy Street
End of items scraping for *Easy Street*
Brand 1027: Homedics
End of items scraping for *Homedics*
Brand 1028: Boelter Brands
End of items scraping for *Boelter Brands*
Brand 1029: Hanes
End of items scraping for *Hanes*
Brand 1030: Esprit
End of items scraping for *Esprit*
Brand 1031: Trolls by DreamWorks
End of items scraping for *Trolls by DreamWorks*
Brand 1032: Hanna Andersson
End of items scraping for *Hanna Andersson*
Brand 1033: Macy's Impulse Beauty Collection
End of items scraping for *Macy's Impulse Beauty Collection*
Brand 1034: Westport
End of items scraping for *Westport*
Brand 1035: Rosie Pope
End of items scraping for *Rosie Pope*
Brand 1036: Bernardo
End of items scraping for *Bernardo*
Brand 1037: Checkered Flag Sports
End of items scraping for *Checkered Flag Sports*
Bra

End of items scraping for *Simply Designz*
Brand 1151: Movado
End of items scraping for *Movado*
Brand 1152: Anova
End of items scraping for *Anova*
Brand 1153: CCM
End of items scraping for *CCM*
Brand 1154: Circus by Sam Edelman
End of items scraping for *Circus by Sam Edelman*
Brand 1155: The Style Club
End of items scraping for *The Style Club*
Brand 1156: La Blanca
End of items scraping for *La Blanca*
Brand 1157: Eileen West
End of items scraping for *Eileen West*
Brand 1158: Department 56
End of items scraping for *Department 56*
Brand 1159: iTouch
End of items scraping for *iTouch*
Brand 1160: Petunia Pickle Bottom
End of items scraping for *Petunia Pickle Bottom*
Brand 1161: B BLOCK Headwear
End of items scraping for *B BLOCK Headwear*
Brand 1162: PLANT Apothecary
End of items scraping for *PLANT Apothecary*
Brand 1163: Madison Park
End of items scraping for *Madison Park*
Brand 1164: Avengers
End of items scraping for *Avengers*
Brand 1165: Shimmer and Shine
End of items scra

End of items scraping for *Jay Z*
Brand 1282: Sunbeam
End of items scraping for *Sunbeam*
Brand 1283: Sub_Urban Riot
End of items scraping for *Sub_Urban Riot*
Brand 1284: Sealy
End of items scraping for *Sealy*
Brand 1285: Creative Bath
End of items scraping for *Creative Bath*
Brand 1286: Monif C.
End of items scraping for *Monif C.*
Brand 1287: Isaac Morris
End of items scraping for *Isaac Morris*
Brand 1288: Bulova
End of items scraping for *Bulova*
Brand 1289: Lacoste
End of items scraping for *Lacoste*
Brand 1290: Blue 84
End of items scraping for *Blue 84*
Brand 1291: Armani Jeans
End of items scraping for *Armani Jeans*
Brand 1292: RITUALS
End of items scraping for *RITUALS*
Brand 1293: Carlos by Carlos Santana
End of items scraping for *Carlos by Carlos Santana*
Brand 1294: True Religion
End of items scraping for *True Religion*
Brand 1295: Hue
End of items scraping for *Hue*
Brand 1296: 3R Studio
End of items scraping for *3R Studio*
Brand 1297: Jansport
End of items scraping

End of items scraping for *Warner's*
Brand 1412: Sleep On It
End of items scraping for *Sleep On It*
Brand 1413: Sweet Heart Rose
End of items scraping for *Sweet Heart Rose*
Brand 1414: Godiva
End of items scraping for *Godiva*
Brand 1415: Tommy Hilfiger
End of items scraping for *Tommy Hilfiger*
Brand 1416: Hamilton Beach
End of items scraping for *Hamilton Beach*
Brand 1417: JPR
End of items scraping for *JPR*
Brand 1418: PANTONE UNIVERSE (TM)
End of items scraping for *PANTONE UNIVERSE (TM)*
Brand 1419: Waterpik
End of items scraping for *Waterpik*
Brand 1420: Lauren Madison
End of items scraping for *Lauren Madison*
Brand 1421: Jou Jou
End of items scraping for *Jou Jou*
Brand 1422: Hypnotize
End of items scraping for *Hypnotize*
Brand 1423: Bissell
End of items scraping for *Bissell*
Brand 1424: jam
End of items scraping for *jam*
Brand 1425: St. Tropez
End of items scraping for *St. Tropez*
Brand 1426: Shun
End of items scraping for *Shun*
Brand 1427: Levi's
End of items scrapin

End of items scraping for *Spring Air*
Brand 1542: Cartier
End of items scraping for *Cartier*
Brand 1543: Protect-A-Bed
End of items scraping for *Protect-A-Bed*
Brand 1544: DERMAFLASH
End of items scraping for *DERMAFLASH*
Brand 1545: Clubhouse
End of items scraping for *Clubhouse*
Brand 1546: LRG
End of items scraping for *LRG*
Brand 1547: Marc Tetro
End of items scraping for *Marc Tetro*
Brand 1548: Cole & Mason
End of items scraping for *Cole & Mason*
Brand 1549: Ashley Graham
End of items scraping for *Ashley Graham*
Brand 1550: Love Tribe
End of items scraping for *Love Tribe*
Brand 1551: American Needle
End of items scraping for *American Needle*
Brand 1552: Charbonnel et Walker
End of items scraping for *Charbonnel et Walker*
Brand 1553: Jambu
End of items scraping for *Jambu*
Brand 1554: Modern Littles
End of items scraping for *Modern Littles*
Brand 1555: Miraclesuit
End of items scraping for *Miraclesuit*
Brand 1556: Fairfield Square Collection
End of items scraping for *Fa

After scraping, check errors notifications in the whole process.

### 4. Import multiple csv files into pandas and concatenate into the final DataFrame:

In [31]:
import glob

In [51]:
path = '.'# use your path
allFiles = glob.glob(path + "/*.csv")
final_data = pd.DataFrame()
lst = []
for fil in allFiles:
    data_file = pd.read_csv(fil, index_col = None, header = 0)
    lst.append(data_file)
final_data = pd.concat(lst).drop('Unnamed: 0', axis = 1)

In [54]:
final_data.shape

(208898, 3)

Save data to the file 'macys_items.csv':

In [53]:
save_data(final_data, 'macys_items.csv')

### 5. Some Notes:

**How this approach could be generalized to other merchants:**

Websites of online merchants like Macy's, JCPenny, Nordstrom, Kohl's, Bloomingdalaccessorye's, Saks Fifth Avenue share some common features in the structure of their pages. The main page leads to several categories(Home, Men, Women, Kids...) of items, and every category have many subcategories(Clothing, Shoes, Accessories...). The annoying thing about this structure is that websites like Macys have many items appeared in multiple categories. Categories have overlapped items on pages.

That's because categories are not mutually exclusive from a mathematical perspective((B ∩ C) !=  ∅ ). Some categories are classified by the function of items(clothing, shoes, bed, bath), but some other categories are classified according to other rules(sales, best offers, more...). It's extremely inefficient to spend days in exploring rules of every merchants' classification.

In order to scrape all items on the shopping page, the conservative way is to scrape items in every root-categories and then we apply a duplicate filter method to remove duplicated items.

In my solution, I applied the 'BeautifulSoup' + 'requests' to complete the web scraping. The initial thinking is to extract main categories on the home page, and then drilling down from the top to the root page to record product details of every item. However, this process became much more complex when I found subcategories on Macy's have a messy classification rule. I need to build a duplicate filter method to solve this problem. From my understanding, the framework 'scrapy' is a better choice than 'BeautifulSoup' + 'requests' for handling this scarping problem. 

So why didn't I code in 'Scrapy' framework?

Honestly, although it would be more efficient for both the generalization of the scarping process and the following ETL process with MongoDB, I don't have much time to rebuild the scraping process with 'scrapy' and MongoDB now. Therefore, I continue to look for a more efficient way to complete the web scraping with 'BeautifulSoup'.

Fortunately, I find the classification 'Brands' is the best way to get access to all items without duplicates or missingness. This is the key choice of my web scraping.

'Brand' is also appeared on 'Bestbuy.com' and some other websites as a main category or sub-categories. My approach can be easily generalized to those merchants. As for some other merchants, 'Scraping All' + 'Duplicate Filter' is a more general an scalable tool.

**How to assess the accuracy?**

During the scraping, I code to record the error information and save scraped data frequently, so it's easy for me to monitor errors which may lead to inaccurate data.

However, items information on Macys.com is updated all the time. Data Scraping is not a one-second action. Scraped data may become slightly different at the start and the end of data scraping process. One method is to scrape data multipe times and then we compare the results to approximately assess the accuracy of data. Another method is to scrape data on several machines(or clusters on cloud) simultaneously and then we compare results to assess the accuracy.

**Notes:**

1. I use IP proxies and temporary time sleep to get around the anti-security system of Macys.com. Fortunately, Macys.com is friendly to scraping spiders.
2. IP proxies are provided here: https://hidemy.name/en/proxy-list/
3. Scraping code can be put on AWS or other Clouds to run, I keep it running on my local machine.