# Scraping Poshmark for Recent Postings
### _Date of Scrape: May 28, 2020_

### Imports

In [1]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

### Poshmark - Men

In [2]:
base_url = 'https://poshmark.com/category/Men'
res = get(base_url)
print(res.status_code)

200


In [17]:
def get_page(url):
    res = get(base_url)
    if res.status_code == 200:
        return res
    else:
        return res.status_code

In [None]:
soup = BeautifulSoup(res.text)

# Create a list off all div containers that have a class of 'tile'
# These are the boxes that represent a single item when browsing the site
tiles = soup.find_all('div', class_ = 'tile')
print(len(tiles))

Poshmark definitely has more than 48 pieces for sale on their site. The reason why we have only captured 48 is because that is the amount of tiles that are initially loaded to the page. You may get a different number, but the reason is the same.

The initial scrape is limited to how many tiles are initially loaded.

For now, we will worry about being able to put values into a Pandas dataframe. Getting more observations will be a problem to tackle for tomorrow.

#### Extracting Values
https://www.tablesgenerator.com/markdown_tables#

Want to also scrape the time it was posted, but this requires me to go into the Page URL to grab that info. Same case for color. Another day.

|   Column  | dtype |                  Example Values                  |       Description      |
|:---------:|:-----:|:------------------------------------------------:|:----------------------:|
|   Title   |  str  | Men's New Balance Revlite 1550 Army Green size 9 |  Title given by seller |
| Seller    | str   | example_username                                 | Username of seller     |
|   Price   |  int  |                        $25                       | Asking price by seller |
|    Size   |  str  |                                                  |                        |
|   Brand   |  str  |                                                  |                        |
|  Page URL |  str  |                                                  |                        |
| Image URL |  str  |                                                  |                        |

In [4]:
# For the sake of demonstration for myself, I will be extracting values from a single tile
ex_tile = tiles[0]
print(ex_tile.prettify())

<div class="tile col-x12 col-l6 col-s8">
 <div class="card card--small">
  <a class="tile__covershot" data-et-element-type="image" data-et-name="listing" data-et-prop-listing_id="5ea49b5888cce33ce1cf3a07" data-et-prop-location="listing_tile" data-et-prop-unit_position="0" href="/listing/Orange-AMerican-Eagle-hoodie-5ea49b5888cce33ce1cf3a07">
   <div class="img__container img__container--square">
    <img alt="Orange AMerican Eagle hoodie" data-src="https://di2ponv0v5otw.cloudfront.net/posts/2020/04/25/5ea49b5888cce33ce1cf3a07/s_5ea49b6ecb692c6ec7629658.jpg" src="https://di2ponv0v5otw.cloudfront.net/posts/2020/04/25/5ea49b5888cce33ce1cf3a07/s_5ea49b6ecb692c6ec7629658.jpg"/>
   </div>
   <!-- -->
  </a>
  <div class="item__details">
   <div class="title__condition__container">
    <a class="tile__title tc--b" data-et-element-type="link" data-et-name="listing" data-et-prop-listing_id="5ea49b5888cce33ce1cf3a07" data-et-prop-location="listing_tile" data-et-prop-unit_position="0" href="/list

In [5]:
print(ex_tile.find_all('a', attrs = {
    'data-et-element-type': 'image'
})[0].prettify())

<a class="tile__covershot" data-et-element-type="image" data-et-name="listing" data-et-prop-listing_id="5ea49b5888cce33ce1cf3a07" data-et-prop-location="listing_tile" data-et-prop-unit_position="0" href="/listing/Orange-AMerican-Eagle-hoodie-5ea49b5888cce33ce1cf3a07">
 <div class="img__container img__container--square">
  <img alt="Orange AMerican Eagle hoodie" data-src="https://di2ponv0v5otw.cloudfront.net/posts/2020/04/25/5ea49b5888cce33ce1cf3a07/s_5ea49b6ecb692c6ec7629658.jpg" src="https://di2ponv0v5otw.cloudfront.net/posts/2020/04/25/5ea49b5888cce33ce1cf3a07/s_5ea49b6ecb692c6ec7629658.jpg"/>
 </div>
 <!-- -->
</a>



##### Title and Page URL

Since the title of the post and the link to post is in the same tag, we will grab both in the same cell below.

In [18]:
# Find the first instance of an a tag with the 'tile__title' class.
# Then strip all leading and trailing whitespace from the resulting text.
ex_title_pageurl = ex_tile.find('a', class_ = 'tile__title')
ex_title = ex_title_pageurl.get_text(strip = True)
ex_page_url = ex_title_pageurl.get('href')
print(f'title: {ex_title}')
print(f'page url: https://www.poshmark.com{ex_page_url}')

title: Orange AMerican Eagle hoodie
page url: https://www.poshmark.com/listing/Orange-AMerican-Eagle-hoodie-5ea49b5888cce33ce1cf3a07


In [48]:
def get_title(tile):
    try:
        title = tile.find('a', class_ = 'tile__title').get_text(strip = True)
        return title
    except:
        return None

get_title(ex_tile)

'Orange AMerican Eagle hoodie'

In [49]:
def get_page_url(tile):
    "Get's the URL of the item. Is preceeded by 'www.poshmark.com'"
    try:
        return tile.find('a', class_ = 'tile__title').get('href')
    except:
        None
        
get_page_url(ex_tile)

'/listing/Orange-AMerican-Eagle-hoodie-5ea49b5888cce33ce1cf3a07'

##### Seller

In [12]:
ex_seller = ex_tile.find('span', class_ = 'tc--g m--l--1').get_text(strip = True)
print(ex_seller)

michyxx3


In [50]:
def get_seller(tile):
    try:
        return tile.find('span', class_ = 'tc--g m--l--1').get_text(strip = True)
    except:
        return None

get_seller(ex_tile)

'michyxx3'

##### Price

In [7]:
ex_price = ex_tile.find('span', class_ = 'fw--bold').get_text(strip = True)
print(ex_price)

$23


In [63]:
def get_price(tile):
    try:
        # Skip the dollar sign and return the rest of the string as an int
        return int(tile.find('span', class_ = 'fw--bold').get_text(strip = True)[1:])
    except:
        None

get_price(ex_tile)

23

##### Size

In [8]:
ex_size = ex_tile.find('a', class_ = 'tile__details__pipe__size').get_text(strip = True)
print(ex_size)

Size: XXL


In [52]:
def get_size(tile):
    try:
        return tile.find('a', class_ = 'tile__details__pipe__size').get_text(strip = True)
    except:
        return None

get_size(ex_tile)

'Size: XXL'

##### Brand

In [9]:
ex_brand = ex_tile.find('a', class_ = 'tile__details__pipe__brand').get_text(strip = True)
print(ex_brand)

American Eagle Outfitters


In [55]:
def get_brand(tile):
    try:
        return tile.find('a', class_ = 'tile__details__pipe__brand').get_text(strip = True)
    except:
        return None

get_brand(ex_tile)

'American Eagle Outfitters'

##### Image URL

In [10]:
ex_img = ex_tile.find('img').get('data-src')
print(ex_img)

https://di2ponv0v5otw.cloudfront.net/posts/2020/04/25/5ea49b5888cce33ce1cf3a07/s_5ea49b6ecb692c6ec7629658.jpg


In [54]:
def get_img(tile):
    try:
        return tile.find('img').get('data-src')
    except:
        return None

# Check if the result of the function is the same
# as the link above via copy and paste
get_img(ex_tile) == 'https://di2ponv0v5otw.cloudfront.net/posts/2020/04/25/5ea49b5888cce33ce1cf3a07/s_5ea49b6ecb692c6ec7629658.jpg'

True

### Checking If the Functions Work on Other Tiles For Consistency

In [14]:
ex_tile_1 = tiles[1]
print(ex_tile_1.prettify())

<div class="tile col-x12 col-l6 col-s8">
 <div class="card card--small">
  <a class="tile__covershot" data-et-element-type="image" data-et-name="listing" data-et-prop-listing_id="5ece9c6ee131649eb0693909" data-et-prop-location="listing_tile" data-et-prop-unit_position="1" href="/listing/USC-sweat-headband-5ece9c6ee131649eb0693909">
   <div class="img__container img__container--square">
    <img alt="USC sweat headband" data-src="https://di2ponv0v5otw.cloudfront.net/posts/2020/05/27/5ece9c6ee131649eb0693909/s_5ece9c73ac97021630c0d9b3.jpg" src="https://di2ponv0v5otw.cloudfront.net/posts/2020/05/27/5ece9c6ee131649eb0693909/s_5ece9c73ac97021630c0d9b3.jpg"/>
   </div>
   <!-- -->
  </a>
  <div class="item__details">
   <div class="title__condition__container">
    <a class="tile__title tc--b" data-et-element-type="link" data-et-name="listing" data-et-prop-listing_id="5ece9c6ee131649eb0693909" data-et-prop-location="listing_tile" data-et-prop-unit_position="1" href="/listing/USC-sweat-headba

In [31]:
get_title(ex_tile_1)

'USC sweat headband'

In [32]:
get_seller(ex_tile_1)

'imjacqueline'

In [33]:
get_price(ex_tile_1)

'$8'

In [34]:
get_size(ex_tile_1)

'Size: OS'

In [35]:
get_brand(ex_tile_1)

'USC'

In [38]:
print(f'https://www.poshmark.com{get_page_url(ex_tile_1)}')

https://www.poshmark.com/listing/USC-sweat-headband-5ece9c6ee131649eb0693909


In [40]:
print(get_img(ex_tile_1))

https://di2ponv0v5otw.cloudfront.net/posts/2020/05/27/5ece9c6ee131649eb0693909/s_5ece9c73ac97021630c0d9b3.jpg


That's enough testing for me! I'm sure I'm going to encounter a problem as I loop through the list of 46 other entries, but let's go for it!

#### Populating a DataFrame with 48 Poshmark Items

In [64]:
titles, sellers, prices, sizes, brands, p_urls, i_urls = [], [], [], [], [], [], []
for tile in tiles:
    titles.append(get_title(tile))
    sellers.append(get_seller(tile))
    prices.append(get_price(tile))
    sizes.append(get_size(tile))
    brands.append(get_brand(tile))
    p_urls.append(get_page_url(tile))
    i_urls.append(get_img(tile))
df = pd.DataFrame({
    'Title': titles,
    'Seller': sellers,
    'Price': prices,
    'Size': sizes,
    'Brand': brands,
    'Page URL': p_urls,
    'Image URL': i_urls
})

df.head()

Unnamed: 0,Title,Seller,Price,Size,Brand,Page URL,Image URL
0,Orange AMerican Eagle hoodie,michyxx3,23,Size: XXL,American Eagle Outfitters,/listing/Orange-AMerican-Eagle-hoodie-5ea49b58...,https://di2ponv0v5otw.cloudfront.net/posts/202...
1,USC sweat headband,imjacqueline,8,Size: OS,USC,/listing/USC-sweat-headband-5ece9c6ee131649eb0...,https://di2ponv0v5otw.cloudfront.net/posts/202...
2,Buffalo David Bitton XL distressed green Vneck...,fashionrunway4,16,Size: XL,Buffalo David Bitton,/listing/Buffalo-David-Bitton-XL-distressed-gr...,https://di2ponv0v5otw.cloudfront.net/posts/201...
3,Nike Joyride,tracib41,100,Size: 11,Nike,/listing/Nike-Joyride-5ec944fe6f6c91e905b830f5,https://di2ponv0v5otw.cloudfront.net/posts/202...
4,BLEACHED RETRO NASCAR STYLE TEE,joeybrzezinski,20,Size: XL,winners circle,/listing/BLEACHED-RETRO-NASCAR-STYLE-TEE-5ecdd...,https://di2ponv0v5otw.cloudfront.net/posts/202...


In [65]:
df.dtypes

Title        object
Seller       object
Price         int64
Size         object
Brand        object
Page URL     object
Image URL    object
dtype: object

Sizes are all over the place. Are the XL shirts? Size 11... shoes, hats, pants? Will need to see if we can categorize these item pieces as we scrape.