# Scraping in Python

For scraping we'll use 2 libraries : `requests` and `Beautiful Soup`. Let's import them.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

We will scrape an etsy page. Let's do the search in the browser and copy the URL here.

In [4]:
www = "https://www.etsy.com/search?q=dog+sweater&explicit=1&locationQuery=6255148"
# &ship_to=SE

We will use only a couple of methods from both libraries. From the `requests` library, we will use only one , namely:

```python
requests.get()
```

From `Beautiful Soup`, we will use a couple of them:
```python
soup.select()
soup.element.get()
.text
```

Let's open the website with requests and parse it with Beautiful Soup, like this:

```python
r = requests.get(www)
soup = bs(r.text)
```

In [8]:
r = requests.get(www)
#r.text
soup = bs(r.text)
#soup

Further, we need to know something about how HTML is structured and so-called CSS selectors.

![html](img/anatomy-of-an-html-element.jpg)
![html](img/html-element.png)

Let's have a look at what our soup looks like

In [None]:
soup

CSS selectors will allow us to look for elements within the HTML code.

![html](img/css_selectors.png)


In [12]:
soup.select("h3.v2-listing-card__title")

[<h3 class="wt-text-caption v2-listing-card__title wt-text-truncate">
                 PDF Instant Digital Download 4 dog coats knitting pattern double knit and chunky (2505)
         </h3>,
 <h3 class="wt-text-caption v2-listing-card__title wt-text-truncate">
                 Icelandic Dog Sweater - lopapeysa - wool
         </h3>,
 <h3 class="wt-text-caption v2-listing-card__title wt-text-truncate">
                 3 Sizes Crochet Dog pullover Sweater in ARAN Yarn **PDF Instant Download** Pattern ONLY
         </h3>,
 <h3 class="wt-text-caption v2-listing-card__title wt-text-truncate">
                 Maglioncino 100% cashmere “Loro Piana”  per cani di piccola taglia handmade made in Italy cane piccolo small dog teacup jumper dog sweater
         </h3>,
 <h3 class="wt-text-caption v2-listing-card__title wt-text-truncate">
                 Dog sweater pattern, Small and Medium dog sweater pattern, Pet stripe sweater crochet pattern, Cat crochet pattern, Sphynx cat sweater.
         

In [14]:
soup.select("span.currency-value")

[<span class="currency-value">1.93</span>,
 <span class="currency-value">54.57</span>,
 <span class="currency-value">3.75</span>,
 <span class="currency-value">52.62</span>,
 <span class="currency-value">3.06</span>,
 <span class="currency-value">3.82</span>,
 <span class="currency-value">32.40</span>,
 <span class="currency-value">36.00</span>,
 <span class="currency-value">1.38</span>,
 <span class="currency-value">4.51</span>]

In [None]:
# empty list to hold our data
data = []

In [None]:
for n in range(1,21):
    
    print(n)
    
    # create a webpage link with the page reference
    www = f"https://www.etsy.com/search?q=dog+sweater&explicit=1&locationQuery=6255148&page={n}&ref=pagination"
   
    # scrape this page
    r = requests.get(www)
    soup = bs(r.text)
    
    # for every 'div' item on the page with the class 'v2-listing-card'
    for item in soup.select("div.v2-listing-card"):

        # create an empty dictionary that will hold the information from every item.
        row = {}

        row["title"] = item.select("h3.v2-listing-card__title")[0].text.strip()
        row["price"] = item.select("span.currency-value")[0].text
        # TODO : add currency
        row["link"] = item.select(".listing-link")[0].get("href")

        # open the page of the item
        r_item = requests.get(row["link"])
        soup_item = bs(r_item.text)

        # find the URL of the shop
        row["shop_url"] = soup_item.select("#desktop_shop_owners_parent a")[0].get("href")

        # open the URL of the shop
        r_shop = requests.get(row["shop_url"])
        soup_shop = bs(r_shop.text)
        
        # try to find the shop location. If it fails, save an empty string
        try:
            row["location"] = soup_shop.select("span.shop-location")[0].text
        except IndexError:
            row["location"] = ''

        # append the dictionary to the data list
        data.append(row)

    pd.DataFrame(data)#.to_csv("data.csv")

In [54]:
df = pd.DataFrame(data)

In [57]:
# find out which titles contain the word 'pattern' and assing True and False to them
df["pattern"] = df["title"].str.lower().str.contains("pattern")

In [72]:
# filter out only the ones that did not contain the word 'pattern'
sweaters = df[df["pattern"] == False]
sweaters.sample(1)

Unnamed: 0,title,price,link,shop_url,location,pattern
132,Dog teddy bear fleece poloneck jumpers. For i...,5.33,https://www.etsy.com/listing/903151000/dog-ted...,https://www.etsy.com/shop/ThePoshPawsCompany?r...,"Birmingham, United Kingdom",False


In [107]:
# get the country of the shop by splitting and chosing the last element of the list.
# use .strip() to remove the whitespaces
# ignore the warning :)
sweaters["country"] = sweaters["location"].str.split(",").str[-1].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sweaters["country"] = sweaters["location"].str.split(",").str[-1].str.strip()


In [108]:
# python thinks that the column price is not numeric. Let's make it numeric
# ignore the warning if you get it
sweaters["price"] = pd.to_numeric(sweaters["price"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sweaters["price"] = pd.to_numeric(sweaters["price"])


In [104]:
# the 5 most expensive sweaters
sweaters.nlargest(5, "price")

Unnamed: 0,title,price,link,shop_url,location,pattern,country
79,Sweater for large dogs / Warm winter dog coat ...,120.05,https://www.etsy.com/listing/1039505921/sweate...,https://www.etsy.com/shop/MioMyDog?ref=l2-abou...,"Riga, Latvia",False,Latvia
141,Spaniel Waterproof Dog Raincoat Plain Color - ...,103.0,https://www.etsy.com/listing/783710021/spaniel...,https://www.etsy.com/shop/BarkAndGo?ref=l2-abo...,"Lviv, Ukraine",False,Ukraine
67,Warm jacket for a big dog; Dog clothes; natur...,92.76,https://www.etsy.com/listing/909390108/warm-ja...,https://www.etsy.com/shop/LigEdHandmade?ref=l2...,"Liepāja, Latvia",False,Latvia
83,Black natural woll ;knitted dog sweater; Warm ...,92.76,https://www.etsy.com/listing/1190804184/black-...,https://www.etsy.com/shop/LigEdHandmade?ref=l2...,"Liepāja, Latvia",False,Latvia
87,dog sweater/ autum sweater/ Warm jacket for a ...,92.76,https://www.etsy.com/listing/1148924065/dog-sw...,https://www.etsy.com/shop/LigEdHandmade?ref=l2...,"Liepāja, Latvia",False,Latvia


In [109]:
# 10 countries that make the most expensive dog sweaters
sweaters.groupby("country").mean().nlargest(10, "price")#.index

Unnamed: 0_level_0,price,pattern
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Hungary,84.03,0.0
Latvia,82.15,0.0
Lithuania,57.84,0.0
Iceland,54.57,0.0
Denmark,54.526,0.0
Sweden,49.04,0.0
Turkey,46.466667,0.0
Ukraine,44.833333,0.0
France,43.65,0.0
Italy,43.396667,0.0
