# Parsing APIs Example

## Intro

Now we will take a look on a real data. When you parse data from web you will often meet API based web-pages. 

For example [zalando.fr](https://www.zalando.fr/accueil-homme/) is API based web-page. 

In this guided lab you will learn how to obtain the links from webpages and extract the data. Read through this doc, execute the cells in order and make sure you understand the explanations. 

*Note: This guided lab uses Google Chrome. Other browsers like Safari and Firefox have similar tools for developers but they work differently. To save your time in following this lab, it is strongly recommended that you install and use Google Chrome.*

## Obtaining the link

Zalando is discount e-store where you can buy clothes and accesories with discount. When we go to the web-page, we can choose different sections. First the general process will be shown using [Children section](https://www.zalando.fr/accueil-enfant/) as example.

Here we will parse data about promotions only. Therefore, final output will be the DataFrame with all the goods under discount.

[![Image from Gyazo](https://i.gyazo.com/fa4874d8e81c7570273bbfb853d66308.png)](https://gyazo.com/fa4874d8e81c7570273bbfb853d66308)


We go to Promos page. Right click of mouse shows us a list of actions possible, from which we select Inspect.

<img src='https://i.gyazo.com/bccbd11d69c9040dc98758d443e32052.png' width="400">


You will see the menu dropdown on the right side or on the bottom of the window. There you should click on Network:


[![Image from Gyazo](https://i.gyazo.com/f7e0db81cbfee67694183d1a7640bf81.png)](https://gyazo.com/f7e0db81cbfee67694183d1a7640bf81)

Right after the developer part will change showing the files behind the page. In order to obtain only useful files we select the following settings:
1. Preserve Log
2. Select XHR files.

[![Image from Gyazo](https://i.gyazo.com/9a899d4441d9d93e795f79747f1e47d5.png)](https://gyazo.com/9a899d4441d9d93e795f79747f1e47d5)

In order to obtain some files we need to scrool down and go forward to second page. 

[![Image from Gyazo](https://i.gyazo.com/0956eb3d5125075a236c9a439c7749c7.png)](https://gyazo.com/0956eb3d5125075a236c9a439c7749c7)

In the Network panel you can see the following files being uploaded. All the data on the web-page is uploaded from the json file, which is one of the following. It is important to understand which file contains what kind of information. 

<a href="https://gyazo.com/cf97a655869f0b22df0ada1cb2a41c3c"><img src="https://i.gyazo.com/cf97a655869f0b22df0ada1cb2a41c3c.png" alt="Image from Gyazo" width="724.8"/></a>

When you find what kind of information you need for the data to be uploaded you just test it. Here we need the article... file:

<a href="https://gyazo.com/78b35bf492994b3f35c0564a21da202a"><img src="https://i.gyazo.com/78b35bf492994b3f35c0564a21da202a.png" alt="Image from Gyazo" width="727.2"/></a>

When we test the link in Chrome inkognito mode we obtain the proper json file:


<a href="https://gyazo.com/b60453fa98454fa29771c731a5174443"><img src="https://i.gyazo.com/b60453fa98454fa29771c731a5174443.png" alt="Image from Gyazo" width="1530.4"/></a>

In order to change the objects in the json file (kind of pagination), you need to change the offset (the number of the first element on the page). in fact, if you take a look on the link, it is easy to unerstand the structure of the link.

# Reading the data

Now the party rocks! When we know how can we obtain the data, it is not a problem to obtain the whole database with all the data from the web-page.
In this lab you will collect your database of Zalando products. You select which goods you want to track. You can define as many filters to your data as you want. Just make sure that the data represents the filters.




In [1]:
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize


In [2]:
# Paste the url you obtained for your data
urlzal='https://www.zalando.fr/api/catalog/articles?categories=promo-enfant&limit=84&offset=84&sort=sale'



#### Collect first 84 object of the of the data (1st page)

Your output should be a Pandas DataFrame of goods. Each row should contain only text or numbers, having *family_articles, flags, media* and *sizes* remaining lists (they are exceptions). Hint: use the headers parameter to get the data!

In [3]:
headers_safari = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15',
    'Cookie': '_abck=D3A410542156C6446A65CE77A24C1962~-1~YAAQxvpxaKcGXWJ4AQAAlwrSYwV1UNJw/Lo5j9fZLikqmrMF0wSHif0NSyLELp/5uGpC1I8bZ4j6NECHzSIkazeyqYomY9in+u8pXWcMx9l8RftsuSuYjxPNiQNaui3Ny5ypA5rnKUFrmrRj/rlzKn9/ymy9hyKhsdwTOf4K89FFOdQan9FXDSYCgPQLJJd84Gva79fIiz/39ZgRFFRedWYrwEwZdq1VvE5jppoKPJ4HDVMvbuL7yk7BBW+QqZ9Id5JdHJfmXaW04uaC2zZf+OVCzx+cPnlz8ZaI1I5YO0XkrtujEn+Mw1panVu5Z7bJNFS/18OteWq2eh4OTsT78djxA8dS8V0PRx+/C5ZyQacV96sy6YE6099CG6nQvy2DJRhUONxAZt47BY826ACJ8Z5N+RJ2+apgz6gQE/t0tNMVhvVckk7UMJEUsbsAo+EnESa14GHV3NnVuQEB5tokuXfOZ06/JIAM~-1~-1~-1; Zalando-Client-Id=21c46254-40fc-44be-8f7e-b5740eeae773; _gat_zalga=1; bm_sv=FCE1A3F3B8D168898FAE490B0B094022~k9qt8e92MysEYyG7SZcdNF79Ynt/wNCbmOZ52ojVgRLEnNbWc/6q7xqs289Q/jo/Gk6IrKICLz07LI3GmqoFDRjfAJcslncOhiWXOKY0/qD6Y8Odr4RV6N2+S7JStdBnzmtyQh8c7U5NtCYhUSAAtGJN9ILdgnburXiNEBC7LuA=; _fbp=fb.1.1616582164479.1495405834; _ga=GA1.2.1536409037.1616582164; _gid=GA1.2.839128922.1616582164; ncx=k; _gcl_au=1.1.742902882.1616582164; sqt_cap=1616582151909; ak_bmsc=766A961EE4C14F7277332106EE3F0DCA~000000000000000000000000000000~YAAQ3fpxaIzjCWN4AQAAOdjNYwvvSwfDgE1+5V2juJ7/mZkg6Bd7wihnfCLEorkfGgxjGN5SxGt065fQEk5RqTuxcsjyLJher9bqxmzSdNSLslqekKF8O5T2BfZgihTBwSoeKWYfFwL3xLrvRV9QfJZpjSoP2fzn6qozrWhVvpNOnKnv3nf9d5vKnVLZvaUIJAp8yTaNxNKEds0r/aWFUM6bZhmVQ5RLBiuLzCXAT7jxbDhwyCS+uR9gtjSByzaLvYZwTafqXkprQNuUaY2xhtMLjsmfL78SBSnphQB9/9UlNiR7Vos7TGck4O7urhzoSic7EugdPokaEqQDoMN5VU2ajCj4rEKPPpihVlcBLFnQzBVueJdB8XWy7+Xc9NHjfoDcGBdMoWZlyAvl1STrhPDki8A3sTgCHMTG; fvgs_ml=mosaic; mpulseinject=false; bm_mi=B6F83C97ED5197A331B2167F3E9956B6~MaZrsGu9c98o83p/vhe/baeVDE7352agv4bKGXzvUTWW7Ny3yCvJew3JZ4R9h7gj7AwM21xupqQqw9PsWlcwAvCLIo+dciH4Esl4jWnYP/ED1bC8P3a3IItU83J22ek5HfEKnY/7HjfrvxFfrUArSneLZHGosFzNVxK32Pqg2VuOdfO9+ZbjmB2uYRRd3tpWUqBpVbQMzqhTiqXIPjUswtKUx5gWlW98nMcXDV0+edPZnuVPIipnlIFlf3mgjxwZIACom3CFipE7bQ2MSisaqQ==; bm_sz=005A1A795BE0753193A0DB3DC9FEADA3~YAAQRn8WAo24dGB4AQAAkpLNYwtvecQaL6naJJsQc+TajVbV1gwfbqK4vuGFLpX027ODDw41e+fwgufE8yO1otUMoU1Sb+5+yv3yDfW9WC9o5ILEVwhhiK7Hn0knSN3oPCk/Tj2GbIUNaajH+k1AwuSzZQemk0ghBEznM23uFftp77GqkW1LATfuNMWri8s78EjuyaFuMeF2ITCsHLmI1hdkpKbvkDDhzfhn2qdPifLn6GwWauJagufpwsOVdXWgR9qposnSWiSa8d8QitBsbQs1lm/v1wYXiiuo98cgdaw=; frsx=AAAAAIBbvAVqZJCEK0yUBKDVtRtYUfe3udRw3cWp9ervTdeUs0w7yWGO6vPp7jXpuOryCaKex_X_6F8tcQzWXIDHDTLiYOkygwBr6xoXTNY_NwdokXNH6VNW82y6zlNp0l8BlCSixYKqmdwXjojkps4='
}

In [4]:
respzal = requests.get(urlzal, headers = headers_safari)

In [5]:
respzal.status_code

200

In [6]:
respzal.json()

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [7]:
respzal.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <link\n      rel="shortcut icon"\n      href="

Nos carga una página de error a pesar de que el status code es 200. Probaremos con otra web conectada a una API.

#### Test with Stubbe Chocolates website

In [8]:
url='https://cdn4.editmysite.com/app/store/api/v13/editor/users/135718637/sites/228142249342516758/products?page=1&per_page=60&sort_by=category_order&sort_order=asc&categories[]=11eb584a0b77c3e79467ac1f6bbbcc9c&include=images,media_files'


In [9]:
# headers definition
resp = requests.get(url)


In [10]:
resp.status_code

200

In [11]:
resp.text

'{"data":[{"id":"11eb5840016c06b19467ac1f6bbbcc9c","owner_id":"135718637","site_id":"228142249342516758","site_product_id":"2","visibility":"visible","visibility_tl":"Visible","visibility_locked":false,"name":"Verspielt","short_description":"<p>Verspielt, Daniel Stubbe\'s signature milk chocolate, translates from German as playful. This 40% cocoa milk chocolate is a rich, deep, memorable chocolate that playfully elevates the traditional milk chocolate taste with a higher cacao content.<\\/p><p> <\\/p><p><em>Availability same day pick-up, or next day delivery, if ordered before 3pm<\\/em><\\/p>","variation_type":"3","product_type":"physical","product_type_details":[],"taxable":true,"required_feature_to_sell":null,"min_prep_time":120,"site_shipping_box_id":3,"site_link":"product\\/verspielt\\/2","permalink":null,"seo_page_description":null,"seo_page_title":null,"avg_rating":null,"avg_rating_all":null,"inventory":{"total":33,"lowest":33,"enabled":true,"marked_sold_out_at_all_existing_loca

In [12]:
choc = resp.json()

In [13]:
choc['data'][9]['owner_id']

'135718637'

In [14]:
choclist = choc['data']
choclist[0]['id']

'11eb5840016c06b19467ac1f6bbbcc9c'

In [15]:
df = pd.DataFrame()

In [16]:
for a in range(len(choclist)):
    df.loc[a, 'name'] = choclist[a]['name']
    df.loc[a, 'price'] = choclist[a]['price']['high']
    df.loc[a, 'description'] = choclist[a]['short_description']

In [17]:
# Your code

df

Unnamed: 0,name,price,description
0,Verspielt,9.0,"<p>Verspielt, Daniel Stubbe's signature milk c..."
1,Verliebt,9.0,"<p>Verliebt, Daniel Stubbe's signature dark ch..."
2,Verliebt Gianduja bar,10.0,<p>This gourmet bar is made with the Stubbe si...
3,Garam Masala Pistachio,10.0,<p>Single origin Ghana milk chocolate combined...
4,Mango Passion fruit Lime Chili bar,10.0,<p>An exciting combination of creamy Zephyr wh...
5,Mint bar,9.0,<p>The Stubbe gourmet mint flavoured dark choc...
6,Dulcey White Chocolate Caramel,9.0,"<p>Smooth, creamy chocolate with a velvety tex..."
7,Cinnamon Milk Chocolate,9.0,<p>Stubbe Chocolates cinnamon flavoured milk c...
8,Semisweet Bars,7.0,<p>Stubbe semisweet dark chocolate bars have a...
9,Semisweet Sea Salt Bars,7.0,"<p>Stubbe semisweet dark chocolate bars, 54% c..."


#### Collect all the objects from selected filters. Total number of pages can be found in the same json. Use *sku* column as index.

Your output should be a Pandas DataFrame of goods. Each row should contain only text or numbers, having family_articles, flags, media and sizes remaining lists (they are exceptions).

Same exercise extracting more info with params as well

In [18]:
# Primero hacemos el test con la página del endpoint buscando la página 2 en parámetros

url2='https://cdn4.editmysite.com/app/store/api/v13/editor/users/135718637/sites/228142249342516758/products'

In [19]:
params2 = {'page':2}

In [20]:
resp2 = requests.get(url2, params = params2)

In [21]:
resp2.status_code

200

In [22]:
resp2.json()

{'data': [{'id': '11eb59bf8ff39ceb9467ac1f6bbbcc9c',
   'owner_id': '135718637',
   'site_id': '228142249342516758',
   'site_product_id': '27',
   'visibility': 'visible',
   'visibility_tl': 'Visible',
   'visibility_locked': False,
   'name': 'Mozart Torte',
   'short_description': '<p>Flourless almond cake with a layer of raspberry and a layer of dark chocolate ganache, covered in dark chocolate.</p><p><br /></p><p><em>Availability: two days after your order is placed</em></p>',
   'variation_type': '3',
   'product_type': 'physical',
   'product_type_details': [],
   'taxable': False,
   'required_feature_to_sell': None,
   'min_prep_time': 1440,
   'site_shipping_box_id': None,
   'site_link': 'product/mozart-torte/27',
   'permalink': None,
   'seo_page_description': 'Stubbe Chocolates Toronto flourless almond cake with a layer of raspberry and a layer of dark chocolate ganache, covered in dark chocolate.',
   'seo_page_title': 'Stubbe Chocolates Toronto Mozart Torte',
   'avg_r

In [23]:
pd.json_normalize(resp2.json()['data'])[['name', 'short_description', 'price.low']]

Unnamed: 0,name,short_description,price.low
0,Mozart Torte,<p>Flourless almond cake with a layer of raspb...,42.0
1,The Davenport,<p>Gluten-free rice flour chocolate cake with ...,42.0
2,Anna Torte,<p>Chocolate sponge cake with layers of orange...,40.0
3,Afternoon Relaxer for Two,<p>This gift set is perfect for an afternoon t...,34.0
4,Cocoa Powder,100% cocoa powder in a 1kg bag. This is a vega...,32.0
5,Party Mix,"<p>Perfect for sharing, a cylinder with layers...",29.75
6,Gateau Voyage Hazelnut,"<p>Our Hazelnut ""Travel cake"" is dense, moist ...",28.5
7,Maltitol Dark Chocolate Almond Bark,<p>Roasted Almonds in dark Maltitol chocolate ...,24.0
8,Hot Cocoa Bomb Set of four,<p>The Stubbe Cocoa Bomb Set is a fun way to e...,20.5
9,Almond Bark,<p>Sheets of milk or dark chocolate with roast...,20.5


In [24]:
table = pd.DataFrame()

In [25]:
# Hemos comprobado que hay 11 páginas. Iteramos a través de ellas y vamos añadiendo el contenido a un DataFrame

for a in range(1,12):
    resp2 = requests.get(url2, {'page':a})
    new = pd.json_normalize(resp2.json()['data'])[['name', 'short_description', 'price.low', 'price.low_formatted']]
    table = table.append(new)


In [26]:
# Reseteamos el índice

table = table.reset_index().drop(columns = 'index')

In [27]:
table

Unnamed: 0,name,short_description,price.low,price.low_formatted
0,Verliebt,"<p>Verliebt, Daniel Stubbe's signature dark ch...",9.00,CAD$9.00
1,Verspielt,"<p>Verspielt, Daniel Stubbe's signature milk c...",9.00,CAD$9.00
2,Truffle Torte,<p>Chocolate cake with layers of dark chocolat...,40.00,CAD$40.00
3,Sacher Torte,<p>A Viennese chocolate torte with either the ...,40.00,CAD$40.00
4,Heart Chocolates,<p>These little bite-sized heart shaped chocol...,11.75,CAD$11.75
...,...,...,...,...
152,Wild Bunnies card,"<p>Local artist, Susan Rosen, creates wonderfu...",6.00,CAD$6.00
153,Tulip Garden card,"<p>Local artist, Susan Rosen, creates wonderfu...",6.00,CAD$6.00
154,Strawberry Love Bar,<p>Strawberry puree and cocoa butter. This is ...,8.50,CAD$8.50
155,Yuzu Love Bar,<p>Yuzu puree with cocoa butter. This is a veg...,8.50,CAD$8.50


In [34]:
table.sort_values('price.low', ascending=False)

Unnamed: 0,name,short_description,price.low,price.low_formatted
101,Happy as a Duck,<p>A large hollow duck (approximately 35cm tal...,195.00,CAD$195.00
5,Chocolate Adventure Box,<p>This gift box includes a large variety of d...,160.70,CAD$160.70
100,Giant Easter Bunny,<p>This 64cm hollow chocolate bunny is a great...,145.00,CAD$145.00
150,Giant Nest Egg,<p>This 35cm hollow chocolate egg makes a grea...,125.50,CAD$125.50
139,Stubbe Classic Forty-Eight Piece Chocolate Box,<p>The forty-eight piece Stubbe Blue box is fi...,114.50,CAD$114.50
...,...,...,...,...
65,Chocolate Dipped Orange Slice,<p>Candied orange slice half dipped in dark ch...,3.75,CAD$3.75
113,Easter Bunny lolly,<p>Festive Easter bunny shaped lollies.</p><p>...,3.75,CAD$3.75
91,Heart lolly,<p><em>Availability: Same day pick-up or next ...,3.50,CAD$3.50
105,Star lolly,<p>Chocolate lolly in the shape of a star with...,3.00,CAD$3.00


#### Display the trending brand in DataFrame

In [None]:
# n/a

#### Display the brand with maximal total discount (sum of discounts on all goods)

In [None]:
# n/a

#### Display the brands without discount at all

In [None]:
# n/a