### Fetching data 

In order to generate new silk squares, like those designed by Hermès, we will need data, preferably in the form of photos, of the models created by the fashion house. After a little research on the web, we find different sources of data. There are blogs of brand enthusiasts, listing images of squares from various collections. On the Hermès website you can find very good quality images of the latest collections. We will therefore try, first of all, to extract from these different sources what will be the raw material to provide to our algorithm.

The art of extracting data from websites is called data scraping or web scraping. The legal framework for this practice is a bit of a gray area. You must always be careful not to violate the terms of use of the applications or website on which you want to retrieve this data. In addition, this practice sometimes consists of generating multiple requests that may affect the proper functioning of the site or application in question. It is therefore a question of knowing what one does not to be outside the law.


### Summary
* <a href='#Retrieve_images_from_Hermes'>1.1 Retrieve images from the Hermès site</a>
* <a href='#Retrieve_requests_with_Mitmproxy'>1.2 Retrieve requests to the Hermès website with Mitmproxy</a>   
* <a href='#Building_image_urls'>1.3 Building image urls : understanding the pattern</a> 
* <a href='#Python_code'>1.4 Python code</a> 

<a id='Retrieve_images_from_Hermes'></a>

#### Retrieve images from the Hermès site.
<p>In order to retrieve the images of the latest collections on the Hermès site, we will have to take an interest in the way their site works. The best way is therefore to go to the page dedicated to the squares and to walk around to see what is going on. </p>
<p>We see that the content of the site loads dynamically. When scrolling to display more products, the site loads new products as they go. It displays products in packages of 40 items.</p>


![SegmentLocal](Images/hermes_site.gif "segment")

<p>For those who are already familiar with web scraping, you know it's a bit of a hassle. We won't be able to make a simple request to the squares page, and parse the source code to retrieve the links of all the images at once.</p> 
<p>One way to do this would be to use a program that controls our browser to scroll and load the different items automatically. Unfortunately (or not) the site is quite good at detecting this kind of program and will not let us do so.</p>
<p>Therefore, we are going to see what is happening in the back kitchen of the site. We want to try to understand the requests sent to the server when, as a user of the site, we scroll to load more items. To do this we will use a tool called mitmproxy. This is a software positioned between our computer and the web server to which we address ourselves when we visit the site, and which will collect the requests that we send to the site, as well as the responses provided by the server hosting the site.</p>

 This article is not intended to explain in detail the operation of mitmproxy, or its installation. If you wish to learn more about this tool, I invite you to consult [their documentation](https://docs.mitmproxy.org/stable/). For other funny examples of uses, you can check out these resources :
- [reverse engeniering a private api from couchsurfing website w/ mitmproxy](https://www.toptal.com/back-end/reverse-engineering-the-private-api-hacking-your-couch). 
- [reverse engeniering your bookstore app w/ mitmproxy](https://www.youtube.com/watch?v=LbPKgknr8m8). 




<a id='Retrieve_requests_with_Mitmproxy'></a>

#### Retrieve requests to the Hermès website with Mitmproxy

<p>Once mitmproxy is configured on our computer, we will repeat the experience of browsing the site. This time we will see in real time the requests made by our browser, as well as the responses from their server. We see that there are a lot of requests, and we will have to inspect them one by one to find what we are looking for. Namely, how is formulated the request that tells the site to load products on the page.</p>

![SegmentLocal](Images/flow.gif "segment")

<p>It's when you try to load more elements than those displayed on the base page that you find what you were looking for. A request is sent to a kind of private API which communicates with a database. 
<img src="Images/mitm_flow.png" alt="Drawing" style="width: 600px;"/>


<p>When we look more closely at the various query parameters. Pagesize is the number of items to display. Offset, is the number of elements from which it is necessary to start to count the elements to be displayed, and a Sort parameter which indicates the way in which the elements are displayed (here by relevance).</p>
<img src="Images/mitm_request.png" alt="Drawing" style="width: 600px;"/>

<p>What is even more interesting is the format of the response. This is a nice json file with the items and details like the page urls for each of them.</p>
<img src="Images/mitm_response2.png" alt="Drawing" style="width: 600px;"/>

<a id='Building_image_urls'></a>

### Building image urls: understanding the pattern

<p>Each image displayed on the site is accessible as an object via a url. When the site loads, it calls these urls to display the images. To understand how these urls are constructed, we will inspect a few product pages. By right clicking on the image you can get the link to the image. After observing a few urls, we find the following pattern: base_url+1700...</p>

<p>What's great is that all the information we need to rebuild these urls is all present in the json that we retrieved in the previous step.</p>

So to achieve our goals we need:
- Formulate the 5 requests to the "http://" url by varying the offset parameter to retrieve the information of the 200 products (5*40 = 200). - Once the json files have been saved, the data they contain is processed to build the urls of the images of each of the products.
- We then send requests to these image urls and record the responses.

<a id='Python_code'></a>

### Python code

Let's get our hands on the code a bit. With mitmproxy, we can retrieve the request that interests us in the form of curl. We convert this curl request in python thanks to [this great tool](https://curlconverter.com/#). Which give :

In [15]:
import requests

cookies = {
    'datadome': '6dGCGJi7OY-QVuTMNf_Va-ihruAS4fT~-prPt53aTUtSMJ6e2jTRdQI9rOCLxC9775DR1GQRrpLUs~67PZ000nS-WYlYuszR-qNvBG9kGLXYs.yI6CUPOdCrifnldDu',
    'ECOM_SESS': 'fcj2587fp2kfmusvhj08lbfbc5',
    'correlation_id': '1808b1db587ebab640cced310efcb6dfdd899ab3eb0c44ea08e3f11ab65441ee',
    '_gcl_au': '1.1.198033202.1643632040',
    '_ga': 'GA1.2.706534774.1643632042',
    '_gid': 'GA1.2.616364715.1643632042',
    'GeoFilteringBanner': '1',
    '_uetsid': '241e11c0829111eca0f5e908623d4c81',
    '_uetvid': '241e4b10829111ecb2360bf65e216a0f',
    '_ga_Y862HCHCQ7': 'GS1.1.1643632042.1.0.1643632045.0',
    '_cs_c': '1',
    '_cs_id': '116f9e30-3c8e-acbc-ecf4-8ef80963d888.1643632043.1.1643632043.1643632043.1.1677796043154',
    'ABTasty': 'uid=hs12gaez8thx8txq&fst=1643632043261&pst=-1&cst=1643632043261&ns=1&pvt=1&pvis=1&th=',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:96.0) Gecko/20100101 Firefox/96.0',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Origin': 'https://www.hermes.com',
    'Connection': 'keep-alive',
    'Referer': 'https://www.hermes.com/',
    'Sec-Fetch-Dest': 'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-site',
}

offset = 40

params = (
    ('locale', 'fr_fr'),
    ('category', 'WOMENSILKSCARVESETC'),
    ('sort', 'relevance'),
    ('offset', offset),
    ('pagesize', '40'),
)

response = requests.get('https://bck.hermes.com/products', headers=headers, params=params, cookies=cookies) 

When we make our request, the server returns a json file. The first item in the list of items in this file is displayed.

In [29]:
json_response = response.json()
items = json_response['products']['items']
print(items[0])

{'sku': 'H892813S 16', 'title': 'Gavroche 45 Della Cavalleria', 'productCode': 'S102', 'perso_product_type': 'silk', 'assets': [{'url': '//assets.hermes.com/is/image/hermesproduct/892813S%2016_flat_1?a=a&size=3000,3000&extend=300,300,300,300&align=0,0', 'type': 'image', 'source': 'scene7'}, {'url': '//assets.hermes.com/is/image/hermesproduct/892813S%2016_worn_2?a=a&size=3000,3000&extend=0,0,0,0&align=0,0', 'type': 'image', 'source': 'scene7'}], 'moreColors': False, 'size': 'SANS_TAILLE', 'avgColor': 'rose', 'departmentCode': 'S', 'familyCode': 'S10', 'divisionCode': '04', 'price': 190, 'url': '/product/gavroche-45-della-cavalleria-H892813Sv16', 'slug': 'gavroche-45-della-cavalleria', 'hasStock': True, 'hasStockRetail': False, 'hasStockOrHasStockRetail': False, 'stock': {'ecom': True, 'retail': False}, 'personalize': True}


In [None]:
urls = [] 

base_url = 'https://assets.hermes.com/is/image/hermesproduct/' 
suffix = '-flat-1-300-0-1700-1700-q99_b.jpg'

for item in items:
    slug = item['slug']
    sku = item['sku'][1:]
    url = base_url + slug + '--' + sku + suffix
    urls.append(url)

In [30]:
print(urls[0])

https://assets.hermes.com/is/image/hermesproduct/gavroche-45-della-cavalleria--892813S 16-flat-1-300-0-1700-1700-q99_b.jpg


In [33]:
image_url = urls[0]
response = requests.get(url=image_url, headers=headers, cookies=cookies)
bytes_image = response.content
jpeg_img = bytearray(bytes_image)
f = open('image.jpeg', 'wb')
f.write(jpeg_img)
f.close()



<_io.BufferedWriter name='image.jpeg'>
