# Introduction
This part of the repository scrapes the data regarding interiors from Dezeen.com

# Environment setup

## Google Drive mount
I'm using Google Colaboratory as my default platform, therefore I need to set up my environment to integrate it with Google Drive. You can skip this bit if you're working locally.

1. Mount Google Drive on the runtime to be able to read and write files. This will ask you to log in to your Google Account and provide an authorization code.
2. Create a symbolic link to a working directory 
3. Change the directory to the one where I cloned my repository.


In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
!ln -s /content/gdrive/My\ Drive/Colab\ Notebooks/dezeenAI /mydrive
!ls /mydrive

classes.gsheet	dezeen_basic-detection.ipynb  LICENSE
darknet		dezeen_download.ipynb	      OIDv4_ToolKit
data		dezeen_scrape.ipynb	      README.md
dezeenAI	files


In [None]:
%cd /mydrive

/content/gdrive/My Drive/Colab Notebooks/dezeenAI


## Libraries & functions
- `requests` - http handling
- `BeautifulSoup` - html parsing & web-scraping
- `urllib.request` - url-opening
- `tqdm` - loop progress bar
- `numpy` - linear algebra
- `pandas` - data manipulation & analysis
- `sys` - system-specific parameters & functions

In [None]:
import requests
from bs4 import BeautifulSoup
import urllib.request
from tqdm import tqdm
import timeit
import numpy as np
import pandas as pd
import sys

# Scraping
Let's now retrieve all the information we need to proceed.

## Article info
Let's scrape a list of articles tagged as `Interior` using [dezeen](http://dezeen.com/)'s category pagination url convention.

1. Initiate the list containers for article id, title and url
2. Iterate through a range between 1 and 500 (currently the category has around 400-ish pages) and display an information after reaching the last page.
3. Initiate a `BeautifulSoup` object and find all `<arcticle>` objects with a `data-article-id` attribute present (each one of these indicate a unique article container)
4. Iterate through the article objects and scrape the id, title and url of each one of them

In [None]:
start = timeit.default_timer() # start the times

print('\nStarting. This might take a few minutes to complete...\n')

ids = []
titles = []
urls = []

for n in range(1, 500):

  url = 'https://www.dezeen.com/interiors/page/'+str(n)
  response = requests.get(url)
  if response.status_code != 200:
    print('\nCrawled a total of {} pages. Finishing.'.format(n-1))
    break

  soup = BeautifulSoup(response.content, 'html.parser')
  articles = soup.find_all('article', attrs={'data-article-id': True})

  for article in articles:
    ids.append(article['data-article-id'])
    titles.append(article.h3.string)
    urls.append(article.a['href'])

stop = timeit.default_timer() # stop the timer
print('Runtime: {} seconds.'.format(stop-start))


Starting. This might take a few minutes to complete...


Crawled a total of 249 pages. Finishing.
Runtime: 508.599164137 seconds.


## Save to the
 DataFrame
Let's save the scraped info to a DataFrame

In [None]:
articles_df = pd.DataFrame(
    data = list(zip(ids, titles, urls)),
    columns = ['id', 'title', 'url'])

articles_df

Unnamed: 0,id,title,url
0,1588230,Daosheng Design creates monochromatic bar with...,https://www.dezeen.com/2020/11/19/the-flow-of-...
1,1588111,Issey Miyake store in Osaka is splashed with w...,https://www.dezeen.com/2020/11/19/issey-miyake...
2,1586860,Nook Pod is a gabled workspace,https://www.dezeen.com/2020/11/18/nook-pod-dez...
3,1587217,Project #13 is an office for Studio Wills + Ar...,https://www.dezeen.com/2020/11/17/project-13-h...
4,1586339,AHEAD Europe 2020 awards winners announced in ...,https://www.dezeen.com/2020/11/16/ahead-europe...
...,...,...,...
4946,70,Marcel Wanders launches Crochet Chair,https://www.dezeen.com/2006/12/10/marcel-wande...
4947,67,WOKmedia show at Design Miami,https://www.dezeen.com/2006/12/10/wokmedia-sho...
4948,54,Zaha Hadid furniture exhibited in New York,https://www.dezeen.com/2006/12/07/zaha-hadid-f...
4949,48,Thomas Heatherwick beach cafe takes shape,https://www.dezeen.com/2006/12/04/thomas-heath...


## Article images
Let's retireve a list of images in each article.

1. Iterate through the article urls
2. Request an url and display an information if not reachable.
3. Create a `BeautifulSoup` object and find all `<figure>` elements with `id` attribute present.
4. Iterate through the figures, retrieve and save an image source for each of them

In [None]:
start = timeit.default_timer() # start the times

print('\nStarting. This might take around two hours to complete...\n')

article_images = []

for url in tqdm(articles_df['url']):
  
  response = requests.get(url)
  if response.status_code != 200:
      print('\nPage not available.'.format(n-1))
      
  soup = BeautifulSoup(response.content, 'html.parser')
  figures = soup.find_all('figure', attrs={'id': True})

  imgs = []
  for figure in figures:
    if figure.img is not None:
      imgs.append(figure.img['data-src'])
  
  article_images.append(imgs)
    
stop = timeit.default_timer() # stop the timer
print('Runtime: {} seconds.'.format(stop-start))

  0%|          | 0/4951 [00:00<?, ?it/s]


Starting. This might take around two hours to complete...



100%|██████████| 4951/4951 [1:52:00<00:00,  1.36s/it]

Runtime: 6720.473755998006 seconds.





## Append to the DataFrame

In [None]:
articles_df['images'] = article_images
articles_df

Unnamed: 0,id,title,url,images
0,1588230,Daosheng Design creates monochromatic bar with...,https://www.dezeen.com/2020/11/19/the-flow-of-...,[https://static.dezeen.com/uploads/2020/11/the...
1,1588111,Issey Miyake store in Osaka is splashed with w...,https://www.dezeen.com/2020/11/19/issey-miyake...,[https://static.dezeen.com/uploads/2020/11/iss...
2,1586860,Nook Pod is a gabled workspace,https://www.dezeen.com/2020/11/18/nook-pod-dez...,[https://static.dezeen.com/uploads/2020/11/noo...
3,1587217,Project #13 is an office for Studio Wills + Ar...,https://www.dezeen.com/2020/11/17/project-13-h...,[https://static.dezeen.com/uploads/2020/11/pro...
4,1586339,AHEAD Europe 2020 awards winners announced in ...,https://www.dezeen.com/2020/11/16/ahead-europe...,[]
...,...,...,...,...
4946,70,Marcel Wanders launches Crochet Chair,https://www.dezeen.com/2006/12/10/marcel-wande...,[https://static.dezeen.com/uploads/2006/12/Mar...
4947,67,WOKmedia show at Design Miami,https://www.dezeen.com/2006/12/10/wokmedia-sho...,[https://static.dezeen.com/uploads/2006/12/WOK...
4948,54,Zaha Hadid furniture exhibited in New York,https://www.dezeen.com/2006/12/07/zaha-hadid-f...,[https://static.dezeen.com/uploads/2006/12/Zah...
4949,48,Thomas Heatherwick beach cafe takes shape,https://www.dezeen.com/2006/12/04/thomas-heath...,[https://static.dezeen.com/uploads/2006/12/Tho...


## Export to a pickle file

1. Increate the maximum depth of the Python interpreter to avoid an overflow error.
2. Export the file using `pandas.to_pickle()` function

In [None]:
sys.setrecursionlimit(10000)

In [None]:
articles_df.to_pickle('files/articles.pkl')