# Behance.net web scraping using network traffic - Infinite scrolling

![](https://i.imgur.com/wYatW14.jpg)

## 1. Introduction : 


### 1.1. What is web scraping 

Web scraping is the process of collecting structured web data in an automated fashion. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

There are a number of tools and methods for performing web scraping; using network traffic, Scrappy, Selenium and Beautiful Soup are the most popular methods. 


### 1.2. Tools used in this project

Most scraping projects use Beautiful Soup but using network traffic can also be used to get the information we need 

In this project, we are going to use Python as our coding language to scrape. 

In Python, We will mainly use Requests library to get the information from the websites.

Then the information that has been scrapped will be turned into a Pandas DataFrame and then
we are going to save the file as CSV. 



## 2. Project Steps:

- assessing the website to find the information we need
- copying as cURL and turning to python
- using request to download the information
- extracting info for one item
- putting it all together and get the information automatically

### 2.1. assessing the website to find the information we need

In order to get the information we need we are going to right click and use inspect to go to the source

Under Network tab we can see there is fetch/XHR tab. while we scroll on the page the info gets populated


![](https://i.imgur.com/9V2K3GU.jpg)


we need to check the items under the Name column and find the information we need 

This part is a trial and error until we find the part we need 

![](https://i.imgur.com/T3AOBSZ.jpg)

### 2.2 copying as cURL and turning to python


After we have found the right object, we should right click on the object and then Copy - Copy as cURL(cmd)

Then we need a cURL converter to turn it into a python format 
we can use curl.trillworks.com

after converting that we will bring the python version to jupyter notebook and continue the project

In [1]:
# Importing necessary libraries

import requests
import pandas as pd

### 2.3 using request to download the information

it seems that since it is infinite scrolling we can replace 48 with 0 to get the first page

In [4]:
headers = {
    'authority': 'www.behance.net',
    'sec-ch-ua': '^\\^Chromium^\\^;v=^\\^92^\\^, ^\\^',
    'accept': '*/*',
    'x-newrelic-id': 'VgUFVldbGwsFU1BRDwUBVw==',
    'x-bcp': 'a0a19a75-0dbb-42e6-8446-3ceb9e74ffda',
    'x-requested-with': 'XMLHttpRequest',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.behance.net/',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': 'gk_suid=46493427; gki=^%^7B^%^22native_lazy_loading^%^22^%^3Afalse^%^2C^%^22feature_hire_me_cta^%^22^%^3Afalse^%^2C^%^22search_recommended_images_with_field_factor^%^22^%^3Afalse^%^2C^%^22feature_stock_rail^%^22^%^3Afalse^%^2C^%^22feature_updated_profile_checklist^%^22^%^3Afalse^%^7D; bcp=a0a19a75-0dbb-42e6-8446-3ceb9e74ffda; ilo0=true; sat_domain=A; gpv=behance.net:search:projects',
}

params = (
    ('content', 'projects'),
    ('ordinal', '0'),
)

response = requests.get('https://www.behance.net/search', headers=headers, params=params)


In [5]:
# checking if the request response was successful
response 

<Response [200]>

In [6]:
# creating a json object from the response we got from page 
json_response =response.json()
json_response
#check a couple of items to see we have the file

{'search': {'filters': {'search': None,
   'sort': 'recommended',
   'time': 'week',
   'field': None,
   'color_hex': None,
   'schools': None,
   'tools': None,
   'user_tags': None,
   'country': None,
   'state': None,
   'city': None,
   'stateCode': None},
  'creativeFields': {'popular': [{'label': 'Architecture',
     'value': 'architecture',
     'id': 4},
    {'label': 'Art Direction', 'value': 'art direction', 'id': 5},
    {'label': 'Branding', 'value': 'branding', 'id': 109},
    {'label': 'Fashion', 'value': 'fashion', 'id': 37},
    {'label': 'Graphic Design', 'value': 'graphic design', 'id': 44},
    {'label': 'Illustration', 'value': 'illustration', 'id': 48},
    {'label': 'Industrial Design', 'value': 'industrial design', 'id': 49},
    {'label': 'Interaction Design', 'value': 'interaction design', 'id': 51},
    {'label': 'Motion Graphics', 'value': 'motion graphics', 'id': 63},
    {'label': 'Photography', 'value': 'photography', 'id': 73},
    {'label': 'UI/UX', 'v

In [7]:
# check what type of object is our json

type(json_response)

dict

In [8]:
json_response.keys()



dict_keys(['search', 'creativeFields', 'tools', 'schools'])

the same number of keys as the ones in the network header in page

the info we need as shown on the page is in "search" key values

### 2.4 extracting info for one item

In [9]:
# the information we are going to get from the json is :
# design name
# first name
# last name
# user name
# country
# views
# likes
# coments
# url

In [10]:
results_json =json_response["search"]["content"]["projects"]
results_json

[{'id': 98483867,
  'name': 'Monstera Species Poster',
  'published_on': 1591568359,
  'created_on': 1591568246,
  'modified_on': 1629608412,
  'url': 'https://www.behance.net/gallery/98483867/Monstera-Species-Poster',
  'slug': 'Monstera-Species-Poster',
  'privacy': 'public',
  'fields': ['Illustration', 'Painting', 'Graphic Design'],
  'covers': {'808': 'https://mir-s3-cdn-cf.behance.net/projects/808/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
   '404': 'https://mir-s3-cdn-cf.behance.net/projects/404/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
   '202': 'https://mir-s3-cdn-cf.behance.net/projects/202/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
   '230': 'https://mir-s3-cdn-cf.behance.net/projects/230/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
   '115': 'https://mir-s3-cdn-cf.behance.net/projects/115/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
   'original': 'https://mir-s3-cdn-cf.behance.net/projects/original/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',

In [11]:
len(results_json)

#this shows we have made the right selection

48

In [12]:
# now we need to grab the information we need for one of the designs
results_json[0]

{'id': 98483867,
 'name': 'Monstera Species Poster',
 'published_on': 1591568359,
 'created_on': 1591568246,
 'modified_on': 1629608412,
 'url': 'https://www.behance.net/gallery/98483867/Monstera-Species-Poster',
 'slug': 'Monstera-Species-Poster',
 'privacy': 'public',
 'fields': ['Illustration', 'Painting', 'Graphic Design'],
 'covers': {'808': 'https://mir-s3-cdn-cf.behance.net/projects/808/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
  '404': 'https://mir-s3-cdn-cf.behance.net/projects/404/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
  '202': 'https://mir-s3-cdn-cf.behance.net/projects/202/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
  '230': 'https://mir-s3-cdn-cf.behance.net/projects/230/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
  '115': 'https://mir-s3-cdn-cf.behance.net/projects/115/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
  'original': 'https://mir-s3-cdn-cf.behance.net/projects/original/751ce598483867.Y3JvcCwxODA2LDE0MTIsMCwyNDU.jpg',
  'max_808': '

In [13]:
#design name
results_json[0]["name"]

'Monstera Species Poster'

In [14]:
#first name
results_json[0]["owners"][0]["first_name"]

'Aaron'

In [15]:
#last name
results_json[0]["owners"][0]["last_name"]

'Apsley'

In [16]:
#last name
results_json[0]["owners"][0]["username"]

'aaronapsley'

In [17]:
#country
results_json[0]["owners"][0]["country"]

'United States'

In [18]:
#views
results_json[0]["stats"]["views"]

8670

In [19]:
# likes
results_json[0]["stats"]["appreciations"]

288

In [20]:
#coments
results_json[0]["stats"]["comments"]

11

In [21]:
#url
results_json[0]["url"]

'https://www.behance.net/gallery/98483867/Monstera-Species-Poster'

### 5.2. putting it all together and get the information automatically and store in dataframe


In [22]:
design_name = []
first_name = []
last_name = []
country = []
views = []
likes = []
comments = []


for result in results_json:
    design_name.append(result["name"])
    first_name.append(result ["owners"][0]["first_name"])
    last_name.append(result["owners"][0]["last_name"])
    country.append(result["owners"][0]["country"])
    views.append(result ["stats"]["views"])
    likes.append( result ["stats"]["appreciations"])
    comments.append( result ["stats"]["comments"])

In [23]:
graphic_df = pd.DataFrame ({"design_name": design_name,
                          "first_name": first_name,
                          "last_name": last_name,
                          "country": country,
                          "views": views,
                            "likes": likes,
                            "comments": comments
                           })

In [24]:
graphic_df

Unnamed: 0,design_name,first_name,last_name,country,views,likes,comments
0,Monstera Species Poster,Aaron,Apsley,United States,8670,288,11
1,Child Portrait II,Lucia,Kvetanova,Greece,64,1,0
2,Cyklus Slatinky,Kateřina,Coufalová,Czech Republic,241,66,7
3,Takami,Motyw,Studio,Poland,6721,622,25
4,Red Lights : Vatican,Aishy,ㅤ,France,750,172,16
5,YUZU BURGER,Júlía,Runólfsdóttir,Iceland,441,31,0
6,Apple Music: Nuevo nuevo,ChocoToy,cute,Mexico,2633,605,18
7,Osmosis,Martin,Naumann,Germany,1197,164,20
8,FLAVOR INDESCRIBABLE·BARISTA TASTING SOCKS风味不可...,HOOK,FOOD,China,2072,189,16
9,The first characters made in Вlender,Denis,Wipart,Russian Federation,415,125,15


### Scraping Multiple Pages 

in "params" we have the page number so we can make a for loop to do the scraping for the number of pages we want

In [25]:
headers = {
    'authority': 'www.behance.net',
    'sec-ch-ua': '^\\^Chromium^\\^;v=^\\^92^\\^, ^\\^',
    'accept': '*/*',
    'x-newrelic-id': 'VgUFVldbGwsFU1BRDwUBVw==',
    'x-bcp': 'a0a19a75-0dbb-42e6-8446-3ceb9e74ffda',
    'x-requested-with': 'XMLHttpRequest',
    'sec-ch-ua-mobile': '?0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.behance.net/',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': 'gk_suid=46493427; gki=^%^7B^%^22native_lazy_loading^%^22^%^3Afalse^%^2C^%^22feature_hire_me_cta^%^22^%^3Afalse^%^2C^%^22search_recommended_images_with_field_factor^%^22^%^3Afalse^%^2C^%^22feature_stock_rail^%^22^%^3Afalse^%^2C^%^22feature_updated_profile_checklist^%^22^%^3Afalse^%^7D; bcp=a0a19a75-0dbb-42e6-8446-3ceb9e74ffda; ilo0=true; sat_domain=A; gpv=behance.net:search:projects',
}

design_name = []
first_name = []
last_name = []
country = []
views = []
likes = []
comments = [] 

for i in range(0,480, 48):


    params = (
        ('content', 'projects'),
        ('ordinal', str(i)),
    )

    response = requests.get('https://www.behance.net/search', headers=headers, params=params)
    json_response =response.json()
    results_json =json_response["search"]["content"]["projects"]

    for result in results_json:
        design_name.append(result["name"])
        first_name.append(result ["owners"][0]["first_name"])
        last_name.append(result["owners"][0]["last_name"])
        country.append(result["owners"][0]["country"])
        views.append(result ["stats"]["views"])
        likes.append( result ["stats"]["appreciations"])
        comments.append( result ["stats"]["comments"])
    

### Turning our data of multiple pages into a pandas dataframe

In [26]:
graphic_multiple_df = pd.DataFrame ({"design_name": design_name,
                          "first_name": first_name,
                          "last_name": last_name,
                          "country": country,
                          "views": views,
                            "likes": likes,
                            "comments": comments
                           }) 

### Store results to excel

In [27]:
graphic_multiple_df.to_excel("multiple_page_graphic.xlsx", index = False)