# Autolist.com  Web Scraping using network traffic

![](https://i.imgur.com/di54DTP.jpg)

## 1. Introduction : 


### 1.1. What is web scraping 

Web scraping is the process of collecting structured web data in an automated fashion. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

There are a number of tools and methods for performing web scraping; using network traffic, Scrappy, Selenium and Beautiful Soup are the most popular methods. 


### 1.2. Tools used in this project

Most scraping projects use Beautiful Soup but using network traffic can also be used to get the information we need 

In this project, we are going to use Python as our coding language to scrape. 

In Python, We will mainly use Requests library to get the information from the websites.

Then the information that has been scrapped will be turned into a Pandas DataFrame and then
we are going to save the file as CSV. 



## 2. Project Steps:

- assessing the website to find the information we need
- copying as cURL and turning to python
- using request to download the information
- extracting info for one item
- putting it all together and get the information automatically

### 2.1. assessing the website to find the information we need

In order to get the information we need we are going to right click and use inspect to go to the source

Under Network tab we can see there is fetch/XHR tab. while we scroll on the page the info gets populated


![](https://i.imgur.com/EYL7FMZ.jpg)


we need to check the items under the Name column and find the information we need 

This part is a trial and error until we find the part we need 

![](https://i.imgur.com/od1F1Oz.jpg)

### 2.2 copying as cURL and turning to python


After we have found the right object, we should right click on the object and then Copy - Copy as cURL(cmd)

Then we need a cURL converter to turn it into a python format 
we can use curl.trillworks.com

after converting that we will bring the python version to jupyter notebook and continue the project

In [1]:
# Importing necessary libraries

import requests
import pandas as pd

### 2.3 using request to download the information

In [3]:
headers = {
    'authority': 'www.autolist.com',
    'sec-ch-ua': '^\\^Chromium^\\^;v=^\\^92^\\^, ^\\^',
    'accept': '*/*',
    'x-requested-with': 'XMLHttpRequest',
    'sec-ch-ua-mobile': '?0',
    'x-autolist-session-guid': 'a05719cf-a626-4dee-bb98-bef4f39d375c',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.autolist.com/listings',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': 'ec=eyJlZGdlVXNlckFnZW50IjoiTW96aWxsYS81LjAgKFdpbmRvd3MgTlQgMTAuMDsgV2luNjQ7IHg2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzkyLjAuNDUxNS4xNTkgU2FmYXJpLzUzNy4zNiJ9; client_guid_timestamp=e02c9c83-67cd-4781-9710-2a0fb9ca8e20.1629563935984; scavenger-a493=i2edJn8U8quG2BtrlZCXcua4KB62fcJRrAhaI/t9eYZNJczdHQYj3P0Y4s6myhEl4IbJl8vp9L0oN+SXDh2fFcTvJ/vA/hPyRJ0ohE6LnQGIee5ACCL3nGqW5xFg3QnijfTypiJhVgTfPjk3c81AtkxzR54e+BHqQL0DI6/oWNd7iIkgvHaWjuilZuITytvFV0IAUJ0bX041NuAIFpx8K90bzfPLeAKgkwc0lpwB3cLwm8NcNKMD3xELdjI6leANl5jZJ9KzZ3K+5J7nCpxnAynFJJt2e95bby70BTVZ0qdWtSWGgwFxEdjmesrF42H7EgAjt4LMIKDuRVtbhVwu/q/wUhPoUZvZb6F+CbFpxMXuZTpd27JmWUrboqGTfyS3Nr7gV1kl6hr+8w0aOlmo8NlrdmNRb0u51Ex4zu4JTymiAUsWvS9JhNr/uvk4GVvM3kafDQeKq7v+xHkB/lCvPkqLEifYP7VVan6h5ZmvaQTgULjs0N5JhUd76Pt6c/tY5R0v8R67gjgbvW8n/4Gqzw==; _sp_ses.8ca5=*; _fbp=fb.1.1629563936393.301782673; _ga=GA1.2.1962975142.1629563934; _gid=GA1.2.809253866.1629563937; __gads=ID=2fecf3c5f98db8bc:T=1629563936:S=ALNI_MbP7Qh5aV67Y9ME82H_NbgP-TB1lg; _ga_152345333=GS1.1.1629563933.1.1.1629563988.0; AMP_TOKEN=^%^24NOT_FOUND; sp-nuid=c9711a73-266d-4daa-8d34-a5f4b1050171; _gcl_au=1.1.1968340282.1629564101; cto_bundle=e7B8Fl94MDdzUjRmcjFBSlZrekd2TyUyRmNzdEZYR0hiM3FNWVFYSHBqbUtJMjJVYzFNOERnOFBSelQlMkZIT3o4Z2Z2RjA0SCUyQjdmJTJCVVhBZmZnRUlFRERqUFp5NlBrVksxWjhsekp2b3ZSMm1UMHc1ajklMkZVbnoyS2lOYklnNVRiSmNTVSUyRlpnJTJCenRRd01nU1BSd2JGelFZNXZJQXNMZyUzRCUzRA; _gat=1; _sp_id.8ca5=c99d8079-5160-4a37-9ca1-0ef78843c81f.1629563936.1.1629564425.1629563936.faa2eb1c-9606-4e57-ab87-742b8c6e8ca9',
    'if-none-match': 'W/^\\^8dcb08c3d726a06db127e8ed93cf14d4^\\^',
}

params = (
    ('make', 'Tesla'),
    ('location', 'San^%^20Francisco,^%^20CA'),
    ('latitude', '37.7749295'),
    ('longitude', '-122.4194155'),
    ('radius', '50'),
    ('page', '2'),
)

response = requests.get('https://www.autolist.com/api/cwv/seo/listings', headers=headers, params=params)

In [4]:
# checking if the request response was successful
response

<Response [200]>

In [6]:
# creating a json object from the response we got from page 
json_response =response.json()
json_response
#check a couple of books to see we have the file

{'total_results_count': 168,
 'make_model_name': 'Tesla',
 'trims': [],
 'ad_info': {'ad_unit': '/19485787/AutoList.com/Search_Results_Page/Tesla',
  'make': 'Tesla',
  'model': None,
  'page_type': 'srp',
  'cpo': False,
  'body_style': None,
  'year': 2022,
  'age_type': 'new',
  'app': 'web'},
 'search_results': [{'available_nationwide': False,
   'created_at': 1623286982,
   'recent_price_drop': False,
   'is_hot': False,
   'href_target': '/listings/5YJ3E1EA6KF483326',
   'vdp_url': '/tesla-model+3#vin=5YJ3E1EA6KF483326',
   'clickoff_url': 'https://www.shift.com/car/c110936?utm_source=Autolist&utm_medium=listing&utm_campaign=deeplink',
   'accepts_leads': True,
   'open_in_new_window': False,
   'id': 220874699,
   'primary_photo_url': 'https://static.cargurus.com/images/forsale/2021/06/09/23/25/2019_tesla_model_3-pic-6436851712061059861-1024x768.jpeg',
   'thumbnail_url_large': 'https://images.autolist.com/image/fetch/s--1kW4bu47--/c_fill,f_auto,q_auto,w_320/https://static.cargu

In [7]:
# check what type of object is our json

type(json_response)

dict

In [8]:
json_response.keys()

dict_keys(['total_results_count', 'make_model_name', 'trims', 'ad_info', 'search_results', 'html'])

the same number of keys as the ones in the network header in page. the info we need as shown on the page is in "search_results" key values

### 2.4 extracting info for one item

In [9]:
# the information we are going to get from the json is :
# model
# mileage
# year
# dealer name
# price 

In [10]:
results_json =json_response["search_results"]
results_json

[{'available_nationwide': False,
  'created_at': 1623286982,
  'recent_price_drop': False,
  'is_hot': False,
  'href_target': '/listings/5YJ3E1EA6KF483326',
  'vdp_url': '/tesla-model+3#vin=5YJ3E1EA6KF483326',
  'clickoff_url': 'https://www.shift.com/car/c110936?utm_source=Autolist&utm_medium=listing&utm_campaign=deeplink',
  'accepts_leads': True,
  'open_in_new_window': False,
  'id': 220874699,
  'primary_photo_url': 'https://static.cargurus.com/images/forsale/2021/06/09/23/25/2019_tesla_model_3-pic-6436851712061059861-1024x768.jpeg',
  'thumbnail_url_large': 'https://images.autolist.com/image/fetch/s--1kW4bu47--/c_fill,f_auto,q_auto,w_320/https://static.cargurus.com/images/forsale/2021/06/09/23/25/2019_tesla_model_3-pic-6436851712061059861-1024x768.jpeg',
  'lat': 37.6397,
  'lon': -122.416,
  'year': 2019,
  'make': 'Tesla',
  'model': 'Model 3',
  'model_id': 2217,
  'display_color': '',
  'price': '$40,500',
  'price_unformatted': 40500,
  'mileage': '16,341 Miles',
  'mileage_

In [11]:
len(results_json)

#this shows we have made the right selection

20

In [14]:
# now we need to grab the information we need for one of the cars
results_json[0]

{'available_nationwide': False,
 'created_at': 1623286982,
 'recent_price_drop': False,
 'is_hot': False,
 'href_target': '/listings/5YJ3E1EA6KF483326',
 'vdp_url': '/tesla-model+3#vin=5YJ3E1EA6KF483326',
 'clickoff_url': 'https://www.shift.com/car/c110936?utm_source=Autolist&utm_medium=listing&utm_campaign=deeplink',
 'accepts_leads': True,
 'open_in_new_window': False,
 'id': 220874699,
 'primary_photo_url': 'https://static.cargurus.com/images/forsale/2021/06/09/23/25/2019_tesla_model_3-pic-6436851712061059861-1024x768.jpeg',
 'thumbnail_url_large': 'https://images.autolist.com/image/fetch/s--1kW4bu47--/c_fill,f_auto,q_auto,w_320/https://static.cargurus.com/images/forsale/2021/06/09/23/25/2019_tesla_model_3-pic-6436851712061059861-1024x768.jpeg',
 'lat': 37.6397,
 'lon': -122.416,
 'year': 2019,
 'make': 'Tesla',
 'model': 'Model 3',
 'model_id': 2217,
 'display_color': '',
 'price': '$40,500',
 'price_unformatted': 40500,
 'mileage': '16,341 Miles',
 'mileage_unformatted': 16341,
 '

In [15]:
#model
results_json[0]["model"]

'Model 3'

In [16]:
#mileage
results_json[0]["mileage"]

'16,341 Miles'

In [17]:
#year
results_json[0]["year"]

2019

In [18]:
#dealer name
results_json[0]["dealer_name"]

'Shift SF - Test Drives Delivered To You'

In [19]:
# price
results_json[0]["price"]

'$40,500'

### 5.2. putting it all together and get the information automatically and store in dataframe

In [20]:
model = []
mileage = []
year = []
dealer_name = []
price = []


for result in results_json:
    model.append(result["model"])
    mileage.append(result["mileage"])
    year.append(result["year"])
    dealer_name.append(result["dealer_name"])
    price.append(result["price"])

In [21]:
cars_df = pd.DataFrame ({"model": model,
                          "mileage": mileage,
                          "year": year,
                          "dealer_name": dealer_name,
                          "price": price})

In [22]:
cars_df

Unnamed: 0,model,mileage,year,dealer_name,price
0,Model 3,"16,341 Miles",2019,Shift SF - Test Drives Delivered To You,"$40,500"
1,Model 3,"1,339 Miles",2021,CarMax Pleasanton - Now Open,"$54,998"
2,Model 3,"32,407 Miles",2019,Shift SF - Test Drives Delivered To You,"$41,950"
3,Model 3,"46,517 Miles",2018,Shift SF - Test Drives Delivered To You,"$39,950"
4,Model 3,"9,350 Miles",2018,San Jose Mitsubishi,"$46,999"
5,Model Y,"17,245 Miles",2020,Premier Nissan Of Fremont,"$62,995"
6,Model 3,"17,796 Miles",2020,CarMax Santa Rosa - Now Open,"$58,998"
7,Model 3,"6,416 Miles",2019,Elite Motor Cars,"$57,990"
8,Model 3,"10,353 Miles",2020,CarMax Fremont - Now Open,"$61,998"
9,Model 3,"21,219 Miles",2019,Audi Oakland,"$42,598"


In [23]:
# store data of the single page to an excel file
cars_df.to_excel("singel_page_car.xlsx", index = False)

### Scraping Multiple Pages 

in "params" we have the page number so we can make a for loop to do the scraping for the number of pages we want

In [21]:
headers = {
    'authority': 'www.autolist.com',
    'sec-ch-ua': '^\\^Chromium^\\^;v=^\\^92^\\^, ^\\^',
    'accept': '*/*',
    'x-requested-with': 'XMLHttpRequest',
    'sec-ch-ua-mobile': '?0',
    'x-autolist-session-guid': 'a05719cf-a626-4dee-bb98-bef4f39d375c',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://www.autolist.com/listings',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': 'ec=eyJlZGdlVXNlckFnZW50IjoiTW96aWxsYS81LjAgKFdpbmRvd3MgTlQgMTAuMDsgV2luNjQ7IHg2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzkyLjAuNDUxNS4xNTkgU2FmYXJpLzUzNy4zNiJ9; client_guid_timestamp=e02c9c83-67cd-4781-9710-2a0fb9ca8e20.1629563935984; scavenger-a493=i2edJn8U8quG2BtrlZCXcua4KB62fcJRrAhaI/t9eYZNJczdHQYj3P0Y4s6myhEl4IbJl8vp9L0oN+SXDh2fFcTvJ/vA/hPyRJ0ohE6LnQGIee5ACCL3nGqW5xFg3QnijfTypiJhVgTfPjk3c81AtkxzR54e+BHqQL0DI6/oWNd7iIkgvHaWjuilZuITytvFV0IAUJ0bX041NuAIFpx8K90bzfPLeAKgkwc0lpwB3cLwm8NcNKMD3xELdjI6leANl5jZJ9KzZ3K+5J7nCpxnAynFJJt2e95bby70BTVZ0qdWtSWGgwFxEdjmesrF42H7EgAjt4LMIKDuRVtbhVwu/q/wUhPoUZvZb6F+CbFpxMXuZTpd27JmWUrboqGTfyS3Nr7gV1kl6hr+8w0aOlmo8NlrdmNRb0u51Ex4zu4JTymiAUsWvS9JhNr/uvk4GVvM3kafDQeKq7v+xHkB/lCvPkqLEifYP7VVan6h5ZmvaQTgULjs0N5JhUd76Pt6c/tY5R0v8R67gjgbvW8n/4Gqzw==; _sp_ses.8ca5=*; _fbp=fb.1.1629563936393.301782673; _ga=GA1.2.1962975142.1629563934; _gid=GA1.2.809253866.1629563937; __gads=ID=2fecf3c5f98db8bc:T=1629563936:S=ALNI_MbP7Qh5aV67Y9ME82H_NbgP-TB1lg; _ga_152345333=GS1.1.1629563933.1.1.1629563988.0; AMP_TOKEN=^%^24NOT_FOUND; sp-nuid=c9711a73-266d-4daa-8d34-a5f4b1050171; _gcl_au=1.1.1968340282.1629564101; cto_bundle=e7B8Fl94MDdzUjRmcjFBSlZrekd2TyUyRmNzdEZYR0hiM3FNWVFYSHBqbUtJMjJVYzFNOERnOFBSelQlMkZIT3o4Z2Z2RjA0SCUyQjdmJTJCVVhBZmZnRUlFRERqUFp5NlBrVksxWjhsekp2b3ZSMm1UMHc1ajklMkZVbnoyS2lOYklnNVRiSmNTVSUyRlpnJTJCenRRd01nU1BSd2JGelFZNXZJQXNMZyUzRCUzRA; _gat=1; _sp_id.8ca5=c99d8079-5160-4a37-9ca1-0ef78843c81f.1629563936.1.1629564425.1629563936.faa2eb1c-9606-4e57-ab87-742b8c6e8ca9',
    'if-none-match': 'W/^\\^8dcb08c3d726a06db127e8ed93cf14d4^\\^',
}


model = []
mileage = []
year = []
dealer_name = []
price = []

for i in range(2,7):

    params = (
        ('make', 'Tesla'),
        ('location', 'San^%^20Francisco,^%^20CA'),
        ('latitude', '37.7749295'),
        ('longitude', '-122.4194155'),
        ('radius', '50'),
        ('page',str(i) ),
    )

    response = requests.get('https://www.autolist.com/api/cwv/seo/listings', headers=headers, params=params)
    json_response =response.json()
    results_json =json_response["search_results"]


    for result in results_json:
        model.append(result["model"])
        mileage.append(result["mileage"])
        year.append(result["year"])
        dealer_name.append(result["dealer_name"])
        price.append(result["price"])
    

### Turning our data of multiple pages into a pandas dataframe

In [22]:
cars_multiple_df = pd.DataFrame ({"model": model,
                          "mileage": mileage,
                          "year": year,
                          "dealer_name": dealer_name,
                          "price": price})

### Store results to excel

In [23]:
cars_multiple_df.to_excel("multiple_page_cars.xlsx", index = False)