### Scraping MercadoLibre smartphone listings:
An analysis of e-commerce listings and sales of smartphones in Peru

### Notes:
#### Ideas:
1. Expand to other markets to do a comparative analysis
2. Compare a rich and a poor country
3. This would involve currency conversion

#### Business case:
* A Chilean entrepreneur wants to enter the Peruvian smartphone market.
* Which sellers and products are most successful? 
* Which features are most important to consumers? 
* (Maybe, how is the Peruvian market different from the Chilean market?)

In [171]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import time

headers={'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}


### Part 1: scrape the URLs for the listings that we will use later

In [2]:
# create response object
response = requests.get('https://listado.mercadolibre.com.pe/celulares-telefonos/celulares-smartphones', headers = headers)


In [3]:
# pull the text
text = BeautifulSoup(response.content,'html.parser')


In [4]:
#obtain titles from the products to get number of products per page
titles=text.find_all('div', attrs={'class':'ui-search-item__group ui-search-item__group--title'})

len(titles)

53

In [6]:
# examine the title
titles[1]

<div class="ui-search-item__group ui-search-item__group--title"><span class="ui-search-item__brand-discoverability ui-search-item__group__element"></span><a class="ui-search-item__group__element ui-search-link" href="https://articulo.mercadolibre.com.pe/MPE-607311001-xiaomi-redmi-note-11-4gb-128gb-original-tienda-oficial-_JM?searchVariation=174390642601#searchVariation=174390642601&amp;position=2&amp;search_layout=stack&amp;type=item&amp;tracking_id=77df895c-6854-4f2d-8a20-d4a43a730418" title="Xiaomi Redmi Note 11 4gb-128gb Original - Tienda Oficial"><h2 class="ui-search-item__title">Xiaomi Redmi Note 11 4gb-128gb Original - Tienda Oficial</h2></a><a class="ui-search-official-store-item__link ui-search-link" href="https://tienda.mercadolibre.com.pe/xiaomi"><p class="ui-search-official-store-label ui-search-item__group__element ui-search-color--GRAY">Vendido por Xiaomi</p></a></div>

In [7]:
# strip urls and put into a list of URLs we can iterate over
phone_urls=[tag.find('a').get('href') for tag in titles]
print(phone_urls[4:5])

['https://articulo.mercadolibre.com.pe/MPE-609274773-poco-m4-pro-8gb-ram-256gb-rom-_JM?searchVariation=174440426312#searchVariation=174440426312&position=5&search_layout=stack&type=item&tracking_id=77df895c-6854-4f2d-8a20-d4a43a730418']


In [10]:
# create a list of results pages from which we can individual enter product listing pages
results_pages = [f'https://listado.mercadolibre.com.pe/celulares-telefonos/celulares-smartphones/_Desde_{i}_NoIndex_True' for i in range(1,2001,50)]


In [26]:
# examine the results page list
results_pages[:1]

['https://listado.mercadolibre.com.pe/celulares-telefonos/celulares-smartphones/_Desde_1_NoIndex_True']

### Part 2: identify the data we will scrape from each individual listing

In [39]:
# explore the individual product listing to obtain the fields we want to scrape
response=requests.get('https://articulo.mercadolibre.com.pe/MPE-609274773-poco-m4-pro-8gb-ram-256gb-rom-_JM?searchVariation=174440426312#searchVariation=174440426312&position=5&search_layout=stack&type=item&tracking_id=77df895c-6854-4f2d-8a20-d4a43a730418', 
                      headers=headers)
text= BeautifulSoup(response.text, 'html.parser')

In [40]:
#title of item from item page
title = text.find('div', attrs = {'class':'ui-pdp-header__title-container'}).find('h1').string
print(title)

Poco M4 Pro 8gb Ram 256gb Rom


In [78]:
#This helps to scrape sellers pages

response2=requests.get('https://perfil.mercadolibre.com.pe/IXCOMERCIO+PERU?brandId=159', 
                      headers=headers)
text2= BeautifulSoup(response2.text, 'html.parser')


In [96]:
# obtain total store reviews and good store reviews from linked sellers pages
store_reviews=text2.find('section',{'class':'buyers-feedback-section'}).find('span').string
print(store_reviews)
good_str_reviews=text2.find('span',{'id':'feedback_good', 'class':'buyers-feedback-qualification'})
print(good_str_reviews)


227
<span class="buyers-feedback-qualification" id="feedback_good">Buena<!-- --> (<!-- -->193<!-- -->)</span>


In [64]:
# here we identify the standard fields we want from the individual product listings

#actual price (with potential discount accounted for)
price_in_soles=text.find('span',attrs={'class':'andes-money-amount ui-pdp-price__part andes-money-amount--cents-superscript andes-money-amount--compact'}).find('span').string
print(price_in_soles)
#number available
available = text.find('span',attrs={'class':'ui-pdp-buybox__quantity__available'}).string
print(available)
#new or used
newstatus_sales=text.find('span',attrs={'class':'ui-pdp-subtitle'}).string
print(newstatus_sales)
#publication number bottom right corner
#seller id
seller_id = text.find('div', {'class':'ui-box-component-pdp__visible--desktop'}).find('a').get('href')
print(seller_id)
#seller sales
seller_sales = text.find('strong', {'class':'ui-pdp-seller__sales-description'}).string
print(seller_sales)
# page id
page_id= text.find('span',{'class':'ui-pdp-color--BLACK ui-pdp-family--SEMIBOLD'}).string
print(page_id)
#number of ratings
ratings=text.find('span',{'class':'ui-pdp-review__amount'}).string
print(ratings)

# product rating
product_rating=text.find('header',{'class':'ui-review-view__header'}).find('p').string

#Try to get other seller details from product page
seller_svc_level=text.find('div',{'class':'ui-seller-info'}).find('ul').get('value')



929 soles
(88 disponibles)
Nuevo  |  242 vendidos
https://perfil.mercadolibre.com.pe/IXCOMERCIO+PERU?brandId=159
3424
#609274773
(43)


### Bringing the code together to execute the scraping:
#### Step 1: scrape the URLS of the individual product listings

In [1]:
from bs4 import BeautifulSoup
import requests
headers= {'User-Agent':
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}

results_pages = [f'https://listado.mercadolibre.com.pe/celulares-telefonos/celulares-smartphones/_Desde_{i}_NoIndex_True' for i in range(1,2001,50)]

phone_page_urls=[]

for url in results_pages:
    response = requests.get(url, headers = headers)
    text = BeautifulSoup(response.content,'html.parser')
    
    if response.status_code!=200:
        raise Exception(f'The status code is not 200! It is {response.status_code}.')
        
    titles=text.find_all('div', attrs={'class':'ui-search-result__image'})
    phone_urls=[tag.find('a').get('href') for tag in titles]
    
    phone_page_urls.extend(phone_urls)


In [88]:
print(len(phone_page_urls))
print(phone_page_urls[589:592])

2114
['https://articulo.mercadolibre.com.pe/MPE-444989657-apple-iphone-xs-256gb-libre-reacondicionado-silver-_JM?searchVariation=82237523651#searchVariation=82237523651&position=36&search_layout=stack&type=item&tracking_id=9fd0252b-37cd-4c06-89ac-41799045edc3', 'https://articulo.mercadolibre.com.pe/MPE-602012973-celular-nuevo-smooth-bommer-teclado-grande-marcacion-rapida-_JM?searchVariation=173764271353#searchVariation=173764271353&position=37&search_layout=stack&type=item&tracking_id=9fd0252b-37cd-4c06-89ac-41799045edc3', 'https://articulo.mercadolibre.com.pe/MPE-611805118-xiaomi-poco-f4-gt-global-12gb-ram-256gb-120w-stock-_JM?searchVariation=174462962758#searchVariation=174462962758&position=38&search_layout=stack&type=item&tracking_id=9fd0252b-37cd-4c06-89ac-41799045edc3']


#### Step 2: Scrape the individual product listings

In [76]:
phone_list=[]

for url in phone_page_urls:
        phone_dict={}
        
        response=requests.get(url,headers=headers)
        text=BeautifulSoup(response.text, 'html.parser')
        
        if response.status_code != 200:
            raise Exception(f'The status code is not 200! It is {response.status_code}.')   
        
        try:
            #get title
            title = text.find('div', attrs = {'class':'ui-pdp-header__title-container'}).find('h1').string
        except:
            title=np.nan
        
        try:
            # page id
            page_id= text.find('span',{'class':'ui-pdp-color--BLACK ui-pdp-family--SEMIBOLD'}).string
        except:
            page_id=np.nan
        
        try:
            #get price
            price_in_soles=text.find('span',attrs={'class':'andes-money-amount ui-pdp-price__part andes-money-amount--cents-superscript andes-money-amount--compact'}).find('span').string
        except:
            price_in_soles=np.nan
        
        try:
            #new or used
            newstatus_sales=text.find('span',attrs={'class':'ui-pdp-subtitle'}).string
        except:
            newstatus_sales=np.nan
        
        try:
            #number available
            total_units_available = text.find('span',attrs={'class':'ui-pdp-buybox__quantity__available'}).string
        except:
            total_units_available=np.nan
        
        try:
            #actual product rating
            product_rating=text.find('header',{'class':'ui-review-view__header'}).find('p').string
        except:
            product_rating=np.nan
        
        try:
            #number of ratings
            total_product_ratings=text.find('span',{'class':'ui-pdp-review__amount'}).string
        except:
            total_product_ratings=np.nan
        
        try:
            #seller id
            seller_id = text.find('div', {'class':'ui-box-component-pdp__visible--desktop'}).find('a').get('href')
        except:
            seller_id=np.nan
        
        try:
            #seller sales
            total_seller_sales = text.find('strong', {'class':'ui-pdp-seller__sales-description'}).string
        except:
            total_seller_sales=np.nan
        
        try:
            #seller service level 1-5
            seller_svc_level=text.find('div',{'class':'ui-seller-info'}).find('ul').get('value')
        except:
            seller_svc_level=np.nan
            
        
        #store info in dictionary
        phone_dict['title']=title
        phone_dict['page_id']=page_id
        phone_dict['price_in_soles']=price_in_soles
        phone_dict['newstatus_sales']=newstatus_sales
        phone_dict['total_units_available']=total_units_available
        phone_dict['product_rating']=product_rating
        phone_dict['total_product_ratings']=total_product_ratings
        
        phone_dict['seller_id']=seller_id
        phone_dict['total_seller_sales']=total_seller_sales
        phone_dict['seller_svc_level']=seller_svc_level
        

        
        #add dictionary to list
        phone_list.append(phone_dict)
        

In [1]:
# examine the product listing data
phone_list[1]

NameError: name 'phone_list' is not defined

#### Step 3: Scrape the sellers page URLs

In [124]:
#obtain just a list of seller URLs
seller_list=[]

for url in phone_page_urls:
    
    seller_dict={}
        
    response=requests.get(url,headers=headers)
    text=BeautifulSoup(response.text, 'html.parser')
        
    if response.status_code != 200:
        raise Exception(f'The status code is not 200! It is {response.status_code}.')   
          
    try:
            
        seller_id = text.find('div', {'class':'ui-box-component-pdp__visible--desktop'}).find('a').get('href')
    except:
        seller_id='missing!'
            
            
    seller_dict['seller_id']=seller_id
    seller_list.append(seller_dict)

#create a set of seller urls and then iterate over that
seller_urls=[val for i in seller_list for val in i.values()]
seller_unique_urls=set(seller_urls)

#### Step 4: Scrape the two pieces of data from the sellers' pages

In [172]:
#scrape the two key data points from the seller pages

seller_data=[]

for url in seller_unique_urls:
        seller_dict={}
        url=url
        seller_dict['url']=url
        if url=='missing!':
            continue
            
        else:
            

        #obtain information from the linked seller page
        
            response2=requests.get(url,headers=headers)
            text2=BeautifulSoup(response2.text, 'html.parser')
        
            if response2.status_code != 200:
                #raise Exception(f'The status code for seller page is not 200! It is {response2.status_code}.') 
                time.sleep(3)
        
            try:
            
                #number of store reviews   
                store_reviews=text2.find('section',{'class':'buyers-feedback-section'}).find('span').string
                #number of good store reviews
                good_str_reviews=text2.find('span',{'id':'feedback_good', 'class':'buyers-feedback-qualification'})
        
            except:
                np.nan

            
            # try to add seller attributes
            seller_dict['url']=url
            seller_dict['store_reviews']=store_reviews
            seller_dict['good_str_reviews']=good_str_reviews

            seller_data.append(seller_dict)

#### Step 5: Export the sellers' data to one csv

In [176]:
#seller_data

#this is for exporting cellphone product listing data
import csv

with open('vendors.csv','w',encoding='utf-8', newline='') as csvfile:
    vendor_writer=csv.writer(csvfile)
    vendor_writer.writerow(['url',
                            'store_reviews',
                           'good_str_reviews'])
    
    for vendor_dict in seller_data:
        vendor_writer.writerow(vendor_dict.values())

#### Step 6: Export the product listings' data scraped above

In [81]:
#this is for exporting cellphone product listing data
import csv

with open('phones.csv','w',encoding='utf-8', newline='') as csvfile:
    phone_writer=csv.writer(csvfile)
    phone_writer.writerow(['title',
                           'page_id',
                           'price_in_soles',
                           'newstatus_sales',
                           'total_units_available',
                           'product_rating',
                           'total_product_ratings',
                           'seller_id',
                           'total_seller_sales',
                           'seller_svc_level'])
    
    for phone_dict in phone_list:
        phone_writer.writerow(phone_dict.values())

#### Subsequent steps: read the csv files into R and conduct data cleaning and EDA