#   Project web: How much does your groceries cost in Bitcoin? 

Isaac Rodriguez

*Data Part Time Barcelona Dic 2019*

## Content
- [Project Description](#project)
- [Web scraping section](#web)
- [API Section](#api)
- [Merge datasets section](#merge)


<a name="project"></a>
## Project Description
Goal of this project is to choose an API to obtain data from and a web page to scrape, convert it into a Pandas data frame, and export it as a CSV file.

<a name="web"></a>

## Get all items from Ulabox. Web scraping. 

In [38]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

### Generic functions

In this section we are going to put the generic functions used to get items from ulabox website.

In [24]:
ulabox_web = "https://ulabox.com/"

In [25]:
#This functions recieves an string and converts it to utf-8 format.
def change_format_to_utf8(string):
    return string.encode('latin-1', 'replace').decode('utf-8', 'replace')

### Get categories

After reviewing how the website is designed, we will first get all the categories with its link to be scraped later.

In [4]:
ulabox_content = requests.get(ulabox_web).content
ulabox_soup = BeautifulSoup(ulabox_content, "lxml")
table = ulabox_soup.find_all("ul", {"class": "list-dropdown"})[0]

rows = table.find_all("li")
links = [row.find("a", {"class": "list-dropdown-item__link list-dropdown-item-link | js-pjax js-track-ui js-updatable-track"}) for row in rows]

links = [link.get("href").split("?ula_src=")[0] for link in links if link]
rows = [change_format_to_utf8(row.text.strip()) for row in rows]

# Got all the categories from the ulabox website.
df_categories = pd.DataFrame({"Categories": rows[1:], "Links": links})

### Get subcategories from each category

In this section we will get all subcategories from each category. To do it, we first get the content from the category link, scrape it to get the subcategory and link related to it.

In [26]:
df_subcategories = pd.DataFrame()

for index, row in df_categories.iterrows():
    ulabox_subcategory_web = ulabox_web + row['Links']
    ulabox_content = requests.get(ulabox_subcategory_web).content
    ulabox_soap = BeautifulSoup(ulabox_content, "lxml")
    
    table = ulabox_soap.find_all("div", {"class": "col-xs-12 col-sm-4"})
    items = [row.find_all("h2", {"class": "category-item__name epsilon islet brand-face"}) for row in table]
    links = [row.find_all("a") for row in table]
    items_name = [change_format_to_utf8(item[0].text) for item in items]
    links_name = [item[0].get("href").split("?ula_src=")[0] for item in links]
    
    for index, item in enumerate(table):
        subitems = item.find_all("ul", {"class": "soft-half--sides soft-half--top | unlist"})
        subitems = item.find_all("li", {"class": "selectable-item | weak-text-color milli | flush--bottom"})
        subitems = [change_format_to_utf8(item.text) for item in subitems]
        
        d = {row['Categories']: [items_name[index], links_name[index]]}
        df_subcategories = df_subcategories.append(pd.DataFrame.from_dict(d, orient = 'index', columns= ["Subcategory", "Links"]))
        
        # Our index is the category as it was the key from the dictionary. We set it on a different column.
        df_subcategories['Category'] = df_subcategories.index

We reset the index making it numerical.

In [27]:
df_subcategories.reset_index(drop=True)

Unnamed: 0,Category,Links,Subcategory
0,Frescos,/categoria/mercado/2493,Mercado
1,Frescos,/categoria/frescos-de-temporada/2232,Frescos de Temporada
2,Frescos,/categoria/frescos-ecologicos/2253,Frescos Ecológicos
3,Frescos,/categoria/frutas/1582,Frutas
4,Frescos,/categoria/pollo-y-aves/2559,Pollo y Aves
...,...,...,...
81,Parafarmacia,/categoria/cuidado-capilar/664,Cuidado Capilar
82,Parafarmacia,/categoria/alimentacion-y-cuidado-infantil/1314,Alimentación y Cuidado Infantil
83,Mascotas,/categoria/perros/696,Perros
84,Mascotas,/categoria/gatos/700,Gatos


### Get items from each subcategory

Now we already have the category and subcategory so we will enter to each subcategory link. We will save the name, category, subcategory, price, currency and brand from each item.

In [8]:
df_products = pd.DataFrame()

for index, row in df_subcategories.iterrows():
    ulabox_products_web = ulabox_web + row['Links']
    ulabox_content = requests.get(ulabox_products_web).content
    ulabox_soap = BeautifulSoup(ulabox_content, "lxml")
    
    if ulabox_soap.find_all("section", {"class": "product-list"}): 
        table = ulabox_soap.find_all("section", {"class": "product-list"})[0]
        items = table.find_all("div", {"class": "grid__item m-one-whole t-one-third d-one-third dw-one-quarter | js-product-grid-grid"})
        item = [item.find_all("article")[0] for item in items if len(item.find_all("article")) > 0]

        product_price = [item.get("data-price") for item in item]
        product_id = [item.get("data-product-id") for item in item]
        product_brand = [change_format_to_utf8(item.get("data-product-brand")) for item in item]
        product_name = [change_format_to_utf8(item.get("data-product-name")) for item in item]

        df_products = df_products.append(pd.DataFrame({"id": product_id, "name": product_name, "category": row['Category'], "subcategory": row['Subcategory'], "price": product_price, "currency": "EUR", "brand": product_brand}))

We saved the price as string so we convert it whole column to float.

In [9]:
df_products['price'] = df_products['price'].astype(float)

As the current products dataframe is a mix of each subcategory, the index is not homogenic. In this function we will reset to make it incremental.

In [10]:
df_products.reset_index(drop=True, inplace=True)

Our final dataset.

In [11]:
df_products.head()

Unnamed: 0,id,name,category,subcategory,price,currency,brand
0,54521,Ensalada Mezclum Petit Plà 250g,Frescos,Mercado,2.98,EUR,Fruites i Verdures Lluís Macià
1,42519,Aguacate Maduro,Frescos,Mercado,3.99,EUR,Fruites i Verdures Lluís Macià
2,43876,Plátano de Canarias Verde,Frescos,Mercado,2.47,EUR,Fruites i Verdures Lluís Macià
3,42501,Plátano de Canarias Maduro,Frescos,Mercado,2.47,EUR,Fruites i Verdures Lluís Macià
4,42535,Fresas de Maresme 500g,Frescos,Mercado,3.98,EUR,Fruites i Verdures Lluís Macià


To finalize this part, we export the dataset to final excel to outputs folder.

In [12]:
df_products.to_csv("./outputs/web_scraping_items.csv")

<a name="api"></a>
## Get BTC Price. API section.

In [28]:
import requests
import pandas as pd
from pandas.io.json import json_normalize

### Generic functions.

In [29]:
blockchain_url = "https://blockchain.info/ticker"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Safari/605.1.15" }  

### Get current btc price

In [30]:
json = requests.get(blockchain_url, headers=headers).json()
currencies_array = []

for item in json:
    json[item]["Currency"] = item
    currencies_array.append(json[item])

df_btc_prices = pd.DataFrame(currencies_array)

In [31]:
df_btc_prices.head()

Unnamed: 0,15m,last,buy,sell,symbol,Currency
0,9546.87,9546.87,9546.87,9546.87,$,USD
1,14393.55,14393.55,14393.55,14393.55,$,AUD
2,41673.03,41673.03,41673.03,41673.03,R$,BRL
3,12656.99,12656.99,12656.99,12656.99,$,CAD
4,9381.41,9381.41,9381.41,9381.41,CHF,CHF


In [32]:
df_btc_prices.to_csv("./outputs/api_btc_currencies.csv")

<a name="merge"></a>
## Merge the two datasets.

### Generic functions.

Convert eur price to btc format.

In [None]:
def price_to_btc(row):
    currency = row['currency']
    price = row["price"]
    btc_price = get_value_from_currency(currency)
    return price / btc_price

Convert btc price to satoshi format.

In [33]:
def btc_to_satoshi(row):
    price = row['price_btc']
    return price * 100000000

# Get price value based on the currency.
def get_value_from_currency(value):
    index = df_btc_prices[df_btc_prices['Currency'] == value].index.tolist()[0]
    return df_btc_prices["last"][index]

New column with the price from each item converted in bitcoin.

In [18]:
df_products['price_btc'] = df_products.apply(lambda row: price_to_btc(row), axis=1)

New column with the price from each btc price in satoshi format.

In [34]:
df_products['price_satoshi'] = df_products.apply(lambda row: btc_to_satoshi(row), axis=1)

In [37]:
df_products.head()

Unnamed: 0,id,name,category,subcategory,price,currency,brand,price_btc,price_satoshi
0,54521,Ensalada Mezclum Petit Plà 250g,Frescos,Mercado,2.98,EUR,Fruites i Verdures Lluís Macià,0.000335,33536.012388
1,42519,Aguacate Maduro,Frescos,Mercado,3.99,EUR,Fruites i Verdures Lluís Macià,0.000449,44902.244775
2,43876,Plátano de Canarias Verde,Frescos,Mercado,2.47,EUR,Fruites i Verdures Lluís Macià,0.000278,27796.627718
3,42501,Plátano de Canarias Maduro,Frescos,Mercado,2.47,EUR,Fruites i Verdures Lluís Macià,0.000278,27796.627718
4,42535,Fresas de Maresme 500g,Frescos,Mercado,3.98,EUR,Fruites i Verdures Lluís Macià,0.000448,44789.70782


#### We finally export our final dataset! 

In [36]:
df_products.to_csv("./outputs/final_ulabox_products.csv")