# **Green Bean Price Tracker**

## Using **BeautifulSoup** and **Requests** to pass a Login form

First we import the required libraries. The import is successful if there are no errors.

In [1]:
# Standart imports
import pandas as pd 
import requests
from bs4 import BeautifulSoup

# Sensitive data imports
import config as cfg 

__Requests__ is an elegant and simple HTTP library for Python, built for human beings.Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.

__Beautiful Soup__ is a Python library for pulling data out of HTML and XML files, we will work with HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree and/or filter out what we are looking for.  

In [2]:
# URL of the page we gonna scrap.
url = "https://offerlist.rehmcoffee.de"

Below we will use a `Session` object because:
+ it allows us to persist certain parameters across requests
+ it persists cookies across all requests made from the Session instance 
+ It has all the methods of the main Requests API

In [3]:
s = requests.Session()

Firstly, we use `Requests` to get access to the page content.

In [4]:
# Using Requests's method get on session object to get access to the page content.
data = s.get(url)

In [5]:
# Printing out the URL of the page to check if everything works proper.
data.url

'https://offerlist.rehmcoffee.de/'

In [None]:
# Calling attribute content to see the content of the login page before we try to log in.
data.content

To parse a document, we pass it into the `BeautifulSoup` constructor.Then we create a `BeautifulSoup` object : *soup* , which represents the document as a nested data structure.

In [6]:
soup = BeautifulSoup(data.content,"html5lib")

In [7]:
# Using prettify() method  to display the HTML in the nested structure:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <!-- 
	This website is powered by TYPO3 - inspiring people to share!
	TYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL.
	TYPO3 is copyright 1998-2021 of Kasper Skaarhoj. Extensions are copyright of their respective owners.
	Information and contribution at https://typo3.org/
-->
  <title>
   Rehm &amp; Co.
  </title>
  <meta content="TYPO3 CMS" name="generator"/>
  <meta content="Rehm &amp; Co" property="og:site_name"/>
  <link href="/typo3temp/assets/css/d42b6e1bdf.css?1592917648" media="all" rel="stylesheet" type="text/css"/>
  <link href="/fileadmin/rehm/Resources/Public/StyleSheet/reset.css?1592910509" media="all" rel="stylesheet" type="text/css"/>
  <link href="/fileadmin/rehm/Resources/Public/StyleSheet/fontawesome.min.css?1592910510" media="all" rel="stylesheet" type="text/css"/>
  <link href="https://d1a7bb4s34c11s.cloudfront.net/cookiejar.

Next, we search for inputs are required to submit to login form to pass it. Below we can see that not _user_ and _password_ but also few extra tokens are required. 

In [8]:
list_input = soup.find_all("input")
list_input

[<input id="user" name="user" placeholder="Username" type="text" value=""/>,
 <input data-rsa-encryption="" id="pass" name="pass" placeholder="Password" type="password" value=""/>,
 <input name="logintype" type="hidden" value="login"/>,
 <input name="pid" type="hidden" value="77"/>,
 <input name="redirect_url" type="hidden" value=""/>,
 <input name="tx_felogin_pi1[noredirect]" type="hidden" value="0"/>]

In [9]:
token_nr1 = soup.find("input", {"name":"logintype"})["value"]
token_nr2 = soup.find("input", {"name":"pid"})["value"]
token_nr3 = soup.find("input", {"name":"redirect_url"})["value"]
token_nr4 = soup.find("input", {"name":"tx_felogin_pi1[noredirect]"})["value"]

In [10]:
print(token_nr1)
print(token_nr2)
print(token_nr3)
print(token_nr4)

login
77

0


Username and password are sensitive info. There is no option to register on web site of trader but company's representative can issue it for your business upon a request after required company info provided. 



__config.py__ consists of a line of code, a dictionary of the following format:

`login_data = {'user': 'name@company-name.com', 'pass': 'password_received_from_trader'}`
A config file has to be created in the root directory of the project and it will be called from the code as follows:

In [11]:
#login_data = {"user":"mymail@company-name.com", "pass":"your_password", "logintype":token_nr1, "pid":token_nr2, "redirect_url":token_nr3, "tx_felogin_pi1[noredirect]":token_nr4}
login_data = cfg.login_data
login_data["logintype"] = token_nr1
login_data["pid"] = token_nr2
login_data["redirect_url"] = token_nr3
login_data["tx_felogin_pi1[noredirect]"] = token_nr4

In [12]:
s.post(url, login_data)

<Response [200]>

__Status code:200__ in the Response message in HTTP Protocol stays for OK (any 2xx stays for Success). If we skip `Session()` and go for `requests.get(url)` above, while our response code will be still OK we will not stay logged in what will result in landing on login form again again. 

In [13]:
home_page = s.get(url)

In [14]:
home_page.content

b'<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n\r\n<meta charset="utf-8">\r\n<!-- \n\tThis website is powered by TYPO3 - inspiring people to share!\r\n\tTYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL.\r\n\tTYPO3 is copyright 1998-2021 of Kasper Skaarhoj. Extensions are copyright of their respective owners.\r\n\tInformation and contribution at https://typo3.org/\r\n-->\r\n\r\n\r\n\r\n<title>Rehm &amp; Co.</title>\r\n<meta name="generator" content="TYPO3 CMS" />\n<meta property="og:site_name" content="Rehm &amp; Co" />\r\n\r\n\r\n<link rel="stylesheet" type="text/css" href="/typo3temp/assets/css/d42b6e1bdf.css?1592917648" media="all">\n<link rel="stylesheet" type="text/css" href="/fileadmin/rehm/Resources/Public/StyleSheet/reset.css?1592910509" media="all">\n<link rel="stylesheet" type="text/css" href="/fileadmin/rehm/Resources/Public/StyleSheet/fontawesome.min.css?1592910510" media="all">\n<link rel="styleshee

Going throgh content we see that we successfully passed the login form and have excess to data we going to scrap. 

## Using **BeautifulSoup** and **Pandas** to extract the data into DataFrame

Now we pass HTML into the `BeautifulSoup` constructor.Then we create a new `BeautifulSoup` object : *soup_home_page* , which represents the HTML code as a nested data structure.

In [15]:
soup_home_page = BeautifulSoup(home_page.content,"html5lib")

In [16]:
print(soup_home_page.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <!-- 
	This website is powered by TYPO3 - inspiring people to share!
	TYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL.
	TYPO3 is copyright 1998-2021 of Kasper Skaarhoj. Extensions are copyright of their respective owners.
	Information and contribution at https://typo3.org/
-->
  <title>
   Rehm &amp; Co.
  </title>
  <meta content="TYPO3 CMS" name="generator"/>
  <meta content="Rehm &amp; Co" property="og:site_name"/>
  <link href="/typo3temp/assets/css/d42b6e1bdf.css?1592917648" media="all" rel="stylesheet" type="text/css"/>
  <link href="/fileadmin/rehm/Resources/Public/StyleSheet/reset.css?1592910509" media="all" rel="stylesheet" type="text/css"/>
  <link href="/fileadmin/rehm/Resources/Public/StyleSheet/fontawesome.min.css?1592910510" media="all" rel="stylesheet" type="text/css"/>
  <link href="https://d1a7bb4s34c11s.cloudfront.net/cookiejar.

In [17]:
# Checking how many tables are on page.

tables = soup_home_page.find_all('table')
len(tables)

3

In [18]:
tables

[<table cellpadding="0" cellspacing="0">
 								<thead>
 									<tr>
 										<th>Month</th>
 										<th>NYC</th>
 										<th>Month</th>
 										<th>LDN</th>
 									</tr>
 								</thead>
 								<tbody>
 									
 										<tr>
 											<td>Jul 21</td>
 											<td>131,90</td>
 											<td>Jul 21</td>
 											<td>1385</td>
 										</tr>
 									
 										<tr>
 											<td>Sep 21</td>
 											<td>133,85</td>
 											<td>Sep 21</td>
 											<td>1401</td>
 										</tr>
 									
 										<tr>
 											<td>Dec 21</td>
 											<td>136,35</td>
 											<td>Nov 21</td>
 											<td>1420</td>
 										</tr>
 									
 										<tr>
 											<td>USD / EUR</td>
 											<td>1,20</td>
 											<td></td>
 											<td></td>
 										</tr>
 									
 								</tbody>
 							</table>,
 <table cellpadding="0" cellspacing="0">
 								<tbody>
 									<!-- <tr>
 										<td>Sina Albrecht</td>
 									

The output is 3 tables. If we visit the  https://offerlist.rehmcoffee.de/ we can see exactly 3 tables: Stock Exchange, Your Contact and the third one in which on fact we are interested in.

Now we automate table choice. Below the loop that search for words in tables and prints an index of a table in cell below, than displays the html code as nested structure of this table in the next cell. (we need to feed it with words unique to the table we are searching).

In [19]:
for index,table in enumerate(tables):
    if ("almond" in str(table)):
        table_index = index
        
print(table_index)

2


In [20]:
print(tables[table_index].prettify())

<table id="datatables" width="100%">
 <thead>
  <tr>
   <th class="sort" data-name="origin">
    Origin
   </th>
   <th class="sort" data-name="coffee">
    Coffee
   </th>
   <th class="sort" data-name="bags">
    Bags
   </th>
   <th class="no-sort" data-name="kg">
    Unit
   </th>
   <!-- <th data-name="packaging" class="no-sort">Packaging</th> -->
   <th class="sort" data-name="farmname">
    Farm / Name
   </th>
   <th class="sort" data-name="process">
    Process
   </th>
   <th class="sort" data-name="cert">
    Cert.
   </th>
   <!-- <th data-name="category" class="sort">Category</th> -->
   <!-- <th data-name="producer" class="sort">Producer</th> -->
   <th class="sort" data-name="cupprofile">
    Cup Profile
   </th>
   <!-- <th data-name="variety" class="sort">Variety</th> -->
   <!-- <th data-name="region" class="sort">Region</th> -->
   <!-- <th data-name="crop" class="sort">Crop</th> -->
   <th class="sort" data-name="availability">
    Avail.
   </th>
   <th class="sort

It is possible to scrape data from HTML tables into a DataFrame using BeautifulSoup and the Pandas function `read_html`that creates a DataFrame and populates it.

Our table is `tables[table_index]`. 

When we use the pandas function `read_html`, we give it the string version of the table as well as the flavor which is the parsing engine bs4.

The function `read_html` always returns a list of DataFrames so we must pick the one we want out of the list. We use `[0]` index as we already spicified proper table from tables above.

We can also use the read_html function to directly get DataFrames from a url and than pick the DataFrame we need out of the list as follows (but this works when we don't need to pass a login form):
<code>
whole_page_df = pd.read_html(url, flavor='bs4')
len(whole_page_df)
whole_page_df[1]
<code>


In [21]:
offer_list_rehm = pd.read_html(str(tables[table_index]), flavor='bs4')[0]
offer_list_rehm

Unnamed: 0,Origin,Coffee,Bags,Unit,Farm / Name,Process,Cert.,Cup Profile,Avail.,€ / KG,$ / KG,Info,Unnamed: 12
0,BLEND,DECAFFEINATED,130,60kg,Espresso Blend DCM Decaffeinated,DCM decaf,,"dark chocolate, hazelnut",Hamb,"4,28€","5,13$",,
1,BRAZIL,ARABICA SPOT,515,59kg,Santos Aquarela NY2 17/18 s.s. fine cup,natural,,"chocolate, almond",Hamb,"2,72€","3,26$",,
2,BRAZIL,ARABICA SPOT,117,59kg,Cerrado Doce Diamantina NY2 16up natural,natural,,,Hamb,"2,79€","3,35$",,
3,BRAZIL,ARABICA SPOT,80,59kg,Mogiana Bella Giana NY2 17/18 s.s. fine cup pu...,pulped natural,,"hazelnut, cream",Hamb,"2,86€","3,44$",Factsheet,
4,BRAZIL,ARABICA SPOT,105,59kg,Santos NY2 Screen 19 s.s. fine cup,natural,,"hazelnut, almond, milk chocolate",Hamb,"3,16€","3,79$",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,MEXICO,ARABICA SPOT,53,69kg,Finca el Flamingo Organic,washed,ORG,,Hamb,"4,35€","5,22$",,
59,NICARAGUA,ARABICA SPOT,229,69kg,Finca San Ramón Screen 18,washed,,"almond, lemon, tea like",Hamb,"4,26€","5,11$",,
60,PANAMA,ARABICA SPOT,13,15kg,Finca Los Limones natural,natural,,"blueberry, raisin, nougat",Hamb,"23,37€","28,04$",Factsheet,
61,RWANDA,ARABICA SPOT,68,60kg,Rugali Screen 15up,washed,,"almond, green tea",Hamb,"4,28€","5,14$",Factsheet,


## Data preparation before storaging 

### Adding Date column

First we need to create Timestamp object that provides current Timestamp. For that we are using Pandas `.to_datetime` method on arg 'today' to get current timestamp (not just date) in local timezone and `normalize()` to keep the date as a Timestamp.

In [22]:
current_date = pd.to_datetime('today').normalize()
current_date

Timestamp('2021-04-21 00:00:00')

In [23]:
offer_list_rehm['offer_date'] = current_date
offer_list_rehm

Unnamed: 0,Origin,Coffee,Bags,Unit,Farm / Name,Process,Cert.,Cup Profile,Avail.,€ / KG,$ / KG,Info,Unnamed: 12,offer_date
0,BLEND,DECAFFEINATED,130,60kg,Espresso Blend DCM Decaffeinated,DCM decaf,,"dark chocolate, hazelnut",Hamb,"4,28€","5,13$",,,2021-04-21
1,BRAZIL,ARABICA SPOT,515,59kg,Santos Aquarela NY2 17/18 s.s. fine cup,natural,,"chocolate, almond",Hamb,"2,72€","3,26$",,,2021-04-21
2,BRAZIL,ARABICA SPOT,117,59kg,Cerrado Doce Diamantina NY2 16up natural,natural,,,Hamb,"2,79€","3,35$",,,2021-04-21
3,BRAZIL,ARABICA SPOT,80,59kg,Mogiana Bella Giana NY2 17/18 s.s. fine cup pu...,pulped natural,,"hazelnut, cream",Hamb,"2,86€","3,44$",Factsheet,,2021-04-21
4,BRAZIL,ARABICA SPOT,105,59kg,Santos NY2 Screen 19 s.s. fine cup,natural,,"hazelnut, almond, milk chocolate",Hamb,"3,16€","3,79$",,,2021-04-21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,MEXICO,ARABICA SPOT,53,69kg,Finca el Flamingo Organic,washed,ORG,,Hamb,"4,35€","5,22$",,,2021-04-21
59,NICARAGUA,ARABICA SPOT,229,69kg,Finca San Ramón Screen 18,washed,,"almond, lemon, tea like",Hamb,"4,26€","5,11$",,,2021-04-21
60,PANAMA,ARABICA SPOT,13,15kg,Finca Los Limones natural,natural,,"blueberry, raisin, nougat",Hamb,"23,37€","28,04$",Factsheet,,2021-04-21
61,RWANDA,ARABICA SPOT,68,60kg,Rugali Screen 15up,washed,,"almond, green tea",Hamb,"4,28€","5,14$",Factsheet,,2021-04-21


### Modifying Unit, € / KG and D / KG columns values

Copy dataframe to check how our functions acts and modify data only after all functions correct

In [45]:
df_copy_offer_list_rehm = offer_list_rehm.copy(deep=True)
df_copy_offer_list_rehm

Unnamed: 0,Origin,Coffee,Bags,Unit,Farm / Name,Process,Cert.,Cup Profile,Avail.,€ / KG,$ / KG,Info,Unnamed: 12,offer_date
0,BLEND,DECAFFEINATED,130,60kg,Espresso Blend DCM Decaffeinated,DCM decaf,,"dark chocolate, hazelnut",Hamb,"4,28€","5,13$",,,2021-04-21
1,BRAZIL,ARABICA SPOT,515,59kg,Santos Aquarela NY2 17/18 s.s. fine cup,natural,,"chocolate, almond",Hamb,"2,72€","3,26$",,,2021-04-21
2,BRAZIL,ARABICA SPOT,117,59kg,Cerrado Doce Diamantina NY2 16up natural,natural,,,Hamb,"2,79€","3,35$",,,2021-04-21
3,BRAZIL,ARABICA SPOT,80,59kg,Mogiana Bella Giana NY2 17/18 s.s. fine cup pu...,pulped natural,,"hazelnut, cream",Hamb,"2,86€","3,44$",Factsheet,,2021-04-21
4,BRAZIL,ARABICA SPOT,105,59kg,Santos NY2 Screen 19 s.s. fine cup,natural,,"hazelnut, almond, milk chocolate",Hamb,"3,16€","3,79$",,,2021-04-21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,MEXICO,ARABICA SPOT,53,69kg,Finca el Flamingo Organic,washed,ORG,,Hamb,"4,35€","5,22$",,,2021-04-21
59,NICARAGUA,ARABICA SPOT,229,69kg,Finca San Ramón Screen 18,washed,,"almond, lemon, tea like",Hamb,"4,26€","5,11$",,,2021-04-21
60,PANAMA,ARABICA SPOT,13,15kg,Finca Los Limones natural,natural,,"blueberry, raisin, nougat",Hamb,"23,37€","28,04$",Factsheet,,2021-04-21
61,RWANDA,ARABICA SPOT,68,60kg,Rugali Screen 15up,washed,,"almond, green tea",Hamb,"4,28€","5,14$",Factsheet,,2021-04-21


First we create functions to modify the column's values

In [46]:
def clean_unit_col(x):
    x = x.replace("kg", "").replace(" ", "")
    return int(x)

In [47]:
def clean_eur_col(x):
    x = x.replace("€", "").replace(",", ".").replace(" ", "")
    return round(float(x),2)

In [48]:
def clean_usd_col(x):
    x = x.replace("$", "").replace(",", ".").replace(" ", "")
    return round(float(x),2)

Now we pass it to apply method. 

In [None]:
df_copy_offer_list_rehm['€ / KG'] = df_copy_offer_list_rehm['€ / KG'].apply(clean_eur_col)
df_copy_offer_list_rehm

In [52]:
df_copy_offer_list_rehm['$ / KG'] = df_copy_offer_list_rehm['$ / KG'].apply(clean_usd_col)
df_copy_offer_list_rehm

Unnamed: 0,Origin,Coffee,Bags,Unit,Farm / Name,Process,Cert.,Cup Profile,Avail.,€ / KG,$ / KG,Info,Unnamed: 12,offer_date
0,BLEND,DECAFFEINATED,130,60kg,Espresso Blend DCM Decaffeinated,DCM decaf,,"dark chocolate, hazelnut",Hamb,4.28,5.13,,,2021-04-21
1,BRAZIL,ARABICA SPOT,515,59kg,Santos Aquarela NY2 17/18 s.s. fine cup,natural,,"chocolate, almond",Hamb,2.72,3.26,,,2021-04-21
2,BRAZIL,ARABICA SPOT,117,59kg,Cerrado Doce Diamantina NY2 16up natural,natural,,,Hamb,2.79,3.35,,,2021-04-21
3,BRAZIL,ARABICA SPOT,80,59kg,Mogiana Bella Giana NY2 17/18 s.s. fine cup pu...,pulped natural,,"hazelnut, cream",Hamb,2.86,3.44,Factsheet,,2021-04-21
4,BRAZIL,ARABICA SPOT,105,59kg,Santos NY2 Screen 19 s.s. fine cup,natural,,"hazelnut, almond, milk chocolate",Hamb,3.16,3.79,,,2021-04-21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,MEXICO,ARABICA SPOT,53,69kg,Finca el Flamingo Organic,washed,ORG,,Hamb,4.35,5.22,,,2021-04-21
59,NICARAGUA,ARABICA SPOT,229,69kg,Finca San Ramón Screen 18,washed,,"almond, lemon, tea like",Hamb,4.26,5.11,,,2021-04-21
60,PANAMA,ARABICA SPOT,13,15kg,Finca Los Limones natural,natural,,"blueberry, raisin, nougat",Hamb,23.37,28.04,Factsheet,,2021-04-21
61,RWANDA,ARABICA SPOT,68,60kg,Rugali Screen 15up,washed,,"almond, green tea",Hamb,4.28,5.14,Factsheet,,2021-04-21


In [53]:
df_copy_offer_list_rehm['Unit'] = df_copy_offer_list_rehm['Unit'].apply(clean_unit_col)
df_copy_offer_list_rehm

Unnamed: 0,Origin,Coffee,Bags,Unit,Farm / Name,Process,Cert.,Cup Profile,Avail.,€ / KG,$ / KG,Info,Unnamed: 12,offer_date
0,BLEND,DECAFFEINATED,130,60,Espresso Blend DCM Decaffeinated,DCM decaf,,"dark chocolate, hazelnut",Hamb,4.28,5.13,,,2021-04-21
1,BRAZIL,ARABICA SPOT,515,59,Santos Aquarela NY2 17/18 s.s. fine cup,natural,,"chocolate, almond",Hamb,2.72,3.26,,,2021-04-21
2,BRAZIL,ARABICA SPOT,117,59,Cerrado Doce Diamantina NY2 16up natural,natural,,,Hamb,2.79,3.35,,,2021-04-21
3,BRAZIL,ARABICA SPOT,80,59,Mogiana Bella Giana NY2 17/18 s.s. fine cup pu...,pulped natural,,"hazelnut, cream",Hamb,2.86,3.44,Factsheet,,2021-04-21
4,BRAZIL,ARABICA SPOT,105,59,Santos NY2 Screen 19 s.s. fine cup,natural,,"hazelnut, almond, milk chocolate",Hamb,3.16,3.79,,,2021-04-21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,MEXICO,ARABICA SPOT,53,69,Finca el Flamingo Organic,washed,ORG,,Hamb,4.35,5.22,,,2021-04-21
59,NICARAGUA,ARABICA SPOT,229,69,Finca San Ramón Screen 18,washed,,"almond, lemon, tea like",Hamb,4.26,5.11,,,2021-04-21
60,PANAMA,ARABICA SPOT,13,15,Finca Los Limones natural,natural,,"blueberry, raisin, nougat",Hamb,23.37,28.04,Factsheet,,2021-04-21
61,RWANDA,ARABICA SPOT,68,60,Rugali Screen 15up,washed,,"almond, green tea",Hamb,4.28,5.14,Factsheet,,2021-04-21


In [54]:
df_copy_offer_list_rehm

Unnamed: 0,Origin,Coffee,Bags,Unit,Farm / Name,Process,Cert.,Cup Profile,Avail.,€ / KG,$ / KG,Info,Unnamed: 12,offer_date
0,BLEND,DECAFFEINATED,130,60,Espresso Blend DCM Decaffeinated,DCM decaf,,"dark chocolate, hazelnut",Hamb,4.28,5.13,,,2021-04-21
1,BRAZIL,ARABICA SPOT,515,59,Santos Aquarela NY2 17/18 s.s. fine cup,natural,,"chocolate, almond",Hamb,2.72,3.26,,,2021-04-21
2,BRAZIL,ARABICA SPOT,117,59,Cerrado Doce Diamantina NY2 16up natural,natural,,,Hamb,2.79,3.35,,,2021-04-21
3,BRAZIL,ARABICA SPOT,80,59,Mogiana Bella Giana NY2 17/18 s.s. fine cup pu...,pulped natural,,"hazelnut, cream",Hamb,2.86,3.44,Factsheet,,2021-04-21
4,BRAZIL,ARABICA SPOT,105,59,Santos NY2 Screen 19 s.s. fine cup,natural,,"hazelnut, almond, milk chocolate",Hamb,3.16,3.79,,,2021-04-21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,MEXICO,ARABICA SPOT,53,69,Finca el Flamingo Organic,washed,ORG,,Hamb,4.35,5.22,,,2021-04-21
59,NICARAGUA,ARABICA SPOT,229,69,Finca San Ramón Screen 18,washed,,"almond, lemon, tea like",Hamb,4.26,5.11,,,2021-04-21
60,PANAMA,ARABICA SPOT,13,15,Finca Los Limones natural,natural,,"blueberry, raisin, nougat",Hamb,23.37,28.04,Factsheet,,2021-04-21
61,RWANDA,ARABICA SPOT,68,60,Rugali Screen 15up,washed,,"almond, green tea",Hamb,4.28,5.14,Factsheet,,2021-04-21


In [55]:
df_copy_offer_list_rehm.to_csv('scraped-data/rehm-offer-list-21.04.2021.csv',index=False)

Now our info bundle is completed. We can move further and place it in file. 

## Placing scraped data for storage in .csv file using  **Pandas** on local machine

Using cell magic we execute `bash` commands to create folders to place our scraped data in a properly arranged manner.

In [None]:
%%bash
mkdir green-bean-price-tracker 
cd green-bean-price-tracker 
mkdir scraped-data
cd ~

Using line magic we check Path Working Directory to be sure that we got back to proper location.

In [None]:
%pwd

Using `to_scv` method we create .csv file where we will collect our data as green bean trader publish new updates to offer list. 

In [None]:
offer_list_rehm.to_csv('scraped-data/rehm-offer-list.csv',index=False)

Now we can check our folders and the.csv file manually to be sure all in place. If we open .csv file with Excel we can see some formating issues therefore we will check the values calling `read_csv` to be sure all values are intact. If you got Error trying to read file, check whether you closed the fail in Excel after checking it. 

In [None]:
pd.read_csv('scraped-data/rehm-offer-list.csv')

## Appending .csv fail with new data release

The green bean trader publishes updates to the offer list weekly. We plan to log in weekly and append the .csv fail with full list. 

In [None]:
# IMPORTANT: run only when new data is available from trader!
offer_list_rehm.to_csv('scraped-data/rehm-offer-list.csv', mode='a',index=False,header=False)

Checking whether new data is available from trader.

In [None]:
df_today = pd.read_csv('scraped-data/rehm-offer-list.csv')
df_previous =  pd.read_csv('scraped-data/rehm-offer-list-14.04.2021.csv')

In [None]:
df_previous.Bags == offer_list_rehm.Bags 

In [None]:
df_previous.Bags.compare(offer_list_rehm.Bags)

## Data preparation

In [None]:
df_previous.shape

In [None]:
df_previous.describe()

In [None]:
df_today_copy = df_today.copy(deep=True)
df_today_copy

In [None]:
def clean_eur_col(x):
    x = x.replace("€", "").replace(",", ".").replace(" ", "")
    return float(x)

In [None]:
df_today_copy['€ / KG'] = df_today_copy['€ / KG'].apply(clean_eur_col)

In [None]:
df_today_copy['€ / KG'].head()

In [None]:
df_today_copy.describe()

In [None]:
def clean_unit_col(x):
    x = x.replace("kg", "").replace(" ", "")
    return int(x)

In [None]:
df_today_copy['Unit'] = df_today_copy['Unit'].apply(clean_unit_col)

In [None]:
df_today_copy['Unit'].head()

In [None]:
df_today_copy['total_kg'] =  df_today_copy['Unit'] * df_today_copy['Bags']

In [None]:
df_today_copy['total_eur'] =  df_today_copy['total_kg'] * df_today_copy['€ / KG']

In [None]:
df_today_copy['total_eur'].head()

In [None]:
df_today_copy.describe()

In [None]:
df_copy_offer_list_rehm