# Why?
Loading the data of boreholes and the ones in CA to check for potential location in BC to build a model

## Info 
The data is the titled *North America: Borehole Data and Climate Reconstruction* provided by the U of mich 

[link](http://geothermal.earth.lsa.umich.edu/NAM.html)




## Loading the data and making sure that the website was loaded successfully

In [22]:
!pip install requests
!pip install bs4
!pip install pandas
import requests
from bs4 import BeautifulSoup
import sys
import numpy as np
import pandas as pd



In [23]:
dataUrl = "http://geothermal.earth.lsa.umich.edu/NAM.html"

In [24]:
response = requests.get(dataUrl)

In [25]:
print(response.status_code)
if response.status_code != 200: 
    raise Exception("Failed to load the webpage")

200


In [26]:
data = response.content

## Parsing the data using bs4 library
The Data of each borehole is stored in a link seperately but the link can be constructed via the borehole id so it is not necessary to parse for the link ofr the data as it can be easily constrcuted

In [27]:
# parsing the content using bs
soup = BeautifulSoup(data)
# soup

In [28]:
boreholes = ""
for i in soup.find_all('td'):
    if i.string != None:
        boreholes += i.string

# observation is type list which has each observation (i.e., each row) seperated as an element in the array
observations = boreholes.splitlines()

## Creating the data abstraction
Headers in original data = Borehole, Longitude, Latitude, Data, Reconstruction, Data Contact

The Reconstruction column will not be stored as ti does not have any utility. Furthermore, an additional column is added to specify the country in which the borehole is in.

In [29]:
# Creating the data abstraction 
# ' CA-0001  -93.94  51.13  Reconstruction  J-C Mareschal (CA)  '

# the Reconstruction data is not saved as it is not used 

boreholes = []
longitudes = []
latitudes = []
datalinks = []
dataContacts = []
countries = []
 

for observation in observations: 
    entry = observation.split("  ") # [' CA-0001', '-93.94', '51.13', 'Reconstruction', 'J-C Mareschal (CA)', '']
    if len(entry) == 6:
        boreholes.append(entry[0].replace(" ",""))
        longitudes.append(entry[1])
        latitudes.append(entry[2])
        link = "http://geothermal.earth.lsa.umich.edu/DATA/"+entry[0].replace(" ","") + ".html"
        datalinks.append(link)
        locationOfP = entry[4].index("(") # locating the ( to be used for indexing below
        dataContacts.append(entry[4][0:locationOfP-1])
        countries.append(entry[4][locationOfP+1:-1])

## Creating dataframe in pandas and saving it as a csv file 
Two csv files will be saved:

one that contains all of the data and ther other that only has boreholes in CA

In [30]:
data = {"Borehole": boreholes, "Longitude":longitudes, "Latitudes":latitudes, "Data":datalinks, "Data Contact": dataContacts, "Country":countries}
df = pd.DataFrame(data)
print(df)
df.to_csv("NorthAmericanBoreHolesUMICHdata.csv")

     Borehole Longitude Latitudes  \
0     CA-0001    -93.94     51.13   
1     CA-0002    -93.86     51.18   
2     CA-0003    -93.17     50.96   
3     CA-0004    -93.14     50.99   
4     CA-0005    -92.89     51.12   
..        ...       ...       ...   
430  US-nos92    -72.55     43.33   
431  US-sar64    -74.27     44.33   
432  US-sar92    -74.27     44.33   
433  US-wad64    -73.47     44.23   
434  US-wes64    -72.82     43.28   

                                                  Data   Data Contact Country  
0    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
1    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
2    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
3    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
4    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
..                                                 ...            ...     .

In [31]:
dfCAonly = df[(df["Country"] == "CA")] # filtering for CA boreholes
print(dfCAonly)
df.to_csv("NorthAmericanBoreHolesUMICHdata_CA.csv")

    Borehole Longitude Latitudes  \
0    CA-0001    -93.94     51.13   
1    CA-0002    -93.86     51.18   
2    CA-0003    -93.17     50.96   
3    CA-0004    -93.14     50.99   
4    CA-0005    -92.89     51.12   
..       ...       ...       ...   
292  CA-9921   -109.52     58.23   
293  CA-9922   -109.52     58.23   
294  CA-9923   -109.52     58.23   
295  CA-JM-a   -107.12     52.02   
296  CA-JM-b   -106.96     50.92   

                                                  Data   Data Contact Country  
0    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
1    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
2    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
3    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
4    http://geothermal.earth.lsa.umich.edu/DATA/CA-...  J-C Mareschal      CA  
..                                                 ...            ...     ...  
292  ht

## Parsing the data webpage for all the observations

In [32]:
!pip install tqdm



In [33]:
from tqdm import tqdm
from time import sleep

In [34]:
def dataWebpageParser(observation):
    """The function will parse the webpage that contains the data of a borehole when passed on observation"""
    
    # getting the link from the observation entry
    link = observation["Data"]
    
    # loading the websites
    observationData = requests.get(link)
    if observationData.status_code != 200:
        print(observationData.status_code)
        raise Exception("Error occured when fetching website")
    
    # parsing
    soup = BeautifulSoup(observationData.content)
    ls = []
    
    for tag in soup.find_all("p"):
        ls.append(tag.contents)
    
    ########### General data
    # yr
    yrData = ls[0][0].replace(" ","")
    colonLocationInyrData = yrData.find(":")
    yr = float(yrData[colonLocationInyrData+1:]) 

    # steady state
    steadyStateData = ls[0][2].replace("    ","")
    colonLocationInSteadyState = steadyStateData.find(":")
    steadyState = float(steadyStateData[colonLocationInSteadyState+1:-1])

    # mean conductivity
    conductivityData = ls[0][8].replace("    ","")
    colonInConductivityData = conductivityData.find(":")
    conductivity = float(conductivityData[colonInConductivityData+1:-1])

    # mean thermal gradient
    thermalGradientData = ls[0][12].replace("    ","")
    colonInThermalGradientData = thermalGradientData.find(":")
    thermalGradient = float(thermalGradientData[colonInThermalGradientData+1:-1])
    thermalGradient


    # Depth Below Surface (m)        and          Temperature (°C)
    # the even numbers (0, 2, 4, ...) are the Depth Below Surface
    # the odd numbers (1, 3, 5, ...) are Temperature

    depthAndTempString = ls[2][0].split()

    depth = []
    temp = []
        
    newObservations = []
    while len(depthAndTempString) != 0:
        newObservations.append({"Borehole": observation["Borehole"],
                                "Depth":float(depthAndTempString[0]),
                                "Temperature": float(depthAndTempString[1]),
                                "Longitude": observation["Longitude"], 
                                "Latitudes": observation["Latitudes"], 
                                "link": observation["Data"], 
                                "Data Contact": observation["Data Contact"], 
                                "Country": observation["Country"]})
        
        del depthAndTempString[0]
        del depthAndTempString[0]
    
    return newObservations
        
        

In [35]:
df

Unnamed: 0,Borehole,Longitude,Latitudes,Data,Data Contact,Country
0,CA-0001,-93.94,51.13,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
1,CA-0002,-93.86,51.18,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
2,CA-0003,-93.17,50.96,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
3,CA-0004,-93.14,50.99,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
4,CA-0005,-92.89,51.12,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
...,...,...,...,...,...,...
430,US-nos92,-72.55,43.33,http://geothermal.earth.lsa.umich.edu/DATA/US-...,E.R. Decker,US
431,US-sar64,-74.27,44.33,http://geothermal.earth.lsa.umich.edu/DATA/US-...,H.N. Pollack,US
432,US-sar92,-74.27,44.33,http://geothermal.earth.lsa.umich.edu/DATA/US-...,E.R. Decker,US
433,US-wad64,-73.47,44.23,http://geothermal.earth.lsa.umich.edu/DATA/US-...,H.N. Pollack,US


In [36]:
i = 0
allData = []
dataFrame = df

for i in tqdm(range(len(dataFrame)), desc ="Processing", ncols = 100):
    sleep(0.03)
    allData.append(dataWebpageParser(dataFrame.iloc[i]))
    i +=1

Processing: 100%|█████████████████████████████████████████████████| 435/435 [02:16<00:00,  3.18it/s]


In [37]:
allDataSorted = []
for borehole in allData:
    for element in borehole:
        allDataSorted.append(element)

In [38]:
northAmericanBoreholesData = pd.DataFrame(allDataSorted)
northAmericanBoreholesData.to_csv("NorthAmericanBoreholes.csv")
northAmericanBoreholesData

Unnamed: 0,Borehole,Depth,Temperature,Longitude,Latitudes,link,Data Contact,Country
0,CA-0001,29.78,4.44,-93.94,51.13,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
1,CA-0001,39.67,4.44,-93.94,51.13,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
2,CA-0001,49.56,4.49,-93.94,51.13,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
3,CA-0001,59.39,4.55,-93.94,51.13,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
4,CA-0001,69.22,4.62,-93.94,51.13,http://geothermal.earth.lsa.umich.edu/DATA/CA-...,J-C Mareschal,CA
...,...,...,...,...,...,...,...,...
19381,US-wes64,371.10,13.11,-72.82,43.28,http://geothermal.earth.lsa.umich.edu/DATA/US-...,H.N. Pollack,US
19382,US-wes64,380.80,13.30,-72.82,43.28,http://geothermal.earth.lsa.umich.edu/DATA/US-...,H.N. Pollack,US
19383,US-wes64,390.60,13.49,-72.82,43.28,http://geothermal.earth.lsa.umich.edu/DATA/US-...,H.N. Pollack,US
19384,US-wes64,400.30,13.68,-72.82,43.28,http://geothermal.earth.lsa.umich.edu/DATA/US-...,H.N. Pollack,US
