# Milestone 1: Exploratory Data Analysis

**Authors**: __Khizer Zakir & Rodrigo Brust Santos__

__October 2023__


<a name="dataset-description" ></a>
## 1. Dataset Description 

**Objective** : Prediction/Interpolation of elemental concentrations

### 1.1 How the Dataset has been Collected?

The data was collected from the Brazilian Geological Survey (CPRM) in the folder of the [Destrito Zincífero de Vazante, Minas Gerais (MG), Brasil](https://rigeo.cprm.gov.br/handle/doc/19397) (_Zinc-enriched District of Vazante, Minas Gerais State, Brazil_), accronym `DZV`. The stream sediments dataset was requested by e-mail to CRPM and promptly disponibilized.

The dataset that will be explored is the Stream Sediments Geochemical Samples from DZV, which is a program that has been conducted in the northwest region of Minas Gerais State, Brazil, published in 2017 to aid exploration and development of mineral industry focused on **Zinc ore**. 

This dataset consists of `stream sediments`, which are sediment samples that are collected from a stream or body of water for geochemical analysis. Stream sediment sampling is considered a good first order approximation for mineral exploration, as catchment lithology (or, in layman's terms, the type of rock in the drainage area) is considered to be the main control on stream sediment geochemistry and therefore can indicate a mineral deposit upstream of the sample location. 

### 1.2 Meaning of the Dataset

#### Context

The `DZV` lies within the Brasília Belt, more especifically in the center-south part (Almeida, 1967 & 1968).  This geological terrain was formed in the Brasiliano Cycle, during the Neoproterozoic period (1Ga to 500Ma), when the cratons of São Francisco, Amazonas and Paranapanema collided. This colision was on of many that created the big southern paleocontinent Gondwana. (Valeriano et al., 2004; Pimentel et al., 2011).

Cratons are caractherized as a solid and stable part of the continental lithosphere. Due to its light density when compared to the ocean crust, the cratons can resist to the movements of the tectonic plates, such as rifts and subductions. As consequences, these resistent pieces of rocks are commonly found in the middle of tectonic plates and continents (Petit,2010). Figure 1 shows the main cratons of South America and Africa, when these continents were joined in the super paleocontinent Pangea.

|![cratons-1.svg](./images/Cratons_West_Gondwana.svg)|
|:--:|
|*Figure 1: South America's and Africa's cratons represented by dark brown. Source: [Woudloper, 2010](https://en.wikipedia.org/wiki/Craton#/media/File:Cratons_West_Gondwana.svg)*|

The shock of these brazilian cratons led to the Brasilia Belt (Figure 2) as stated above. This collision generated 3 geological compartiments of the folding belt: the Inner Zone, the External Zone and the Cratonic Domain. Since the area of interest (AOI) is located in the External Zone, that is where we are focusing - highlighted by the purple box in Figure 2. 

|![brasilia-belt.png](./images/brasilia_belt_draw.png)|
| :--: |
|*Figure 2: Brasília Belt geological setting and position. It is a strip that connects both São Francisco Craton and Paranapanema Craton, that nowadays is under the Paraná Basin. The stream sediments are located in the region of the purple draw square, in the cities of Paracatu, Vazante and Coromandel. Source: Valeriano, 2016.*|

The `DZV` is located in the External Zone of Brasília Belt. The main mineralizations of Zn are within the Vazante Group (Dias et al., 2015). In Geology, group is a grouping of geologic formations. Each formation is caractherized by a couple of lithologies. Within Vazante Group, there are in total 3 Formations: Lapa Fm, Serra do Poço and Morro do Calcário Fm, Serra do Garrote Fm. They are always displayed from the youngest to the oldest, as shown in Figure 3. Groups and formations are classified according to the similarities between the lithologies, being different from rock bodies in the suroudings, and that also ocuppies  particular position in the layers of rock exposed in a geographical region (Boggs, 1987).

| ![vazante_group.png](./images/vazante_group_modified.png) |
|:--:|
|__Figure 3: Vazante Group stratigraphic column. Modified from [Aldi et al. (2022)](https://www.researchgate.net/publication/362722519_LA-ICP-MS_Trace_Element_Composition_of_Sphalerite_and_Galena_of_the_Proterozoic_Carbonate-Hosted_Morro_Agudo_Zn-Pb_Sulfide_District_Brazil_Insights_into_Ore_Genesis)|


In total, in the `DZV` there are 6 mines extracting Zn-ore, mainly by the mineral willemite (Zn2SiO4 - Zinc Silicate). In some areas, the concentration of Zn can reach 8000 ppm (Dias et al., 2015).  

#### Objectives

Having the whole background in mind, the objective of this project is to find, with the aid of Machine Learning regression algorithms, prospectable areas rich in Zn. 

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

In [3]:
# Load the dataset
file_path = '../dataset/stream_samples_original.csv'
df = pd.read_csv(file_path, sep = ';')

# Display the first few rows of the dataset
df.head()


Unnamed: 0,Estação,N__Lab_,Long__X_,Lat__Y_,Folha,Ag (ppm),Al (%),As (ppm),Au (ppm),B (ppm),...,Ta (ppm),Te (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zn (ppm),Zr (ppm)
0,AC-0002,CDE225,248757,7972050,Coromandel,0.03,2.63,4,<0.1,<10,...,<0.05,0.06,7.4,0.03,0.7,73,0.3,13.15,27,6.3
1,AC-0003,CDE226,244460,7973135,Coromandel,0.02,1.93,2,<0.1,<10,...,<0.05,0.14,8.2,0.06,0.94,58,0.3,23.9,58,6.9
2,AC-0004,CDE227,244044,7970217,Coromandel,0.04,1.47,3,<0.1,<10,...,<0.05,0.08,5.4,0.04,0.65,55,0.2,10.4,34,1.8
3,AC-0005,CDE228,242895,7970593,Coromandel,0.05,1.72,23,<0.1,<10,...,<0.05,<0.05,6.9,0.04,1.01,66,0.7,9.34,27,4.1
4,AC-0006,CDE229,242999,7971416,Coromandel,0.04,0.97,7,<0.1,<10,...,<0.05,<0.05,5.7,0.05,0.82,41,0.4,7.45,28,1.5



### 1.3 Explanatory Variables
List and describe the types and numbers of explanatory variables.


In [13]:
 #filtering only the elements columns
exp_variables = df.iloc[:,5:]

# Remove the Zinc column by index (-2)
exp_variables = exp_variables.drop(exp_variables.columns[-2], axis=1)

exp_variables.head()

Unnamed: 0,Ag (ppm),Al (%),As (ppm),Au (ppm),B (ppm),Ba (ppm),Be (ppm),Bi (ppm),Ca (%),Cd (ppm),...,Sr (ppm),Ta (ppm),Te (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zr (ppm)
0,0.03,2.63,4,<0.1,<10,76,1.0,0.31,0.05,0.03,...,3.3,<0.05,0.06,7.4,0.03,0.7,73,0.3,13.15,6.3
1,0.02,1.93,2,<0.1,<10,84,1.7,0.29,0.04,0.01,...,3.1,<0.05,0.14,8.2,0.06,0.94,58,0.3,23.9,6.9
2,0.04,1.47,3,<0.1,<10,70,0.7,0.19,0.11,0.06,...,3.9,<0.05,0.08,5.4,0.04,0.65,55,0.2,10.4,1.8
3,0.05,1.72,23,<0.1,<10,80,1.1,0.32,0.04,0.08,...,3.1,<0.05,<0.05,6.9,0.04,1.01,66,0.7,9.34,4.1
4,0.04,0.97,7,<0.1,<10,56,0.5,0.22,0.03,0.05,...,2.5,<0.05,<0.05,5.7,0.05,0.82,41,0.4,7.45,1.5


In [17]:
print("Feature Name:", exp_variables.columns.values)

print("\n Type of X Values:")
print(exp_variables.dtypes)
print("\nshape of X:")
print(exp_variables.shape)

Feature Name: ['Ag (ppm)' 'Al (%)' 'As (ppm)' 'Au (ppm)' 'B (ppm)' 'Ba (ppm)' 'Be (ppm)'
 'Bi (ppm)' 'Ca (%)' 'Cd (ppm)' 'Ce (ppm)' 'Co (ppm)' 'Cr (ppm)'
 'Cs (ppm)' 'Cu (ppm)' 'Fe (%)' 'Ga (ppm)' 'Ge (ppm)' 'Hf (ppm)'
 'Hg (ppm)' 'In (ppm)' 'K (%)' 'La (ppm)' 'Li (ppm)' 'LREE (ppm)' 'Mg (%)'
 'Mn (ppm)' 'Mo (ppm)' 'Na (%)' 'Nb (ppm)' 'Ni (ppm)' 'P (ppm)' 'Pb (ppm)'
 'Rb (ppm)' 'Re (ppm)' 'S (%)' 'Sb (ppm)' 'Sc (ppm)' 'Se (ppm)' 'Sn (ppm)'
 'Sr (ppm)' 'Ta (ppm)' 'Te (ppm)' 'Th (ppm)' 'Ti (%)' 'U (ppm)' 'V (ppm)'
 'W (ppm)' 'Y (ppm)' 'Zr (ppm)']

 Type of X Values:
Ag (ppm)       object
Al (%)        float64
As (ppm)       object
Au (ppm)       object
B (ppm)        object
Ba (ppm)       object
Be (ppm)       object
Bi (ppm)      float64
Ca (%)         object
Cd (ppm)       object
Ce (ppm)      float64
Co (ppm)      float64
Cr (ppm)        int64
Cs (ppm)      float64
Cu (ppm)      float64
Fe (%)         object
Ga (ppm)      float64
Ge (ppm)       object
Hf (ppm)       object
Hg (ppm)   

**Location Variables**

- `Estação` (_object_): means Station. It is the ID of the collecting point.

- `N__Lab_` (_object_): identification of the lab that performed the stream sediment analysis.

- `Long__X_` (_int64_): coordinates in the X axis.

- `Lat__Y_` (_int64_): coordinates in the Y axis.

- `Folha` (_object_): mapping site

**Explanatory Variables**

- `element (ppm)` (_float64_): element measured in parts per million

- `element (%)`(_float64_): element measured in percentage

| Chemical Element | Element Name |
|------------------|--------------|
| Ag (ppm)         | Silver       |
| Al (%)           | Aluminum     |
| As (ppm)         | Arsenic      |
| Au (ppm)         | Gold         |
| B (ppm)          | Boron        |
| Ba (ppm)         | Barium       |
| Be (ppm)         | Beryllium    |
| Bi (ppm)         | Bismuth      |
| Ca (%)           | Calcium      |
| Cd (ppm)         | Cadmium      |
| Ce (ppm)         | Cerium       |
| Co (ppm)         | Cobalt       |
| Cr (ppm)         | Chromium     |
| Cs (ppm)         | Cesium       |
| Cu (ppm)         | Copper       |
| Fe (%)           | Iron         |
| Ga (ppm)         | Gallium      |
| Ge (ppm)         | Germanium    |
| Hf (ppm)         | Hafnium      |
| Hg (ppm)         | Mercury      |
| In (ppm)         | Indium       |
| K (%)            | Potassium    |
| La (ppm)         | Lanthanum    |
| Li (ppm)         | Lithium      |
| LREE (ppm)       | Light Rare Earth Elements |
| Mg (%)           | Magnesium    |
| Mn (ppm)         | Manganese    |
| Mo (ppm)         | Molybdenum   |
| Na (%)           | Sodium       |
| Nb (ppm)         | Niobium      |
| Ni (ppm)         | Nickel       |
| P (ppm)          | Phosphorus   |
| Pb (ppm)         | Lead         |
| Rb (ppm)         | Rubidium     |
| Re (ppm)         | Rhenium      |
| S (%)            | Sulfur       |
| Sb (ppm)         | Antimony     |
| Sc (ppm)         | Scandium     |
| Se (ppm)         | Selenium     |
| Sn (ppm)         | Tin          |
| Sr (ppm)         | Strontium    |
| Ta (ppm)         | Tantalum     |
| Te (ppm)         | Tellurium    |
| Th (ppm)         | Thorium      |
| Ti (%)           | Titanium     |
| U (ppm)          | Uranium      |
| V (ppm)          | Vanadium     |
| W (ppm)          | Tungsten     |
| Y (ppm)          | Yttrium      |
| Zr (ppm)         | Zirconium    |


*_Disclamer_*: in geology, the chemical elements are separated in two types of elements: `Major` and `Trace`. 

All `Major elements`, such as Al, K, and Fe, for example, are measured in percentage (%). These elements typically constitute a significant portion of the sediment's composition, so their concentrations are conveniently expressed as percentages. 

Meanwhile all `Trace elements` are measured in ppm.  These elements, presented in much lower concentrations, include valuable minerals, heavy metals, and other trace elements like gold (Au), silver (Ag), copper (Cu), and various rare earth elements. Even in trace amounts, these elements can be essential indicators for mineral deposits or provide valuable information about the geological environment.



### 1.4 Response Variable
Describe the response variable and its type.

**Response Variable**

- `Zn (ppm)` (_float64_): concentration of Zinc in ppm.

| Chemical Element | Element Name |
|------------------|--------------|
| Zn (ppm)| Zinc | 


In [25]:
res_variables = df['Zn (ppm)'].reset_index()


In [32]:
print("Feature Name:", res_variables.columns[1])

print("\nType of Y Values:")
print(res_variables.dtypes[1])

print("\nshape of Y:")
print(exp_variables.shape)

Feature Name: Zn (ppm)

Type of Y Values:
int64

shape of Y:
(709, 50)
