# Milestone 1: Exploratory Data Analysis

**Authors**: __Khizer Zakir & Rodrigo Brust Santos__

__October 2023__
______


<a name="dataset-description" ></a>
## 1. Dataset Description 

**Objective** : Prediction/Interpolation of elemental concentrations

### 1.1 How the Dataset has been Collected?

The data was collected from the Brazilian Geological Survey (CPRM) in the folder of the [Destrito Zincífero de Vazante, Minas Gerais (MG), Brasil](https://rigeo.cprm.gov.br/handle/doc/19397) (_Zinc-enriched District of Vazante, Minas Gerais State, Brazil_), accronym `DZV`. The stream sediments dataset was requested by e-mail to CRPM and promptly disponibilized.

The dataset that will be explored is the Stream Sediments Geochemical Samples from DZV, which is a program that has been conducted in the northwest region of Minas Gerais State, Brazil, published in 2017 to aid exploration and development of mineral industry focused on **Zinc ore**. 

This dataset consists of `stream sediments`, which are sediment samples that are collected from a stream or body of water for geochemical analysis. Stream sediment sampling is considered a good first order approximation for mineral exploration, as catchment lithology (or, in layman's terms, the type of rock in the drainage area) is considered to be the main control on stream sediment geochemistry and therefore can indicate a mineral deposit upstream of the sample location. 

### 1.2 Meaning of the Dataset

#### Context

The `DZV` lies within the Brasília Belt, more especifically in the center-south part (Almeida, 1967 & 1968).  This geological terrain was formed in the Brasiliano Cycle, during the Neoproterozoic period (1Ga to 500Ma), when the cratons of São Francisco, Amazonas and Paranapanema collided. This colision was on of many that created the big southern paleocontinent Gondwana. (Valeriano et al., 2004; Pimentel et al., 2011).

Cratons are caractherized as a solid and stable part of the continental lithosphere. Due to its light density when compared to the ocean crust, the cratons can resist to the movements of the tectonic plates, such as rifts and subductions. As consequences, these resistent pieces of rocks are commonly found in the middle of tectonic plates and continents (Petit,2010). Figure 1 shows the main cratons of South America and Africa, when these continents were joined in the super paleocontinent Pangea.

|![cratons-1.svg](./images/Cratons_West_Gondwana.svg)|
|:--:|
|*Figure 1: South America's and Africa's cratons represented by dark brown. Source: [Woudloper, 2010](https://en.wikipedia.org/wiki/Craton#/media/File:Cratons_West_Gondwana.svg)*|

The shock of these brazilian cratons led to the Brasilia Belt (Figure 2) as stated above. This collision generated 3 geological compartiments of the folding belt: the Inner Zone, the External Zone and the Cratonic Domain. Since the area of interest (AOI) is located in the External Zone, that is where we are focusing - highlighted by the purple box in Figure 2. 

|![brasilia-belt.png](./images/brasilia_belt_draw.png)|
| :--: |
|*Figure 2: Brasília Belt geological setting and position. It is a strip that connects both São Francisco Craton and Paranapanema Craton, that nowadays is under the Paraná Basin. The stream sediments are located in the region of the purple draw square, in the cities of Paracatu, Vazante and Coromandel. Source: Valeriano, 2016.*|

The `DZV` is located in the External Zone of Brasília Belt. The main mineralizations of Zn are within the Vazante Group (Dias et al., 2015). In Geology, group is a grouping of geologic formations. Each formation is caractherized by a couple of lithologies. Within Vazante Group, there are in total 3 Formations: Lapa Fm, Serra do Poço and Morro do Calcário Fm, Serra do Garrote Fm. They are always displayed from the youngest to the oldest, as shown in Figure 3. Groups and formations are classified according to the similarities between the lithologies, being different from rock bodies in the suroudings, and that also ocuppies  particular position in the layers of rock exposed in a geographical region (Boggs, 1987).

| Figure 3 |
|:--:|
|__Figure 3: Vazante Group__|


In total, in the `DZV` there are 6 mines extracting Zn-ore, mainly by the mineral willemite (Zn2SiO4 - Zinc Silicate). In some areas, the concentration of Zn can reach 8000 ppm.  

#### Objectives

Having the whole background in mind, the objective of this project is to find, with the aid of Machine Learning regression algorithms, prospectable areas rich in Zn. 


### 1.3 Explanatory Variables
List and describe the types and numbers of explanatory variables.


In [None]:
# Print feature names of X
print("Feature Names of X:", feature_names)

# Print type of the X values
print("\nType of X Values:")
print(X.dtypes)
print("\nTotal number of explanatory variables:")
print(len(feature_names))


### 1.4 Response Variable
Describe the response variable and its type.


In [None]:
print("Feature Nam:", feature_names)

# Print type of the X values
print("\nType of X Values:")
print(X.dtypes)
print("\nTotal number of explanatory variables:")
print(len(feature_names))


## 2. Visual Exploration

### 2.1 Visualizing the Dataset
Provide a few figures to help understand the dataset.

```python
# Code for visualizations (e.g., matplotlib, seaborn)


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [5]:

# Load the dataset
file_path = 'pontos_limpo.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataset
df.head()


Unnamed: 0,Estação,N__Lab_,Long__X_,Lat__Y_,Folha,Ag (ppm),Al (%),As (ppm),Au (ppm),B (ppm),...,Ta (ppm),Te (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zn (ppm),Zr (ppm)
0,AC-0002,CDE225,248757,7972050,Coromandel,0.03,2.63,4.0,0.05,5,...,25.0,0.06,7.4,0.03,0.7,73,0.3,13.15,27,6.3
1,AC-0003,CDE226,244460,7973135,Coromandel,0.02,1.93,2.0,0.05,5,...,25.0,0.14,8.2,0.06,0.94,58,0.3,23.9,58,6.9
2,AC-0004,CDE227,244044,7970217,Coromandel,0.04,1.47,3.0,0.05,5,...,25.0,0.08,5.4,0.04,0.65,55,0.2,10.4,34,1.8
3,AC-0005,CDE228,242895,7970593,Coromandel,0.05,1.72,23.0,0.05,5,...,25.0,25.0,6.9,0.04,1.01,66,0.7,9.34,27,4.1
4,AC-0006,CDE229,242999,7971416,Coromandel,0.04,0.97,7.0,0.05,5,...,25.0,25.0,5.7,0.05,0.82,41,0.4,7.45,28,1.5


In [9]:
# Filter numeric columns
numeric_columns = df.iloc[:, 4:].select_dtypes(include='number')
numeric_columns.head()

Unnamed: 0,Ag (ppm),Al (%),As (ppm),Au (ppm),B (ppm),Ba (ppm),Be (ppm),Bi (ppm),Ca (%),Cd (ppm),...,Ta (ppm),Te (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zn (ppm),Zr (ppm)
0,0.03,2.63,4.0,0.05,5,76.0,1.0,0.31,0.05,0.03,...,25.0,0.06,7.4,0.03,0.7,73,0.3,13.15,27,6.3
1,0.02,1.93,2.0,0.05,5,84.0,1.7,0.29,0.04,0.01,...,25.0,0.14,8.2,0.06,0.94,58,0.3,23.9,58,6.9
2,0.04,1.47,3.0,0.05,5,70.0,0.7,0.19,0.11,0.06,...,25.0,0.08,5.4,0.04,0.65,55,0.2,10.4,34,1.8
3,0.05,1.72,23.0,0.05,5,80.0,1.1,0.32,0.04,0.08,...,25.0,25.0,6.9,0.04,1.01,66,0.7,9.34,27,4.1
4,0.04,0.97,7.0,0.05,5,56.0,0.5,0.22,0.03,0.05,...,25.0,25.0,5.7,0.05,0.82,41,0.4,7.45,28,1.5


In [None]:

# Plot histograms for numeric columns
for column in numeric_columns.columns:
    plt.figure(figsize=(8, 6))
    sns.histplot(df[column], bins=20, kde=True)
    plt.title(f'Histogram of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()


In [10]:
numeric_columns.describe()

Unnamed: 0,Ag (ppm),Al (%),As (ppm),Au (ppm),B (ppm),Ba (ppm),Be (ppm),Bi (ppm),Ca (%),Cd (ppm),...,Ta (ppm),Te (ppm),Th (ppm),Ti (%),U (ppm),V (ppm),W (ppm),Y (ppm),Zn (ppm),Zr (ppm)
count,709.0,709.0,709.0,709.0,709.0,709.0,709.0,709.0,709.0,709.0,...,709.0,709.0,709.0,709.0,709.0,709.0,709.0,709.0,709.0,709.0
mean,1.539746,1.984654,5.782087,0.096121,7.495063,126.69464,1.354443,0.258166,0.288166,0.256276,...,22.602257,15.349901,11.531312,1.381001,1.431227,66.693935,0.369041,9.807955,37.609309,11.836671
std,2.257801,1.448399,4.563129,1.016514,19.285443,133.159244,1.537795,0.177348,0.998065,0.94915,...,7.311806,12.13193,14.197642,2.184772,1.341452,47.287505,0.712671,6.568492,20.887808,13.266331
min,0.01,0.27,0.5,0.05,5.0,2.5,0.05,0.04,0.01,0.01,...,0.05,0.05,0.9,0.01,0.21,9.0,0.05,0.88,3.0,0.25
25%,0.05,1.1,3.0,0.05,5.0,68.0,0.7,0.15,0.04,0.03,...,25.0,0.1,5.7,0.02,0.77,38.0,0.1,5.64,25.0,4.6
50%,0.08,1.54,5.0,0.05,5.0,92.0,1.1,0.22,0.06,0.05,...,25.0,25.0,7.6,0.06,1.04,53.0,0.2,8.81,34.0,7.8
75%,5.0,2.37,8.0,0.05,5.0,137.0,1.5,0.32,0.09,0.08,...,25.0,25.0,10.7,5.0,1.53,81.0,0.4,12.09,47.0,13.1
max,6.3,11.09,41.0,26.9,188.0,1750.0,30.5,2.85,5.0,5.0,...,25.0,25.0,120.7,5.0,12.02,559.0,12.8,57.9,194.0,111.0


In [14]:
all_column_names = numeric_columns.columns.tolist()
len(all_column_names)


51

In [17]:
feature_names = [col for col in all_column_names if col != "Zn (ppm)"]

# Set X and y variables
X = numeric_columns[feature_names]
y = numeric_columns["Zn (ppm)"]

# Print feature names, X, and y
print("Feature Names:", feature_names)
print("\nX Variable (Features):")
print(X.head())
print("\ny Variable (Target):")
print(y)

Feature Names: ['Ag (ppm)', 'Al (%)', 'As (ppm)', 'Au (ppm)', 'B (ppm)', 'Ba (ppm)', 'Be (ppm)', 'Bi (ppm)', 'Ca (%)', 'Cd (ppm)', 'Ce (ppm)', 'Co (ppm)', 'Cr (ppm)', 'Cs (ppm)', 'Cu (ppm)', 'Fe (%)', 'Ga (ppm)', 'Ge (ppm)', 'Hf (ppm)', 'Hg (ppm)', 'In (ppm)', 'K (%)', 'La (ppm)', 'Li (ppm)', 'LREE (ppm)', 'Mg (%)', 'Mn (ppm)', 'Mo (ppm)', 'Na (%)', 'Nb (ppm)', 'Ni (ppm)', 'P (ppm)', 'Pb (ppm)', 'Rb (ppm)', 'Re (ppm)', 'S (%)', 'Sb (ppm)', 'Sc (ppm)', 'Se (ppm)', 'Sn (ppm)', 'Sr (ppm)', 'Ta (ppm)', 'Te (ppm)', 'Th (ppm)', 'Ti (%)', 'U (ppm)', 'V (ppm)', 'W (ppm)', 'Y (ppm)', 'Zr (ppm)']

X Variable (Features):
   Ag (ppm)  Al (%)  As (ppm)  Au (ppm)  B (ppm)  Ba (ppm)  Be (ppm)  \
0      0.03    2.63       4.0      0.05        5      76.0       1.0   
1      0.02    1.93       2.0      0.05        5      84.0       1.7   
2      0.04    1.47       3.0      0.05        5      70.0       0.7   
3      0.05    1.72      23.0      0.05        5      80.0       1.1   
4      0.04    0.97   

## 3. Statistical Exploration


### 3.1 Descriptive Statistical Analysis
Conduct a descriptive statistical analysis of the dataset.

### 3.2 Correlation Analysis
Determine the potential correlation between variables and comment on its implications for machine learning.

### 3.3 Pre-processing
Introduce potential pre-processing steps (e.g., handling outliers, missing values, normalization).

In [1]:
# Code for statistical analysis and pre-processing


## 4. Evaluation Protocol


### 4.1 Dataset Splitting
Explain how you will split the dataset into training and testing data to avoid data leakage.

### 4.2 Evaluation Metrics
Present the main evaluation metric and any additional metrics for model comparison.

## 5. References

[Petit 2010](https://onlinelibrary.wiley.com/doi/10.1002/scin.5591781325)

[Valeriano 2016](https://link.springer.com/chapter/10.1007/978-3-319-01715-0_10)

[Boggs 1987](Boggs, Sam Jr. (1987). Principles of sedimentology and stratigraphy (1st ed.). Merrill Pub. Co. ISBN 0675204879.)