# **Big Data Analysis**

## **Pré-tratamento**

***Nesta secção irá ser feito o pré-tratamento dos dados***. Concretamente iremos ***fundir os dados*** provenientes dos diferentes datasets, *"alinhando-os"* pelas colunas *"Entidade"* (equivalente a nome do país), *"Code"* (equivalente ao código identificador IUPAC do país) e *"Year"* (correspondente ao ano em que foram recolhidos os dados).

Iremos também *"aparar"* os dados em função do indicador limitante, neste caso o ano de dados disponíveis, sendo o ***período de 1990 a 2016*** o período comum a todos os datasets.

**Para efeitos de comparação** iremos fazer merge dos datasets usando o package Pandas e PySpark, de modo a comparar a eficiência de ambos em diferentes contextos

### Pré-tratamento utilizando Pandas

In [None]:
# Importamos o módulo time para efeitos de comparação de tempos de execução
import time

# Carregamento dos datasets
import pandas as pd

pd_start = time.time() # inicio do contador pandas

In [None]:
# Dataset on world population
pop = pd.read_csv('../datasets/raw/population.csv')
pop.rename(columns={'Entity': 'Country'}, inplace = True)
pop = pop[pop['Year'] >= 1990]
pop = pop[pop['Year'] <= 2016]
pop.head()

In [None]:
# Dataset on obesity
obes = pd.read_csv('../datasets/raw/share-of-adults-defined-as-obese.csv')
obes.rename(columns={'Entity': 'Country'}, inplace = True)
obes = obes[obes['Year'] >= 1990]
obes = obes[obes['Year'] <= 2016] 
obes.head()

In [None]:
# Dataset on mental disorders prevalence
mental = pd.read_csv('../datasets/raw/mental-illnesses-prevalence.csv')
mental.rename(columns={'Entity': 'Country'}, inplace = True)
mental = mental[mental['Year'] >= 1990]
mental = mental[mental['Year'] <= 2016] 
mental.head()

In [None]:
# Mergint the dataframes
dataframes = [pop, obes, mental]

fused = dataframes[0]

for dataframe in dataframes[1:]:
    try:
        fused = pd.merge(
            fused,
            dataframe,
            on = ['Country', 'Year', 'Code'],
            how = 'inner'
        )
    except KeyError:
        fused = pd.merge(
            fused,
            dataframe,
            on = ['Country', 'Year'],
            how = 'outer'
        )

# Exporting to CSV
fused.to_csv('../datasets/processed/pd_processed_data.csv')

In [None]:
pd_end = time.time()
pd_elapsed = pd_end - pd_start # tempo de execução do contador pandas
print('Pandas took',pd_elapsed,'seconds to process data.')

### Pré-tratamento de dados utilizando Spark

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Pandas to PySpark") \
    .getOrCreate()

# Load datasets
pop = spark.read.csv('../datasets/raw/population.csv', header=True)
obes = spark.read.csv('../datasets/raw/share-of-adults-defined-as-obese.csv', header=True)
mental = spark.read.csv('../datasets/raw/mental-illnesses-prevalence.csv', header=True)

# Rename columns
pop = pop.withColumnRenamed('Entity', 'Country')
obes = obes.withColumnRenamed('Entity', 'Country')
mental = mental.withColumnRenamed('Entity', 'Country')

# Merge datasets
fused = pop.join(obes, ['Country', 'Year', 'Code'], 'inner') \
           .join(mental, ['Country', 'Year', 'Code'], 'inner')

# Drop rows with null values
fused = fused.dropna()

# Export to CSV
fused.coalesce(1).write.option("header", "true").csv('../datasets/processed/spark_processed_data.csv')

# Stop SparkSession
spark.stop()


In [None]:
# Comparamos o tempo de computação para cada um dos modelos
spark_elapsed = time.time()
print(f'Pandas took {round(pd_elapsed, 3)} seconds.',
      f'Spark took {round(spark_elapsed, 3)} seconds.',
      sep = '\n')

### Obs.:
Pandas completed the pre-processing task in approximately 0.15 seconds, whereas Spark took around 6.5 seconds to accomplish the same task. This significant difference in processing time underscores the efficiency of Pandas for smaller datasets, where its lightweight nature and streamlined processes result in faster execution.

However, it is essential to recognize that Spark's strength lies in its ability to handle larger-scale datasets efficiently. Despite the longer processing time observed in our experiment, Spark has demonstrated superior performance in processing datasets with millions of rows, as reported by our colleagues.

Therefore, while Spark may not be the optimal choice for every pre-processing task, particularly for smaller datasets, its capabilities shine when dealing with large-scale data operations. The selection of pre-processing tools should be tailored to the specific requirements and characteristics of the dataset, ensuring optimal performance and efficiency.

## Loading data to database (MongoDB)

In [None]:
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

uri = 'YOUR URL'

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('YOUR API')) # replace YOUR API with your MongoDB API

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

# Access database
db = client.get_database("BigData")

# Access/create collection
collection = db.get_collection("ObesPovMen")
collection

# Read CSV file using pandas
csv_file = "../datasets/processed/pd_processed_data.csv"
data = pd.read_csv(csv_file)
data.head()

In [None]:
# Convert DataFrame to dictionary
data_dict = data.to_dict(orient='records')
print(data_dict)

In [None]:
# Insert data into MongoDB collection
collection.insert_many(data_dict)

# Close connection
client.close()