## Data Engineering Capstone Project
#### Project Summary
The data sources will be aggregated using Spark SQL and matplotlib will be used to display graphs of the data
The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [9]:
import os
import gc
import logging
from datetime import datetime
from sys import stdout
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

In [10]:
def setting_log(flag_stdout=True, flag_logfile=False):
    """
    Applies log settings and returns a logging object.
    :flag_stdout: boolean
    :flag_logfile: boolean
    """
    handler_list = list()
    LOGGER = logging.getLogger()
    [LOGGER.removeHandler(h) for h in LOGGER.handlers]
    if flag_logfile:
        path_log = './logs/{}_{:%Y%m%d}.log'.format('log', datetime.now())
        if not os.path.isdir('./logs'):
            os.makedirs('./logs')
        handler_list.append(logging.FileHandler(path_log))
    if flag_stdout:
        handler_list.append(logging.StreamHandler(stdout))
    logging.basicConfig(
        level=logging.INFO\
        ,format='[%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s'\
        ,handlers=handler_list)
    return LOGGER

In [11]:
gc.enable()
LOGGER = setting_log()
JOB_NAME = 'spark_dend_capstone'

In [12]:
spark = SparkSession\
    .builder\
    .appName(JOB_NAME)\
    .enableHiveSupport()\
    .getOrCreate()

In [13]:
LOGGER.info('Set ERROR level JVM logger')
logger_jvm = spark._jvm.org.apache.log4j
logger_jvm.LogManager.getLogger("org"). setLevel(logger_jvm.Level.ERROR)
logger_jvm.LogManager.getLogger("akka").setLevel(logger_jvm.Level.ERROR)

[2019-09-27 22:00:52,938] {<ipython-input-13-4f1dd5897f31>:1} INFO - Set ERROR level JVM logger


### Step 1: Scope the Project and Gather Data
The scope of this project is to create a Spark job that public datasets persisted to disk and, provide clean and reliable data in a dimensional model.

#### Describe and Gather Data
1. **Crime Data in Brazil:** comes from [Kaggle](https://www.kaggle.com/inquisitivecrow/crime-data-in-brazil). All crime data for 10 years of police work in the biggest city of South America.
2. **News of the Brazilian Newspaper:** comes from [Kaggle](https://www.kaggle.com/marlesson/news-of-the-site-folhauol). 167.053 news of the site Folha de São Paulo (Brazilian Newspaper)
3. **Current Properati Listing Information:** comes from [Kaggle](https://www.kaggle.com/properati-data/properties). Property attributes of 1.5 million Latin American listings.

In [4]:
df_crime = spark.read.option('mergeSchema', 'true').option('header', 'true').csv('./data/crime_data_br/*.csv')

In [5]:
df_crime.show()

+------+------+------------+--------------------+--------------------+--------------------+----------------------+--------------------+--------------------+----+---+------------------+------------------+-------------+--------------------+--------------------+----------+--------+---------+--------------------+--------------------+-----------------+-------------+--------------------+-----------+-----------+------------+--------------------+--------------------+--------------------+----+----+----+
|NUM_BO|ANO_BO|ID_DELEGACIA|   NOME_DEPARTAMENTO|      NOME_SECCIONAL|           DELEGACIA|NOME_DEPARTAMENTO_CIRC| NOME_SECCIONAL_CIRC| NOME_DELEGACIA_CIRC| ANO|MES|DATA_OCORRENCIA_BO|HORA_OCORRENCIA_BO|FLAG_STATUS13|             RUBRICA|       DESDOBRAMENTO|   CONDUTA|LATITUDE|LONGITUDE|              CIDADE|          LOGRADOURO|NUMERO_LOGRADOURO|FLAG_STATUS22|   DESCR_TIPO_PESSOA|CONT_PESSOA|SEXO_PESSOA|IDADE_PESSOA|                 COR|     DESCR_PROFISSAO|DESCR_GRAU_INSTRUCAO|_c30|_c31|_c32|


In [8]:
df_crime.select('_c30', '_c31', '_c32').distinct().show()

+--------------------+----------------+-------------+
|                _c30|            _c31|         _c32|
+--------------------+----------------+-------------+
|1 Grau completo  ...|            null|         null|
|                    |            NULL|         NULL|
|2 Grau completo  ...|            null|         null|
|Preta               |            null|         null|
|                null|                |         NULL|
|                null|            null|         null|
|                    |            NULL|         NULL|
|Outros              |            null|         null|
|Branca              |            null|         null|
|                    |            null|         null|
|                    |            NULL|         NULL|
|                null|            null|             |
|1 Grau incompleto...|            null|         null|
|                    |            NULL|         NULL|
|                NULL|            null|         null|
|Parda               |      

In [14]:
df_crime.printSchema()

root
 |-- NUM_BO: string (nullable = true)
 |-- ANO_BO: string (nullable = true)
 |-- ID_DELEGACIA: string (nullable = true)
 |-- NOME_DEPARTAMENTO: string (nullable = true)
 |-- NOME_SECCIONAL: string (nullable = true)
 |-- DELEGACIA: string (nullable = true)
 |-- NOME_DEPARTAMENTO_CIRC: string (nullable = true)
 |-- NOME_SECCIONAL_CIRC: string (nullable = true)
 |-- NOME_DELEGACIA_CIRC: string (nullable = true)
 |-- ANO: string (nullable = true)
 |-- MES: string (nullable = true)
 |-- DATA_OCORRENCIA_BO: string (nullable = true)
 |-- HORA_OCORRENCIA_BO: string (nullable = true)
 |-- FLAG_STATUS13: string (nullable = true)
 |-- RUBRICA: string (nullable = true)
 |-- DESDOBRAMENTO: string (nullable = true)
 |-- CONDUTA: string (nullable = true)
 |-- LATITUDE: string (nullable = true)
 |-- LONGITUDE: string (nullable = true)
 |-- CIDADE: string (nullable = true)
 |-- LOGRADOURO: string (nullable = true)
 |-- NUMERO_LOGRADOURO: string (nullable = true)
 |-- FLAG_STATUS22: string (nullabl