## Data Engineering Capstone Project
#### Project Summary
The data sources will be aggregated using Spark SQL and matplotlib will be used to display graphs of the data
The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
import os
import gc
import logging
from datetime import datetime
from sys import stdout
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

In [2]:
def setting_log(flag_stdout=True, flag_logfile=False):
    """
    Applies log settings and returns a logging object.
    :flag_stdout: boolean
    :flag_logfile: boolean
    """
    handler_list = list()
    LOGGER = logging.getLogger()
    [LOGGER.removeHandler(h) for h in LOGGER.handlers]
    if flag_logfile:
        path_log = './logs/{}_{:%Y%m%d}.log'.format('log', datetime.now())
        if not os.path.isdir('./logs'):
            os.makedirs('./logs')
        handler_list.append(logging.FileHandler(path_log))
    if flag_stdout:
        handler_list.append(logging.StreamHandler(stdout))
    logging.basicConfig(
        level=logging.INFO\
        ,format='[%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s'\
        ,handlers=handler_list)
    return LOGGER

In [3]:
gc.enable()
LOGGER = setting_log()
JOB_NAME = 'spark_dend_capstone'

In [7]:
spark = SparkSession\
    .builder\
    .appName(JOB_NAME)\
    .master("local[*]")\
    .enableHiveSupport()\
    .getOrCreate()

In [8]:
LOGGER.info('Set ERROR level JVM logger')
logger_jvm = spark._jvm.org.apache.log4j
logger_jvm.LogManager.getLogger("org"). setLevel(logger_jvm.Level.ERROR)
logger_jvm.LogManager.getLogger("akka").setLevel(logger_jvm.Level.ERROR)

[2019-09-29 21:44:13,197] {<ipython-input-8-4f1dd5897f31>:1} INFO - Set ERROR level JVM logger


### Step 1: Scope the Project and Gather Data
The scope of this project is to create a Spark job that public datasets persisted to disk and, provide clean and reliable data in a dimensional model.

#### Describe and Gather Data
1. **Crime Data in Brazil:** comes from [Kaggle](https://www.kaggle.com/inquisitivecrow/crime-data-in-brazil). All crime data for 10 years of police work in the biggest city of South America.
2. **News of the Brazilian Newspaper:** comes from [Kaggle](https://www.kaggle.com/marlesson/news-of-the-site-folhauol). 167.053 news of the site Folha de São Paulo (Brazilian Newspaper)
3. **Current Properati Listing Information:** comes from [Kaggle](https://www.kaggle.com/properati-data/properties). Property attributes of 1.5 million Latin American listings.

#### Read dataset Crime data in Brazil

In [70]:
READ_PATH_CRIME_DATA = './data/crime_data_br/*.csv'

In [71]:
df_crime_crude = spark.read.option('mergeSchema', 'true').option('header', 'true').csv(READ_PATH_CRIME_DATA)

The columns are recorded as _col# records of badly formatted strings to our project records affected by these anomalies will be discarded because they represent a very low percentage.

In [72]:
df_crime = df_crime_crude.where('(_c30 is null and _c31 is null and _c32 is null)').drop('_c30', '_c31', '_c32')

In [82]:
df_crime.where("SEXO_PESSOA = 'I'").count()

578285

In [83]:
df_crime.where("SEXO_PESSOA = 'M'").count()

9158074