## Data Engineering Capstone Project
#### Introduction
This project will bring together criminal facts of data, news and distinct propriedados information in the same locality, São Paulo / Brazil, so that the result of this project is a clean and reliable basis and can be used for statist modeling and business intelligence.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [12]:
import os
import gc
import logging
from datetime import datetime
from sys import stdout
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

In [111]:
def normalize_columns_name(_df):
    """
    Recive Spark data frame and return data frame with clean column names
    """
    cols = _df.columns
    df_result = _df
    for c in cols:
        df_result = df_result.withColumnRenamed(c, c.lower().replace(' ', '_'))
    
    return df_result

def replace_ncols(_df):
    """
    Recive Spark data frame and return data frame with replaced null string values
    """
    col_types = _df.dtypes
    df_result = _df
    
    for item in col_types:
        if item[1] == 'string':
            df_result = df_result.withColumn(item[0]\
                                            ,f.when(f.col(item[0]).isin('NULL','NaN', 'NA'), None)\
                                            .otherwise(item[0]))
    return df_result
    

In [13]:
# Enable garbage collector in Py4j
gc.enable()
JOB_NAME = 'spark_dend_capstone'

In [14]:
spark = SparkSession\
    .builder\
    .appName(JOB_NAME)\
    .master("local[*]")\
    .enableHiveSupport()\
    .getOrCreate()

In [15]:
# Due to the verbosity of log Spark, just leave related error messages org and akka
logger_jvm = spark._jvm.org.apache.log4j
logger_jvm.LogManager.getLogger("org"). setLevel(logger_jvm.Level.ERROR)
logger_jvm.LogManager.getLogger("akka").setLevel(logger_jvm.Level.ERROR)

### Step 1: Scope the Project and Gather Data
The scope of this project is to create a Spark task responsible for the data warehouse intake with clean and reliable data in a dimensional model.

#### Describe and Gather Data
1. **Crime Data in Brazil:** comes from [Kaggle](https://www.kaggle.com/inquisitivecrow/crime-data-in-brazil). All crime data for 10 years of police work in the biggest city of South America.
2. **News of the Brazilian Newspaper:** comes from [Kaggle](https://www.kaggle.com/marlesson/news-of-the-site-folhauol). 167.053 news of the site Folha de São Paulo (Brazilian Newspaper)
3. **Current Properati Listing Information:** comes from [Kaggle](https://www.kaggle.com/properati-data/properties). Property attributes of 1.5 million Latin American listings.

In [16]:
READ_PATH_CRIME_DATA = './data/crime_data_br/*.csv'
READ_PATH_NEWS = './data/news_folhauol/*.csv'
READ_PATH_PROPERTIES = './data/properties_br/*.csv'

#### Read dataset Crime data in Brazil and drop tuples duplicates

In [114]:
df_crime_crude = spark.read\
    .option('mergeSchema', 'true')\
    .option('header', 'true')\
    .option('inferSchema', 'true')\
    .csv(READ_PATH_CRIME_DATA)\
    .dropDuplicates()

df_crime_crude = normalize_columns_name(df_crime_crude)

In [82]:
df_crime_crude.count()

17107839

#### Replace strings 'NULL','NaN', 'NA' to null type

In [112]:
df_crime = replace_ncols(df_crime_crude)

#### The columns are recorded as _col# records of badly formatted strings to our project records affected by these anomalies will be discarded because they represent a very low percentage.

In [None]:
df_crime = df_crime.where('(_c30 is null and _c31 is null and _c32 is null)').drop('_c30', '_c31', '_c32')

#### Normalizing the sexo_pessoa stands for F (female)

In [126]:
df_crime = df_crime.withColumn('sexo_pessoa', f.when(f.col('sexo_pessoa') == 'I', 'F').otherwise(f.col('sexo_pessoa')))

#### Validation if columns have only numeric data, otherwise values will be null

In [80]:
df_crime.select('latitude', 'longitude').where("latitude rlike '[a-z]'").distinct().show(10, False)

+--------+---------+
|latitude|longitude|
+--------+---------+
+--------+---------+



In [119]:
df_crime.select('flag_status13', 'flag_status22').show()

+-------------+-------------+
|flag_status13|flag_status22|
+-------------+-------------+
|        18:00|          972|
|        21:33|         1600|
|        21:20|            0|
|        22:00|          900|
|         NULL|          220|
|        17:35|           50|
|        21:40|          154|
|        11:10|          231|
|         NULL|           74|
|        21:40|         NULL|
|        20:05|          200|
|        04:00|          483|
|         NULL|         2040|
|         NULL|         2913|
|         NULL|        99999|
|        21:05|          226|
|        03:40|          365|
|         NULL|          682|
|        22:30|         2927|
|        22:30|         4000|
+-------------+-------------+
only showing top 20 rows



#### Read dataset News of the Brazilian Newspaper

In [22]:
df_news_crude = spark.read\
    .option('mergeSchema', 'true')\
    .option('header', 'true')\
    .option('quote', '"')\
    .option('escape', '"')\
    .csv(READ_PATH_NEWS)
df_news_crude = normalize_columns_name(df_news_crude)

Total records with malformed delimiter is almost 4% to this project this percentage is acceptable.

In [23]:
df_news = df_news_crude.where('date is not null')

In [24]:
df_news = df_news.where('link is not null')

#### Read dataset Properties

In [4]:
df_properties_crude = spark.read\
    .option('mergeSchema', 'true')\
    .option('header', 'true')\
    .option('quote', '"')\
    .option('escape', '"')\
    .csv(READ_PATH_PROPERTIES)
df_properties_crude = normalize_columns_name(df_properties_crude)

In [5]:
df_properties = df_properties_crude.where("lat-lon is not null")

In [11]:
df_properties.select('price_usd_per_m2','price_per_m2').limit(10).toPandas()

Unnamed: 0,price_usd_per_m2,price_per_m2
0,2877.2106944444445,9444.444444444443
1,3595.199440993789,11801.242236024846
2,3026.4511111111115,10000.0
3,2044.1819298245612,6754.3859649122805
4,1089.52242,3600.0
5,590.5269512195122,1951.219512195122
6,590.5269512195122,1951.219512195122
7,1101.8389671361504,3615.023474178404
8,4315.673132743363,14159.29203539823
9,2133.5609,7000.0


In [25]:
df_crime.limit(2).toPandas()

Unnamed: 0,NUM_BO,ANO_BO,ID_DELEGACIA,NOME_DEPARTAMENTO,NOME_SECCIONAL,DELEGACIA,NOME_DEPARTAMENTO_CIRC,NOME_SECCIONAL_CIRC,NOME_DELEGACIA_CIRC,ANO,...,LOGRADOURO,NUMERO_LOGRADOURO,FLAG_STATUS22,DESCR_TIPO_PESSOA,CONT_PESSOA,SEXO_PESSOA,IDADE_PESSOA,COR,DESCR_PROFISSAO,DESCR_GRAU_INSTRUCAO
0,4,2013,900821,DPPC- DEP.POL.PROT A CIDADANIA,DPPC-DEP.POL.PROT.A CIDADANIA,DIISP - 01ª DEL.POL.,DECAP,DEL.SEC.1º CENTRO,03º D.P. CAMPOS ELISEOS,2009,...,R VINTE E QUATRO DE MAIO,35,C,Declarante,1,F,43,,VENDEDOR(A),2 Grau completo
1,23,2014,30613,DEMACRO,DEL.SEC.CARAPICUIBA,DEL.DEF.MUL. BARUERI,DEMACRO,DEL.SEC.CARAPICUIBA,01º D.P. BARUERI,2009,...,EST DAS ROSAS,0,C,Autor,3,M,49,Parda,EMPRESARIO COMERCIAL,
