# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up


In [1]:
## NOTES
# Install a pip package in the current Jupyter kernel

#import sys
#!{sys.executable} -m pip install s3fs
#!{sys.executable} -m pip install boto
#!{sys.executable} -m pip install boto3
#!{sys.executable} -m pip install pyspark

In [1]:
# IMPORTS AND INSTALLS

import pandas as pd

from datetime import datetime

from s3_local_io import *
from create_parquet_tables import *
from data_quality_checks import *

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, count, lit, when, max, lower, countDistinct


In [2]:
# Global names

##This is now in the s3_local_io file
##config = configparser.ConfigParser()
##config.read('dl.cfg')
##os.environ['AWS_ACCESS_KEY_ID']=config['KEYS']['AWS_ACCESS_KEY_ID']
##os.environ['AWS_SECRET_ACCESS_KEY']=config['KEYS']['AWS_SECRET_ACCESS_KEY']


# URL and PATHS to data
bucket_name = 'raul-udacity'
bucket_parquet_path ='/parquet/'
bucket_path = 's3a://'+bucket_name+'/'
local_path = './input_files/'
local_parquet_path = './input_files/parquet_files/'
#S3_URI = "s3a://raul-udacity/"
#s3a vs s3 explanation https://stackoverflow.com/questions/33356041/technically-what-is-the-difference-between-s3n-s3a-and-s3


# Filenames
data_bares = 'bares.csv'
data_restaurantes = 'restaurantes.csv'
data_cafeterias = 'cafeterias.csv'

data_asociaciones = 'AsociacionesJCyL.csv'
data_clubes_deportivos = 'Clubes deportivos.csv'

data_bibliotecas = 'Directorio de Bibliotecas de Castilla y León.json'
data_museos = 'Directorio de Museos de Castilla y León.json'

data_poblacion = 'Cities population per gender age.csv'

# Other available data/filenames we decided not to use
# Poblacion municipio sexo relacion nacimiento residencia.json
# Municipios Origen Nacimiento.csv
# 

# Step 1: Scope the Project and Gather Data

## Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>
Scope.md file

## Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 
https://github.com/rantoncuadrado/udacity_capstone_project/blob/main/Datasources%20Description.md
Datasources Description.md file

### COPY FILES FROM s3 TO LOCAL

In [None]:
## WE COPY FILES FROM s3 TO LOCAL
## This step is not needed if working with s3 files

# Commented as we don't need to copy them anytime we run the process
# copy_files_s3_to_local(bucket_name, local_path)

# Step 2: Explore and Assess the Data
## Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

## Cleaning Steps
Once we have the files in local filesystem, I'll use dataframes to clean the data and later SPARK to manipulate them.

In [3]:
# Create an SPARK SESSION 

spark_session = SparkSession \
        .builder \
        .appName("Castilla y Leon -> Fact Tables") \
        .getOrCreate()


# This is needed just if we use spark on s3
#spark_session.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
#spark_session.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key",os.environ['AWS_ACCESS_KEY_ID'])
#spark_session.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key",os.environ['AWS_SECRET_ACCESS_KEY'])


### CLEANING BAR, RESTAURANT, CAFE and CREATING GARITOS TABLE
These 3 files share same schema

In [4]:
sparkdf_garitos=create_garitos(spark_session,local_path,[data_bares,data_restaurantes,data_cafeterias])

### CREATION OF CITY / POSTALCODE PARQUET TABLE


In [6]:
sparkdf_postal_codes=create_postal_code(sparkdf_garitos)

In [7]:
df=sparkdf_postal_codes.toPandas()
df.describe(include='all')

Unnamed: 0,county,city,postal_code
count,2881,2881,2881
unique,9,1730,2025
top,León,Zamora,24000
freq,519,33,23


In [7]:
toparquet_postal_codes(spark_session,local_parquet_path,sparkdf_postal_codes)

### GARITOS CLEANUP

In [18]:
##CHECK FOR NULL VALUES IN KEY COLUMNS
check_sparkdf_not_nulls(sparkdf_garitos,['name','county','city','garito_kind'])

Checking DataFrame. No null values found in column name
Checking DataFrame. No null values found in column county
Checking DataFrame. No null values found in column city
Checking DataFrame. No null values found in column garito_kind


True

In [11]:
##WE SHOULD NOT CHECK FOR DUPES AS DUPES ARE PERFECTLY OK
## THey mean several bar/restaurant/cafes with the same name in a given city
## This happens with Burguer King and similars. 
## An example here:

a=sparkdf_garitos.select("*").groupBy('name','county','city','garito_kind') \
    .agg(count("name").alias("Total")) \
    .orderBy("Total", ascending=False)

a.head(10)

[Row(name='BURGER KING', county='Valladolid', city='Valladolid', garito_kind='restaurantes', Total=8),
 Row(name='TELEPIZZA', county='Valladolid', city='Valladolid', garito_kind='bares', Total=6),
 Row(name='LA BODEGUILLA', county='León', city='León', garito_kind='bares', Total=5),
 Row(name='GARCIA', county='León', city='León', garito_kind='bares', Total=5),
 Row(name='EL PASO', county='Valladolid', city='Valladolid', garito_kind='bares', Total=4),
 Row(name='TELEPIZZA', county='Burgos', city='Burgos', garito_kind='bares', Total=4),
 Row(name='BERTIZ', county='Burgos', city='Burgos', garito_kind='cafeterias', Total=4),
 Row(name='BURGER KING', county='León', city='León', garito_kind='restaurantes', Total=4),
 Row(name='PAN EL VISO', county='Zamora', city='Zamora', garito_kind='cafeterias', Total=4),
 Row(name='BURGER KING', county='Burgos', city='Burgos', garito_kind='restaurantes', Total=4)]

In [36]:
## GUESS NULL POSTAL CODE WHEN POSSIBLE AND COMPLETE THESE ROWS

# We could have null postal codes. IN these cases we'll try to find the one corresponding with
# the county and city in the postal code table (when there is just one for that city) and
# we'll complete the row with the postal code

check_sparkdf_not_nulls(sparkdf_garitos,['postal_code'])

Checking DataFrame. I found null values in column postal_code


False

In [37]:
# We found rows with null postal_code so we put all these rows in a spark dataframe

sparkdf_garitos_pc_null=sparkdf_garitos.select(
            '*'
            ).where(col('postal_code').isNull())

sparkdf_garitos=sparkdf_garitos.select(
            '*'
            ).where(col('postal_code').isNotNull())

sparkdf_garitos_pc_null.head(2)

[Row(name='PUB EBANO', address='C/ JUAN FERRERO Nº 80', county='León', city='Valderrueda', postal_code=None, garito_kind='bares'),
 Row(name='SILVAN', address=None, county='León', city='Torre del Bierzo', postal_code=None, garito_kind='bares')]

In [38]:
# We extract the list of cities with unique postal code (The only ones we can use to guess postal codes)

cities_with_unique_postal_codes=sparkdf_postal_codes.select(
            'city',
            'postal_code'
            ).groupBy("city") \
            .agg(count('postal_code').alias('postal_codes'),
                 max('postal_code').alias('postal_code')) \
            .orderBy('city', ascending=True) \
            .where("postal_codes=1")


cities_with_unique_postal_codes.show(5)   


+----------------+------------+-----------+
|            city|postal_codes|postal_code|
+----------------+------------+-----------+
|          Abades|           1|      40141|
|Abarca de Campos|           1|      34338|
|          Abejar|           1|      42146|
|         Abusejo|           1|      37640|
|  Adrada de Haza|           1|      09462|
+----------------+------------+-----------+
only showing top 5 rows



In [39]:
# Completing postalcodeless garitos with postal code when there is only one / city

sparkdf_garitos_pc_fixed = sparkdf_garitos_pc_null.join(
    cities_with_unique_postal_codes,
    sparkdf_garitos_pc_null.city == cities_with_unique_postal_codes.city,
    'left').select(
        'name',
        'address',
        'county',
        sparkdf_garitos_pc_null.city,
        cities_with_unique_postal_codes.postal_code,
        'garito_kind'
    )

print(sparkdf_garitos_pc_fixed.show(15))
# (It seems we were just able to fix 1)


+--------------------+--------------------+---------+--------------------+-----------+-----------+
|                name|             address|   county|                city|postal_code|garito_kind|
+--------------------+--------------------+---------+--------------------+-----------+-----------+
|           PUB EBANO|C/ JUAN FERRERO N...|     León|         Valderrueda|       null|      bares|
|              SILVAN|                null|     León|    Torre del Bierzo|       null|      bares|
|              FARINA|C/ ANTONIO ALMARZ...|    Ávila|               Ávila|       null|      bares|
|              AGUEDA|CAMINO DEL PEREGR...|     León|                León|       null|      bares|
|               VIFER| MAESTRO URIARTE, 25|     León|                León|       null|      bares|
|           LA CHISPA|C/ OBISPO MONTOYA...| Palencia|             Grijota|      34192|      bares|
|              RAMSES|CARRETERA ESTACIO...|    Ávila|         Sanchidrián|       null|      bares|
|         

In [40]:
sparkdf_garitos = (
        sparkdf_garitos.union(sparkdf_garitos_pc_fixed)
    )

### EXAMPLE QUERIES TO GARITOS

In [7]:
# I want to practice with both dataframes and sparkdfs
# df shows here that there are addressless and postal_codeless
# garitos (garito= bar | restaurant | cafe) but no countyless or cityless
df=sparkdf_garitos.toPandas()
df.describe(include='all')

Unnamed: 0,name,address,county,city,postal_code,garito_kind
count,22487,22426,22487,22487,22472,22487
unique,16100,21087,9,1730,2025,3
top,LA PLAZA,"PLAZA MAYOR, 2",León,Valladolid,24003,bares
freq,87,20,4942,2668,369,15080


In [43]:
# Playing with Spark Data Frames. Most repeated names.
garitos_name_top = sparkdf_garitos \
    .select("name",'address') \
    .groupBy("name") \
    .agg(count("address").alias("Total")) \
    .orderBy("Total", ascending=False)
garitos_name_top.head(30)


[Row(name='LA PLAZA', Total=87),
 Row(name='AVENIDA', Total=55),
 Row(name='PLAZA', Total=47),
 Row(name='CENTRAL', Total=43),
 Row(name='TELEPIZZA', Total=41),
 Row(name='EL PASO', Total=36),
 Row(name='PISCINAS MUNICIPALES', Total=35),
 Row(name='LA TABERNA', Total=32),
 Row(name='EL RINCON', Total=32),
 Row(name='LA TERRAZA', Total=31),
 Row(name='BURGER KING', Total=31),
 Row(name='CASTILLA', Total=31),
 Row(name='EL CRUCE', Total=30),
 Row(name='LOS ARCOS', Total=30),
 Row(name='LA PARADA', Total=29),
 Row(name='EL PUENTE', Total=28),
 Row(name='LA BODEGUILLA', Total=27),
 Row(name='LA FUENTE', Total=27),
 Row(name='LOS ANGELES', Total=23),
 Row(name='EL MOLINO', Total=23),
 Row(name='EL PARQUE', Total=22),
 Row(name='MANOLO', Total=22),
 Row(name='LAS PISCINAS', Total=21),
 Row(name='LA CASONA', Total=21),
 Row(name='PISCINA MUNICIPAL', Total=21),
 Row(name='LA BODEGA', Total=21),
 Row(name='EL CASTILLO', Total=21),
 Row(name='EL REFUGIO', Total=20),
 Row(name='LA POSADA', Total=

In [5]:
# Playing with Spark Data Frames. 
# Most repeated restaurant names

restaurante_name_top = sparkdf_garitos \
    .select("name",'address','garito_kind') \
    .where("garito_kind='restaurantes'") \
    .groupBy("name",) \
    .agg(count("address").alias("Total")) \
    .orderBy("Total", ascending=False)

print(restaurante_name_top.head(30))



[Row(name='BURGER KING', Total=27), Row(name='TELEPIZZA', Total=18), Row(name='LA POSADA', Total=12), Row(name='EL MOLINO', Total=10), Row(name='LA TABERNA', Total=10), Row(name='AVENIDA', Total=10), Row(name='LA CASONA', Total=10), Row(name="FOSTER'S HOLLYWOOD", Total=9), Row(name='EL CRUCE', Total=8), Row(name='LOS ARCOS', Total=7), Row(name='BURGUER KING', Total=7), Row(name='CASTILLA', Total=7), Row(name='PLAZA', Total=7), Row(name="DOMINO'S PIZZA", Total=7), Row(name="MC DONALD'S", Total=7), Row(name='EL CASTILLO', Total=7), Row(name='LA PARADA', Total=7), Row(name='LA TERRAZA', Total=7), Row(name='CASA PACO', Total=6), Row(name='EL JARDIN', Total=6), Row(name='EL PASO', Total=6), Row(name='LA GRAN MURALLA', Total=6), Row(name='LA ENCINA', Total=6), Row(name='CENTRAL', Total=6), Row(name='LAS NIEVES', Total=6), Row(name='EL CAPRICHO', Total=6), Row(name='LA MURALLA', Total=6), Row(name='EL MESON', Total=6), Row(name='EL MIRADOR', Total=6), Row(name='HOGAR TERCERA EDAD', Total=5)]


In [51]:
# Playing with Spark Data Frames. 
# Counties order by number of cafeterias 

cafe_name_top = sparkdf_garitos \
    .select("county",'address') \
    .where("garito_kind='cafeterias'") \
    .groupBy("county",) \
    .agg(count("address").alias("Total")) \
    .orderBy("Total", ascending=False)

print(cafe_name_top.head(30))

[Row(county='León', Total=317), Row(county='Salamanca', Total=315), Row(county='Valladolid', Total=230), Row(county='Burgos', Total=179), Row(county='Ávila', Total=141), Row(county='Zamora', Total=91), Row(county='Segovia', Total=59), Row(county='Soria', Total=59), Row(county='Palencia', Total=54)]


In [54]:
# Playing with Spark Data Frames. 
# Order cities in a county (Burgos) by numero of garitos, but showing each type
burgos_top = sparkdf_garitos \
    .select("city",'address',
           when(sparkdf_garitos['garito_kind'] == 'cafeterias', 1).alias("is_cafe"),
           when(sparkdf_garitos['garito_kind'] == 'bares', 1).alias("is_bar"),
           when(sparkdf_garitos['garito_kind'] == 'restaurantes', 1).alias("is_restaurante")
           ) \
    .where("county='Burgos'") \
    .groupBy("city") \
    .agg(count("is_cafe").alias("cafes"), 
         count("is_bar").alias("bars"),
         count("is_restaurante").alias("restaurants"),
         count("address").alias("total"),
        ) \
    .orderBy("Total", ascending=False)


burgos_top.head(30)

[Row(city='Burgos', cafes=102, bars=765, restaurants=244, total=1110),
 Row(city='Aranda de Duero', cafes=19, bars=168, restaurants=62, total=249),
 Row(city='Miranda de Ebro', cafes=14, bars=177, restaurants=42, total=233),
 Row(city='Medina de Pomar', cafes=3, bars=67, restaurants=20, total=90),
 Row(city='Villarcayo de Merindad de Castilla la Vieja', cafes=7, bars=42, restaurants=16, total=65),
 Row(city='Briviesca', cafes=3, bars=42, restaurants=13, total=58),
 Row(city='Lerma', cafes=2, bars=19, restaurants=22, total=43),
 Row(city='Valle de Mena', cafes=2, bars=29, restaurants=13, total=41),
 Row(city='Espinosa de los Monteros', cafes=0, bars=25, restaurants=11, total=36),
 Row(city='Salas de los Infantes', cafes=0, bars=20, restaurants=9, total=29),
 Row(city='Roa', cafes=1, bars=18, restaurants=6, total=25),
 Row(city='Belorado', cafes=3, bars=12, restaurants=10, total=25),
 Row(city='Quintanar de la Sierra', cafes=0, bars=18, restaurants=7, total=24),
 Row(city='Melgar de Fern

### CREATION OF GARITOS PARQUET TABLE


In [19]:
## CLEANED UP GARITOS TO PARQUET 
toparquet_by_county_and_postcode(spark_session,local_parquet_path  + 'garitos/',sparkdf_garitos)


# Associations and Sport Clubs

### CLEANING ASOCIACIONES Y CLUBES DEPORTIVOS
These 3 files share same schema

In [4]:
## One from s3 (To test) ant the other from local folder

sparkdf_social=create_social(spark_session,local_path,data_asociaciones, data_clubes_deportivos) 

sparkdf_social.head(5)


social Spark Data Frame was created; 
  [Row(name='ASOCIACION DE JUBILADOS Y PENSIONISTAS VIRGEN DE LA PIEDAD', address='c/ Las Parras, 89', county='avila', city='EL BARRACO', postal_code='05110', sports='N/A', social_kind='association'), Row(name='PEÑA SAN MARCOS ', address='DE LAS FUENTES 10 ', county='avila', city='EL BARRACO', postal_code='05000', sports='N/A', social_kind='association')]


[Row(name='ASOCIACION DE JUBILADOS Y PENSIONISTAS VIRGEN DE LA PIEDAD', address='c/ Las Parras, 89', county='avila', city='EL BARRACO', postal_code='05110', sports='N/A', social_kind='association'),
 Row(name='PEÑA SAN MARCOS ', address='DE LAS FUENTES 10 ', county='avila', city='EL BARRACO', postal_code='05000', sports='N/A', social_kind='association'),
 Row(name='ASOCIACION DE JUBILADOS Y PENSIONISTAS DE SAN JUAN DEL OLMO SAN JUAN BAUTISTA', address='C/General Franco s/nº', county='avila', city='SAN JUAN DEL OLMO', postal_code='05145', sports='N/A', social_kind='association'),
 Row(name='ASOCIACION DEPORTIVO CULTURAL DE CAZADORES LA PICOTA', address='C/Piñonera nº 18', county='avila', city='CEBREROS', postal_code='05260', sports='N/A', social_kind='association'),
 Row(name='ASOCIACION CULTURAL -FORO 93-', address='AVDA. DE JOSE ANTONIO, 32 - 6', county='avila', city='EL HOYO DE PINARES', postal_code='05250', sports='N/A', social_kind='association')]

### SOCIAL CLEANUP

In [5]:
##CHECK FOR NULL VALUES IN KEY COLUMNS
check_sparkdf_not_nulls(sparkdf_social,['name','county','city','social_kind'])

Checking DataFrame. No null values found in column name
Checking DataFrame. No null values found in column county
Checking DataFrame. No null values found in column city
Checking DataFrame. No null values found in column social_kind


True

In [6]:
##CHECK FOR REPEATED VALUES IN KEY COLUMNS
# We can accept same name, even same name in same city (if it is a different kind
# of association), but not same name / same city / same kind
dupes=check_sparkdf_find_dupes(sparkdf_social,['name','county','city','social_kind'])

print(dupes.head(15))

[Row(name='ASOCIACION JUVENIL CULTURAL 15 DE AGOSTO ', county='segovia', city='MOZONCILLO', social_kind='association', count=2), Row(name='ASOCIACIÓN CÍRCULO PSICOANALÍTICO DE LEÓN', county='león', city='LEÓN', social_kind='association', count=2), Row(name='ASOCIACION JUVENIL EL RIO ', county='segovia', city='CARRASCAL DEL RÍO', social_kind='association', count=2), Row(name='ANULADO', county='zamora', city='ZAMORA', social_kind='sports_club', count=2), Row(name='ASOCIACIÓN CULTURAL LUCERNA', county='león', city='CARUCEDO', social_kind='association', count=2)]


In [7]:
## We see there are several duplicates, one of them name="ANULADO" (Means voided)
## makes us think we should find other ANULADO/VOIDED and remove all of them

sparkdf_social_voided = sparkdf_social.select('*').where("name='ANULADO'").show()

+-------+--------------------+----------+--------------------+-----------+--------------------+-----------+
|   name|             address|    county|                city|postal_code|              sports|social_kind|
+-------+--------------------+----------+--------------------+-----------+--------------------+-----------+
|ANULADO|                 XXX|valladolid|          VALLADOLID|      47000|FU001#FÚTBOL - FÚ...|sports_club|
|ANULADO|C/ CORUÑA DEL CON...|    burgos|     ARANDA DE DUERO|       9400|FU001#FÚTBOL - FÚ...|sports_club|
|ANULADO|                JJJJ|  palencia|CAMPORREDONDO DE ...|      34888|CI003#CICLISMO - ...|sports_club|
|ANULADO|      Avda principal|     ávila|NAVAS DEL MARQUES...|       5230|CA001#CAZA - PICH...|sports_club|
|ANULADO|           XXXXXXXXX| salamanca|           SALAMANCA|      37008|MT018#MOTOCICLISM...|sports_club|
|ANULADO|C/ CARTAJENA DE I...|    zamora| MORALES DE VALVERDE|      49697|CA001#CAZA - PICH...|sports_club|
|ANULADO|    XXXXXXXXXXXXXXX

In [8]:
# changing name to ANULADO instead of removing the item seems a common practice in the sports_club 
# datasource. We get rid of all of them

sparkdf_social = sparkdf_social.select('*').where("name <> 'ANULADO'")

In [9]:
# Also, we want to remove other duplicates

print('Rows before removing duplicates',sparkdf_social.count())
sparkdf_social=sparkdf_social.dropDuplicates(['name','county','city','social_kind'])
print('Rows after removing duplicates',sparkdf_social.count())


Rows before removing duplicates 43738
Rows after removing duplicates 43734


In [10]:
# CHECK and remove wrong Counties as we are going to partition by county
sparkdf_social=clean_wrong_counties(sparkdf_social)

sparkdf_social.select("County").distinct().head(80)


[Row(County='león'),
 Row(County='ávila'),
 Row(County='segovia'),
 Row(County='palencia'),
 Row(County='soria'),
 Row(County='burgos'),
 Row(County='zamora'),
 Row(County='valladolid'),
 Row(County='salamanca')]

### CREATION OF SOCIAL PARQUET TABLE

In [11]:

## CLEANED UP SOCIAL TO PARQUET (LOCAL)
toparquet_by_county_and_postcode(spark_session,local_parquet_path  + 'social/',sparkdf_social)

In [32]:
## CLEANED UP SOCIAL TO PARQUET (S3)

#bucket_name = 'raul-udacity'
#bucket_parquet_path ='/parquet/'
#Sparkdf_social.write.partitionBy("county","postal_code").parquet('s3a://'+bucket_name+bucket_parquet_path + "social/", mode="overwrite")

### EXAMPLE QUERIES TO SOCIAL

In [17]:
## Sports / county
# https://stackoverflow.com/questions/57066797/pyspark-dataframe-split-column-with-multiple-values-into-rows#57080133

from pyspark.sql.functions import explode, regexp_replace, split

out=sparkdf_social.withColumn(
    "sports", 
    explode(split(col("sports"), "\|"))
).where("county='burgos'").select(
    col('sports')
    ).groupBy('sports').agg(count('sports').alias('sport_associations')) \
    .orderBy('sport_associations', ascending=False)



out.head(70)

[Row(sports='N/A', sport_associations=5423),
 Row(sports='', sport_associations=1102),
 Row(sports='CA005#CAZA - PERROS DE CAZA Y AGILITY#', sport_associations=385),
 Row(sports='CA013#CAZA - EDUCACIÓN CANINA#', sport_associations=385),
 Row(sports='CA006#CAZA - CETRERIA#', sport_associations=384),
 Row(sports='CA009#CAZA - TIRO A CAZA LANZADA#', sport_associations=384),
 Row(sports='CA001#CAZA - PICHON A BRAZO#', sport_associations=384),
 Row(sports='CA012#CAZA - PERDIZ CON RECLAMO#', sport_associations=384),
 Row(sports='CA014#CAZA - CAZA DE BECADAS#', sport_associations=384),
 Row(sports='CA008#CAZA - CAZA CON ARCO#', sport_associations=384),
 Row(sports='CA010#CAZA - CAZA FOTOGRAFICA Y VIDEO#', sport_associations=384),
 Row(sports='CA011#CAZA - COMPAK SPORTING#', sport_associations=384),
 Row(sports='CA004#CAZA - CAZA SAN HUBERTO#', sport_associations=384),
 Row(sports='CA007#CAZA - PAJAROS DE CANTO#', sport_associations=383),
 Row(sports='CA003#CAZA - RECORRIDOS DE CAZA#', sport_a

# Step 3: Define the Data Model
## 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

## 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

# Step 4: Run Pipelines to Model the Data 
## 4.1 Create the data model
Build the data pipelines to create the data model.


## 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

## 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

## Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.