# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--
I have broken down the 2018 New York parking ticket data so that more insite can be analyized

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [32]:
import configparser
from datetime import datetime
import os
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, lit, concat
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>
I plan on warehousing New Yorks parking ticket dataset inorder to find more insite on the trends that are happening when parking officers are handing out tickets. I'm using the New York City Parking Violation data and parking code data to add more insite to the main dataset for fiscial year 2018. My end solution will be a datalake with Dim tables Registration, Vehicle, Violation Location, and Violation Details. My Fact table will be the Ticket table. I used Spark and AWS S3 to create a datalake. 

#### Describe and Gather Data 
Describe the data sets you're using. 

- The first datset that I am using is City of New York parking voilation tickets that happend in 2018 which can be found at this link https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2018/faiq-9dfq.

The information included in the dataset includes:

Summons Number
Plate ID
Registration State
Plate Type
Issue Date
Violation Code
Vehicle Body Type
Issuing Agency
Street Code1
Street Code2
Street Code3
Vehicle Expiration Date
Violation Location
Violation Precinct
Issuer Precinct
Issuer Code
Issuer Command
Issuer Squad
Violation Time
Time First Observed
Violation County
Violation In Front Of Or Opposite
House Number
Street Name
Intersecting Street
Date First Observed
Law Section
Sub Division
Violation Legal Code
Days Parking In Effect
From Hours In Effect
To Hours In Effect
Vehicle Color
Unregistered Vehicle?
Vehicle Year
Meter Number
Feet From Curb
Violation Post Code
Violation Description
No Standing or Stopping Violation
Hydrant Violation


Where did it come from? What type of information is included? 

- The second datset that I am using is the parking code discription which can be found at this link https://catalog.data.gov/dataset/dof-parking-violation-codes-63051.

The information included in the dataset includes:

Code
Discription
Manhattan 96th street & below
All other areas



In [3]:
spark = SparkSession \
        .builder \
        .appName("Capstone Cluster") \
        .getOrCreate()

#### This fist dataset is the main dataset that has 4001111 rows

In [4]:
# Read in the data here
df_ticket = spark.read.format("csv").option("header", "true").load("parking-violations-issued-fiscal-year-2018.csv")

In [27]:
pd.set_option('display.max_columns', 999)

In [7]:
df_ticket.count()

4001111

In [28]:
df_ticket.limit(5).toPandas()

Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
0,1105232165,GLS6001,NY,PAS,2018-07-03T00:00:00.000,14,SDN,HONDA,X,47130,13230,80030,20180702.0,78,78,968,86684,968,0,0811P,,K,F,2,HANSON PLACE,,0,408,D1,,BBYBBBB,ALL,ALL,BLUE,0,2006,-,0,,,,,
1,1121274900,HXM7361,NY,PAS,2018-06-28T00:00:00.000,46,SDN,NISSA,X,28990,14890,15040,20200203.0,112,112,968,103419,968,0,1145A,,Q,F,71-30,AUSTIN ST,,0,408,C,,BBBBBBB,ALL,ALL,GRY,0,2017,-,0,,,,,
2,1130964875,GTR7949,NY,PAS,2018-06-08T00:00:00.000,24,SUBN,JEEP,X,64,18510,99,20180930.0,122,122,835,0,835,0,0355P,,R,,,GREAT KILLS BOAT LAU,,0,408,D5,,BBBBBBB,ALL,ALL,GREEN,0,0,-,0,,,,,
3,1130964887,HH1842,NC,PAS,2018-06-07T00:00:00.000,24,P-U,FORD,X,11310,39800,39735,0.0,122,122,835,0,835,0,0123P,,R,,,GREAT KILLS PARK BOA,,0,408,D5,,BBBBBBB,ALL,ALL,WHITE,0,0,-,0,,,,,
4,1131599342,HDG7076,NY,PAS,2018-06-29T00:00:00.000,17,SUBN,HYUND,X,47130,13230,80030,20190124.0,78,78,868,2354,868,0,0514P,,K,F,2,HANSON PLACE,,0,408,C4,,BBBBBBB,ALL,ALL,GREEN,0,2007,-,0,,,,,


In [9]:
df_ticket_code = spark.read.json("parking_violation codes.json", multiLine=True)

In [10]:
df_ticket_code.limit(5).toPandas()

Unnamed: 0,data,meta
0,"[[row-vc2y~qug8_qh44, 00000000-0000-0000-CF04-...","((Department of Finance (DOF), http://www.nyc...."


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

In the main ticket file there were a lot of null values:

- Violation Post Code
- Violation Description
- No Standing or Stopping Violation
- Hydrant Violation

#### Cleaning Steps
Document steps necessary to clean the data

In [12]:
code_list = []
definition_list = []
for data in df_ticket_code.toPandas().data[0]:
    code_list.append(data[8])
    definition_list.append(data[9])

In [15]:
df_codes = pd.DataFrame(columns=['Code','Definition'])

In [16]:
df_codes['Code'] = code_list
df_codes['Definition'] = definition_list

In [17]:
df_codes_spark  = spark.createDataFrame(df_codes)

In [18]:
df_codes_spark.toPandas().head()

Unnamed: 0,Code,Definition
0,30,NO STOP/STANDNG EXCEPT PAS P/U
1,60,ANGLE PARKING
2,13,NO STANDING-TAXI STAND
3,73,REG STICKER-MUTILATED/C'FEIT
4,38,FAIL TO DSPLY MUNI METER RECPT


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

### Creating Vehicle Table

In [20]:
df_vehicle_table = df_ticket.select(col('Plate ID').alias('plate_id'), col('Vehicle Make').alias('vehicle_make')\
                                    ,col('Vehicle Body Type').alias('vehicle_body_type'), col('Vehicle Color').alias('vehicle_color')\
                                    ,col('Vehicle Year').alias('vehicle_year'))

### Create registration table

In [22]:
df_registration_table = df_ticket.select(col('Plate ID').alias('plate_id'), col('Plate Type').alias('plate_type')\
                                         ,col('Registration State').alias('registration_state'), col('Vehicle Expiration Date').alias('registration_expired_date')\
                                        ,col('Unregistered Vehicle?').alias('unregistered_vehicle'))

### Create Violation Location Table

In [30]:
df_violation_location_table = df_ticket.select(col('Street Code1').alias('street_code1'), col('Street Code2').alias('street_code2')\
                                         ,col('Street Code3').alias('street_code3'), col('Violation Precinct').alias('violation_precinct')\
                                        ,col('Violation County').alias('violation_county'),col('House Number').alias('house_number')
                                        ,col('Street Name').alias('street_name'),col('Days Parking In Effect    ').alias('parking_enforced_days')
                                        ,col('From Hours In Effect').alias('from_enforced_hours'),col('To Hours In Effect').alias('to_enforced_hours'))

In [33]:
df_violation_location_table = df_violation_location_table.withColumn("street_code_key", \
                                    concat(col("street_code1"), lit('-'),col("street_code2"), lit('-'),col("street_code3"))) 

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.