# Quantifying completeness

The records for parking violations stored in the `parking_violation` table contain missing values for the `vehicle_body_type` column. Assume this data is missing completely at random (MCAR) due to human error. In an effort to make the data more complete, you have been tasked with filling in these values. You decide to quantify how many records are missing and perform an analysis for an appropriate fill-in value to replace the missing values.

How many `parking_violation` records have a `NULL` value for `vehicle_body_type`? Write and execute a `SELECT` query that computes this number.

```
SELECT COUNT(*) FROM parking_violation
WHERE vehicle_body_type IS NULL
```

- 179

# Using a fill-in value

The sedan body type is the most frequently occurring `vehicle_body_type` in the sample parking violations. For this reason, you propose changing all NULL-valued `vehicle_body_type` records in the `parking_violations` table to `SDN`. Discussions with your team result in a decision to use a value other than `SDN` as a fill-in value. The body type can be determined by looking up the vehicle using its license plate number. A license plate number is present in most parking_violation records. Rather than using the most frequent value to replace `NULL` `vehicle_body_type` values, a placeholder value of `Unknown` will be used. The actual body type will be updated as license plate lookup data is gathered.

In this exercise, you will replace `NULL` `vehicle_body_type` values with the string `Unknown`

```
UPDATE
  parking_violation
SET
  -- Replace NULL vehicle_body_type values with `Unknown`
  vehicle_body_type = COALESCE(vehicle_body_type, 'Unknown');

SELECT COUNT(*) FROM parking_violation WHERE vehicle_body_type = 'Unknown';
```

# Analyzing incomplete records

In an effort to reduce the number of missing `vehicle_body_type` values going forward, your team has decided to embark on a campaign to educate issuing agencies on the need for complete data. However, each campaign will be customized for individual agencies.

In this exercise, your goal is to use the current missing data values to prioritize these campaigns. You will write a query which outputs the issuing agencies along with the number of records attributable to that agency with a `NULL` `vehicle_body_type`. These records will be listed in descending order to determine the order in which education campaigns should be developed.

```
SELECT
  -- Define the SELECT list: issuing_agency and num_missing
  issuing_agency,
  COUNT(*) AS num_missing
FROM
  parking_violation
WHERE
  -- Restrict the results to NULL vehicle_body_type values
  vehicle_body_type IS NULL
  -- Group results by issuing_agency
  GROUP BY issuing_agency
  -- Order results by num_missing in descending order
  ORDER BY num_missing DESC;
```

# Duplicate parking violations

There have been a number of complaints indicating that some New York residents have been receiving multiple parking tickets for a single violation. This is resulting in the affected residents having to incur additional legal fees for a single incident. There is justifiable anger about this situation. You have been tasked with identifying records that reflect this duplication of violations.

In this exercise, using `ROW_NUMBER()`, you will find `parking_violation` records that contain the same `plate_id`, `issue_date`, `violation_time`, `house_number`, and `street_name`, indicating that multiple tickets were issued for the same violation.

```
SELECT 
	-- Include all columns 
	*
FROM (
	SELECT
  		summons_number,
  		ROW_NUMBER() OVER(
        	PARTITION BY 
            	plate_id, 
          		issue_date, 
          		violation_time, 
          		house_number, 
          		street_name
      	) - 1 AS duplicate, 
      	plate_id, 
      	issue_date, 
      	violation_time, 
      	house_number, 
      	street_name 
	FROM 
		parking_violation
) sub
WHERE
	-- Only return records where duplicate is 1 or more
	 duplicate > 0;
```

# Resolving impartial duplicates

The `parking_violation` dataset has been modified to include a `fee` column indicating the fee for the violation. This column would be useful for keeping track of New York City parking ticket revenue. However, due to duplicated violation records, revenue calculations based on the dataset would not be accurate. These duplicate records only differ based on the value in the fee column. All other column values are shared in the duplicated records. A decision has been made to use the minimum `fee` to resolve the ambiguity created by these duplicates.

Identify the 3 duplicated `parking_violation` records and use the `MIN()` function to determine the `fee` that will be used after removing the duplicate records.

```
SELECT 
	-- Include SELECT list columns
	summons_number, 
    MIN(fee) AS fee
FROM 
	parking_violation 
GROUP BY
	-- Define column for GROUP BY
	summons_number 
HAVING 
	-- Restrict to summons numbers with count greater than 1
	COUNT(summons_number) > 1;
```

# Detecting invalid values with regular expressions

In the video exercise, we saw that there are a number of ways to detect invalid values in our data. In this exercise, we will use regular expressions to identify records with invalid values in the `parking_violation` table.

A couple of regular expression patterns that will be useful in this exercise are `c{n}` and `c+`. `c{n}` matches strings which contain the character `c` repeated `n` times. For example, `x{4}` would match the pattern `xxxx`. `c+` matches strings which contain the character` c` repeated one or more times. This pattern would match strings including `xxxx` as well as `x` and `xx`.

```
SELECT
  summons_number,
  plate_id,
  registration_state
FROM
  parking_violation
WHERE
  -- Define the pattern to use for matching
  registration_state NOT SIMILAR TO '[A-Z]{2}';
```

```
SELECT
  summons_number,
  plate_id,
  plate_type
FROM
  parking_violation
WHERE
  -- Define the pattern to use for matching
  plate_type NOT SIMILAR TO '[A-Z]{3}';
```

```
SELECT
  summons_number,
  plate_id,
  vehicle_make
FROM
  parking_violation
WHERE
  -- Define the pattern to use for matching
  vehicle_make NOT SIMILAR TO '[A-Z]\/\S';
```

# Identifying out-of-range vehicle model years

Type constraints are useful for restricting the type of data that can be stored in a table column. However, there are limitations to how thoroughly these constraints can prevent invalid data from entering the column. Range constraints are useful when the goal is to identify column values that are included in a range of values or excluded from a range of values. Using type constraints when defining a table followed by checking column values with range constraints are a powerful approach to ensuring the integrity of data.

In this exercise, you will use a `BETWEEN` clause to build a range constraint to identify invalid vehicle model years in the `parking_violation` table. Valid vehicle model years for this dataset are considered to be between 1970 and 2021.

```
SELECT
  -- Define the columns to return from the query
  summons_number,
  plate_id,
  vehicle_year
FROM
  parking_violation
WHERE
  -- Define the range constraint for invalid vehicle years
  vehicle_year NOT BETWEEN 1970 AND 2021;
```

# Identifying invalid parking violations

The `parking_violation` table has three columns populated by related time values. The `from_hours_in_effect` column indicates the start time when parking restrictions are enforced at the location where the violation occurred. The `to_hours_in_effect` column indicates the ending time for enforcement of parking restrictions. The `violation_time` indicates the time at which the violation was recorded. In order to ensure the validity of parking tickets, an audit is being performed to identify tickets given outside of the restricted parking hours.

In this exercise, you will use the parking restriction time range defined by `from_hours_in_effect` and `to_hours_in_effect` to identify parking tickets with an invalid `violation_time`

```
SELECT 
  summons_number, 
  violation_time, 
  from_hours_in_effect, 
  to_hours_in_effect 
FROM 
  parking_violation 
WHERE 
  -- Exclude results with overnight restrictions 
  from_hours_in_effect < to_hours_in_effect AND 
  violation_time NOT BETWEEN from_hours_in_effect AND to_hours_in_effect;
```

# Invalid violations with overnight parking restrictions

In the previous exercise, you identified `parking_violation` records with `violation_time` values that were outside of the restricted parking times. The query for identifying these records was restricted to violations that occurred at locations without overnight restrictions. A modified query can be constructed to capture invalid violation times that include overnight parking restrictions. The parking violations in the dataset satisfying this criteria will be identified in this exercise.

For example, this query will identify that a record with a `from_hours_in_effect` value of `10:00 PM`, a `to_hours_in_effect` value of `10:00 AM`, and a `violation_time` of `4:00` PM is an invalid record.

```
SELECT
  summons_number,
  violation_time,
  from_hours_in_effect,
  to_hours_in_effect
FROM
  parking_violation
WHERE
  -- Ensure from hours greater than to hours
  from_hours_in_effect > to_hours_in_effect AND
  -- Ensure violation_time less than from hours
  violation_time < from_hours_in_effect AND
  -- Ensure violation_time greater than to hours
  violation_time > to_hours_in_effect;
```

# Recovering deleted data

While maintenance of the film permit data was taking place, a mishap occurred where the column storing the New York City borough was deleted. While the data was backed up the previous day, additional permit applications were processed between the time the backup was made and when the borough column was removed. In an attempt to recover the borough values while preserving the new data, you decide to use some data cleaning skills that you have learned to rectify the situation.

Fortunately, a table mapping zip codes and boroughs is available (`nyc_zip_codes`). You will use the zip codes from the `film_permit` table to re-populate the `borough` column values. This will be done utilizing five sub-queries to specify which of the five boroughs to use in the new `borough` column.

- Missing completely at random

```
-- Select all zip codes from the borough of Manhattan
SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Manhattan';
```

```
SELECT 
	event_id,
	CASE 
      WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Manhattan') THEN 'Manhattan' 
      -- Match Brooklyn zip codes
      WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Brooklyn') THEN 'Brooklyn'
      -- Match Bronx zip codes
      WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Bronx') THEN 'Bronx'
      -- Match Queens zip codes
      WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Queens') THEN 'Queens'
      -- Match Staten Island zip codes
      WHEN zip_code IN (SELECT zip_code FROM nyc_zip_codes WHERE borough = 'Staten Island') THEN 'Staten Island'
      -- Use default for non-matching zip_code
      ELSE NULL 
    END as borough
FROM
	film_permit
```