# Predicting Revenue from Parking Citations in Baltimore
Capstone Project for Springboard Data Science Bootcamp

Tamara Monge

### 1. The Project
Parking citations are a common occurence in most metropolitan areas. On the one hand, they can be seen as an indicator of people's frustration with what may be limited legal parking options On the other hand, they provide a non-neglibible amount of revenue for a city. 

This project explores patterns in parking citations issued in the City of Baltimore and builds several machine learning models that predict which citations will be paid. 

### 2. The Clients
Two clients will be interested in this study: the City of Baltimore's treasury department and city planners. City planners can use the patterns in parking citations elucidated here to learn where within the city more legal parking solutions are needed. The city treasury can use this project's models (of which fines will be paid) to predict the revenue for the coming quarter, given the citations that have been issued. This will allow them to make quarterly adjustments to the amount of money they can budget for items that are funded by parking citation revenues. They can also use these predictions to anticipate the amount of staff and resources that will be needed for collections. 

### 3. The Data: source and description 
The data used for this project comprises all parking citations issued in Baltimore and can be found at [Open Baltimore](https://data.baltimorecity.gov/Transportation/Parking-Citations/n4ma-fj3m/data "Open Baltimore website").

The dataset available on the website contains two distinct temporal cohorts of records: (1) a rolling account of all citations issued over the last two years, updated daily, and (2) an account of all citations issued more than two years ago that still have an outstanding balance. The aim of this project is to predict the amount of revenue the city can expect from parking citations so the second cohort is excluded from these analyses so as to avoid biasing the analysis toward the older, outstanding accounts. To extend the length of the study, two downloads were performed and the first cohort of each download were merged. The final dataset contained all citations that were issued between September 23, 2015 and November 30, 2017 for a total of 912,308 records. 

Each record contains the following fields: date, time, and address of citation, violation description and code, citation number, license plate number, license plate state, fine amount, and account balance. Some records contain additional fields, such as: latitude, longitude, neighborhood, police district, council district of the incident, license plate expiration date, vehicle make, and penalty amount (if any).

### 4. Wrangling 
The data came in the form of CSV files which allowed them to be imported as pandas dataframes. The first cohorts of each file were merged to form a single dataframe. The resulting dataset was quite large (650 MB), so to conserve space and computing time, features that were not relevant to the study were dropped. Feature names were shortened, to ease data operations. The dataframe index was assigned as the `date` feature, to assist with analyses and plotting. Numerous bits of information contained in the `date` feature were also extracted into separate features, namely: yr, mo, day, and hr. Similarly, the bits of information contained in the `location` feature were extracted into: lat, lon, and lonlat. Financial features were converted from strings to floats so that they would be treated as numerical variables. Some categorical features required cleaning along the lines of text-consistency. For example, two  records indicated the same vehicle make as `honda` and `HON`. In order for these categorical variables to be treated appropriately, their classes were cleaned  to have consistent case and consistent number of characters. Finally, three features were created to simplify information in the original features. The binary feature `paid` indicates whether or not a citation has been paid (i.e., `bal` = \$0). The binary feature `instate` indicates whether or not the offending vehicle is from Maryland (i.e., `state` = `MD`). The categorical feature `quad` indicates in which quadrant of the city the incident occurred (i.e., in which quadrant `lonlat` fell). The table below shows the cleaned features. 

Cleaned Field | Description | Feature Type
:---|:---|:---
`date` | datetime of citation | -
`yr` | year of citation | categorical
`mo` | month of citation | categorical
`day` | day of citation | categorical
`hr` | hour of citation | categorical
`lonlat` | longitude and latitude of citation | -
`quad` | city quadrant of citation | categorical
`cit` | citation number | -
`desc` | violation description | categorical
`fine` | amount of fine | numerical
`bal` | amount due on account | -
`paid` | account paid down | categorical
`ofine` | open fine due on account | -
`openalty` | open penalty due on account | -
`tag` | licence plate number | - 
`state` | licence plate state | - 
`instate` | license plate from Maryland | categorical
`make` | car manufacturer | categorical

#### 4.1 Dealing with missing data
Approximately 30% of the records were missing `lonlat` values. Since the original dataset was so numerous, however, I decided it was acceptable to conduct the study using only records that contained `lonlat` information. This choice brought the number of records down to 641,072.  

#### 4.2 Dealing with extraneous data
While examining the data, I noticed a small number (< 3%) of citations were issued outside of the city limits of Baltimore. While the reason for this is unknown, the focus of this project is to examine parking citations in the city of Baltimore. Therefore, citations issued outside the city were removed from the analysis. This was achieved by drawing a rectangle around the city (shown below) and removing observations outside of it. Fortunately, the eastern, northern, and western borders of the city fall approximately along latitude and longitude lines, and thus the only simplification required was to use a straight line along the edge of the southern border. Removing the extraneous data brought the number of records to 623,639.

![title](figures/Baltimore_BoundingMap.png "Map of Baltimore region (source: Google). Red shading indicates Baltimore city limits. Dashed rectangle indicates bounding box used for selecting data within Baltimore.")

### 5. Statistical exploration
In this section of the investigation I performed a statistical exploration of the dataset. In particular, I am interested in questions of the following nature: What kind of offenses were committed? Which offenses were the most common? Were offenses evenly distributed in time or were there peak hours or months when more citations were issued? Which vehicle makes committed the most offenses? From which state did the majority of offenders come? What were the most common fine amounts? How many accounts have been paid down? On average, how many citations were issued per day, per month, per year? On average, how much revenue did the city take in from parking citations, per day, per month, per year?  

The hour of day that saw the most citations was 11:00-12:00 (figure below), followed closely by 12:00-13:00. This suggests that more lunch-hour parking solutions are needed. 
![title](figures/CitationVolumeByHour.png)

Most citations occurred within central baltimore (figure [here](figures/citation_heatmap_small_2018-05-21.html)). This suggests that more parking solutions are needed in downtown.

Citations were issued for 26 unique violations (figure below). The most common violation was `All Other Parking Meter Violations` (24%), followed by `Fixed Speed Camera` (18%), and `No Stop/Park Street Cleaning` (12%). This suggests that a system could be implemented that allows patrons to pay their meter fee from any of numerous pay-stations around town. Alternatively, a mobile app could be developed that would allow patrons to pay their meter fee remotely. This figure also suggests that signage for Street Cleaning may need to be made clearer.

![title](figures/CitationVolumeByDescription.png)


The fines for parking citations ranged from \$23 to \$502 (figure below). Nearly half of fines were exactly \$32 and almost all (97%) fines were less than \$100. ![title](figures/CitationVolumeByFine.png "Note: y-axis is on a logarithmic scale.")

The balance due on accounts ranged from \$0 to \$954 (figure below). Some accounts had balances greater than their initial fine because they were issued a penalty for delinquent payment. ![title](figures/CitationVolumeByBalance.png "Note: y axis is on a logarithmic scale.") 

Two-thirds of accounts carried a balance of \$0 (figure below), meaning they had been paid down. ![title](figures/CitationVolumeByPaid.png)

The car makes that received the most citations were Honda (13%), Toyota (12%), and Ford (11%) (figure below).
![title](figures/CitationVolumeByMake.png)

Most offending vehicles were, not surprisingly, from Maryland (86%) (figure below), followed by neighboring states Virginia (2.4%) and Pennsylvania (2.3%). ![title](figures/CitationVolumeByState.png "Note: y axis is on a logarithmic scale.")

On average, the city issued 780 citations per day (figure below). ![title](figures/CitationVolumeAverages.png)

If one assumes all of these fines will be paid, this would translate to an average revenue from parking citations of ~\$37K per day, or ~\$3.3M per quarter. As was shown, however, only 67% of accounts were paid. So how much revenue did the city actually take in from parking citations? Is it possible to predict which citations will be paid and thus the actual revenue that will be taken in, given the citations that have been issued? In other words, can the city achieve a higher level of budgetary accuracy? This brings me to my machine learning problem. 

### 6. Prediction
I have treated this as a supervised learning classification problem where the binary feature `paid` is the target variable ($y$) and the other features are the predictive variables ($X$ = [`fine`, `desc`, `instate`, `make`, `quad`, `yr`, `mo`, `day`, and `hr`]). The null hypothesis for this problem is that all citations will be paid ($H_0$ = 0.67). The modeling task is to predict which citations will be paid with a greater accuracy than $H_0$. 

Variable |  Type | No. of Categories
:---|:---:|:---:
`fine` | numerical | n/a
`desc` | categorical | 26
`instate` | categorical | 2
`make` | categorical | 317
`quad` | categorical| 4
`yr`  | categorical | 3
`mo`  | categorical | 12
`day` | categorical | 31
`hr`  | categorical | 24









#### 6.1 Feature preparation and selection
The categorical predictive variables (see table above) were converted to dummy variables using the pandas `get_dummies` function. This increased the feature space from 9 to 411 dimensions. The dataset was already very large in sample size ($n$), so increasing the number of dimensions by 45x could make the training and predicting times untennable for some algorithms. Therefore the choice of which features to include needed to be considered carefully. 

I visually examined countplots of each predictive feature to identify which features may be most important. `Fine` appears to be a good predictive variable (figure below). Most citations with a fine of either \$40 or \$302 were unpaid, while most citations of any other fine amount were paid down. ![title](figures/CountplotByFine.png)  
`Desc` also appears to be a good predictive variable (figure below). Most citations with the description `Fixed Speed Camera` or `Abandoned Vehicle` were unpaid while most citations with any other description were paid down. ![title](figures/CountplotByDescription.png) 
`Instate` and `quad` appear to be less powerful predictors (figures below) in that most citations in each class of these variables were paid. ![title](figures/CountplotByState.png) ![title](figures/CountplotByQuadrant.png) 
The only good predictor of a temporal nature appears to be `hour` (figure below). The majority of citations issued in the 6 o'clock hour were unpaid while most citations issued at any other time of day were paid. ![title](figures/CountplotByHour.png) 

#### 6.2 Model selection
Six classifiers were fit for this project: Logistic regression, two Support Vector Machines (SVMs) - one with a linear kernel and one with an RBF kernel, Decision Tree, Random Forest, and Naive Bayes. For comparison, a dummy decision tree classifier was fit using only the `fine` feature.


##### (A) Logistic Regression
The benefits of logistic regression are that it is one of the simplest classifiers and is readily interpretable. An especially useful feature of the logistic regression classifier is that it returns coefficients that indicate the level of influence each feature has on the model. This was useful when the computationally expensive nature of the RBF-SVM model required that  the number of features be culled (see below). 

##### (B) Support Vector Machine 
The benefits of SVMs are that they are adept at handling very large sample sizes and perform well with non-linearly separable boundaries. One drawback to the SVM with a full kernel like the RBF, is that training can be computationally expensive. Indeed, attempting to train the RBF-SVM classifier on all 411 features of the dataset turned out to be prohibitively expensive on my machine. I took two approaches to work around this problem. The first approach was to train an SVM with a linear kernel. The second approach was to drastically reduce the number of features used in prediction. Each of these approaches produced the desired effect of a managable training time, though they do pose disadvantages. The downside of the first approach is that it sacrifices the ability to handle non-linearly separable boundaries while the downside of the second approach is that it may sacrifice performance by culling features (a less meaningful decision boundary is achieved). 

To reduce the number of features, I used the coefficients returned by the logistic regression model and selected those with a coefficient $\geq$ 0.2. This arbitrary cutoff was chosen to achieve a manageable number (20-30) features and resulted in 21 features. 

##### (C) Decision Tree
Decision tree classifiers are computationally fast and highly interpretable. They also work well with categorical features. One drawback to decision trees is that they are prone to overfitting.  

##### (D) Random Forest 
Random forest classifiers are an ensemble of decision tree classifiers. As a result, they carry some of the advantages of decision trees, namely they are computationally fast, work well with high dimensions, and work well with categorical features. At the same time the ensemble nature of random forests reduces overfitting, thereby addressing the biggest disadvantage of decision trees. In addition, random forests are easier to interpret than other complex classifiers like the SVM. 

##### (E) Naive Bayes
The benefits of naive bayes classifiers are that they are computationally very fast and perform well with high dimensions. 


#### 6.3 Model fitting
The best hyperparameters for each model were found using a gridsearch 5-fold cross-validation where the accuracy score was selected as the evaluation metric. For each model the cross-validation training was performed on a random 70%-30% train-test split of the data. The 30% test data was held out until the evaluation stage. The best hyperparameters are shown below. 

Classifier | Hyperparameters 
--- | ---
Logistic Regression | $C$ = 1
Linear SVM | $C$ = 0.1
RBF SVM | $C$ = 1, $\gamma$ = 0.25
Decision Tree | `max_depth` = 10
Random Forest | `max_depth` = 50, `min_samples_leaf` = 2, `n_estimators` = 50
Naive Bayes | n/a



Recall that the purpose of the model is to forecast which tickets will be paid so that the city can anticipate the amount of revenue they will receive. The most important metric by which to select the best model is thus the accuracy. 
As shown in the table (below), many of the classifiers out-performed the null hypothesis with the **random forest** just edging out the others. The random forest algorithm predicts which citations will be paid with 76.6% accuracy. 

Classifier | Accuracy |
--- | ---
Logistic Regression | 0.748 
Linear SVM | 0.746  
RBF SVM | 0.753
Decision Tree | 0.755 
Random Forest | 0.766 
Naive Bayes | 0.328  
Dummy | 0.714

#### 6.4 Model evaluation 
To evaluate it's performance on un-seen data, the tuned random forest was tested against the held-out test set.

In this project, the most important metric for evaluating model performance is the accuracy. However, other metrics are also useful and understanding the ways in which the model is wrong will let the city know which eventualities to prepare for. For example, a false-positive (predicting a citation will be paid when in fact it is not), will result in the city thinking it has more money than it does. This would lead to programs going underfunded with little notice. Another type of error is a false-negative (predicting a ticket will not be paid when in fact it is). This would result in the city receiving more money than planned and could mean a program was preemptively cut when it needn't have been. While both of these outcomes are undesirable, one could argue the false-positive is moreso. Therefore, our secondary metric of evaluation will be precision (i.e., how many of the citations predicted to be paid are actually paid?). And our tertiary metric of evaluation will be recall (i.e., how many of the citations that are actually paid were anticipated?). 

The random forest's performance on the held-out test set is reported below.

Accuracy | Class 1 Precision | Class 1 Recall
:---:|:---:|:---:
0.769|0.786 | 0.902


The random forest algorithm predicts which citations will be paid with 77% accuracy. When it predicts a ticket will be paid, it is correct 79% of the time. And it detects 90% of the paid tickets. 

#### 6.5. Model results
The table below summarizes the most important features, according to the random forest algorithm. 

Feature | Importance
:---|:---:|
yr_2017 | 0.148
desc_FIXED SPEED CAMERA | 0.074
yr_2016 | 0.070
mo_11 | 0.055
fine | 0.043
desc_ALL OTHER PARKING METER VIOLATIONS | 0.031
desc_EXPIRED TAGS | 0.031
mo_9 | 0.030
quad_SOUTHEAST | 0.024
mo_8 | 0.022

The most important features in determining whether a ticket will be paid are the **year** of the citation,  the **type of violation**, the **month** of the citation, the **fine**, and the **quadrant** where the citation occurred.

### 7. Limitations
- It is possible that there was an unknown pattern (e.g., period of time, geographic area, or particular officers) to the records that did not contain geospatial information, such that the exclusion of those records may have introduced bias into the results.


- Since there is no date-of-payment field, there is no means by which to establish the timing of payment beyond the fact that 67% of citations were paid within two years. We cannot say "X percent are paid within 1 month, 2 months, etc.," because all we have is a snapshot of the data, not a temporal trajectory. (We could say something like 50% of tickets issued in October 2017 were paid off within a month, but we can't generalize it and say tickets issued at any other time are 50% likely to be paid within a month because we only have the data at our two download points.) 

 

### 8. Further research
- Using another score during the gridsearch cross validation, such as the F1 score, may lead to modest improvements in performance. 


- Since `quad` did not have a high logistic regression coefficient, one might consider reincorporating the records that were missing `latlon` information into the samples. And then exclude `quad` from the predictive features.


- Alternative methods for dimensionality reduction could improve performance. One possibility is to use PCA. Another possibility is to reduce the number of classes in the `make` field (e.g., top 30 and an 'other' class).

### 9. Client recommendations
*To the city treasury:* 
1. Budget items that are funded by parking citation revenues should be budgeted **no more than one-quarter in advance using the random forest classifier to predict the revenue the city can expect**. 

*To city planners:* 
1. The most common violation is "Parking Meter Violations". One solution to this would be to **develop a mobile app** that allows users to pay their meter remotely from their devices. Or, an alternative solution would be to use a **centralized system** in which meters can be paid from one of numerous machines throughout the city. 

2. Citations peak during the lunch hours (11:00-1:00). One solution to this could be to **designate more spaces with a 30-60 minute limit during these hours, forcing turnover** so more spaces will be available.


### Acknowlegements

I would like to thank Springboard. I would also like to thank my mentor Justin Breucop for his guidance and suggestions. I also thank OpenBaltimore for making civic information openly available. 

---------
__Copyright 2018 Tamara Monge__