## Understanding and Predicting Property Maintenance Fines

Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. This code was written to predict whether a given blight ticket will be paid on time.    

Each row in dataset corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. 

Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.

In [3]:
# import required packages 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [11]:
# import data in train, test, address, and latlon dataframes
df_train = pd.read_csv(r'address1\train.csv'
                       , encoding='iso-8859-1')
df_test = pd.read_csv(r'address2\test.csv'
                       , encoding='utf-8')
df_address = pd.read_csv(r'address3\addresses.csv'
                       , encoding='utf-8')
df_latlon = df_address = pd.read_csv(r'address4\latlons.csv'
                       , encoding='utf-8')
print(f'train dataframe \n {df_train.head()}\n, \n test dataframe \n {df_test.head()}\n, address dataframe \n {df_address.head()}\n, latlon dataframe \n {df_latlon.head()}\n')

train dataframe 
    ticket_id                                     agency_name  \
0      22056  Buildings, Safety Engineering & Env Department   
1      27586  Buildings, Safety Engineering & Env Department   
2      22062  Buildings, Safety Engineering & Env Department   
3      22084  Buildings, Safety Engineering & Env Department   
4      22093  Buildings, Safety Engineering & Env Department   

     inspector_name                      violator_name  \
0   Sims, Martinzie  INVESTMENT INC., MIDWEST MORTGAGE   
1  Williams, Darrin           Michigan, Covenant House   
2   Sims, Martinzie                    SANDERS, DERRON   
3   Sims, Martinzie                       MOROSI, MIKE   
4   Sims, Martinzie                    NATHANIEL, NEAL   

   violation_street_number violation_street_name  violation_zip_code  \
0                   2900.0                 TYLER                 NaN   
1                   4311.0               CENTRAL                 NaN   
2                   1449.0      

In [12]:
len(df_train['ticket_id'].value_counts())

250306

In [40]:
df_train[['fine_amount','admin_fee', 'state_fee', 'late_fee', 'discount_amount',
       'clean_up_cost', 'judgment_amount' ]]

Unnamed: 0,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount
0,250.0,20.0,10.0,25.0,0.0,0.0,305.0
1,750.0,20.0,10.0,75.0,0.0,0.0,855.0
2,250.0,0.0,0.0,0.0,0.0,0.0,0.0
3,250.0,0.0,0.0,0.0,0.0,0.0,0.0
4,250.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
250301,1000.0,0.0,0.0,0.0,0.0,0.0,0.0
250302,1000.0,0.0,0.0,0.0,0.0,0.0,0.0
250303,1000.0,0.0,0.0,0.0,0.0,0.0,0.0
250304,1000.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# list of columns to be used in the model for training set 
col = ['ticket_id', 'fine_amount', 'late_fee', 'discount_amount', \
        'judgment_amount', 'compliance']

In [14]:
df_final_train = df_train[col]
df_final_test = df_test[col[: len(col)-1]] #test set for the model
df_final_test_drop = df_final_test.dropna() #dropped null values 

In [15]:
df_final_drop = df_final_train.dropna() #dropping null values from the training set 
df_feature = df_final_drop[col[: len(col)-1]] #final training set 
df_label = df_final_drop['compliance'] # dropping compliance as it is not required 
X_train, X_test, y_train, y_test = train_test_split(df_feature, df_label, random_state=0)


In [16]:
# train a logistic regression model 
clf = LogisticRegression().fit(X_train, y_train)


In [17]:
# check the accuracy
clf.score(X_test, y_test)
# seems like a good model 

0.931048286214661

In [89]:
# predict the probability on test set 
data = clf.predict_proba(df_final_test_drop[col[: len(col)-1]])[:,1]

In [90]:
pd.Series(data=data, index=df_final_test_drop['ticket_id'])

ticket_id
284932    0.050666
285362    0.004446
285361    0.067789
285338    0.050636
285346    0.067790
            ...   
376496    0.003876
376497    0.003876
376499    0.059578
376500    0.059578
369851    0.096334
Length: 61001, dtype: float64