<a href="https://colab.research.google.com/github/sanjayc2/Prediction_Gradient_Boosting_Classsifier/blob/main/Predict_Property_Fines_Gradient_Boosting_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

## Understanding and Predicting Property Maintenance Fines

[Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data for this assignment has been provided through the [Detroit Open Data Portal](https://data.detroitmi.gov/).
___

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible. Compliance, as well as a handful of other variables that will not be available at test-time, are only included in train.csv.

Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.

<br>

**File descriptions** (Use only this data for training your model!)

    readonly/train.csv - the training set (all tickets issued 2004-2011)
    readonly/test.csv - the test set (all tickets issued 2012-2016)
    readonly/addresses.csv & readonly/latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___

## Evaluation

Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). 

Your grade will be based on the AUC score computed for your classifier. A model which with an AUROC of 0.7 passes this assignment, over 0.75 will recieve full points.
___

For this assignment, create a function that trains a model to predict blight ticket compliance in Detroit using `readonly/train.csv`. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from `readonly/test.csv` will be paid, and the index being the ticket_id.

Example:

    ticket_id
       284932    0.531842
       285362    0.401958
       285361    0.105928
       285338    0.018572
                 ...
       376499    0.208567
       376500    0.818759
       369851    0.018528
       Name: compliance, dtype: float32

In [None]:
import pandas as pd
import numpy as np

def blight_model():
    
    # Your code here
    
    from sklearn.model_selection import train_test_split
    from sklearn.feature_selection import SelectFromModel
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import roc_curve, auc
    from sklearn.metrics import confusion_matrix
    
    # get ticket_id to address map, and merge with address to latitude, longitude map, using address as "on"
    ticket_to_address_df = pd.read_csv("readonly/addresses.csv")
    address_to_lat_lon_df = pd.read_csv("readonly/latlons.csv")
    ticket_to_location_df = pd.merge(ticket_to_address_df, address_to_lat_lon_df, how="left", on="address", left_index=True, right_index=True)
    ticket_to_location_df.drop(labels = 'address', axis = 1, inplace=True)  # done need address as a feature
    #print(ticket_to_location_df.head(20))
    
    # get training and test data. Drop rows where needed columns have NaN
    #train_df = pd.read_csv("readonly/train.csv", engine="python")
    train_df = pd.read_csv("readonly/train.csv", encoding = "ISO-8859-1", low_memory=False)
    test_df = pd.read_csv("readonly/test.csv")
    # Drop rows where needed columns have NaN.  Consider if fillna (0) is better for any columns, e.g. fine_amount 
    # Note that compliance = null corresponds to notresponsible, which we ignore for this assignment
    train_df.dropna(axis=0, subset = ['compliance'], inplace = True)
    #test_df.dropna(axis=0, subset = ['ticket_id', 'zip_code'], inplace = True)
    
   
    # need to include ticket_id to help join with latlons
    # use disposition as categorical feature; test violation_code too
    train_df_needed = train_df[['ticket_id', 'judgment_amount', 'disposition', 'violation_code', 'compliance']]   
    test_df_needed  = test_df[['ticket_id', 'judgment_amount', 'disposition', 'violation_code']]
    
    # construct X_train and X_test w/ lat and lon of mailing address by merging X_train/test_df with ticket_to_location_df
    #train_df_used = pd.merge(train_df_needed, ticket_to_location_df, how = "left", on = "ticket_id", left_index=True, right_index=True)
    #test_df_used = pd.merge(test_df_needed, ticket_to_location_df, how = "left", on = "ticket_id", left_index=True, right_index=True)
    train_df_used = train_df_needed # pd.merge(train_df_needed, ticket_to_location_df, how = "inner", on = "ticket_id", left_index=True, right_index=True)
    test_df_used = test_df_needed # pd.merge(test_df_needed, ticket_to_location_df, how = "inner", on = "ticket_id", left_index=True, right_index=True)
    
    #print("train_df_used shape after merge w/ ticket->location:" + str(train_df_used.shape))
    #print(train_df_used.head(10))
    #print("test_df_used shape after merge w/ ticket->location:" + str(test_df_used.shape))
    #print(test_df_used.head(10))
    
    #process the disposition (categorical) column using one-hot or get_dummies, and merge with train_df_used
    disp_df_train = pd.get_dummies(train_df['disposition'])
    violation_df_train = pd.get_dummies(train_df['violation_code'])
    train_df_used = train_df_used.merge(disp_df_train, how = "left", left_index=True, right_index=True)
    train_df_used = train_df_used.merge(violation_df_train, how = "left", left_index=True, right_index=True)
    #print("train_df_used shape after merge w/ disposition dummies:" + str(train_df_used.shape))
    #print(train_df_used.head(10))
    disp_df_test = pd.get_dummies(test_df['disposition'])
    violation_df_test = pd.get_dummies(test_df['violation_code'])
    test_df_used = test_df_used.merge(disp_df_test, how = "left", left_index=True, right_index=True)
    test_df_used = test_df_used.merge(violation_df_test, how = "left", left_index=True, right_index=True)
    #print("test_df_used shape after merge w/ disposition dummies:" + str(test_df_used.shape))
    #print(test_df_used.head(10))
    
    #train_df_used.dropna(inplace= True)         # drop any NaNs created as a result of the left merge
    train_df_used.dropna(axis=0, subset = ['compliance'], inplace=True)     # remove rows with null compliance (y) values
    #train_df_used.fillna({'lat': train_df_used['lat'].mean(), 'lon': train_df_used['lon'].mean()}, inplace=True)     # fill w/ means instead of dropping NaNs
    train_df_used.fillna({'fine_amount': 0.0}, inplace=True)    # fill NaNs created as a result of the left merge
    
    #print("train_df_used shape after fillna:" + str(train_df_used.shape))
    #print(train_df_used.head(10))
    
    #test_df_used.dropna(inplace= True)         # if any NaNs created as a result of the left merge
    #test_df_used.fillna(value = {'lat': test_df_used['lat'].mean(), 'lon': test_df_used['lon'].mean()}, inplace= True)  # fill w/ means instead of dropping NaNs
    test_df_used.fillna({'judgment_amount': 0.0}, inplace= True)    # fill NaNs created as a result of the left merge
    
    #print("test_df_used shape after merge and fillna:" + str(test_df_used.shape))
    #print(test_df_used.head(10))
    
    # drop non-needed features (ticket_id, disposition, etc.); don't need ticket_id as a feature in X_train or X_test
    # also drop the raw disposition and violation_code values (since these cannot be processed by the sklearn decision tree)
    train_df_used.drop(labels = ['ticket_id', 'disposition', 'violation_code'], axis = 1, inplace=True)   
    # Save test ticket_id as ndarray for later use
    ticket_id = test_df_used['ticket_id'].values
    #print(type(ticket_id))
    test_df_used.drop(labels = ['ticket_id', 'disposition', 'violation_code'], axis = 1, inplace=True)

    #print("train_df_used shape after dropping non-needed features:" + str(train_df_used.shape))
    #print(train_df_used.head(10))
    #print("test_df_used shape after dropping non-needed features:" + str(test_df_used.shape))
    #print(test_df_used.head(10))
    
    #now split the data into a training and test 
    X_train, X_dev, y_train, y_dev = train_test_split(train_df_used.drop(labels='compliance',axis=1), train_df_used['compliance'], random_state=0)
    
    #Do model-based feature selection using linear classifier w/ L1 norm
    lr = LogisticRegression(penalty="l1", C = 0.05, dual=False, random_state=0)
    model = SelectFromModel(lr, prefit=False)
    transformed_X_array = model.fit_transform(X_train, y_train)
    feature_idx = model.get_support()           # return mask associated with each feature
    selected_feature_names = X_train.columns[feature_idx]
    #feature_importances = model.feature_importance_
    #print("X_train shape:" + str(X_train.shape))
    #X_train.head(10)
    print("feature names after selection:" + str(selected_feature_names))
    
    #Test evaluation metric (auc) using model-based selected features on dev set
    clf_lr = lr.fit(X_train[selected_feature_names], y_train)
    y_scores = clf_lr.decision_function(X_dev[selected_feature_names])
    false_pos_rate, true_pos_rate, _ = roc_curve(y_dev, y_scores)
    roc_auc = auc(false_pos_rate, true_pos_rate)
    #print("dev set roc_auc w/ selected features:" + str(roc_auc))
    y_predict = clf_lr.predict(X_dev[selected_feature_names])
    #print("dev set confusion matrix w/ selected features:")
    #print(confusion_matrix(y_dev, y_predict))
    
    #print("y_train shape used:" + str(y_train.shape))
    #print(y_train.dtype)
    #print("X_train shape used:" + str(X_train.shape))
    #print(X_train.head(10))
    #for col in X_train.columns:
    #    print(X_train[col].dtype)
    
    # Now, update X_test to ensure we use same columns in X_test as those in X_train (Note: X_test has extra disposition columns)
    X_test = test_df_used[X_train[selected_feature_names].columns]
    #print("X_test shape used:" + str(X_test.shape))
    #print(X_test.head(10))

    #Do cross-validation to determine best hyperparameters.
    #gbc = GradientBoostingClassifier(random_state=0)
    #grid_dict = {'learning_rate': [0.01, 0.05], 'n_estimators': [100, 200], 'max_depth': [4,8]}
    #grid_clf_auroc = GridSearchCV(gbc, param_grid = grid_dict, scoring = "roc_auc")    # using default 5-fold cross-validation
    #grid_clf_auroc.fit(X_train[selected_feature_names], y_train)
    #dict = grid_clf_auroc.cv_results_
    #mean_score_grid = dict['mean_test_score'].reshape(2,4)
    #print("mean score grid:") 
    #print (mean_score_grid)
    
    # Test on dev set using roc_auc evaluation metric 
    gbc = GradientBoostingClassifier(learning_rate = 0.1,max_depth = 8, n_estimators = 400, random_state=0)
    clf_gb = gbc.fit(X_train[selected_feature_names], y_train)
    y_scores = clf_gb.decision_function(X_dev[selected_feature_names])
    false_pos_rate, true_pos_rate, _ = roc_curve(y_dev, y_scores)
    roc_auc = auc(false_pos_rate, true_pos_rate)
    print("dev set roc_auc:" + str(roc_auc))
    y_predict = clf_gb.predict(X_dev[selected_feature_names])
    print("dev set confusion matrix:")
    print(confusion_matrix(y_dev, y_predict))
    
    # Get prdicted probabilities using test set
    clf = gbc.fit(X_train[selected_feature_names], y_train)
    y_proba = clf.predict_proba(X_test)
    y_proba_series = pd.Series(data=y_proba[:,1],index=ticket_id,dtype="float32",name='compliance')
    y_proba_series.index.name = "ticket_id"
    #print(len(y_proba_series))
    return y_proba_series # Your answer here
blight_model()

feature names after selection:Index(['judgment_amount', 'Responsible (Fine Waived) by Deter',
       'Responsible by Admission', 'Responsible by Default', '19450901',
       '22-2-17', '22-2-43', '22-2-45', '22-2-61', '22-2-88',
       '61-81.0100/32.0066', '9-1-103(C)', '9-1-104', '9-1-110(a)',
       '9-1-36(a)', '9-1-43(a) - (Dwellin', '9-1-81(a)'],
      dtype='object')
dev set roc_auc:0.798854445717
dev set confusion matrix:
[[37008    70]
 [ 2400   492]]


ticket_id
284932    0.158726
285362    0.010554
285361    0.053662
285338    0.042167
285346    0.080901
285345    0.042167
285347    0.059388
285342    0.832888
285530    0.010554
284989    0.016898
285344    0.048164
285343    0.010554
285340    0.010554
285341    0.048164
285349    0.080901
285348    0.042167
284991    0.016898
285532    0.016898
285406    0.016898
285001    0.014205
285006    0.010554
285405    0.010554
285337    0.016898
285496    0.048164
285497    0.042167
285378    0.010554
285589    0.016898
285585    0.042167
285501    0.053662
285581    0.010554
            ...   
376367    0.037511
376366    0.035495
376362    0.182485
376363    0.185069
376365    0.037511
376364    0.035495
376228    0.035495
376265    0.035495
376286    0.329754
376320    0.035495
376314    0.035495
376327    0.329754
376385    0.329754
376435    0.089113
376370    0.930873
376434    0.059388
376459    0.046991
376478    0.000050
376473    0.035495
376484    0.016416
376482    0.016898
37