# Blight violations (Property Maintenance Fines)

### Objective

Objective: How can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

### Evaluation

Predictions: Probability that the corresponding blight ticket will be paid on time.

Evaluation metric: Area Under the ROC Curve (AUC).

Your grade will be based on the AUC score computed for your classifier. A model which with an AUCROC of 0.7 passes this assignment, over 0.75 will recieve full points.

### Mission

Trains a model to predict blight ticket compliance in Detroit using train.csv. Using this model, return a series of length 61001 with the data being the probability that each corresponding ticket from test.csv will be paid, and the index being the ticket_id.

    Reminder:
    y_proba_cls = cls.fit(X_train, y_train).predict_proba(X_test)

Use Supervised and MinMaxScalar()

    Reminder:
    from sklearn.preprocessing import MinMaxScaler

    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

Example:

    ticket_id
       284932    0.531842
       285362    0.401958
       285361    0.105928
       285338    0.018572
                 ...
       376499    0.208567
       376500    0.818759
       369851    0.018528
       Name: compliance, dtype: float32

### Data files

    train.csv      : training data
    test.csv       : testing data
    addresses.csv  : ticket_id to address
    latlons.csv    : address to geo-coordinates

    In train.csv & test.csv,
    Each row       : a single blight ticket including information about when, why, and to whom each ticket was issued.

    In train.csv,
    target variable: compliance (0: late payment, 1: on-time payment, null: ignore)
    
### Data field
    
train.csv & test.csv

    - ticket_id: unique identifier for tickets
    - agency_name: Agency that issued the ticket
    - inspector_name: Name of inspector that issued the ticket
    - violator_name: Name of the person/organization that the ticket was issued to
    - violation_street_number, violation_street_name, violation_zip_code: Address where the violation occurred
    - mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country: Mailing address of the violator
    - ticket_issued_date: Date and time the ticket was issued
    - hearing_date: Date and time the violator's hearing was scheduled
    - violation_code, violation_description: Type of violation
    - disposition: Judgment and judgement type
    - fine_amount: Violation fine amount, excluding fees
    - admin_fee: 20 fee assigned to responsible judgments
    - state_fee: 10 fee assigned to responsible judgments
    - late_fee: 10% fee assigned to responsible judgments
    - discount_amount: discount applied, if any
    - clean_up_cost: DPW clean-up or graffiti removal cost
    - judgment_amount: Sum of all fines and fees
    - grafitti_status: Flag for graffiti violations
    
train.csv only

    - payment_amount: Amount paid, if any
    - payment_date: Date payment was made, if it was received
    - payment_status: Current payment status as of Feb 1 2017
    - balance_due: Fines and fees still owed
    - collection_status: Flag for payments in collections
    - compliance: target variable for prediction
         Null = Not responsible
         0 = Responsible, non-compliant
         1 = Responsible, compliant
    - compliance_detail: More information on why each ticket was marked compliant or non-compliant


In [1]:
import pandas as pd
import numpy as np

### Explore train data

In [2]:
train_data = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
print(train_data.shape)
train_data.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


(250306, 34)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,clean_up_cost,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance
0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,...,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,...,0.0,855.0,780.0,75.0,2005-06-02 00:00:00,PAID IN FULL,,,compliant by late payment within 1 month,1.0
2,22062,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","SANDERS, DERRON",1449.0,LONGFELLOW,,23658.0,P.O. BOX,DETROIT,...,0.0,0.0,0.0,0.0,,NO PAYMENT APPLIED,,,not responsible by disposition,


In [3]:
print(train_data[(train_data['compliance'] == 0) | (train_data['compliance'] == 1)].shape)
train_data[(train_data['compliance'] == 0) | (train_data['compliance'] == 1)].head(3)

(159880, 34)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,clean_up_cost,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance
0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,...,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,...,0.0,855.0,780.0,75.0,2005-06-02 00:00:00,PAID IN FULL,,,compliant by late payment within 1 month,1.0
5,22046,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","KASIMU, UKWELI",6478.0,NORTHFIELD,,2755.0,E. 17TH,LOG BEACH,...,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0


In [4]:
train_data = train_data.dropna(subset=['compliance'])   #same
print(train_data.shape)
train_data.head(3)

(159880, 34)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,clean_up_cost,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance
0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,...,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0
1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,...,0.0,855.0,780.0,75.0,2005-06-02 00:00:00,PAID IN FULL,,,compliant by late payment within 1 month,1.0
5,22046,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","KASIMU, UKWELI",6478.0,NORTHFIELD,,2755.0,E. 17TH,LOG BEACH,...,0.0,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0


### Check columns' type and missing values in train_data

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159880 entries, 0 to 250293
Data columns (total 34 columns):
ticket_id                     159880 non-null int64
agency_name                   159880 non-null object
inspector_name                159880 non-null object
violator_name                 159854 non-null object
violation_street_number       159880 non-null float64
violation_street_name         159880 non-null object
violation_zip_code            0 non-null float64
mailing_address_str_number    157322 non-null float64
mailing_address_str_name      159877 non-null object
city                          159880 non-null object
state                         159796 non-null object
zip_code                      159879 non-null object
non_us_str_code               3 non-null object
country                       159880 non-null object
ticket_issued_date            159880 non-null object
hearing_date                  159653 non-null object
violation_code                159880 non-null obj

    A) Drop the null columns
    B) Convert the missing values null into 0 for float columns
    C) Convert the missing values null into "NA" for string object columns

    violator_name                 159854 non-null object              C
    violation_zip_code            0 non-null float64                  A
    mailing_address_str_number    157322 non-null float64             B
    mailing_address_str_name      159877 non-null object              C
    state                         159796 non-null object              C
    zip_code                      159879 non-null object              C   (Yes, it is an object column since there is
                                                                           incorrect input(s) making it a object column
                                                                           and we cannot use .astype(int))
    non_us_str_code               3 non-null object                   C
    hearing_date                  159653 non-null object              C
    payment_date                  39611 non-null object               C
    collection_status             36897 non-null object               C
    grafitti_status               0 non-null object                   A

### Explore test data

In [6]:
test_data = pd.read_csv('test.csv')
print(test_data.shape)
test_data.head()

(61001, 27)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,violation_description,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,grafitti_status
0,284932,Department of Public Works,"Granberry, Aisha B","FLUELLEN, JOHN A",10041.0,ROSEBERRY,,141,ROSEBERRY,DETROIT,...,Failure to secure City or Private solid waste ...,Responsible by Default,200.0,20.0,10.0,20.0,0.0,0.0,250.0,
1,285362,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,Allowing bulk solid waste to lie or accumulate...,Responsible by Default,1000.0,20.0,10.0,100.0,0.0,0.0,1130.0,
2,285361,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,Improper placement of Courville container betw...,Responsible by Default,100.0,20.0,10.0,10.0,0.0,0.0,140.0,
3,285338,Department of Public Works,"Talbert, Reginald","HARABEDIEN, POPKIN",1835.0,CENTRAL,,2246,NELSON,WOODHAVEN,...,Allowing bulk solid waste to lie or accumulate...,Responsible by Default,200.0,20.0,10.0,20.0,0.0,0.0,250.0,
4,285346,Department of Public Works,"Talbert, Reginald","CORBELL, STANLEY",1700.0,CENTRAL,,3435,MUNGER,LIVONIA,...,Violation of time limit for approved container...,Responsible by Default,100.0,20.0,10.0,10.0,0.0,0.0,140.0,


In [7]:
# As we only consider Detroit city

print(test_data[test_data['city'].str.upper() == 'DETROIT'].shape)
test_data[test_data['city'].str.upper() == 'DETROIT'].head(3)

(30981, 27)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,violation_description,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,grafitti_status
0,284932,Department of Public Works,"Granberry, Aisha B","FLUELLEN, JOHN A",10041.0,ROSEBERRY,,141,ROSEBERRY,DETROIT,...,Failure to secure City or Private solid waste ...,Responsible by Default,200.0,20.0,10.0,20.0,0.0,0.0,250.0,
1,285362,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,Allowing bulk solid waste to lie or accumulate...,Responsible by Default,1000.0,20.0,10.0,100.0,0.0,0.0,1130.0,
2,285361,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,Improper placement of Courville container betw...,Responsible by Default,100.0,20.0,10.0,10.0,0.0,0.0,140.0,


### Check columns' type and missing values in test_data

In [11]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61001 entries, 0 to 61000
Data columns (total 27 columns):
ticket_id                     61001 non-null int64
agency_name                   61001 non-null object
inspector_name                61001 non-null object
violator_name                 60973 non-null object
violation_street_number       61001 non-null float64
violation_street_name         61001 non-null object
violation_zip_code            24024 non-null object
mailing_address_str_number    59987 non-null object
mailing_address_str_name      60998 non-null object
city                          61000 non-null object
state                         60670 non-null object
zip_code                      60998 non-null object
non_us_str_code               0 non-null float64
country                       61001 non-null object
ticket_issued_date            61001 non-null object
hearing_date                  58804 non-null object
violation_code                61001 non-null object
violation_

    A) Drop the null columns
    B) Convert the missing values null into 0 for float columns
    C) Convert the missing values null into "NA" for string object columns
    
    
    In test_data:
    violator_name                 60973 non-null object               C
    violation_zip_code            24024 non-null object               C
    mailing_address_str_number    59987 non-null object               C->B   (because most of the data in this column in
                                                                              train_data is float. Why object?
                                                                              incorrect input(s))
    mailing_address_str_name      60998 non-null object               C
    city                          61000 non-null object               C
    state                         60670 non-null object               C
    zip_code                      60998 non-null object               C
    non_us_str_code               0 non-null float64                  A->B
    hearing_date                  58804 non-null object               C
    grafitti_status               2221 non-null object                C
    
    
    In train_data:
    violator_name                 159854 non-null object              C
    violation_zip_code            0 non-null float64                  A->C
    mailing_address_str_number    157322 non-null float64             B
    mailing_address_str_name      159877 non-null object              C
    state                         159796 non-null object              C
    zip_code                      159879 non-null object              C   (Yes, it is an object column since there is
                                                                           incorrect input(s) making it a object column
                                                                           and we cannot use .astype(int))
    non_us_str_code               3 non-null object                   C
    hearing_date                  159653 non-null object              C
    payment_date                  39611 non-null object               C
    collection_status             36897 non-null object               C
    grafitti_status               0 non-null object                   A->C

### Merge train_data & test_data with address

In [7]:
address = pd.read_csv('addresses.csv')
print(address.shape)
address.head(3)

(311307, 2)


Unnamed: 0,ticket_id,address
0,22056,"2900 tyler, Detroit MI"
1,27586,"4311 central, Detroit MI"
2,22062,"1449 longfellow, Detroit MI"


In [8]:
latlons = pd.read_csv('latlons.csv')
print(latlons.shape)
latlons.head(3)

(121769, 3)


Unnamed: 0,address,lat,lon
0,"4300 rosa parks blvd, Detroit MI 48208",42.346169,-83.079962
1,"14512 sussex, Detroit MI",42.394657,-83.194265
2,"3456 garland, Detroit MI",42.373779,-82.986228


In [9]:
train_data = pd.merge(train_data, address, how='inner', left_on='ticket_id', right_on='ticket_id')
print(train_data.shape)
train_data.head(3)

(159880, 35)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,judgment_amount,payment_amount,balance_due,payment_date,payment_status,collection_status,grafitti_status,compliance_detail,compliance,address
0,22056,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","INVESTMENT INC., MIDWEST MORTGAGE",2900.0,TYLER,,3.0,S. WICKER,CHICAGO,...,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0,"2900 tyler, Detroit MI"
1,27586,"Buildings, Safety Engineering & Env Department","Williams, Darrin","Michigan, Covenant House",4311.0,CENTRAL,,2959.0,Martin Luther King,Detroit,...,855.0,780.0,75.0,2005-06-02 00:00:00,PAID IN FULL,,,compliant by late payment within 1 month,1.0,"4311 central, Detroit MI"
2,22046,"Buildings, Safety Engineering & Env Department","Sims, Martinzie","KASIMU, UKWELI",6478.0,NORTHFIELD,,2755.0,E. 17TH,LOG BEACH,...,305.0,0.0,305.0,,NO PAYMENT APPLIED,,,non-compliant by no payment,0.0,"6478 northfield, Detroit MI"


In [10]:
test_data = pd.merge(test_data, address, how='inner', left_on='ticket_id', right_on='ticket_id')
print(test_data.shape)
test_data.head(3)

(61001, 28)


Unnamed: 0,ticket_id,agency_name,inspector_name,violator_name,violation_street_number,violation_street_name,violation_zip_code,mailing_address_str_number,mailing_address_str_name,city,...,disposition,fine_amount,admin_fee,state_fee,late_fee,discount_amount,clean_up_cost,judgment_amount,grafitti_status,address
0,284932,Department of Public Works,"Granberry, Aisha B","FLUELLEN, JOHN A",10041.0,ROSEBERRY,,141,ROSEBERRY,DETROIT,...,Responsible by Default,200.0,20.0,10.0,20.0,0.0,0.0,250.0,,"10041 roseberry, Detroit MI"
1,285362,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,Responsible by Default,1000.0,20.0,10.0,100.0,0.0,0.0,1130.0,,"18520 evergreen, Detroit MI"
2,285361,Department of Public Works,"Lusk, Gertrina","WHIGHAM, THELMA",18520.0,EVERGREEN,,19136,GLASTONBURY,DETROIT,...,Responsible by Default,100.0,20.0,10.0,10.0,0.0,0.0,140.0,,"18520 evergreen, Detroit MI"


## Answer

### Feature Selection

In [1]:
import pandas as pd
import numpy as np

# !!--- load data files ---!!
train_data = pd.read_csv('train.csv', encoding = "ISO-8859-1")                                 #(250306, 34)
test_data = pd.read_csv('test.csv', encoding = "ISO-8859-1")   #X_test                         #(61001, 27)
#address & latlons are not relevant here



# !!--- Prepare X_train, y_train, X_test ---!!
X_train = train_data.dropna(subset=['compliance'])   #dropna in compliance column              #(159880, 34)
y_train = X_train['compliance'].astype(int)   #compliance column converted to int without na   #(159880,)
X_test = test_data                                                                             #(61001, 27)
#y_prob is need to be predicted (objective)   #y_proba_cls = cls.fit(X_train, y_train).predict_proba(X_test)



# !!--- Convert missing values Nan to 0 for int and "NA" for string in X_train & X_test ---!!
# !-- X_train --!
X_train_converted_na = X_train.copy()

convert_nan_to_NA_X_train = [
    'violator_name',
    'violation_zip_code',
    'mailing_address_str_name',
    'state',
    'zip_code',
    'non_us_str_code',
    'hearing_date',
    'grafitti_status',
    'payment_date',
    'collection_status',
    'mailing_address_str_number'
]

X_train_converted_na[convert_nan_to_NA_X_train] = X_train[convert_nan_to_NA_X_train].fillna('NA')

# !-- X_test --!
X_test_converted_na = X_test.copy()

convert_nan_to_NA_X_test = [
    'violator_name',
    'violation_zip_code',
    'mailing_address_str_name',
    'city',
    'state',
    'zip_code',
    'non_us_str_code',
    'hearing_date',
    'grafitti_status',
    'mailing_address_str_number'
]

X_test_converted_na[convert_nan_to_NA_X_test] = X_test[convert_nan_to_NA_X_test].fillna('NA')

# X_train_converted_na                                                                          #(159880, 34)
# X_test_converted_na                                                                           #(61001, 27)



# !!--- Drop irrelevant features ---!!
# !-- feature drop for X_train & X_test both --!
feature_drop = [
    'agency_name',                #irrelevant
    'inspector_name',             #irrelevant
    'violator_name',              #irrelevant
    'violation_street_number',    #irrelevant
    'violation_street_name',      #irrelevant
    'violation_zip_code',         #irrelevant
    'violation_description',      #irrelevant
    'violation_code',             #irrelevant
    'mailing_address_str_number', #we have city
    'mailing_address_str_name',   #we have city
    'zip_code',                   #we have city
    'state',                      #we have city
    'hearing_date',               #irrelevant
    'ticket_issued_date',         #irrelevant
    'grafitti_status',            #too few data in X_train
    'country',                    #only 3 rows are not USA
    'non_us_str_code',            #too few data in X_train & X_test
    'clean_up_cost',              #all 0 in X_train
    'fine_amount',                #sum of judgment_amount
    'admin_fee',                  #sum of judgment_amount
    'state_fee',                  #sum of judgment_amount
    'late_fee'                    #sum of judgment_amount
]

# !-- additional feature drop for X_train --!
feature_drop_X_train = [
    'balance_due',                #data leakage
    'payment_amount',             #data leakage
    'payment_date',               #data leakage
    'payment_status',             #data leakage
    'compliance_detail',          #data leakage
    'compliance',                 #data leakage
    'collection_status'           #data leakage
]

X_train_converted_na_dropped = X_train_converted_na.drop(feature_drop, axis=1)
X_train_converted_na_dropped = X_train_converted_na_dropped.drop(feature_drop_X_train, axis=1)    #(159880, 5)
X_test_converted_na_dropped = X_test_converted_na.drop(feature_drop, axis=1)                      #(61001, 5)

# !-- Set index = ticket_id --!
X_train_converted_na_dropped= X_train_converted_na_dropped.set_index('ticket_id')                 #(159880, 4)
X_test_converted_na_dropped = X_test_converted_na_dropped.set_index('ticket_id')                  #(61001, 4)



# !!--- Convert String columns to category columns and then convert them to codes ---!! (Or use one-hot encoding)
X_train_category = X_train_converted_na_dropped.copy()
X_test_category = X_test_converted_na_dropped.copy()

category_column = ['city', 'disposition']

for df in [X_train_category, X_test_category]:
    for column in category_column:
        df[column] = df[column].astype('category').cat.codes
        df[column] = df[column].astype('category').cat.codes



# !!--- Final dataset ---!!
X_train_category                                                                                  #(159880, 4)
# y_train                                                                                         #(159880,)
# X_test_category                                                                                 #(61001, 4)
#y_prob is need to be predicted (objective)   #y_proba_cls = cls.fit(X_train, y_train).predict_proba(X_test)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,city,disposition,discount_amount,judgment_amount
ticket_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
22056,543,2,0.0,305.0
27586,952,3,0.0,855.0
22046,1930,2,0.0,305.0
18738,725,2,0.0,855.0
18735,952,2,0.0,140.0
18733,952,2,0.0,140.0
28204,952,2,0.0,855.0
18743,952,2,0.0,855.0
18741,952,2,0.0,855.0
18978,952,2,0.0,855.0


In [33]:
# X_train_category                                                                                #(159880, 4)
# y_train                                                                                         #(159880,)
# X_test_category                                                                                 #(61001, 4) 

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV            #GridSearchCV(clf, param_grid=grid_vals, scoring='roc_auc')

grid_values = {'learning_rate': [0.01, 0.1, 1], 'max_depth': [3, 4, 5]}

clf = GradientBoostingClassifier(random_state = 0)
grid_clf = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf.fit(X_train_category , y_train)
y_prob = grid_clf.predict_proba(X_test_category)

print('Grid best parameter (max. auc): ', grid_clf.best_params_)
print('Grid best score (auc): ', grid_clf.best_score_)

print(y_prob.shape)
y_prob    #first column: prob of 0 -ve class, second column: prob of 1 +ve class



Grid best parameter (max. auc):  {'learning_rate': 0.1, 'max_depth': 5}
Grid best score (auc):  0.7883516676312213
(61001, 2)


array([[0.78561743, 0.21438257],
       [0.85804491, 0.14195509],
       [0.78801247, 0.21198753],
       ...,
       [0.78801247, 0.21198753],
       [0.78801247, 0.21198753],
       [0.06946753, 0.93053247]])

In [31]:
pd.Series(y_prob[:, 1], index=X_test_category.index)

ticket_id
284932    0.214383
285362    0.141955
285361    0.211988
285338    0.283439
285346    0.216399
285345    0.217970
285347    0.211526
285342    0.859405
285530    0.123014
284989    0.171205
285344    0.361160
285343    0.245174
285340    0.141955
285341    0.282280
285349    0.186782
285348    0.188960
284991    0.169798
285532    0.165578
285406    0.165578
285001    0.187882
285006    0.134986
285405    0.171800
285337    0.175807
285496    0.222116
285497    0.228790
285378    0.171800
285589    0.236852
285585    0.214383
285501    0.211988
285581    0.141955
            ...   
376367    0.165578
376366    0.195294
376362    0.214995
376363    0.232496
376365    0.165578
376364    0.195294
376228    0.247343
376265    0.176471
376286    0.886243
376320    0.247343
376314    0.195294
376327    0.902828
376385    0.902828
376435    0.211792
376370    0.902828
376434    0.282280
376459    0.211988
376478    0.050572
376473    0.212629
376484    0.190800
376482    0.227645
37

##  Try normalization
#### Result is very similar or even same

In [2]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_category)     #X_train need to fit and then transform
X_test_scaled = scaler.transform(X_test_category)           #X_test need to transform only

In [3]:
X_train_scaled

array([[0.13269795, 0.66666667, 0.        , 0.02765186],
       [0.23264907, 1.        , 0.        , 0.07751587],
       [0.471652  , 0.66666667, 0.        , 0.02765186],
       ...,
       [0.21334311, 0.66666667, 0.        , 0.05258386],
       [0.21334311, 1.        , 0.        , 0.02085222],
       [0.13318671, 0.66666667, 0.        , 0.02266546]])

In [35]:
# X_train_category                                                                                #(159880, 4)
# y_train                                                                                         #(159880,)
# X_test_category                                                                                 #(61001, 4) 

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV            #GridSearchCV(clf, param_grid=grid_vals, scoring='roc_auc')

grid_values = {'learning_rate': [0.01, 0.1, 1], 'max_depth': [3, 4, 5]}

clf = GradientBoostingClassifier(random_state = 0)
grid_clf = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf.fit(X_train_scaled, y_train)
y_prob1 = grid_clf.predict_proba(X_test_scaled)

print('Grid best parameter (max. auc): ', grid_clf.best_params_)
print('Grid best score (auc): ', grid_clf.best_score_)

print(y_prob1.shape)
y_prob1    #first column: prob of 0 -ve class, second column: prob of 1 +ve class



Grid best parameter (max. auc):  {'learning_rate': 0.1, 'max_depth': 5}
Grid best score (auc):  0.788352365853119
(61001, 2)


array([[0.78561743, 0.21438257],
       [0.85804491, 0.14195509],
       [0.78801247, 0.21198753],
       ...,
       [0.78801247, 0.21198753],
       [0.78801247, 0.21198753],
       [0.06946753, 0.93053247]])

In [37]:
pd.Series(y_prob1[:, 1], index=X_test_category.index)

ticket_id
284932    0.214383
285362    0.141955
285361    0.211988
285338    0.283439
285346    0.216399
285345    0.217970
285347    0.211526
285342    0.859405
285530    0.123014
284989    0.171205
285344    0.361160
285343    0.245174
285340    0.141955
285341    0.282280
285349    0.186782
285348    0.188960
284991    0.169798
285532    0.165578
285406    0.165578
285001    0.187882
285006    0.134986
285405    0.171800
285337    0.175807
285496    0.222116
285497    0.228790
285378    0.171800
285589    0.236852
285585    0.214383
285501    0.211988
285581    0.141955
            ...   
376367    0.165578
376366    0.195294
376362    0.214995
376363    0.232496
376365    0.165578
376364    0.195294
376228    0.247343
376265    0.176471
376286    0.886243
376320    0.247343
376314    0.195294
376327    0.902828
376385    0.902828
376435    0.211792
376370    0.902828
376434    0.282280
376459    0.211988
376478    0.050572
376473    0.212629
376484    0.190800
376482    0.227645
37

## Try Neural networks (Classification)

In [10]:
# X_train_category                                                                                #(159880, 4)
# y_train                                                                                         #(159880,)
# X_test_category                                                                                 #(61001, 4) 

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV            #GridSearchCV(clf, param_grid=grid_vals, scoring='roc_auc')

grid_values = {'alpha': [0.0001,0.001,0.01]}

clf = MLPClassifier(hidden_layer_sizes = [100, 10], solver='lbfgs', random_state = 0)
grid_clf = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf.fit(X_train_category, y_train)
y_prob2 = grid_clf.predict_proba(X_test_category)

print('Grid best parameter (max. auc): ', grid_clf.best_params_)
print('Grid best score (auc): ', grid_clf.best_score_)

print(y_prob2.shape)
y_prob2    #first column: prob of 0 -ve class, second column: prob of 1 +ve class



Grid best parameter (max. auc):  {'alpha': 0.0001}
Grid best score (auc):  0.4935535765483143
(61001, 2)


array([[4.45107889e-01, 5.54892111e-01],
       [1.00000000e+00, 1.30861313e-16],
       [4.45107889e-01, 5.54892111e-01],
       ...,
       [4.45107889e-01, 5.54892111e-01],
       [4.45107889e-01, 5.54892111e-01],
       [5.82003192e-01, 4.17996808e-01]])

In [11]:
pd.Series(y_prob2[:, 1], index=X_test_category.index)

ticket_id
284932    5.548921e-01
285362    1.308613e-16
285361    5.548921e-01
285338    5.022697e-02
285346    2.201538e-01
285345    5.321298e-01
285347    1.289112e-01
285342    1.582729e-01
285530    6.737090e-11
284989    5.548921e-01
285344    5.548921e-01
285343    3.039150e-19
285340    1.308613e-16
285341    4.323454e-01
285349    5.548921e-01
285348    5.548921e-01
284991    5.548921e-01
285532    3.417296e-04
285406    3.417296e-04
285001    5.548921e-01
285006    4.468020e-09
285405    5.548921e-01
285337    5.548921e-01
285496    3.353958e-02
285497    1.941348e-01
285378    5.548921e-01
285589    5.548921e-01
285585    5.548921e-01
285501    5.548921e-01
285581    1.308613e-16
              ...     
376367    6.256851e-10
376366    5.548921e-01
376362    4.371196e-01
376363    2.724578e-01
376365    6.256851e-10
376364    5.548921e-01
376228    5.548921e-01
376265    3.771204e-05
376286    1.654424e-01
376320    5.548921e-01
376314    5.548921e-01
376327    5.548921e-01
3

## Neural networks (with normalization)

In [8]:
# X_train_category                                                                                #(159880, 4)
# y_train                                                                                         #(159880,)
# X_test_category                                                                                 #(61001, 4) 

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV            #GridSearchCV(clf, param_grid=grid_vals, scoring='roc_auc')

grid_values = {'alpha': [1,10,100]}

clf = MLPClassifier(hidden_layer_sizes = [100, 10], solver='lbfgs', random_state = 0)
grid_clf = GridSearchCV(clf, param_grid = grid_values, scoring = 'roc_auc')
grid_clf.fit(X_train_scaled, y_train)
y_prob3 = grid_clf.predict_proba(X_test_scaled)

print('Grid best parameter (max. auc): ', grid_clf.best_params_)
print('Grid best score (auc): ', grid_clf.best_score_)

print(y_prob3.shape)
y_prob3    #first column: prob of 0 -ve class, second column: prob of 1 +ve class



Grid best parameter (max. auc):  {'alpha': 10}
Grid best score (auc):  0.7717153009433458
(61001, 2)


array([[0.0208386 , 0.9791614 ],
       [0.04587422, 0.95412578],
       [0.02083024, 0.97916976],
       ...,
       [0.02083024, 0.97916976],
       [0.02083024, 0.97916976],
       [0.02082568, 0.97917432]])

In [9]:
pd.Series(y_prob3[:, 1], index=X_test_category.index)

ticket_id
284932    0.979161
285362    0.954126
285361    0.979170
285338    0.980814
285346    0.979773
285345    0.979765
285347    0.979777
285342    0.982292
285530    0.954928
284989    0.977977
285344    0.978922
285343    0.953334
285340    0.954126
285341    0.979174
285349    0.979114
285348    0.979105
284991    0.978259
285532    0.977643
285406    0.977643
285001    0.978188
285006    0.955218
285405    0.957547
285337    0.978043
285496    0.980297
285497    0.980285
285378    0.957909
285589    0.979381
285585    0.979161
285501    0.979170
285581    0.954126
            ...   
376367    0.967905
376366    0.979157
376362    0.982141
376363    0.982145
376365    0.967905
376364    0.979157
376228    0.979210
376265    0.978737
376286    0.857146
376320    0.979210
376314    0.979157
376327    0.846462
376385    0.846462
376435    0.981288
376370    0.981322
376434    0.979174
376459    0.979170
376478    0.051663
376473    0.980600
376484    0.979192
376482    0.977176
37