***Sam Cressman Capstone Project: Shelter Animal Outcomes***

***Help improve outcomes for shelter animals***

***Capstone inspiration:*** [Kaggle](https://www.kaggle.com/c/shelter-animal-outcomes)

***Modeling Notebook***

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
animals = pd.read_csv("./cleaned_animals_with_dummies")

In [3]:
# outcomes_dict = {"Transfer": 0, "Adoption": 1, "Return to Owner": 2, 
#                "Euthanasia": 3, "Rto-Adopt": 4, "Disposal": 5, "Died": 6,
#                "Missing": 7, "Relocate": 8}

animals["Outcome Type"].value_counts();

1.0    31113
0.0    24099
2.0    12658
3.0     6334
6.0      730
5.0      318
4.0      169
7.0       45
8.0       16
Name: Outcome Type, dtype: int64

In [4]:
test_dict = {"Transfer": 1, "Adoption":1 , "Return to Owner": 1, "Euthanasia":0, "Rto-Adopt":1, "Disposal":0,
             "Died":0, "Missing":0, "Relocate":1}

animals["Outcome Type"] = animals["Outcome Type"].map(test_dict)

In [20]:
# Baseline

animals["Outcome Type"].value_counts(normalize = True)

1    0.901606
0    0.098394
Name: Outcome Type, dtype: float64

***Setting X, y, features***

In [8]:
# Disregarding DateTime objects, target (Outcome Type), Outcome Subtype (many nulls: kept for EDA/visualization),
# Breed (added columns "manually"), Color (created buckets, concated back with animals)

features_to_disregard = ["Intake Time", "Outcome Time", "Date of Birth",
                         "Outcome Type", "Outcome Subtype", "Breed", "Color"]

In [9]:
features = [feat for feat in animals.columns if feat not in features_to_disregard]

X = animals[features]
y = animals["Outcome Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

***Logistic Regression Model***

In [18]:
# Accuracy score: 0.7196227014996556

lr = LogisticRegression()

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

model = lr.fit(X_train, y_train)

model.score(X_train, y_train)

0.9582413311900514

In [19]:
model.score(X_test, y_test)

0.9557522123893806

In [12]:
model.predict(X_test)

array([1, 1, 1, ..., 0, 1, 1])

In [13]:
predictions = model.predict(X_test)

In [14]:
# Creating DataFrame to view coefficient values

coef_df = pd.DataFrame({
        "coef": lr.coef_[0],
        "feature": features
    })

In [15]:
coef_df["exponential_value"] = [(np.exp(i)) for i in coef_df["coef"]]

In [16]:
coef_df.sort_values("coef", ascending=False).head(20)

Unnamed: 0,coef,feature,exponential_value
0,0.561431,has_name,1.75318
134,0.444501,Sex upon Outcome_Spayed Female,1.559712
133,0.389782,Sex upon Outcome_Neutered Male,1.476659
18,0.353037,retriever,1.423384
116,0.30005,Intake Condition_Normal,1.349927
121,0.286665,Sex upon Intake_Intact Female,1.331978
111,0.238911,Intake Type_Stray,1.269865
40,0.232016,ridgeback,1.26114
122,0.22723,Sex upon Intake_Intact Male,1.255119
14,0.183788,black tan,1.201762


In [17]:
coef_df.sort_values("coef", ascending=False).tail(20)

Unnamed: 0,coef,feature,exponential_value
8,-0.141602,staffordshire,0.867966
3,-0.148212,Length of Time In Shelter (Days),0.862249
135,-0.156273,Sex upon Outcome_Unknown,0.855326
125,-0.156273,Sex upon Intake_Unknown,0.855326
186,-0.16643,Intake Year_2013,0.846682
108,-0.184594,Intake Type_Euthanasia Request,0.831442
1,-0.192931,Age at Intake (Years),0.824539
120,-0.201884,Intake Condition_Sick,0.81719
2,-0.204923,Age at Outcome (Years),0.81471
124,-0.276507,Sex upon Intake_Spayed Female,0.758428


***GridSearch Logistic Regression Model***

In [10]:
# gs_params = {
#     "penalty": ["l1", "l2"],
#     "solver": ["liblinear"],
#     "C": [0.1 , 1]
# }

# lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params)

# lr_gridsearch_model = lr_gridsearch.fit(X_train, y_train)

# print(lr_gridsearch_model.best_score_)

# print(lr_gridsearch_model.best_params_)

0.7242938651498825
{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}


In [12]:
# GridSearch Results (commenting out due to run time):

# 0.7242938651498825
# {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}

***Random Forest***

In [21]:
rf = RandomForestClassifier()

rf.fit(X_train, y_train)

rf.score(X_test, y_test)

0.9593026336707117

***Neural Network***

Neural networks, in a single line, attempt to iteratively train a set (or sets) of weights that, when used together, return the most accurate predictions for a set of inputs. The model is trained using a loss function, which our model will attempt to minimize over iterations.

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [23]:
# One hot encoding target

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

In [24]:
X_train.shape

(56611, 248)

In [26]:
model = Sequential()

model.add(Dense(248, input_dim = 248, activation= "relu"))
model.add(Dense(2, activation = "softmax"))

model.compile(loss = "categorical_crossentropy", optimizer = "adam", metrics=["accuracy"])

model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs = 50)

Train on 56611 samples, validate on 18871 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1a1cbb55f8>

In [25]:
model.predict_classes(X_test)

array([3, 1, 0, ..., 0, 6, 0])