***Sam Cressman Capstone Project: Shelter Animal Outcomes***

***Help improve outcomes for shelter animals***

***Capstone inspiration:*** [Kaggle](https://www.kaggle.com/c/shelter-animal-outcomes)

***Modeling Notebook***

In [13]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

In [2]:
animals = pd.read_csv("./cleaned_animals_with_dummies")

In [11]:
# outcomes_dict = {"Transfer": 0, "Adoption": 1, "Return to Owner": 2, 
#                "Euthanasia": 3, "Rto-Adopt": 4, "Disposal": 5, "Died": 6,
#                "Missing": 7, "Relocate": 8}

animals["Outcome Type"].value_counts()

1.0    31113
0.0    24099
2.0    12658
3.0     6334
6.0      730
5.0      318
4.0      169
7.0       45
8.0       16
Name: Outcome Type, dtype: int64

***Setting X, y, features***

In [3]:
# Disregarding DateTime objects, target (Outcome Type), Outcome Subtype (many nulls: kept for EDA/visualization),
# Breed (added columns "manually"), Color (created buckets, concated back with animals)

features_to_disregard = ["Intake Time", "Outcome Time", "Date of Birth",
                         "Outcome Type", "Outcome Subtype", "Breed", "Color"]

In [15]:
features = [feat for feat in animals.columns if feat not in features_to_disregard]

X = animals[features]
y = animals["Outcome Type"]

X_train, X_test, y_train, y_test = train_test_split(X, y)

***Logistic Regression Model***

In [4]:
# Accuracy score: 0.7196227014996556

lr = LogisticRegression()

ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

model = lr.fit(X_train, y_train)

model.score(X_test, y_test)

0.7196227014996556

In [5]:
model.predict(X_test)

array([2., 2., 0., ..., 0., 1., 1.])

In [16]:
predictions = model.predict(X_test)

In [6]:
# Creating DataFrame to view coefficient values

coef_df = pd.DataFrame({
        "coef": lr.coef_[0],
        "feature": features
    })

In [7]:
coef_df["exponential_value"] = [(np.exp(i)) for i in coef_df["coef"]]

In [8]:
coef_df.sort_values("coef", ascending=False).head(20);

In [9]:
coef_df.sort_values("coef", ascending=False).tail(20);

***GridSearch Logistic Regression Model***

In [10]:
# gs_params = {
#     "penalty": ["l1", "l2"],
#     "solver": ["liblinear"],
#     "C": [0.1 , 1]
# }

# lr_gridsearch = GridSearchCV(LogisticRegression(), gs_params)

# lr_gridsearch_model = lr_gridsearch.fit(X_train, y_train)

# print(lr_gridsearch_model.best_score_)

# print(lr_gridsearch_model.best_params_)

0.7242938651498825
{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}


In [12]:
# GridSearch Results (commenting out due to run time):

# 0.7242938651498825
# {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}

***Random Forest***

In [14]:
rf = RandomForestClassifier()

rf.fit(X_train, y_train)

rf.score(X_test, y_test)

0.7772773037994807

***Neural Network***

Neural networks, in a single line, attempt to iteratively train a set (or sets) of weights that, when used together, return the most accurate predictions for a set of inputs. The model is trained using a loss function, which our model will attempt to minimize over iterations.