<h2>Inputs</h2>

In [1]:
home_dir = r"/Users/wrngnfreeman/Github/Shelter-Animal-Outcomes"  # Enter your computer's home directory
data_file=r"Austin_Animal_Center_Outcomes_20250318"  # Raw data
AnimalID=r"AnimalID"  # Column name containing Animal's IDs
dep_var=r"OutcomeType"  # Column name containing dependent variable
seed=42  # Random seed for reproducibility

<h2>Importing required modules</h2>

In [2]:
import sys
import random

sys.path.append(home_dir + r"/src")
import data_processing, feature_engineering, models

<h2>Data preparation</h2>

1. **Age**: Cleans the `AgeuponOutcome` column, converts age to days, and groups ages into categories.
2. **Sex**: Cleans the `SexuponOutcome` column by removing unwanted spaces and unknown values, then splits it into two columns for detailed categorization.
3. **Breed**:
    1. Standardizes text in the `Breed` column using regular expressions to handle spaces, unknowns, and specific terms.
    2. Splits breeds containing 'Mix', creating a new `Mix` column indicating mixed breed status.
    3. Separates multiple breeds listed in the same entry of the `Breed` column into individual rows.
    4. Maps each breed to its respective type (e.g., Terrier, Working) using a predefined dictionary and assigns an `nan` category if no match is found.
    5. Calculates the frequency of each animal's occurrence and updates the `Mix` status based on these counts.
    6. Ensures that breeds are properly categorized and mixed status is accurately reflected across all related DataFrames.
4. **Coat**:
    1. Coat Color Standardization: Adjusts the `Color` attribute according to the `AnimalType` ('Dog', 'Cat') for consistency in color naming.
    2. Pattern Extraction: Identifies and extracts coat patterns from colors.
    3. Pattern Removal: Strips out recognized pattern indicators from the `Color` string.
    4. Data Merging: Combines the original data with processed color information into `coat_color`.
    5. List Separation: Separates multiple colors listed in the same entry of the `Color` column into individual rows.

<h3>The dataset</h3>

In [3]:
train_df = data_processing.process_data(
    home_dir=home_dir,
    data_file=data_file,
    AnimalID=AnimalID,
    dep_var=dep_var
)
# Display the few rows of the dataset
train_df.loc[
    train_df[AnimalID].isin(random.sample(train_df[AnimalID].unique().tolist(), 5)),
    :
].drop(columns="BreedType").rename(columns={"Breed_broken": "Breed"})

Unnamed: 0,AnimalID,OutcomeType,Name,DateTime,AnimalType,AgeuponOutcome,SexuponOutcome,Sterilization,Breed,Mix,CoatColor,CoatPattern
6124,A629747,Adoption,Otis,08/05/2014 07:52:00 AM,Dog,<5 years,Male,Sterilized,American Bulldog,Mix,Brown,Brindle
6125,A629747,Adoption,Otis,08/05/2014 07:52:00 AM,Dog,<5 years,Male,Sterilized,American Bulldog,Mix,White,Brindle
79360,A719587,Adoption,Andy,02/10/2016 01:20:00 PM,Dog,<1 year,Male,Sterilized,German Shepherd,Mix,Black,
79361,A719587,Adoption,Andy,02/10/2016 01:20:00 PM,Dog,<1 year,Male,Sterilized,German Shepherd,Mix,White,
79362,A719587,Adoption,Andy,02/10/2016 01:20:00 PM,Dog,<1 year,Male,Sterilized,Border Collie,Mix,Black,
79363,A719587,Adoption,Andy,02/10/2016 01:20:00 PM,Dog,<1 year,Male,Sterilized,Border Collie,Mix,White,
130951,A765243,Transfer,Oreo,01/17/2018 04:51:00 PM,Dog,<1 year,Female,Intact,Beagle,Mix,White,
130952,A765243,Transfer,Oreo,01/17/2018 04:51:00 PM,Dog,<1 year,Female,Intact,Beagle,Mix,Tricolor,
163723,A792763,Return_to_owner,Armani,04/15/2019 06:23:00 PM,Dog,<6 months,Male,Sterilized,Labrador Retriever,Mix,Black,
163724,A792763,Return_to_owner,Armani,04/15/2019 06:23:00 PM,Dog,<6 months,Male,Sterilized,Labrador Retriever,Mix,White,


<h2>Model training</h2>

<h3>Random Forest Model</h3>

In [4]:
export_model_path = home_dir + r"/pickle_files/rf_model.pkl"

# Load and process training data
processed_df = data_processing.process_data(
    home_dir=home_dir,
    data_file=data_file,
    AnimalID=AnimalID,
    dep_var=dep_var
).drop(columns=["Breed_broken"])
# Engineer features
engineered_df = feature_engineering.engineer_features(
    df=processed_df,
    AnimalID=AnimalID,
    dep_var=dep_var
)

models.random_forest_model(
    df=engineered_df,
    AnimalID=AnimalID,
    dep_var=dep_var,
    seed=seed
)

Classification Report
                 precision    recall  f1-score   support

       Adoption       0.72      0.93      0.81     28191
Return_to_owner       0.49      0.32      0.38      9233
       Transfer       0.72      0.57      0.64     15007
           Died       0.00      0.00      0.00       424
     Euthanasia       0.25      0.02      0.04      1549

       accuracy                           0.69     54404
      macro avg       0.43      0.37      0.37     54404
   weighted avg       0.66      0.69      0.66     54404

Random Forest Model Accuracy: 0.6923020366149548

Feature Importances


Unnamed: 0,feature,importance
10,Sterilization_Sterilized,5.093290e-01
9,Age_<6 months,9.620536e-02
3,Age_<1 month,6.920570e-02
0,AnimalType_Cat,3.149489e-02
8,Age_<5 years,2.788575e-02
...,...,...
17,BreedType_Birman,4.535828e-06
74,CoatColor_Ruddy,3.235077e-06
38,BreedType_Ocicat,2.747422e-06
31,BreedType_Javanese,3.964498e-07
