<h2>Enter your computer's home firectory</h2>

In [1]:
home_dir = r"/Users/wrngnfreeman/Github/Shelter-Animal-Outcomes-by-kaggle.com"

<h2>Importing required modules</h2>

In [2]:
import sys
sys.path.append(home_dir + r"/src")
import data_processing, model_training

<h2>Data preparation</h2>

1. **Age**: Cleans the `AgeuponOutcome` column, converts age to days, and groups ages into categories.
2. **Sex**: Cleans the `SexuponOutcome` column by removing unwanted spaces and unknown values, then splits it into two columns for detailed categorization.
3. **Breed**:
    1. Standardizes text in the `Breed` column using regular expressions to handle spaces, unknowns, and specific terms.
    2. Splits breeds containing 'Mix', creating a new `Mix` column indicating mixed breed status.
    3. Separates multiple breeds listed in the same entry of the `Breed` column into individual rows.
    4. Maps each breed to its respective type (e.g., Terrier, Working) using a predefined dictionary and assigns an `nan` category if no match is found.
    5. Calculates the frequency of each animal's occurrence and updates the `Mix` status based on these counts.
    6. Ensures that breeds are properly categorized and mixed status is accurately reflected across all related DataFrames.
4. **Coat**:
    1. Coat Color Standardization: Adjusts the `Color` attribute according to the `AnimalType` ('Dog', 'Cat') for consistency in color naming.
    2. Pattern Extraction: Identifies and extracts coat patterns from colors.
    3. Pattern Removal: Strips out recognized pattern indicators from the `Color` string.
    4. Data Merging: Combines the original data with processed color information into `coat_color`.
    5. List Separation: Separates multiple colors listed in the same entry of the `Color` column into individual rows.

<h3>The train dataset</h3>

In [3]:
train_df = data_processing.process_data(
    home_dir=home_dir,
    data_file=r"train",
    AnimalID=r"AnimalID",
    dep_var=r"OutcomeType"
)
display(train_df)

	Age information processed in: 0.02 seconds
	Sex information processed in: 0.03 seconds
	Breed information processed in: 0.41 seconds
	Coat information processed in: 3.12 seconds
	All data merged in: 0.06 seconds


Unnamed: 0,AnimalID,OutcomeType,Name,DateTime,AnimalType,AgeuponOutcome,SexuponOutcome,Sterilization,BreedType,Mix,CoatColor,CoatPattern
0,A006100,Return_to_owner,Scamp,2014-12-20 16:35:00,Dog,<10 years,Male,Sterilized,Sporting,Pure breed,Yellow,
1,A006100,Return_to_owner,Scamp,2014-12-20 16:35:00,Dog,<10 years,Male,Sterilized,Sporting,Pure breed,White,
2,A047759,Transfer,Oreo,2014-04-07 15:12:00,Dog,<15 years,Male,Sterilized,Hound,Pure breed,Tricolor,
3,A134067,Return_to_owner,Bandit,2013-11-16 11:54:00,Dog,15+ years,Male,Sterilized,Herding,Pure breed,Brown,
4,A134067,Return_to_owner,Bandit,2013-11-16 11:54:00,Dog,15+ years,Male,Sterilized,Herding,Pure breed,White,
...,...,...,...,...,...,...,...,...,...,...,...,...
44733,A721108,Return_to_owner,Buddy,2016-02-21 15:27:00,Dog,<5 years,Male,Sterilized,Hound,Mix,White,
44734,A721108,Return_to_owner,Buddy,2016-02-21 15:27:00,Dog,<5 years,Male,Sterilized,Sporting,Mix,Chocolate,
44735,A721108,Return_to_owner,Buddy,2016-02-21 15:27:00,Dog,<5 years,Male,Sterilized,Sporting,Mix,White,
44736,A721109,Return_to_owner,Oliver,2016-02-21 16:32:00,Dog,<5 years,Male,Sterilized,Toy,Pure breed,Tricolor,


<h3>The scoring dataset</h3>

In [4]:
test_df = data_processing.process_data(
    home_dir=home_dir,
    data_file=r"test",
    AnimalID=r"ID"
)
display(test_df)

	Age information processed in: 0.01 seconds
	Sex information processed in: 0.01 seconds
	Breed information processed in: 0.18 seconds
	Coat information processed in: 1.36 seconds
	All data merged in: 0.01 seconds


Unnamed: 0,ID,Name,DateTime,AnimalType,AgeuponOutcome,SexuponOutcome,Sterilization,BreedType,Mix,CoatColor,CoatPattern
0,1,Summer,2015-10-12 12:15:00,Dog,<1 year,Female,Intact,Sporting,Pure breed,Red,
1,1,Summer,2015-10-12 12:15:00,Dog,<1 year,Female,Intact,Sporting,Pure breed,White,
2,2,Cheyenne,2014-07-26 17:59:00,Dog,<5 years,Female,Sterilized,Herding,Mix,Black,
3,2,Cheyenne,2014-07-26 17:59:00,Dog,<5 years,Female,Sterilized,Herding,Mix,Cream,
4,2,Cheyenne,2014-07-26 17:59:00,Dog,<5 years,Female,Sterilized,Working,Mix,Black,
...,...,...,...,...,...,...,...,...,...,...,...
19171,11453,,2014-10-21 12:57:00,Cat,<1 month,Female,Intact,Unknown,Pure breed,Gray,
19172,11454,,2014-09-29 09:00:00,Cat,<5 years,Female,Intact,Unknown,Pure breed,Calico,
19173,11455,Rambo,2015-09-05 17:16:00,Dog,<10 years,Male,Sterilized,Herding,Pure breed,Black,
19174,11455,Rambo,2015-09-05 17:16:00,Dog,<10 years,Male,Sterilized,Herding,Pure breed,Cream,


<h2>Model training</h2>

<h3>Random Forest Model</h3>

In [5]:
export_model_path = home_dir + r"/pickle_files/rf_model.pkl"

rf_model = model_training.train_model(
    home_dir=home_dir,
    data_file=r"train",
    AnimalID=r"AnimalID",
    dep_var=r'OutcomeType',
    export_model_path=export_model_path
)

	Age information processed in: 0.02 seconds
	Sex information processed in: 0.03 seconds
	Breed information processed in: 0.40 seconds
	Coat information processed in: 3.13 seconds
	All data merged in: 0.06 seconds
Data loaded in: 3.74 seconds
Features engineered in: 0.11 seconds
Classification Report
              precision    recall  f1-score   support

           1       0.60      0.82      0.70      3630
           2       0.43      0.32      0.37      1757
           3       0.70      0.59      0.64      3008
           4       0.00      0.00      0.00        60
           5       0.27      0.09      0.14       493

    accuracy                           0.60      8948
   macro avg       0.40      0.36      0.37      8948
weighted avg       0.58      0.60      0.58      8948

Accuracy: 0.598681269557443

Feature Importances


Unnamed: 0,feature,importance
10,Sterilization_Sterilized,0.362180
9,Age_<6 months,0.088598
0,AnimalType_Cat,0.048849
3,Age_<1 month,0.043955
1,Sex_Female,0.040608
...,...,...
32,CoatColor_Chocolate,0.000038
60,CoatColor_White Lynx,0.000029
21,CoatColor_Apricot,0.000019
26,CoatColor_Blue Cream,0.000010


Random Forest Model built in: 1.53 seconds
