<h2>Enter your computer's home firectory</h2>

In [1]:
home_dir = r"/Users/wrngnfreeman/Github/Shelter-Animal-Outcomes-by-kaggle.com"

<h2>Importing required modules</h2>

In [2]:
import sys
sys.path.append(home_dir + r"/src")
import data_processing, model_training

<h2>Data preparation</h2>

1. **Age**: Cleans the `AgeuponOutcome` column, converts age to days, and groups ages into categories.
2. **Sex**: Cleans the `SexuponOutcome` column by removing unwanted spaces and unknown values, then splits it into two columns for detailed categorization.
3. **Breed**:
    1. Standardizes text in the `Breed` column using regular expressions to handle spaces, unknowns, and specific terms.
    2. Splits breeds containing 'Mix', creating a new `Mix` column indicating mixed breed status.
    3. Separates multiple breeds listed in the same entry of the `Breed` column into individual rows.
    4. Maps each breed to its respective type (e.g., Terrier, Working) using a predefined dictionary and assigns an `nan` category if no match is found.
    5. Calculates the frequency of each animal's occurrence and updates the `Mix` status based on these counts.
    6. Ensures that breeds are properly categorized and mixed status is accurately reflected across all related DataFrames.
4. **Coat**:
    1. Coat Color Standardization: Adjusts the `Color` attribute according to the `AnimalType` ('Dog', 'Cat') for consistency in color naming.
    2. Pattern Extraction: Identifies and extracts coat patterns from colors.
    3. Pattern Removal: Strips out recognized pattern indicators from the `Color` string.
    4. Data Merging: Combines the original data with processed color information into `coat_color`.
    5. List Separation: Separates multiple colors listed in the same entry of the `Color` column into individual rows.

<h3>The train dataset</h3>

In [3]:
train_df = data_processing.process_data(
    home_dir=home_dir,
    data_file=r"Austin_Animal_Center_Outcomes_20250318",
    AnimalID=r"AnimalID",
    dep_var=r"OutcomeType"
)
display(train_df)

Age information processed in: 0.16 seconds
Sex information processed in: 0.35 seconds
Breed information processed in: 2.46 seconds
Coat information processed in: 20.37 seconds


: 

<h3>The scoring dataset</h3>

In [None]:
test_df = data_processing.process_data(
    home_dir=home_dir,
    data_file=r"test",
    AnimalID=r"ID"
)
display(test_df)

<h2>Model training</h2>

<h3>Random Forest Model</h3>

In [None]:
export_model_path = home_dir + r"/pickle_files/rf_model(2).pkl"

rf_model = model_training.train_model(
    home_dir=home_dir,
    data_file=r"Austin_Animal_Center_Outcomes_20250318",
    AnimalID=r"AnimalID",
    dep_var=r'OutcomeType',
    export_model_path=export_model_path
)