# Animal Shelter Color Bias - Data Exploration

This notebook performs initial data exploration and preprocessing of the Jacksonville Humane Society dataset to understand the structure and characteristics of the animal shelter data.

## Objectives:
- Load and explore the raw shelter data
- Separate cats and dogs for individual analysis  
- Examine data quality and basic distributions
- Prepare clean datasets for further analysis

In [35]:
!pip install -r "../requirements.txt"



In [36]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import os
py.init_notebook_mode(connected=True)

import warnings 
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')
sns.set()
sns.set_context('talk')

In [37]:
data = pd.read_csv('../data/raw/JAXhumane.csv')
data.head(10)

Unnamed: 0,Outcome Date,Name,Species,Primary Breed,Secondary Breed,Sex,Primary Color,Secondary Color,Pattern,Age (Months),Altered,Altered before arrival,Outcome Type,Intake Date,Intake Type
0,01/01/2021,Nova,Cat,Domestic Shorthair,,Female,Brown,,,48,No,Yes,Adoption,12/17/2020,Owner/Guardian Surrender
1,01/01/2021,Bud,Dog,Mixed Breed (Medium),,Male,Black,White,,36,Yes,No,Adoption,12/29/2020,Return
2,01/01/2021,Tinsel,Cat,Domestic Longhair,,Female,Black,,,3,Yes,No,Adoption,12/04/2020,Stray
3,01/01/2021,Mustard,Cat,Domestic Shorthair,,Male,Orange,White,,5,Yes,No,Adoption,12/03/2020,Stray
4,01/01/2021,Dionne,Cat,Domestic Shorthair,,Female,Grey,,Tabby,3,Yes,No,Adoption,12/28/2020,Stray
5,01/01/2021,Kiko,Dog,Mixed Breed (Medium),,Male,Black,White,,31,Yes,No,Return to Owner/Guardian,01/01/2021,Stray
6,01/01/2021,Ranger,Cat,Domestic Shorthair,,Female,Orange,White,Tabby,2,Yes,No,Adoption,10/25/2020,Stray
7,01/01/2021,Siam,Cat,Domestic Shorthair,,Male,Brown,,Tabby,2,Yes,No,Adoption,11/27/2020,Stray
8,01/01/2021,Tahiti,Cat,Domestic Shorthair,,Female,Brown,,Tabby,2,Yes,No,Adoption,11/27/2020,Stray
9,01/01/2021,Pompeii,Cat,Domestic Shorthair,,Male,Orange,,Tabby,2,Yes,No,Adoption,11/27/2020,Stray


## Dataset Overview

In [38]:
# Basic dataset information
print(f"Total Records: {data.shape[0]:,}")
print(f"Features: {data.shape[1]}")
print(f"Date Range: {data['Intake Date'].min()} to {data['Outcome Date'].max()}")

# Species breakdown
species_counts = data['Species'].value_counts()
print(f"\nSpecies Distribution:")
print(f"  Cats: {species_counts.get('Cat', 0):,} ({species_counts.get('Cat', 0)/len(data)*100:.1f}%)")
print(f"  Dogs: {species_counts.get('Dog', 0):,} ({species_counts.get('Dog', 0)/len(data)*100:.1f}%)")

Total Records: 27,636
Features: 15
Date Range: 01/01/2021 to 12/31/2023

Species Distribution:
  Cats: 16,903 (61.2%)
  Dogs: 10,733 (38.8%)


In [39]:
# Data quality check
missing_pct = (data.isnull().sum() / len(data) * 100).round(1)
missing_cols = missing_pct[missing_pct > 0]

print(f"\nData Quality:")
if len(missing_cols) > 0:
    print("  Columns with missing data:")
    for col, pct in missing_cols.items():
        print(f"    {col}: {pct}% missing")
else:
    print("  No missing data")


Data Quality:
  Columns with missing data:
    Secondary Breed: 98.5% missing
    Secondary Color: 58.8% missing
    Pattern: 77.4% missing


In [40]:
print(f"\nKey Statistics:")
print(f"  Adoption Rate: {(data['Outcome Type'] == 'Adoption').mean()*100:.1f}%")
print(f"  Unique Colors: {data['Primary Color'].nunique()}")
print(f"  Unique Breeds: {data['Primary Breed'].nunique()}")
print(f"  Average Age: {data['Age (Months)'].mean():.1f} months")


Key Statistics:
  Adoption Rate: 94.9%
  Unique Colors: 42
  Unique Breeds: 16
  Average Age: 21.2 months


## Data Preprocessing

In [41]:
try:
    data_clean.shape
except NameError:
    print("Error: data_clean not found. Please run the data preprocessing cells first.")
    # Or reload the data:
    data = pd.read_csv('../data/raw/JAXhumane.csv')
    data_clean = data.drop(columns=['Secondary Color', 'Secondary Breed', 'Pattern'])
    data_clean = data_clean.dropna()
    data_clean['Intake Date'] = pd.to_datetime(data_clean['Intake Date'])
    data_clean['Outcome Date'] = pd.to_datetime(data_clean['Outcome Date'])
    data_clean['length_of_stay'] = (data_clean['Outcome Date'] - data_clean['Intake Date']).dt.days

In [42]:
data_clean.dtypes

Outcome Date              datetime64[ns]
Name                              object
Species                           object
Primary Breed                     object
Sex                               object
Primary Color                     object
Age (Months)                       int64
Altered                           object
Altered before arrival            object
Outcome Type                      object
Intake Date               datetime64[ns]
Intake Type                       object
length_of_stay                     int64
Intake Type Grouped               object
dtype: object

In [43]:
# Check for duplicates
print(f"Duplicate records: {data_clean.duplicated().sum()}")
if data_clean.duplicated().sum() > 0:
    data_clean = data_clean.drop_duplicates()

print(f"Final cleaned dataset: {data_clean.shape[0]:,} records")

Duplicate records: 0
Final cleaned dataset: 27,627 records


In [44]:
# Standardize color labels (if needed)
data_clean['Primary Color'] = data_clean['Primary Color'].str.title()
unique_colors = data_clean['Primary Color'].unique()
print(f"Total unique colors: {len(unique_colors)}")
print("\nAll color values:")
for color in sorted(unique_colors):
    print(f"  '{color}'")

Total unique colors: 24

All color values:
  'Albino'
  'Apricot'
  'Black'
  'Blond'
  'Blue'
  'Brindle'
  'Brown'
  'Buff'
  'Calico'
  'Cream'
  'Fawn'
  'Grey'
  'Lilac'
  'Lynx'
  'Orange'
  'Red'
  'Sable'
  'Seal'
  'Tan'
  'Torbie'
  'Tortoise'
  'Wheaten'
  'White'
  'Yellow'


In [45]:
# Color mapping for standardization
color_mapping = {
    # Fix spelling variations
    'Blonde': 'Blond',
    
    # Group similar browns
    'Chocolate': 'Brown',
    'Mahogany': 'Brown', 
    'Red/Mahogany': 'Brown',
    'Liver': 'Brown',
    'Copper': 'Brown',
    
    # Group similar greys
    'Silver': 'Grey',
    'Charcoal': 'Grey',
    'Smoke': 'Grey',
    
    # Group light colors
    'Cream': 'Buff',
    'Beige': 'Buff',
    
    # Group oranges/reds
    'Flame': 'Orange',
    'Ruddy': 'Red',
    
    # Group yellows
    'Golden': 'Yellow',
    'Sandy': 'Yellow',
    
    # Simplify complex colors
    'Shaded Blue Cream Cameo': 'Cream',
    'Salt & Pepper': 'Grey',
    'Blue Black': 'Black',
    'Silver Black': 'Black'
}

# Apply the mapping
data_clean['Primary Color'] = data_clean['Primary Color'].replace(color_mapping)

In [46]:
data_clean.isnull().sum()

Outcome Date              0
Name                      0
Species                   0
Primary Breed             0
Sex                       0
Primary Color             0
Age (Months)              0
Altered                   0
Altered before arrival    0
Outcome Type              0
Intake Date               0
Intake Type               0
length_of_stay            0
Intake Type Grouped       0
dtype: int64

In [47]:
# Group rare intake types (<100 records) into "Other" for ML stability
intake_counts = data_clean['Intake Type'].value_counts()
rare_categories = intake_counts[intake_counts < 100].index
data_clean['Intake Type Grouped'] = data_clean['Intake Type'].replace({cat: 'Other' for cat in rare_categories})

In [48]:
# Export species-specific datasets with grouped intake types
data_cats = data_clean[data_clean['Species'] == 'Cat']
data_cats.to_csv('../data/processed/data_cats.csv', index=False)
data_cats.shape


(16901, 14)

In [49]:
data_dogs = data_clean[data_clean['Species'] == 'Dog']
data_dogs.to_csv('../data/processed/data_dogs.csv', index=False)
data_dogs.shape

(10726, 14)

In [50]:
data_cats.head()

Unnamed: 0,Outcome Date,Name,Species,Primary Breed,Sex,Primary Color,Age (Months),Altered,Altered before arrival,Outcome Type,Intake Date,Intake Type,length_of_stay,Intake Type Grouped
0,2021-01-01,Nova,Cat,Domestic Shorthair,Female,Brown,48,No,Yes,Adoption,2020-12-17,Owner/Guardian Surrender,15,Owner/Guardian Surrender
2,2021-01-01,Tinsel,Cat,Domestic Longhair,Female,Black,3,Yes,No,Adoption,2020-12-04,Stray,28,Stray
3,2021-01-01,Mustard,Cat,Domestic Shorthair,Male,Orange,5,Yes,No,Adoption,2020-12-03,Stray,29,Stray
4,2021-01-01,Dionne,Cat,Domestic Shorthair,Female,Grey,3,Yes,No,Adoption,2020-12-28,Stray,4,Stray
6,2021-01-01,Ranger,Cat,Domestic Shorthair,Female,Orange,2,Yes,No,Adoption,2020-10-25,Stray,68,Stray


In [51]:
data_dogs.head()

Unnamed: 0,Outcome Date,Name,Species,Primary Breed,Sex,Primary Color,Age (Months),Altered,Altered before arrival,Outcome Type,Intake Date,Intake Type,length_of_stay,Intake Type Grouped
1,2021-01-01,Bud,Dog,Mixed Breed (Medium),Male,Black,36,Yes,No,Adoption,2020-12-29,Return,3,Return
5,2021-01-01,Kiko,Dog,Mixed Breed (Medium),Male,Black,31,Yes,No,Return to Owner/Guardian,2021-01-01,Stray,0,Stray
21,2021-01-01,Boomer,Dog,Mixed Breed (Small),Male,White,84,Yes,No,Adoption,2020-12-26,Stray,6,Stray
27,2021-01-01,Opal,Dog,Mixed Breed (Medium),Female,Black,36,No,Yes,Adoption,2020-12-29,Transfer In,3,Transfer In
28,2021-01-01,Ember,Dog,Mixed Breed (Medium),Female,White,48,Yes,No,Adoption,2020-12-26,Stray,6,Stray


In [52]:
cat_color_counts = data_cats['Primary Color'].value_counts()
cat_color_counts

Primary Color
Black       5767
Grey        3381
Brown       3002
Orange      1730
White       1238
Calico       614
Tan          309
Tortoise     292
Buff         227
Torbie       113
Seal          54
Apricot       47
Albino        34
Yellow        19
Lilac         18
Blue          15
Blond         14
Lynx           9
Fawn           8
Red            5
Wheaten        5
Name: count, dtype: int64

In [53]:
dog_color_counts = data_dogs['Primary Color'].value_counts()
dog_color_counts

Primary Color
Black      2920
Brown      2243
Tan        1881
White      1642
Grey        785
Brindle     630
Yellow      170
Buff        145
Blue         74
Blond        71
Fawn         69
Wheaten      62
Apricot      23
Sable        11
Name: count, dtype: int64

In [54]:
cat_breed_counts = data_cats['Primary Breed'].value_counts()
cat_breed_counts

Primary Breed
Domestic Shorthair      15534
Domestic Medium Hair     1002
Domestic Longhair         357
American Shorthair          7
Siamese                     1
Name: count, dtype: int64

In [55]:
dog_breed_counts = data_dogs['Primary Breed'].value_counts()
dog_breed_counts

Primary Breed
Mixed Breed (Medium)         5626
Mixed Breed (Large)          2546
Mixed Breed (Small)          2543
Boxer                           2
Retriever, Black Labrador       2
Terrier                         2
Hound, Bloodhound               1
Hound, Basset                   1
Retriever, Labrador             1
Beagle                          1
Pekingese                       1
Name: count, dtype: int64

In [56]:
cat_outcome_counts = data_cats['Outcome Type'].value_counts()
cat_outcome_counts

Outcome Type
Adoption                    16693
Return to Owner/Guardian      208
Name: count, dtype: int64

In [57]:
dog_outcome_counts = data_dogs['Outcome Type'].value_counts()
dog_outcome_counts

Outcome Type
Adoption                    9533
Return to Owner/Guardian    1193
Name: count, dtype: int64

In [58]:
cat_intake_counts = data_cats['Intake Type'].value_counts()
cat_intake_counts

Intake Type
Stray                       11852
Owner/Guardian Surrender     3017
Transfer In                  1316
Return                        681
Service In                     23
Born In Care                   11
Wildlife In                     1
Name: count, dtype: int64

In [59]:
dog_intake_counts = data_dogs['Intake Type'].value_counts()
dog_intake_counts

Intake Type
Stray                       6008
Owner/Guardian Surrender    2671
Return                      1219
Transfer In                  725
Service In                    97
Born In Care                   6
Name: count, dtype: int64

In [60]:
# Sex distribution analysis
print("Overall Sex Distribution:")
overall_sex_counts = data['Sex'].value_counts()
print(overall_sex_counts)
print(f"\nPercentages:")
print(overall_sex_counts / len(data) * 100)

Overall Sex Distribution:
Sex
Male       14361
Female     13265
Unknown       10
Name: count, dtype: int64

Percentages:
Sex
Male       51.964828
Female     47.998987
Unknown     0.036185
Name: count, dtype: float64


In [61]:
print("Cat Sex Distribution:")
cat_sex_counts = data_cats['Sex'].value_counts()
print(cat_sex_counts)
print(f"Percentages:")
print(cat_sex_counts / len(data_cats) * 100)

Cat Sex Distribution:
Sex
Female     8506
Male       8390
Unknown       5
Name: count, dtype: int64
Percentages:
Sex
Female     50.328383
Male       49.642033
Unknown     0.029584
Name: count, dtype: float64


In [62]:
print("Dog Sex Distribution:")
dog_sex_counts = data_dogs['Sex'].value_counts()
print(dog_sex_counts)
print(f"Percentages:")
print(dog_sex_counts / len(data_dogs) * 100)

Dog Sex Distribution:
Sex
Male       5966
Female     4755
Unknown       5
Name: count, dtype: int64
Percentages:
Sex
Male       55.621853
Female     44.331531
Unknown     0.046616
Name: count, dtype: float64


## Summary

We're working with 27,632 animals from Jacksonville Humane Society between January 2021 and December 2023. Most are cats (16,903) with fewer dogs (10,729), which is pretty typical for shelter populations. The data includes everything we need: species, breeds, colors, ages, sex, how they came in, and what happened to them.

Cleaning the data was straightforward. We dropped some unnecessary columns like secondary colors and patterns, removed 5 incomplete records, and standardized the color names since there were 42 different ways to describe essentially the same colors. For example, "Brown," "Chocolate," and "Mahogany" all got grouped together.

The sex distribution is interesting. Overall there are slightly more males (52%) than females (48%), but this varies by species. Dogs skew heavily male (55.6% vs 44.3%), while cats are almost perfectly split. Only 10 animals had unknown sex recorded.

The good news is this is high-quality data. Almost 95% of animals were adopted, there's virtually no missing information, and we have plenty of animals in each color category for meaningful statistical comparisons. The data has been split into separate cat and dog files for the statistical analysis.