For cats, we definitely see a strong trend toward the weighted average as the count of pets goes up above 25. We see a similar trend for dogs, but we don't see as much of a convergence toward the weighted average until we get to breeds with counts over 125.

It does seem to make sense that we should pick a threshold as our minimum count of pets per breed. Choosing this threshold is not as straightforward though. 

Here's a list of the steps we'll follow:
1. Set threshold and save a list of breeds with counts greater or equal to the threshold
2. Write a function to update the breed for a row based on whether or not it exists in the list from step 1
3. Create a copy of our original df and apply the function
4. Print out the before and after numbers for our count of unique breeds

As part of step 2 above, we'll update the breed name for breeds with a low pet count to group them together in an *Other* category. To ensure we don't lose any species-specific information, we'll create two versions of *Other*, 'Other Cat' and 'Other Dog'. 

Let's get started.

In [None]:
# Set threshold
threshold = 50

# Preserve list of Breeds with count greater equal to the threshold
pet_breeds = pets_by_breed[pets_by_breed.Count >= threshold].Breed.tolist()

# Create function to update breed column based on threshold
def update_breed(row):
    if (row["Breed"] in pet_breeds):
        return row["Breed"]
    else:
        if (row["Species"] == 'Cat'):
            return 'Other Cat'
        else:
            return 'Other Dog'

# Print number of unique breeds before update
print("Number of unique breeds before: " + str(df.Breed.nunique()))

# Create copy of original df and apply function to update Breed
df_new = df.copy()
df_new["Breed"] = df_new.apply(update_breed, axis=1)
print("Number of unique breeds after: " + str(df_new.Breed.nunique()))


Now that we've filtered out breeds with a count of pets below our threshold, let's replot the data.

In [None]:
# Group pets by breed and aggregate claims data columns
pets_by_breed = df_new.groupby(by=['Breed', 'Species']).agg({'PetId': ['count'],
                                                         'AmtClaimsTotal': ['mean']}).reset_index()
pets_by_breed.columns = ['Breed', 'Species', 'Count', 'AvgTotalClaims']

# Calculate weighted average
pets_by_breed["weighted_total"] = pets_by_breed["Count"] * pets_by_breed["AvgTotalClaims"]
weighted_avg = pets_by_breed["weighted_total"].sum() / pets_by_breed["Count"].sum()

# Create a scatterplot showing count of breed vs avg total claims
fig, ax = plt.subplots(1, 1, figsize=(16, 8))

sns.scatterplot(x="Count", y="AvgTotalClaims", data=pets_by_breed, hue='Species', hue_order=['Dog', 'Cat'],
                alpha=0.7, size="Count", sizes=(20, 200)).set(xlabel="Count of Pets",
                                                              ylabel="Average Total Claims, USD")

# Plot line showing the average for all pets
plt.axhline(weighted_avg, color='g', linestyle='dashed', linewidth=1)
plt.text(3500, 1525, "All Breeds Weighted Average")

# Add title and display plot
plt.title("Count of Pets vs. Average Total Claims Amount, by Breed and Species", y=1.02, fontsize=14)
plt.suptitle("",
             y=0.91, x=0.513, fontsize=11)

plt.legend(loc='upper right')
plt.show()

After grouping together the breeds with a low pet count, we can see a lot less variability on the low end of the range. And in fact, it doesn't look like we have any breeds with \$0 in average total claims which definitely feels more reasonable. 

All of that said, we do still see a fair bit of variability for breeds with less than about 1200 pets. For now, let's assume that variability is due to the fact that those breeds really are more expensive. We can come back and make adjustments later if we are getting poor results in our predictive model.

# PCA


In [None]:
# Filter df down the a subset of features
cols = ['Breed', 'AgeYr1', 'AmtClaimsYr1', 'NumClaimsYr1']

#Create a new dataframe and set the index to Breed
df_new_scale = df_new[cols].set_index('Breed')

#Save the breed labels
df_new_index = df_new_scale.index

#Save the column names 
df_new_columns = df_new_scale.columns
df_new_scale.head()

In [None]:
# Scale the data
df_new_scale = scale(df_new_scale)

In [None]:
#Create a new dataframe using saved column names
df_new_scaled_df = pd.DataFrame(df_new_scale, columns=df_new_columns)
df_new_scaled_df.head()

In [None]:
# Verify the scaling
df_new_scaled_df.mean()

In [None]:
# Verify scaled std
df_new_scaled_df.std(ddof=0)

In [None]:
# Fit the PCA tranformation
pets_pca = PCA().fit(df_new_scale)

In [None]:
# Plot the result
plt.subplots(figsize=(10, 6))
plt.plot(pets_pca.explained_variance_ratio_.cumsum())
plt.xlabel('Component #')
plt.ylabel('Cumulative ratio variance')
plt.title('Cumulative variance ratio explained by PCA components for pets summary statistics')
plt.show()

Looking at the results from PCA, we can see that about 85% of the variance is explained by the first 5 features of the data. This information may be helpful down the road in preprocessing and model creation as it provides us with a better foundation for understanding our data.

# TODO
* Should I do more with the PCA results here before moving on to the summary?