**If you found this notebook to be helpful, please upvote so that others can see it too :)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # data visualization
import matplotlib.pyplot as plt #more data visualization
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore') # ignore warnings
from scipy.stats import ttest_ind # for the t-test we'll be doing
from subprocess import check_output 
print(check_output(["ls", "../input"]).decode("utf8"))


# Question: Which dog breeds bite most frequently?

Such a contraversial topic!  This data can be useful when talking about public safety and pet ownership in the public square.  One thing to keep in mind is that this dataset comes from a concentrated area of Louiseville, Kentucky, so our results may be skewed by which dogs are most commonly owned.

With that in mind, let's read our data in!

In [None]:
# Read in the Data
bites = pd.read_csv("../input/Health_AnimalBites.csv")
bites.head()

In [None]:
bites.shape #look at how many rows we have (rows, columns)

If you peak at the input data up top, it looks like we have cats **AND** dogs in our dataset.  Let's get rid of everything besides dogs.

In [None]:
# Create a dataframe where there are only dogs included
dogs = bites.loc[bites['SpeciesIDDesc'] == 'DOG', :]
dogs.shape # prints out (rows, columns)

I also noticed that a lot of our values in the 'BreedIDDesc" column (dog breed) are missing.  Let's get rid of those.

In [None]:
dogs_with_breed = dogs.dropna(subset = ['BreedIDDesc'])
dogs_with_breed.shape # prints out (rows, columns)

Ok, looks like we're down to a sample set of 3,755 dogs.  Now let's answer our question by visualizing this data!

In [None]:
with sns.plotting_context('notebook', font_scale=2): #justs makes our breed names bigger
    sns.set_style("whitegrid") # makes our plot have a white background
    ax = plt.subplots(figsize=(20,25)) # makes our plot larger
    
    #Plot the number of dogs in each breed
    sns.countplot(y= 'BreedIDDesc' # the breeds go on our y axis
                  , data = dogs_with_breed # tells sns.countplot which dataset we're using
                  , order = dogs_with_breed['BreedIDDesc'].value_counts().index # Orders our results by size
                 )
    #Change aesthetic stuff
    plt.xticks(rotation=90) # rotates our x-axis labels so that they're readable
    plt.title('Count of Dog Bites by Breed', fontsize = 40) # Puts the title on with larger text size
    plt.xlabel('Count', fontsize = 35) # puts x axis label on with larger text size
    plt.ylabel('Breed', fontsize = 35) # puts y axis label on with larger text size
    plt.subplots_adjust(top=2, bottom=.8, left=0.10, right=0.95, hspace= 1
                        , wspace=0.5) # Changes the size of my bars and spacing

That answers our question- looks like Pit Bulls have the lead by far!  I suspect that chichuahuas and Shih Tzu's bit more frequently, but aren't reported since they're such small dogs.

I'm curious now- which dogs are leading in the percentage of bites that give rabies?  

It looks like the "ResultsIDDesc" variable contains the info on whether the bite gave rabies or not. Let's clean that up a bit and see how much data we really have.

In [None]:
# Find out how many of the bites had a known outcome

rabies_data = dogs_with_breed.loc[dogs_with_breed['ResultsIDDesc'] != 'UNKNOWN', :] # Get rid of "UNKNOWN"
rabies_data = rabies_data.dropna(subset = ['ResultsIDDesc']) # Get rid of "NaN"
rabies_data = rabies_data.loc[dogs_with_breed['ResultsIDDesc'] == 'POSITIVE', :] #Only Display "POSITIVE" results
print('(rows, columns) = ', rabies_data.shape)
rabies_data.head()

Ok, we have 1 dog that has given rabies which has been classified by breed.  Not much data to work with here, haha!  Looks like we'll need to find a bigger dataset to really answer our question!  For now, just watch out for pit-bulls :)