### Data to the Rescue
FetchMaker’s mission is to match up prospective dog owners with their perfect pet. They have been collecting data on their adoptable dogs, and our goal is to analyze some of that datat to get some valuable insights.


FetchMaker has provided us with data for a sample of dogs from their app, including the following attributes:
- `weight`, an integer representing how heavy a dog is in pounds
- `tail_length`, a float representing tail length in inches
- `age`, in years
- `color`, a String such as "brown" or "grey"
- `is_rescue`, a boolean 0 or 1

In [22]:
# Import libraries
import pandas as pd
import numpy as np

from scipy.stats import binom_test
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import chi2_contingency

# Load the data and print a sample
dogs = pd.read_csv("dog_data.csv")
print(dogs.head())

   is_rescue  weight  tail_length  age  color  likes_children  \
0          0       6         2.25    2  black               1   
1          0       4         5.36    4  black               0   
2          0       7         3.63    3  black               0   
3          0       5         0.19    2  black               0   
4          0       5         0.37    1  black               1   

   is_hypoallergenic      name      breed  
0                  0      Huey  chihuahua  
1                  0   Cherish  chihuahua  
2                  1     Becka  chihuahua  
3                  0     Addie  chihuahua  
4                  1  Beverlee  chihuahua  


FetchMaker estimates (based on historical data for all dogs) that 8% of dogs in their system are rescues. They would like to know if whippets are significantly more or less likely than other dogs to be a rescue.

In [14]:
# Save 'is_rescue' for whippets
whippet_rescue = dogs.is_rescue[dogs.breed == "whippet"]

# Count the number of 'rescues' and print the result
num_whippet_rescues = np.sum(whippet_rescue == 1)
print(num_whippet_rescues)

# Count the number of whippets in total
num_whippets = len(whippet_rescue)
# Print the result
print(num_whippets)

6
100


Now we use a hypothesis test to test the following null and alternative hypotheses:
- null: 8% of whippets are rescues
- alternative: more or less than 8% of whippets are rescues

For this test, we are focused on a single binary categorical variable, which indicates whether or not each whippet is a rescue. We want to compare the number of rescues in our sample to a hypothetical population-level proportion of 0.08. Therefore, we should use a binomial test.

In [15]:
# Run a binomial test 
pval = binom_test(num_whippet_rescues, num_whippets, .08)
# Print the result
print(pval)

0.5811780106238107


Using a significance threshold of 0.05, we can conclude that the proportion of whippets who are rescues is NOT significantly different from 8%.

### Mid-Sized Dog Weights
Three of FetchMaker’s most popular mid-sized dog breeds are 'whippet's, 'terrier's, and 'pitbull's. We'd like to see if there a significant difference in the average weights of these three dog breeds.

In [16]:
# Save the weight of whippets
wt_whippets = dogs.weight[dogs.breed == "whippet"]
# Save the weight of terriers
wt_terriers = dogs.weight[dogs.breed == "terrier"]
# Save the weight of pitbulls
wt_pitbulls = dogs.weight[dogs.breed == "pitbull"]

Now we want to run a single hypothesis test to address the following null and alternative hypotheses:
- null: whippets, terriers, and pitbulls all weigh the same amount on average
- alternative: whippets, terriers, and pitbulls do not all weigh the same amount on average (at least one pair of breeds has differing average weights)

This test addresses an association between two variables: a non-binary categorical variable (breed, with three possible options) and a quantitative variable (weight). It is not a good idea to run three separate two-sample t-tests here, because running multiple t-tests increases our chances of a type I error, or a false positive. 

In order to run a single hypothesis test with three categories, we should use an ANOVA.

In [18]:
# Run ANOVA test
fstat, pval = f_oneway(wt_whippets, wt_terriers, wt_pitbulls)
# Print the result
print(pval)

3.276415588274815e-17


Using a significance threshold of 0.05, we can conclude that there is at least one pair of dog breeds that have significantly different average weights. 

We'll run another hypothesis test to determine which of those breeds (whippets, terriers, and pitbulls) weigh different amounts on average using an overall type I error rate of 0.05 for all three comparisons. 

For this case, we need Tukey’s range test. The inputs to this function are our two variables of interest: the weights and breeds of the dogs. 

In [20]:
# Subset to just whippets, terriers, and pitbulls
dogs_wtp = dogs[dogs.breed.isin(["whippet", "terrier", "pitbull"])]

# Run Turkey's range test and print the result
output = pairwise_tukeyhsd(dogs_wtp.weight, dogs_wtp.breed)
print(output)

  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower    upper  reject
-------------------------------------------------------
pitbull terrier   -13.24   -0.0 -16.7278 -9.7522   True
pitbull whippet    -3.34 0.0638  -6.8278  0.1478  False
terrier whippet      9.9    0.0   6.4122 13.3878   True
-------------------------------------------------------


For any pair where “Reject” is “True”, we conclude that those two breeds weigh significantly different amounts.

### Poodle and Shihtzu Colors
FetchMaker wants to know if 'poodle's and 'shihtzu's come in different colors. To start, we'll subset the data and use it to create a contingency table of dog colors by breed (poodle vs. shihtzu). 

In [25]:
# Subset to just poodles and shihtzus
dogs_ps = dogs[dogs.breed.isin(["poodle", "shihtzu"])]

# Create a contingency table of color vs. breed
Xtab = pd.crosstab(dogs_ps.color, dogs_ps.breed)
print(Xtab)

breed  poodle  shihtzu
color                 
black      17       10
brown      13       36
gold        8        6
grey       52       41
white      10        7


Now we want to run a hypothesis test for the following null and alternative hypotheses:
- null: There is an association between breed (poodle vs. shihtzu) and color
- alternative: There is not an association between breed (poodle vs. shihtzu) and color

This case investigates an association between two categorical variables, so we can use a Chi-Square test.

In [24]:
# Run Chi-Square Test
chi2, pval, dof, exp = chi2_contingency(Xtab)
# Print the result
print(pval)

0.005302408293244593


Using a significance threshold of 0.05, we can conclude that poodles and shihtzus come in significantly different color combinations.