<center><h4>FetchMaker</h4></center>

Congratulations! You’ve just started working at the hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet. Data on thousands of adoptable dogs are in FetchMaker’s system, and it’s your job to analyze some of that data.

In [1]:
import numpy as np
import fetchmaker
from scipy.stats import binom_test
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import chi2_contingency 

The attributes that FetchMaker keeps track of are:
* <code>weight</code>, an integer representing how heavy a dog is in pounds
* <code>tail_lenght</code>, a float representing tail lenght in inches
* <code>age</code>, in years
* <code>color</code>, a string such as <code>"brown"</code> or <code>"grey"</code>
* <code>is_rescue</code>, a boolean, <code>0</code>, or <code>1</code>

The <code>fetchmaker</code> package lets you access this data for a specified breed of dog with the following format:

    fetchmaker.get_weight("poodle")
    
This returns a Pandas DataFrame of the weights of the poodles recorded in the system. The other methods are <code>get_tail_length</code>, <code>get_color</code>, <code>get_age</code>, and <code>get_is_rescue</code>, which all take a breed as an input.


Over the years, we have seen that we expect <code>8%</code> of dogs in the FetchMaker system to be rescues. We want to know if whippets are significantly more or less likely to be a rescue.

In [2]:
whippet_rescue = fetchmaker.get_is_rescue("whippet")
num_whippet_rescues = np.count_nonzero(whippet_rescue)
num_whippets = np.size(whippet_rescue)

In [3]:
print("There's a {} probability that the whippets are indeed likely to be rescues according to expectations.".\
       format(binom_test(num_whippet_rescues, n=num_whippets, p=0.08)))

There's a 0.5811780106238098 probability that the whippets are indeed likely to be rescues according to expectations.


Three of our most popular mid-sized dog breeds are whippets, terriers, and pitbulls. Is there a significant difference in the average weights of these three dog breeds? Perform a comparative numerical test to determine if there is a significant difference.

In [4]:
whippet_weight = fetchmaker.get_weight("whippet")
terrier_weight = fetchmaker.get_weight("terrier")
pitbull_weight = fetchmaker.get_weight("pitbull")

#We use ANOVA because we're analyzing a numerical variable (weight) and
#a categorical variable with three categories(3 breeds). (more than 2:ANOVA, 2:t-test)
print("There's a {} probability that there's no significant difference\
      in the average weights of whippets, terriers and pitbulls".\
      format(f_oneway(whippet_weight,terrier_weight,pitbull_weight).pvalue)) 

There's a 3.276415588274815e-17 probability that there's no significant difference      in the average weights of whippets, terriers and pitbulls


Thus, we reject the null hypothesis. Now, let us perform another test to determine which of the pairs of these dog breeds differ from each other

In [5]:
doggos_weight = np.concatenate([whippet_weight, terrier_weight, pitbull_weight])
labels = ['whippet'] * len(whippet_weight) + ['terrier'] * len(terrier_weight) + ['pitbull'] * len(pitbull_weight)

tukey_results = pairwise_tukeyhsd(doggos_weight, labels, 0.05)
print(tukey_results)

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj   lower  upper  reject
-----------------------------------------------------
pitbull terrier   -13.24  0.001 -16.728 -9.752   True
pitbull whippet    -3.34 0.0639  -6.828  0.148  False
terrier whippet      9.9  0.001   6.412 13.388   True
-----------------------------------------------------


This table tells us that there's a significant statistical difference in the average weight of terriers.


We want to see if <code>"poodle"</code>s and <code>"shihtzu"</code>s have significantly different color breakdowns.
You can get the number of occurrences of brown poodles by using <code>np.count_nonzero(poodle_colors == "brown")</code>

In [6]:
"""
Using a Chi Squared test (analyzing two categorical variables: breed and color)

contingency table: 

        poodle    shitzu
black |   x    |     x
brown |   x    |     x
gold  |   x    |     x
grey  |   x    |     x
white |   x    |     x
"""

poodle_colors = fetchmaker.get_color("poodle")
shihtzu_colors = fetchmaker.get_color("shihtzu")

color_table = [
  [
    np.count_nonzero(poodle_colors == "black"), 
    np.count_nonzero(shihtzu_colors == "black")
  ],
  [
    np.count_nonzero(poodle_colors == "brown"), 
    np.count_nonzero(shihtzu_colors == "brown")
  ],
  [
    np.count_nonzero(poodle_colors == "gold"), 
    np.count_nonzero(shihtzu_colors == "gold")
  ],
  [
    np.count_nonzero(poodle_colors == "grey"), 
    np.count_nonzero(shihtzu_colors == "grey")
  ],
  [
    np.count_nonzero(poodle_colors == "white"), 
    np.count_nonzero(shihtzu_colors == "white")
  ], 
]

_, color_pval, _, _ = chi2_contingency(color_table)
print("There's a {} probability that there's no relationship between doggo color and doggo breed".format(color_pval))

There's a 0.005302408293244593 probability that there's no relationship between doggo color and doggo breed
