## FetchMaker
Congratulations! You’ve just started working at the hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet. Data on thousands of adoptable dogs are in FetchMaker’s system, and it’s your job to analyze some of that data.


Let's start by including a data interface called fetchmaker that will give you access to FetchMaker's dog data.

Use import fetchmaker at the top of your script.py file to import the fetchmaker package.

In [3]:
import numpy as np
import fetchmaker as fm

The attributes that FetchMaker keeps track of are:

* weight, an integer representing how heavy a dog is in pounds
* tail_length, a float representing tail length in inches
* age, in years
* color, a String such as "brown" or "grey"
* is_rescue, a boolean 0 or 1
The fetchmaker package lets you access this data for a specific breed of dog with the following format:

 ** fetchmaker.get_weight("poodle")

fetchmaker.get_weight("poodle")
This returns a NumPy array of the weights of the poodles recorded in the system. The other methods are get_tail_length, get_color, get_age, and get_is_rescue, which all take a breed as an input.

* Get the tail lengths of all of the "rottweiler"s in the system, and store it in a variable called rottweiler_tl.

* Print out the mean of rottweiler_tl and the standard deviation of rottweiler_tl, using np.mean and np.std.

In [4]:
rottweiler_tl = fm.get_tail_length("rottweiler")

print "Average tail length of rottweiler is {} with standar deviation of {}".format(np.mean(rottweiler_tl), np.std(rottweiler_tl))

Average tail length of rottweiler is 4.2361 with standar deviation of 2.06475368749


Over the years, we have seen that we expect **8%** of dogs in the FetchMaker system to be rescues. We want to know if whippets are significantly more or less likely to be a rescue.

Store the is_rescue values for "whippet"s in a variable called whippet_rescue.

In [5]:
whippet_rescue = fm.get_is_rescue("whippet")

Use np.count_nonzero to get the number of entries in whippet_rescue that are 1. Store this number in a variable called num_whippet_rescues.

Get the number of samples in the whippet set by taking the np.size of whippet_rescue. Store this in a variable called num_whippets.

In [6]:
num_whippet_rescues = np.count_nonzero(whippet_rescue)
num_whippets = np.size(whippet_rescue)

Use a binomial test to test the number of whippet rescues, num_whippet_rescues, against our expected percentage, 8%.

Remember to import the binomial test by using from scipy.stats import binom_test.

In [7]:
from scipy.stats import binom_test, f_oneway, ttest_ind, chi2_contingency

pval = binom_test(num_whippet_rescues, num_whippets, p=0.08)
print pval

0.581178010624


Three of our most popular mid-sized dog breeds are whippets, terriers, and pitbulls. Is there a significant difference in the average weights of these three dog breeds? Perform a comparative numerical test to determine if there is a significant difference.

In [8]:
whippet_weight = fm.get_weight("whippet")
terrier_weight = fm.get_weight("terrier")
pitbull_weight = fm.get_weight("pitbull")

whippet_weight_avg = np.mean(whippet_weight)
terrier_weight_avg = np.mean(terrier_weight)
pitbull_weight_avg = np.mean(pitbull_weight)
print "Average weight of whippet is {}, terrier is {}, and pitbull is {}\n".format(whippet_weight_avg, terrier_weight_avg, pitbull_weight_avg)

tstat, weight_avg_pval = f_oneway(whippet_weight, terrier_weight, pitbull_weight)
if weight_avg_pval < 0.05:
  print "There isn't a significant difference between the breeds based on weight"
else:
  print "Weight may be a factor in the breeds"

Average weight of whippet is 40.82, terrier is 30.92, and pitbull is 44.16

There isn't a significant difference between the breeds based on weight


Now, perform another test to determine which of the pairs of these dog breeds differ from each other.

In [9]:
pstat, w_t_pval = ttest_ind(whippet_weight, terrier_weight)
pstat, w_p_pval = ttest_ind(whippet_weight, pitbull_weight)
pstat, t_p_pval = ttest_ind(terrier_weight, pitbull_weight)

print "Whippet v Terrier weight - pval {}".format(w_t_pval)
print "Whippet v Pitbull weight - pval {}".format(w_p_pval)
print "Terrier v Pitbull weight - pval {}".format(t_p_pval)

Whippet v Terrier weight - pval 1.16992867788e-09
Whippet v Pitbull weight - pval 0.0374252984019
Terrier v Pitbull weight - pval 2.3360916239e-20


We want to see if "poodle"s and "shihtzu"s have significantly different color breakdowns.

Get the poodle colors and store it in a variable called poodle_colors.

Get the shih tzu colors and store it in a variable called shihtzu_colors.
Feed your color_table into SciPy's Chi Square test, save the p-value and print it out.

Is there a significant difference?

In [10]:
# .11 compare poodle and shihtzu colors breakdown
poodle_colors = fm.get_color("poodle")
shihtzu_colors = fm.get_color("shihtzu")

# .12 - .13 create chi_square test of poodle vs shihtzu based on colors
unique_col = np.unique(poodle_colors)
color_table = [[0 for x in (0,1)] for y in range(len(unique_col))]
for i in range(len(unique_col)):
  color_table[i][0] = np.count_nonzero(poodle_colors == unique_col[i])
  color_table[i][1] = np.count_nonzero(shihtzu_colors == unique_col[i])

chi2, pval, dof, expected = chi2_contingency(color_table)
  
print "Poodle v Shihtzu yield pval of {:.4f}".format(pval)

Poodle v Shihtzu yield pval of 0.0053
