# FetchMaker
##### Codecademy | Analyze Data with Python | Hypothesis Testing with SciPy
##### by Sebastian Hsiao
***
Congratulations! You’ve just started working at the hottest new tech startup, FetchMaker. FetchMaker’s mission is to match up prospective dog owners with their perfect pet. Data on thousands of adoptable dogs are in FetchMaker’s system, and it’s your job to analyze some of that data.



### Play around with the data

1.Let’s start by including a data interface called `fetchmaker` that will give you access to FetchMaker’s dog data.

Use `import fetchmaker` at the top of your script.py file to import the `fetchmaker` package.

In [1]:
import fetchmaker
import numpy as np
from scipy.stats import binom_test
from scipy.stats import f_oneway    # ANOVA Test
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import chi2_contingency    # Chi Square Contingency

2.The attributes that FetchMaker keeps track of are:

- `weight`, an integer representing how heavy a dog is in pounds
- `tail_length`, a float representing tail length in inches
- `age`, in `years`
- `color`, a String such as `"brown"` or `"grey"`
- `is_rescue`, a boolean `0` or `1`
The `fetchmaker` package lets you access this data for a specific breed of dog with the following format:

```py
fetchmaker.get_weight("poodle")
```

This returns a Pandas DataFrame of the weights of the poodles recorded in the system. The other methods are `get_tail_length`, `get_color`, `get_age`, and `get_is_rescue`, which all take a breed as an input.

Get the tail lengths of all of the `"rottweiler"`s in the system, and store it in a variable called `rottweiler_tl`.

In [2]:
fetchmaker.get_attribute('poodle', 'color')
fetchmaker.get_weight('rottweiler')
fetchmaker.get_tail_length
fetchmaker.get_age
fetchmaker.get_color
fetchmaker.get_is_rescue

<function fetchmaker.get_is_rescue(breed)>

3.Print out the mean of `rottweiler_tl` and the standard deviation of `rottweiler_tl`, using `np.mean` and `np.std`.

In [3]:
rottweiler_tl = fetchmaker.get_tail_length('rottweiler')
print(np.mean(rottweiler_tl))
print(np.std(rottweiler_tl))

4.2360999999999995
2.0647536874891395


### Data to the rescue

4.Over the years, we have seen that we expect `8%` of dogs in the FetchMaker system to be rescues. We want to know if whippets are significantly more or less likely to be a rescue.

Store the `is_rescue` values for `"whippet"`s in a variable called `whippet_rescue`.


In [4]:
whippet_rescue = fetchmaker.get_is_rescue('whippet')
whippet_rescue;

5.Use `np.count_nonzero` to get the number of entries in `whippet_rescue` that are `1`. Store this number in a variable called `num_whippet_rescues`.


In [5]:
num_whippet_rescues = np.count_nonzero(whippet_rescue)
num_whippet_rescues

6

6.Get the number of samples in the whippet set by taking the `np.size` of `whippet_rescue`. Store this in a variable called `num_whippets`.


In [6]:
num_whippets = np.size(whippet_rescue)
num_whippets

100

7.Use a binomial test to test the number of whippet rescues, `num_whippet_rescues`, against our expected percentage, 8%.

Remember to import the binomial test by using `from scipy.stats import binom_test`.


In [7]:
# recall pval =  binom_test(x, n, p)
pval = binom_test((num_whippet_rescues / num_whippets) * num_whippets, num_whippets, 0.08)

  pval = binom_test((num_whippet_rescues / num_whippets) * num_whippets, num_whippets, 0.08)


8.Print out the p-value. Is your result significant?

In [8]:
print('P-Value:',pval)


P-Value: 0.5811780106238111


### Size does matter

9.Three of our most popular mid-sized dog breeds are whippets, terriers, and pitbulls. Is there a significant difference in the average weights of these three dog breeds? Perform a comparative numerical test to determine if there is a significant difference.

Hint
Use ANOVA for this scenario. First, use the line `from scipy.stats import f_oneway` to import SciPy’s ANOVA function.

In [9]:
# weight analysis: mid_sized dog breeds: whippet, terrier, pitbull
w_whippet = fetchmaker.get_weight('whippet')
w_terrier = fetchmaker.get_weight('terrier')
w_pitbull = fetchmaker.get_weight('pitbull')

# perform Anova Test analysis for dog's weights
tstat, pval = f_oneway(w_whippet, w_terrier, w_pitbull)
print(pval)


3.276415588274815e-17


10.
Now, perform another test to determine which of the pairs of these dog breeds differ from each other.

Hint
Use Tukey’s Range Test for this scenario. First, use the line `from statsmodels.stats.multicomp import pairwise_tukeyhsd` to import the test.

In [10]:
# perfrom Tukey's Range Test
v = np.concatenate([w_whippet, w_terrier, w_pitbull])
labels = ['w_whippet'] * len(w_whippet) + ['w_terrier'] * len(w_terrier) + ['w_pitbull'] * len(w_pitbull)
tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)

    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
  group1    group2  meandiff p-adj   lower    upper  reject
-----------------------------------------------------------
w_pitbull w_terrier   -13.24    0.0 -16.7278 -9.7522   True
w_pitbull w_whippet    -3.34 0.0638  -6.8278  0.1478  False
w_terrier w_whippet      9.9    0.0   6.4122 13.3878   True
-----------------------------------------------------------


### Categorical dog test

11.We want to see if `"poodle"`s and `"shihtzu"`s have significantly different color breakdowns.

Get the poodle colors and store it in a variable called `poodle_colors`.

Get the shih tzu colors and store it in a variable called `shihtzu_colors`.

Hint
```py 
fetchmaker.get_color("poodle")
fetchmaker.get_color("shihtzu")
```

In [11]:
# get dog's breed colors
poodle_colors = fetchmaker.get_color('poodle')
shihtzu_colors = fetchmaker.get_color('shihtzu')

12.You can get the number of occurrences of brown poodles by using `np.count_nonzero(poodle_colors == "brown")`.

Use this function to build a Chi Square contingency table, called `color_table`, with the following structure:

Poodle	Shih Tzu
Black	x	x
Brown	x	x
Gold	x	x
Grey	x	x
White	x	x

Fill in the “x” entries with the number of each poodle or shih tzu with the specified color.

Hint
The `color_table` can be defined like this:
```py
color_table = [[x, x], [x, x], [x, x], [x, x], [x, x]]
```
where each inner array is a row of the table.

In [12]:
# create pivot table 'Poodle' 'Shihtzu' and colors for Chi Square
color_table = [
  [np.count_nonzero(poodle_colors == 'black'), np.count_nonzero(shihtzu_colors == 'black')], 
  [np.count_nonzero(poodle_colors == 'brown'), np.count_nonzero(shihtzu_colors == 'brown')], 
  [np.count_nonzero(poodle_colors == 'gold'), np.count_nonzero(shihtzu_colors == 'gold')], 
  [np.count_nonzero(poodle_colors == 'grey'), np.count_nonzero(shihtzu_colors == 'grey')], 
  [np.count_nonzero(poodle_colors == 'white'), np.count_nonzero(shihtzu_colors == 'white')]
  ]
color_table

[[17, 10], [13, 36], [8, 6], [52, 41], [10, 7]]

13.Feed your `color_table` into SciPy’s Chi Square test, save the p-value and print it out.

Is there a significant difference?

In [13]:
# perform Chi Square analysis
chi2, pval, dof, expected = chi2_contingency(color_table)
print('P-value', pval)

P-value 0.005302408293244593


### Good learner! Have a treat!

14.Great job!

Feel free to play around with `fetchmaker` more and run some hypothesis tests of your own.

The breeds you can explore are `"poodle"`, `"rottweiler"`, `"whippet"`, `"greyhound"`, `"terrier"`, `"chihuahua"`, `"shihtzu"`, and `"pitbull"`.