<a href="https://colab.research.google.com/github/tanzilahmed0/CS-133/blob/main/TanzilAhmed_HO13_ANOVA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Hands-on 13**
## **One-way ANOVA**
Created by Kathy Lam

<img src='https://static1.cbrimages.com/wordpress/wp-content/uploads/2019/07/Kaiba-stole-all-the-blue-eyed-white-dragon-cards-he-owns-and-killed-one-of-the-previous-owners-to-get-it.jpg?q=50&fit=contain&w=750&h=&dpr=1.5' width=600><br>


Suppose we have 4 different kinds of trading card games and we want to know if one card game is more or less expensive than the others. We collected the prices of the trading cards from yugioh, pokemon, weiss schwarz, and magic, then stored them in a dataframe below.

Here is the link to the data: `https://raw.githubusercontent.com/kathylambchops/data/main/tcg_prices.csv`


## **Question 1**: Read in the data and view the LAST 7 lines. (0 pt)

In [None]:
import pandas as pd

# Q1 answer

df = pd.read_csv('https://raw.githubusercontent.com/kathylambchops/data/main/tcg_prices.csv')
df.tail(7)

Unnamed: 0,Yugioh,Pokemon,Weiss,Magic
993,,,125.0,275
994,,,257.0,189
995,,,116.0,124
996,,,67.0,11
997,,,302.0,193
998,,,91.0,249
999,,,,111


## **Question 2**: Calculate the mean of each group and the grand mean. (2 pt)
#### **NOTE:** You should see from the previous problem that there are some NaN values in the dataframe; these are missing values. That means the groups DO NOT contain the same number of values. To find how many non-null values are in each column, you can use the `info()` method on the dataframe, or you can use the `count()` method on each column. To sum up all the values in a column you can use the pandas `sum()` method.

Helpful Syntax:     
`dataframe['column'].count()`  
`dataframe['column'].sum()`

In [None]:
# Q2 answer
import pandas as pd

# find each group mean
yugioh_mean =  df['Yugioh'].sum() / df['Yugioh'].count()
pokemon_mean = df['Pokemon'].sum() / df['Pokemon'].count()
weiss_mean =  df['Weiss'].sum() / df['Weiss'].count()
magic_mean =  df['Magic'].sum() / df['Magic'].count()

# store the sample size for each group in the appropriate variable
yugioh_n = df['Yugioh'].count()
pokemon_n = df['Pokemon'].count()
weiss_n = df['Weiss'].count()
magic_n = df['Magic'].count()


# calculate grand mean when groups have different sample sizes
grand_mean = (df['Yugioh'].sum() + df['Pokemon'].sum() + df['Weiss'].sum() + df['Magic'].sum()) / (df['Yugioh'].count() + df['Pokemon'].count() + df['Weiss'].count() + df['Magic'].count())

print(f"Yugioh mean: {round(yugioh_mean, 2)}")
print(f"Pokemon mean: {round(pokemon_mean, 2)}")
print(f"Weiss mean: {round(weiss_mean, 2)}")
print(f"Magic mean: {round(magic_mean, 2)}")
print(f"Grand mean: {round(grand_mean, 2)}")


Yugioh mean: 151.21
Pokemon mean: 149.3
Weiss mean: 201.63
Magic mean: 206.02
Grand mean: 178.29


## **Question 3**: Find the sum of squares for between groups using each sample mean and the grand mean. (2 pt)

Recall:  
Sample sizes are different:  
$\large SS_{between}$ = $\large \sum{n_k (\bar{x}_k - \bar{x}_G})^2$  

<br>


Sample sizes are the same:  
$\large SS_{between}$ = $\large n\sum{ (\bar{x}_k - \bar{x}_G})^2$

In [None]:
def squared_dev(values_list, mean):
  """ Takes a list of values and a mean. Returns a list of squared deviations
  """
  sq_dev = []
  for xi in values_list:
    sq_dev.append((xi - mean)**2)

  return sq_dev

###################
# Q3 answer

# list xk stores all the group means
xk = [yugioh_mean, pokemon_mean, weiss_mean, magic_mean]

# list nk stores the sample sizes for each group
nk = [yugioh_n, pokemon_n, weiss_n, magic_n]

# sq_dev holds the squared deviations for the 4 groups by calling squared_dev() function
sq_dev = squared_dev(xk, grand_mean)

# we need to multiply sq_dev[0] with nk[0], then sq_dev[1] with nk[1]...etc
temp = []
for i in range(len(nk)): # i = 0,1,2,3
  temp.append(sq_dev[i]*nk[i])

ss_between = sum(temp)
print(f"SS_between: {ss_between}")

SS_between: 2742649.421705062


## **Question 4**: Find the sum of squares for within groups using each sample mean and the grand mean. (2 pt)

$\large SS_{within}$ = $\Large \sum(x_i - \bar{x}_k)^2$  

<br>

#### **Important Tip:** To get rid of the NaN values in a column, you can use the Pandas `dropna()` method.  Doing `df.column.dropna()` will give you a series with only non-null values! You can feed this into the `squared_dev()` function now.

In [None]:
df.Yugioh.dropna()

0       93.0
1      228.0
2        9.0
3      285.0
4      288.0
       ...  
970    248.0
971    205.0
972     91.0
973     85.0
974    209.0
Name: Yugioh, Length: 975, dtype: float64

In [None]:
# Q4 answer
yugioh = df.Yugioh.dropna()
pokemon = df.Pokemon.dropna()
weiss = df.Weiss.dropna()
magic = df.Magic.dropna()
# Q4 solution

yugioh_sq_dev = squared_dev(yugioh, yugioh_mean)
pokemon_sq_dev = squared_dev(pokemon, pokemon_mean)
weiss_sq_dev = squared_dev(weiss, weiss_mean)
magic_sq_dev = squared_dev(magic, magic_mean)

ss_within = (sum(yugioh_sq_dev) + sum(pokemon_sq_dev) + sum(weiss_sq_dev) + sum(magic_sq_dev))
print(f"SS_within: {ss_within}")

SS_within: 40298048.69440373


## **Question 5**: Find the degrees of freedom for between groups and the degrees of freedom for within groups (1 pt)

$\large df_{between}$ = $\large k-1$

$\large df_{within}$ = $\large N-k$

In [None]:
# Q5 answer


k = 4
df_between = k - 1

N = yugioh_n + pokemon_n + weiss_n + magic_n
df_within = N - k

print(f"df_between: {df_between}")
print(f"df_within: {df_within}")

df_between: 3
df_within: 3820


## **Question 6**: Find the mean square for between and mean square for within (0.5 pt)

$\large SS_{between} / df_{between}$ = $\large MS_{between}$  

<br>

$\large SS_{within} / df_{within}$ = $\large MS_{within}$

In [None]:
# Q6 answer


ms_between = ss_between / df_between
ms_within = ss_within / df_within

print(f"MS_between: {ms_between}")
print(f"MS_within: {ms_within}")

MS_between: 914216.4739016873
MS_within: 10549.227406911972


## **Question 7**: Find the F-statistic. (0.5 pt)

$F$ = $\Large\frac{MS_{between}}{MS_{within}}$

In [None]:
# Q7 answer


F = ms_between / ms_within
print(f"F-statistic: {F}")

F-statistic: 86.6619363331463


## **Question 8:** Find the F-critical value using the [F-table](https://www.stat.purdue.edu/~lfindsen/stat511/F_alpha_05.pdf) with an $\alpha$ level of 0.05 (0.5 pt)

In [None]:
# Q8 answer

F_crit = 2.605

## **Question 9:** Compare F-statistic with F-critical value. Do we reject or retain the null? (0.5 pt)

In [None]:
# Q9 answer

if F > F_crit:
  print('We reject the null')
else:
  print('We fail to reject the null')


We reject the null


## **Question 10:** What does it mean? (1 pt)
We reject the null and accept the alternative hypothesis so at least one card game is significantly more expensive than the other.