# DS105-04-08 - ANOVAs - One Way Between Subjects - in Python

---
## Load Libraries

In [1]:
import pandas as pd
import numpy as np
import scipy
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison

___
## Load Data

In [2]:
apps = pd.read_csv('./assets/NEWgoogleplaystore.csv')

In [3]:
apps.head(2)

Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content.Rating,Genres,Last.Updated,Current.Ver,Android.Ver
0,1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,7-Jan-18,1.0.0,4.0.3 and up
1,2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,15-Jan-18,2.0.0,4.0.3 and up


___
## Question Setup
Is there a difference in the number of reviews among the three app categories of `beauty`, `food and drink`, and `photography`? 

___
## Data Wrangling

### Filtering the data
The data has many more categories than three, so you will need to filter the dataset by the categories you want: beauty, food and drink, and photography.

The code below makes a list of the categories you want to keep, then searching through the `Category` column using the `isin()` function to keep only those that match. 

Then, you can apply that list to your actually data frame, being sure to you use the `.copy()` function to change this from a slice into a data frame.

In [4]:
categories = ['BEAUTY', 'FOOD_AND_DRINK','PHOTOGRAPHY']
apps1 = apps['Category'].isin(categories)
apps2 = apps[apps1].copy()

### Subsetting only the variables you need
You only want to keep the two variables you'll need in your test: `Category` and `Reviews`.

In [5]:
apps3 = apps2[['Category','Reviews']]

### Changing `Reviews` to an integer
Your dependent variable will need to be an integer. You can check what format it is in by using the `.info()` function:

In [6]:
apps3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 515 entries, 98 to 10740
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  515 non-null    object
 1   Reviews   515 non-null    object
dtypes: object(2)
memory usage: 12.1+ KB


Note that both `Category` and `Reviews` is a non-null object (string). You'll want to convert `Reviews` to an integer:

NOTE: It will give you a warning, because you still technically have a slice masquerading as a data frame.

But it's ok, because the command has still worked just fine (see next 2 cells)

In [7]:
apps3.Reviews = apps3.Reviews.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apps3.Reviews = apps3.Reviews.astype(int)


In [8]:
apps3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 515 entries, 98 to 10740
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  515 non-null    object
 1   Reviews   515 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 12.1+ KB


### Recoding `Category` as a Number
The post-hocs and assumptions won't take any string values, so you'll need to recode `Category` as well:

In [9]:
def recode (series):
    if series == "BEAUTY": 
        return 0
    if series == "FOOD_AND_DRINK": 
        return 1
    if series == "PHOTOGRAPHY": 
        return 2

apps3['CategoryR'] = apps3['Category'].apply(recode)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apps3['CategoryR'] = apps3['Category'].apply(recode)


You get the same warning as above, but again, if you use .head() to examine the data, you see that things have worked ok, so you can proceed.

In [10]:
apps3.head()

Unnamed: 0,Category,Reviews,CategoryR
98,BEAUTY,18900,0
99,BEAUTY,49790,0
100,BEAUTY,1150,0
101,BEAUTY,1739,0
102,BEAUTY,32090,0


### Dropping the Original Category Variable
But wait! You now have three variables again! Go ahead and drop the original `Category` variable out, since it's mere presence will throw off the work you'll do later.

In [11]:
apps4 = apps3[['CategoryR','Reviews']]

And finally, eons later, you are all prepared to run a one-way ANOVA and all it's assumptions and post-hoc tests. Phew! R required a lot less wrangling, because it is specifically meant for advanced statistics.

___
## Test Assumptions

Before you go any further, it's important to test for assumptions. If the assumptions are not met for ANOVA, but you proceeded anyway, you run the risk of biasing your results.

### Assumption: Normality
You only need to test for the normality of the dependent variable, since the IV is categorical.

In [12]:
sns.distplot(apps4['Reviews'])

NameError: name 'sns' is not defined

Looks like that isn't normal in any way - it is very highly positively skewed. So, you'll need to transform price by taking the square root or the log.

In [None]:
apps4['ReviewsSQRT'] = np.sqrt(apps4['Reviews'])

That looks relatively normal, so keep it there:

### Assumption: Homogeneity of Variance
Just like in R, you can test for homogeneity of variance easily using either Bartlett's test or Fligner's Test. Bartlett's test is for when your data is normally distributed, and Fligner's test is for when your data is non-parametric. No matter which test you are using, you are looking for a non-significant test. The null hypothesis for both of these is that the data has equal variance, so you'd like to have a p value of > .05. Since you have corrected your data, you can use Bartlett's test, but just for learning purposes, you'll try both here.

#### Bartlett's Test
To do Bartlett's test, use the function `scipy.stats.bartlett()`, with the argument of the y data, followed by the x data.

In [None]:
scipy.stats.bartlett(apps4['ReviewsSQRT'], apps4['CategoryR'])

The *p* value associated with this test is < .05, which means that unfortunately, you have violated the assumption of homogeneity of variance.

#### Fligner's Test
To perform Fligner's test, use the function `scipy.stats.fligner()`, with the argument of the y data, followed by the x data.

In [None]:
scipy.stats.fligner(apps4['ReviewsSQRT'], apps4['CategoryR'])

The p value is still < .05, which means you have violated the assumption of homogeneity of variance.

### Correcting for Violations of Homogeneity of Variance
As you know, there are many different ways to correct for this violation in the general field of statistics. However, Python does not support any of them! Which means that you can run the ANOVA, but there is a good chance it will be inaccurate. If you do choose to proceed with the analysis in Python, ensure that all parties consuming your results understand that there could be inaccuracies with the data analysis!

It is recommended, however, that if you violate the assumption of homogeneity of variance that you switch over to R, and proceed from there. You are becoming a guru in both languages for a reason!

### Assumption: Sample Size
An ANOVA requires a sample size of at least 20 per independent variable. In this case, you only have one independent variable, so as long as you have at least 20 cases, you are fine. Looking at the data, the n is 515, so you are fine to proceed with this assumption!

### Assumption: Independence
There is no statistical test for the assumption of independence, so you can proceed!

---
## Computing ANOVAs with Equal Variance (Met Homogeneity of Variance Assumption)
In this case, your data did not meet this assumption, but for the purposes of learning, you'll be shown what to do if you had.

Below is the code to run a one-way ANOVA in Python. It uses the function `stats.f_oneway()` and the arguments are the three categories, crossed with your dependent variable. So here your DV is listed out, followed by the first level of your IV. Each level is separated by a comma:

In [None]:
stats.f_oneway(apps['Reviews'][apps['Category']=='BEAUTY'],
                    apps['Reviews'][apps['Category']=='FOOD_AND_DRINK'],
                    apps['Reviews'][apps['Category']=='PHOTOGRAPHY'])

Not much here, is there? Just the F value, under the name `statistic`, and the p value. Since the p value is less than .05, there is a significant difference in Reviews between these three categories.

___
## Computing ANOVAs with Unequal Variance (Violated Homogeneity of Variance Assumption)
There is NO WAY to compute ANOVAs with unequal variance in Python! Either switch over to R or be VERY CAUTIOUS when interpreting your results and don't use for anything high stakes!

___
## Post Hocs
It's important to run post-hocs to figure out what groups significantly differed from each other. In Python, the only automatically coding for post-hocs that is available is the Tukey post hoc, so that is what you will learn.

### Computing Post Hocs with Tukey's
Here is the code for computing a Tukey's post hoc in Python:

In [None]:
postHoc = MultiComparison(apps4['ReviewsSQRT'], apps4['CategoryR'])
postHocResults = postHoc.tukeyhsd()
print(postHocResults)

First you use the `MultiComparison()` function to specify the variables to use. Then, you call the `tukeyhsd()` function to run the Tukey's correction on the data. Finally, you can print the results.

Interpreting this is a little harder than in R, because you've been forced to recode your categorical IV to have numbers instead. So, make sure you refer back to that recode command to remember which number is which. 0 stands for beauty apps, 1 stands for food and drink apps, and 2 stands for photography apps. This output provides you with the mean difference in the number of reviews per comparison, plus the confidence interval (`lower` and `upper` columns), and whether or not you can reject the null hypothesis. If the value in the `reject` column is `True`, then this means there was a significant difference in the means between those groups. So, there is a significant difference between the number of reviews between photography and both beauty and food and drink apps. What is that difference? Well, you will have to examine the means.

### Computing Post Hocs When You've Violated the Assumption of Homogeneity of Variance
There is NO WAY to compute post hocs with unequal variance in Python! Either switch over to R or be VERY CAUTIOUS when interpreting your results and don't use for anything high stakes!

___
## Determine Means and Draw Conclusions
The last step is just to examine the means, to determine which apps had the highest and lowest number of reviews.

In [None]:
apps4.groupby('CategoryR').mean()

The `groupby()` function allows you to specify a grouping variable for an entire dataset, and you can then call the `.mean()` function on top of it.

Looking at the reviews column, which has the means, you can say that photography apps had significantly more reviews than both beauty and food and drink apps.