## Install and Load Packages
The MANOVA function comes in the base package of R, so the libraries that you will need to load are all related to assumption testing. You will need the following: mvnormtest to test for multivariate normality, and car to test for homogeneity of variance.

In [2]:
# install packages
install.packages("mvnormtest")
install.packages("car")


The downloaded binary packages are in
	/var/folders/wk/6why77bn1kn0l0pkd4vd3zl00000gn/T//RtmpZGIruS/downloaded_packages

The downloaded binary packages are in
	/var/folders/wk/6why77bn1kn0l0pkd4vd3zl00000gn/T//RtmpZGIruS/downloaded_packages


In [3]:
# load packages
library("mvnormtest")
library("car")

Loading required package: carData



## Load in Data¶
You will be using data about Kickstarter Projects to learn MANOVAs. This data has information about the project, it's category, the deadline and goal for fundraising, when the project was launched, the amount of money pledged, the country, and the current state of the project.

In [64]:
### load data
kickstarter = read.csv('../Data/kickstarter.csv')

In [65]:
# always to a quick view of your data to ensure you've loaded it correctly and what you're dealing with
head(kickstarter)

Unnamed: 0_level_0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd.pledged,X,X.1,X.2,X.3
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>
1,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,10/9/2015 11:36,1000,8/11/2015 12:12,0,failed,0,GB,0,,,,
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2/26/2013 0:20,45000,1/12/2013 0:20,220,failed,3,US,220,,,,
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,4/16/2012 4:24,5000,3/17/2012 3:24,1,failed,1,US,1,,,,
4,1000011046,Community Film Project: The Art of Neighborhood Filmmaking,Film & Video,Film & Video,USD,8/29/2015 1:00,19500,7/4/2015 8:35,1283,canceled,14,US,1283,,,,
5,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,4/1/2016 13:38,50000,2/26/2016 13:38,52375,successful,224,US,52375,,,,
6,1000023410,Support Solar Roasted Coffee & Green Energy! SolarCoffee.co,Food,Food,USD,12/21/2014 18:30,1000,12/1/2014 18:30,1205,successful,16,US,1205,,,,


A typical way (or classical way) in R to achieve some iteration is using apply and sapply renders through a list and simplifies (hence the “s” in sapply) if possible.

In [66]:
sapply(kickstarter, function(x)sum(is.na(x)))

Do you like or need those four columns at the end on the far right of this dataset? Let's get rid of them as they're not needed or wanted.

In [67]:
#easy peasy subset
kickstarter2 = subset(kickstarter, select = -c(X,X.1,X.2,X.3))

In [69]:
# Boom Gone!
head(kickstarter2)

Unnamed: 0_level_0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd.pledged
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,10/9/2015 11:36,1000,8/11/2015 12:12,0,failed,0,GB,0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2/26/2013 0:20,45000,1/12/2013 0:20,220,failed,3,US,220
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,4/16/2012 4:24,5000,3/17/2012 3:24,1,failed,1,US,1
4,1000011046,Community Film Project: The Art of Neighborhood Filmmaking,Film & Video,Film & Video,USD,8/29/2015 1:00,19500,7/4/2015 8:35,1283,canceled,14,US,1283
5,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,4/1/2016 13:38,50000,2/26/2016 13:38,52375,successful,224,US,52375
6,1000023410,Support Solar Roasted Coffee & Green Energy! SolarCoffee.co,Food,Food,USD,12/21/2014 18:30,1000,12/1/2014 18:30,1205,successful,16,US,1205


We have 1 NA left in the name column, let's drop it.

In [71]:
sapply(kickstarter2, function(x)sum(is.na(x)))

Passing your data frame or matrix through the na.omit() function is a simple way to purge incomplete records from your analysis. It is an efficient way to remove na values in r. We just need to remove the X.3 column first because that was all NAs...

In [74]:
# let's omit that last NA
kickstarter3 <- na.omit(kickstarter2) 

In [75]:
# Boom gone!
sapply(kickstarter3, function(x)sum(is.na(x)))

## Question Set Up
You will be answering the following question with this data:

Does the country the project originated in influence the number of backers and the amount of money pledged?

To answer this question, the independent variable will be the country the project originated in, country. This is a categorical variable. The two dependent variables will be the number of backers (backers) and the amount pledged (pledged). These variables are both continuous.

## Data Wrangling
Although no data wrangling is actually required for the MANOVA itself, some wrangling is required to test for assumptions. In order to test for multivariate normality, you will need to create a dataset containing only your two dependent variables that is in a matrix format, and you will need to ensure that they are numeric. 

Note: The test for normality can only handle 5,000 records, so you will also need to limit your data to 5,000 rows as well.

## Ensure Variables are Numeric
Then check the structure of the data to see what format your dependent variables are in.

In [76]:
str(kickstarter3$pledged)
str(kickstarter3$backers)

 chr [1:323749] "0" "220" "1" "1283" "52375" "1205" "453" "8233" "6240.57" ...
 chr [1:323749] "0" "3" "1" "14" "224" "16" "40" "58" "43" "0" "100" "0" ...


Notice the quotes around each number, because you may use str() in order to check the data type of each DataFrame column in R

But before convert lets subset the data

## Subsetting
Next, keep only your two dependent variables, pledged and backers.

In [78]:
# we only want to keep our two dependent variables
keepers <- c("pledged", "backers")
kickstarter4 <- kickstarter3[keepers]

In [79]:
# Then limit the number of rows:
kickstarter5 <- kickstarter4[1:5000,]

In [83]:
# we're down to our two columns we need
head(kickstarter5)

Unnamed: 0_level_0,pledged,backers
Unnamed: 0_level_1,<chr>,<chr>
1,0,0
2,220,3
3,1,1
4,1283,14
5,52375,224
6,1205,16


In [84]:
sapply(kickstarter5, function(x)sum(is.na(x)))

In [85]:
# as noted above we know these numbers are strings so lets convert to numeric
kickstarter5$pledged <- as.numeric(kickstarter5$pledged)

“NAs introduced by coercion”


In [86]:
# as noted above we know these numbers are strings so lets convert to numeric
kickstarter5$backers <- as.numeric(kickstarter5$backers)

“NAs introduced by coercion”


Oh No! “NAs introduced by coercion” What can we do to fix that? Let's work an example

In [87]:
#define character vector
x <- c('1', '2', '3', NA, '4', 'Hey')

#convert to numeric vector
x_num <- as.numeric(x)

#display numeric vector
x_num

“NAs introduced by coercion”


In [88]:
# Let's check out how many NAs we introduced by coercion
sapply(kickstarter5, function(x)sum(is.na(x)))

In [89]:
# Let's remove those 
kickstarter6 <- na.omit(kickstarter5) 

In [91]:
# Did we remove them?
sapply(kickstarter6, function(x)sum(is.na(x)))

Yes we did

## Format as a Matrix
Lastly, format the data as a matrix:

In [93]:
kickstarter7 <- as.matrix(kickstarter6)

You are now ready to perform the assumptions test for multivariate normality on kickstarter7.

# Test Assumptions
With the data wrangling out of the way, it is now time to test assumptions!

## Sample Size
The first assumption of MANOVAs is sample size. The rule of thumb is that you must have at least 20 cases per independent variable, and that there must be more cases then dependent variables in every cell. Meaning that there must be more than 2 cases for each country. Happily, both of these are fulfilled with a dataset of 323,746!

## Multivariate Normality
To test for multivariate normality, we needed to first drop any missing values using the code which we already performed above in our data wrangling phase

Now you can use the dataset you wrangled, kickstarter7, in the Wilks-Shapiro test. You can do that with the function mshapiro.test() pulled from the mvnormtest library:

In [95]:
mshapiro.test(t(kickstarter7))


	Shapiro-Wilk normality test

data:  Z
W = 0.07914, p-value < 2.2e-16


You have violated the assumption of multivariate normality if the p value is significant at p < .05, so unfortunately, these data do not meet the assumption for MANOVAs. However, for learning purposes, you will continue

## Homogeneity of Variance
You can use Levene's Test from the car library to test for homogeneity of variance on both of your dependent variables:

In [97]:
leveneTest(kickstarter3$pledged, kickstarter3$country, data=kickstarter3)

ERROR: Error in leveneTest.default(kickstarter3$pledged, kickstarter3$country, : kickstarter3$pledged is not a numeric variable


In [98]:
# as noted above we know these numbers are strings so lets convert to numeric
kickstarter3$pledged <- as.numeric(kickstarter3$pledged)

“NAs introduced by coercion”


In [99]:
# Let's check out how many NAs we introduced by coercion
sapply(kickstarter3, function(x)sum(is.na(x)))

In [100]:
# Let's remove those 
kickstarter3 <- na.omit(kickstarter3) 

In [101]:
sapply(kickstarter3, function(x)sum(is.na(x)))

In [102]:
leveneTest(kickstarter3$pledged, kickstarter3$country, data=kickstarter3)

“kickstarter3$country coerced to factor.”


Unnamed: 0_level_0,Df,F value,Pr(>F)
Unnamed: 0_level_1,<int>,<dbl>,<dbl>
group,23,22.13071,5.663731e-93
,323101,,
