# Data Preprocessing

Data preprocessing process is divided into the following steps:
1. Importing the dataset.
2. Completing missing data.
3. Encoding categorical data.
4. Splitting the dataset.
5. Feature scaling.

In [1]:
dataset <- read.csv("Dataset/datasetComplete.csv")

In [2]:
head(dataset)

Age,Income,Graduate
24,35000,no
26,48000,no
35,54000,yes
40,61000,yes
50,49600,no
35,50000,yes


### Completing Missing Data
Completing missing data is optional. If your dataset is complete you obviously will not have to do this part. But sometimes you will find datasets with some missing cells, in that case, you could do 2 things,
Remove a complete row (not recommended, you could delete crucial information).
Complete that missing information with the mean of the column.

In [4]:
dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE )), dataset$Age)

In [5]:
dataset$Income = ifelse(is.na(dataset$Income), ave(dataset$Income, FUN = function(x) mean(x, na.rm = TRUE )), dataset$Income)

ave by default does a mean by group and is used to create a new column as it returns an output with the same length as input and the order remains the same

In [6]:
head(dataset)

Age,Income,Graduate
24,35000,no
26,48000,no
35,54000,yes
40,61000,yes
50,49600,no
35,50000,yes


### Encoding Categorical Data
This step is also optional. Depending on your dataset, you might have from beginning on, a dataset with already encoded categorical data. In that case you won’t need to do this.
In our case, we have the Graduate column, this column has 2 possible values, either yes or no. In order to be able to work with this data, we have to encode it, that means, changing the labels to numbers. Doing this in R is really simple, you just have to do the following,

In [7]:
dataset$Graduate <- factor(dataset$Graduate,
                          levels = c('yes','no'),
                          labels = c(1,0))

In [8]:
head(dataset)

Age,Income,Graduate
24,35000,0
26,48000,0
35,54000,1
40,61000,1
50,49600,0
35,50000,1


### Feature Scaling
This last step is also not always necessary. In the dataset there are some values that are not on the same scale, for example the Age and the Income have a very different scale.

In [18]:
dataset[, 1:2] = scale(dataset[, 1:2])


In [19]:
dataset

Age,Income,Graduate
-1.1569337,-1.70833278,0
-0.9465821,-0.18721455,0
0.0,0.51484001,1
0.525879,1.33390368,1
1.5776369,0.0,0
0.0,0.04680364,1


# Aggregating and analyzing data 

In [1]:
library("dplyr")  

"package 'dplyr' was built under R version 3.6.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [5]:
dat <- read.csv("http://mgimond.github.io/ES218/Data/FAO_grains_NA.csv", header=TRUE)

In [6]:
head(dat)

Country,Crop,Information,Year,Value,Source
Canada,Barley,Area harvested (Ha),2012,2060000.0,Official data
Canada,Barley,Yield (Hg/Ha),2012,38894.66,Calculated data
Canada,Buckwheat,Area harvested (Ha),2012,0.0,FAO estimate
Canada,Canary seed,Area harvested (Ha),2012,101900.0,Official data
Canada,Canary seed,Yield (Hg/Ha),2012,12161.92,Calculated data
Canada,"Grain, mixed",Area harvested (Ha),2012,57900.0,Official data


In [7]:
dat %>%
group_by(Country)%>%
summarise(yr_min = min(Year),yr_max = max(Year))

`summarise()` ungrouping output (override with `.groups` argument)


Country,yr_min,yr_max
Canada,1961,2012
United States of America,1961,2012


In [8]:
dat %>%
group_by(Crop)%>%
summarise(yr_min = min(Year),yr_max = max(Year))

`summarise()` ungrouping output (override with `.groups` argument)


Crop,yr_min,yr_max
Barley,1961,2012
Buckwheat,1961,2012
Canary seed,1980,2012
"Grain, mixed",1961,2012
Maize,1961,2012
Millet,1961,2012
Oats,1961,2012
Popcorn,1961,1982
Rye,1961,2012
Sorghum,1961,2012


### Count the number of records in each group

In [9]:
dat%>%
filter(Information == "Yield (Hg/Ha)",
      Year >=2005 & Year <=2010,
      Country =="United States of America")%>%
group_by(Crop)%>%
count()

Crop,n
Barley,6
Buckwheat,6
Maize,6
Millet,6
Oats,6
Rye,6
Sorghum,6
