Skip to content
Practicals for Data Analysis & Visualisation
HTML
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
01_R_basics_for_DAV
02_Data_manipulation
03_Data_visualisation
04_Assignment_Exploratory_data_analysis
05_Supervised_learning_Regression_1
06_Supervised_learning_Regression_2
07_Supervised_learning_Regression_3
08_Supervised_learning_Classification_1
09_Supervised_learning_Classification_2
10_Assignment_Prediction_model
11_Unsupervised_learning_PCA_CA
12_Unsupervised_learning_Clustering
.gitattributes
.gitignore
CNAME
LICENSE
README.md
_config.yml
package_check.R

README.md

Data Analysis and Visualisation practicals

Here you can find all information and files for the practicals of the elective master's course Data Analysis and Visualisation at Utrecht University (course code 201600038 in Osiris).

You are going to be working inside the practicals folder. Download the folder and unzip it to a smart location on your computer.

Links to practicals

# Name HTML PDF Answers
01 R basics for DAV .html .pdf
02 Data manipulation & EDA .html .pdf Answers
03 Data Visualisation using ggplot2 .html .pdf Answers
04 Assignment EDA .html .pdf
05 Supervised learning: Regression 1 .html .pdf Answers
06 Supervised learning: Regression 2 .html .pdf Answers
07 Supervised learning: Regression 3 .html .pdf Answers
08 Supervised learning: Classification 1 .html .pdf Answers
09 Supervised learning: Classification 2 .html .pdf Answers
10 Assignment Prediction Model .html .pdf
11 Unsupervised learning: PCA & Correspondence Analysis .html .pdf Answers
12 Unsupervised learning: Clustering .html .pdf Answers

Prerequisites

  • Install R and RStudio Desktop (open source) by following the instructions here
  • If you don't yet have a TeX distribution, run the following within RStudio:
    install.packages("tinytex")
    library(tinytex)
    install_tinytex()

If you have no experience with R or another programming language, you are going to need to catch up before starting the course and during the course. This is not an introductory course on programming with R, but a course on data analysis and visualisation.

Some good sources are:

install.packages("swirl")
library(swirl)
swirl()

and follow the guide to run the R Programming: The basics of programming in R interactive course.

The following is the minimum of what you should know about R before starting with the first practical

  • What is R (a fancy calculator) and what is an .R file (a recipe for calculations)
  • What is an R package (a set of functions you can download to use in your own code)
  • How to run R code in RStudio
  • What is a variable x <- 10
  • What is a function y <- fun(x = 10)
  • Understand what the following statements do (tip: you may run it in R line by line)
y <- "Let him go!"
x <- "Bismillah!"
z <- paste(x, "No, we will not let you go.", y)
rep(z, 3)
1:10
sample(1:20, 4)
sample(1:20, 40, replace = TRUE)
z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1)
z^2
z == 2
z > 2
install.packages("dplyr")
library(dplyr)
  • Be able to read the help file of any function, (e.g., type ?plot in the console)

Outline of the practicals

Anything written in italic font is optional/extra material. You can look those up by yourself if you have extra time.

Week 1

  • R basics for DAV

    • R and RStudio
    • Project organisation
    • Help files using ?, CRAN, and internet search
    • R Markdown
    • The ISLR package (datasets from James ISLR)
    • The tidyverse as a dialect of the R language (Wickham R4DS)
    • The google style guide or tidyverse style guide (ISLR does not follow these)
    • R packages on GitHub
  • Data manipulation & exploratory data analysis

    • Data types: character, numeric, factor
    • Lists
    • Loading datasets from .csv or .xlsx (or other formats with haven)
    • data.frame() and tibble()
    • View(), head(), tail()
    • summary()
    • filter(), select(), and mutate() from dplyr
    • bind_rows(), bind_cols()
    • missing values (na.omit)
    • group_by() and summarise() from dplyr
    • the pipe operator %>%
    • table()
    • dplyr cheatsheet
    • wide to long format: gather and spread

Week 2

  • Data Visualisation using ggplot2
    • Preparing data for a ggplot() call
    • What is a ggplot object and how to construct it
    • Aesthetics: x, y, size, colour, fill
    • geom_point(), geom_line(), geom_bar()
    • Labels, limits
    • geom_boxplot(), geom_density()
    • themes (ggthemes?)

Week 3

  • HANDIN: Pass / Fail assignment

    • Find a dataset and create an Exploratory Data Analysis
    • Tip: The new Google dataset search.
    • Format: stand-alone RStudio project folder with:
      • the dataset (csv, xlsx, sav, dat, json, or any other common format)
      • one .Rmd notebook file
      • a compiled .pdf or .html
    • Requirements:
      • explain the dataset in 1 or 2 paragraphs
      • use tidyverse
      • clean, legible R code (preferably following the google style guide)
      • table(s) with relevant summary statistics
      • descriptive plots
      • explain what you did and why (max 3 paragraphs total)
  • Supervised learning: Regression 1

    • lm(), the formula object, the lm object and its methods (print(), summary(), coef(), plot())
    • Regression lines in ggplot with uncertainty
    • Linear regression with multiple variables, interaction effects
    • Model assessment:
      • Train/test split
      • Mean square error calculation (predict())
      • AIC, BIC
    • Bias/variance tradeoff

Week 4

  • Supervised learning: Regression 2
    • Feature selection
    • Regularization using the glmnet package
    • Optimising lambda

Week 5

  • Supervised learning: Regression 3
    • Polynomial regression
    • Nonlinear regression using the splines package
    • Visualising nonlinear regression

Week 6

  • Supervised learning: Classification 1
    • (titanic data? default data?)
    • KNN
    • Logistic regression (see also 4.2)
    • LDA

Week 7

  • Supervised learning: assessing classification methods
    • Confusion matrix, errors, AUC, ROC curve
    • Cross validation on classification problems
    • Classification trees

Week 8

  • HANDIN: Pass / Fail assignment
    • Find a dataset and create and assess a prediction model
    • Tip: The new Google dataset search.
    • Format: stand-alone RStudio project folder with:
      • the dataset (csv, xlsx, sav, dat, json, or any other common format)
      • one .Rmd notebook file
      • a compiled .pdf or .html
      • a .Rproj file
    • Requirements:
      • explain the dataset in 1 or 2 paragraphs
      • use tidyverse
      • clean, legible R code (preferably following the google style guide)
      • explain which method you use
      • assess your predictions
      • make conclusions about your predictions
      • use plots where useful (they are almost always useful)
  • Unsupervised learning: PCA & Correspondence Analysis
    • PCA using princomp
    • Visualising PCA
    • SVD
    • Correspondence Analysis & Biplots

Week 9

  • Unsupervised learning: Clustering
    • K-means clustering with kmeans()
    • Hierarchical clustering with hclust()
    • Visualising clusters in ggplot
    • Modularity clustering with igraph
You can’t perform that action at this time.