Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
298 lines (225 sloc) 11.8 KB
% For documentation, see
breaklinks=ning,pdfborder={0 0 1},backref=ning,colorlinks=ning]
\hypersetup{pdfstartview={XYZ null null 1}}
% Fonts
% For tikz
\title{\Huge Open Data Analytics and Reproducible Research}
\author{{Markus Kainu, Joona Lehtomäki, Juuso Parkkinen, Juha Yrjölä, Måns Magnusson, Mikko Tolonen, Niko Ilomäki, Leo Lahti}\\ \Large Contact: \textcolor{blue}{}}
% Define block styles
\tikzstyle{decision} = [diamond, draw, fill=blue!20, text width=4.5em, text badly centered, node distance=3cm, inner sep=0pt]
\tikzstyle{block} = [rectangle, draw, fill=blue!20, text width=5em, text centered, rounded corners, minimum height=4em]
\tikzstyle{greenbox} = [rectangle, draw=blue, fill=green!20, text width=5em, text centered, rounded corners, inner sep=10pt, inner ysep=12pt, very thick]
\tikzstyle{line} = [draw, -latex']
\tikzstyle{cloud} = [draw, ellipse, node distance=4cm, minimum height=2em, text width=5em]
\tikzstyle{mybox} = [draw=blue, fill=green!20, very thick, rectangle, rounded corners, inner sep=10pt, inner ysep=20pt]
\tikzstyle{myboxwhite} = [rectangle, rounded corners, inner sep=10pt, inner ysep=0pt]
\tikzstyle{myboxblue} = [rectangle, fill=blue!10, rounded corners, inner sep=10pt, inner ysep=20pt]
\tikzstyle{fancytitle} = [fill=white, text=black, ellipse, draw=blue]
\node [myboxwhite] (intro){%
{\bf Open data analytics} The recent explosion in open data
availability has created novel opportunities for research. Efficient
data analytical tools are crucial for taking full advantage of these
new information resources. Custom software libraries are now rapidly
emerging and have a huge potential to contribute to transforming
computational social sciences, digital humanities, and related fields
(Lazer et al. 2009).
\node [myboxblue, below of=intro, node distance=23cm, yshift=0cm, xshift=0cm] (ecosystem){%
{\bf Advantages of the open development model} Efficient data analysis
relies on custom workflows that are best developed jointly by the user
community, as already demonstrated in bioinformatics (Bioconductor),
particle physics, and other fields (rOpenSci). Similar communities are
now shaping up in social sciences and humanities. The open development model
has many benefits (Ioannidis 2014, Morin et al. 2012):
\item \textbf{Efficiency}: Standard tasks can be automated, leaving
researchers more time to focus on their specific research tasks.
\item \textbf{Transparency}: Full details of the analysis from raw
data to the final results are made available, often before formal
\item \textbf{Reproducibility}: Reproducible analyses can be modified
easily and repeated exactly without human intervention.
\item \textbf{Standardization}: A developer community can pool time
and skills, develop standards in data analysis and ensure
\item \textbf{Open source}: Open licensing guarantees that the tools
are freely available and can be further expanded. We use GitHub for
shared version control and open development.
%\node [myboxwhite, left of=ropengov, node distance=45cm, yshift=50cm] (flowchart){%
% Place nodes
\node [greenbox, right of=intro, node distance=26cm, yshift=0cm] (init) {Raw Data 1};
\node [greenbox, below of=init, node distance=4cm] (init2) {Raw Data 2};
\node [greenbox, below of=init2, node distance=4cm] (init3) {Raw Data 3};
\node [block, right of=init2, node distance=7cm] (tidydata) {Tidy\\Data};
\node [block, right of=tidydata, node distance=7cm] (summaries) {Statistical\\Summaries};
\node [cloud, above of=tidydata, node distance=7cm] (preprocess) {Preprocessing};
\node [block, right of=summaries, node distance=7cm] (figs) {Tables \&\\Figures};
\node [cloud, above of=summaries, node distance=6cm] (analysis) {Analysis};
\node [greenbox, right of=figs, node distance=7cm] (reporting) {Online Report\\(Github)};
\node [cloud, above of=figs, node distance=5cm] (visualization) {Visualization};
\node [cloud, above of=reporting, node distance=4cm] (reports) {Document Generation};
% Draw edges
\path [line] (init) -- (tidydata);
\path [line] (init2) -- (tidydata);
\path [line] (init3) -- (tidydata);
\path [line] (preprocess) -- (tidydata);
\path [line,dashed] (analysis) -- (summaries);
\path [line,dashed] (visualization) -- (figs);
\path [line,dashed] (reports) -- (reporting);
%\path [line] (init) |- (tidydata);
\path [line] (tidydata) |- (summaries);
%\draw [->[width=5cm]] (tidydata) |- (summaries)
\path [line] (summaries) |- (figs);
\path [line] (figs) |- (reporting);
\node [myboxwhite, below of=ecosystem, node distance=29cm, xshift=10cm] (example){%
% \section*{Eurostat open data: a reproducible case study}
<<lataa_shape, echo=FALSE, cache=TRUE, results="hide", message=FALSE>>=
download.file("", destfile="")
df <- get_eurostat("tgs00026", time_format = "raw")
<<page4-1, eval=TRUE, dev='pdf', fig.height=22, fig.width=22, cache=FALSE, message=FALSE, warning=FALSE, echo=FALSE>>=
invisible(lapply(c("eurostat","tidyr","rgdal","maptools","rgeos","ggplot2","scales","grid"), require, character.only=T))
df$time <- eurotime2num(df$time)
df <- df[df$time == max(df$time),]
map <- readOGR(dsn = "./NUTS_2010_60M_SH/Data", layer = "NUTS_RG_60M_2010", verbose = FALSE)
map_nuts2 <- subset(map, STAT_LEVL_ == 2)
NUTS_ID <- as.character(map_nuts2$NUTS_ID)
VarX <- rep(NA, 316)
dat <- data.frame(NUTS_ID,VarX)
dat2 <- merge(dat,df,by.x="NUTS_ID",by.y="geo", all.x=TRUE)
row.names(dat2) <- dat2$NUTS_ID
row.names(map_nuts2) <- as.character(map_nuts2$NUTS_ID)
dat2 <- dat2[order(row.names(dat2)), ]
map_nuts2 <- map_nuts2[order(row.names(map_nuts2)), ]
dat2$NUTS_ID <- NULL
shape <- spCbind(map_nuts2, dat2)
shape$id <- rownames(shape@data)
map.points <- fortify(shape, region = "id")
map.df <- merge(map.points, shape, by = "id")
map.df$unit <- as.character(map.df$unit)
p <- ggplot(data=map.df, aes(long,lat,group=group))
p <- p + geom_polygon(aes(fill = values), color= alpha("white", 1/2), size=.2)
p <- p + coord_map(project="orthographic", xlim=c(-10,45), ylim=c(25,90))
p <- p + theme(legend.position = c(0.06,0.42), legend.justification=c(0,0), legend.key.size=unit(20,'mm'),
legend.background=element_rect(colour=NA, fill=alpha("white", 2/3)),
legend.text=element_text(size=30), legend.title=element_text(size=30), title=element_text(size=30),
panel.background = element_blank(), plot.background = element_blank(), panel.grid = element_blank(),
axis.text = element_blank(), axis.title = element_blank(), axis.ticks = element_blank(),
plot.margin = unit(c(-3,-1.5, -3, -1.5), "cm"))
p + guides(fill = guide_legend(title = "Euro per Year", title.position = "top", title.hjust=0))
\node [myboxwhite, below of=example, node distance=41cm, yshift = 0cm, xshift=-7cm] (examplecaption){%
The automatically generated map above shows \textit{disposable incomes of private households at NUTS2 level regions in European Union in 2011}. This poster, including the Eurostat analysis example above, is fully reproducible; for the full source code of this poster, see \textcolor{blue}{\url{}}
\node [myboxwhite, right of=ecosystem, node distance=40cm, yshift = -19cm] (ropengov){%
{\bf Reproducible research workflow} R is a popular statistical programming language with a versatile open source ecosystem that enables a completely automated analysis from raw data to the final reports. We create dedicated tools to support reproducible research in computational social science and digital humanities, helping to automatize many standard data analysis tasks in these fields. Raw data sets are downloaded from original sources, preprocessed, integrated with other information, analysed and visualized. The results are reported in web-based documents via automated document generation. The complete workflow, including full access to every detail, is shared publicly in distributed version control (Github). The full source code of this poster is available at \textcolor{blue}{}.
{\bf rOpenGov} (rOpenGov core team, 2013) is an open source ecosystem
and a developer community dedicated to open data analytics for
computational social science and digital humanities. The main
components include:
\item {\bf Reproducible research blog} at
\textcolor{blue}{} highlights the
opportunities of open data analytics.
\item {\bf Online tutorials} demonstrate how to access open data
streams from the original sources and carry out specific analysis
\item {\bf R packages} are used to share computational algorithms to
support reproducible data analysis. We provide tools for open data in
various countries (Finland, Poland, Russia, USA), cities (Helsinki),
and organizations (Eurostat, PX-Web, QOG), data anonymization,
geographic information (OpenStreetMap, WFS), weather, demography,
bibliograpies, media APIs, political science, elections and
parliamentary monitoring. We are thankful for a number of
developers. For a full list, see
<<eval=FALSE, echo=TRUE>>=
# Load R packages
# Read data
map <- readOGR("NUTS_2010_60M_SH/Data")
# Draw image
p <- ggplot(data=map, aes(long,lat,group=group))
p <- p + geom_polygon(aes(fill = values), color= alpha("white", 1/2), size=.2)
# Print the map
\node [myboxwhite, below of=ropengov, node distance=49cm, xshift=3cm] (references){%
\item J. Ioannidis (2014). How to Make More Published Research True? PLoS Medicine 11(10): e1001747.
\item D. Lazer et al. (2009). Computational Social Science 323, 721–723
\item A. Morin et al. (2012). Research priorities. Shining light into black boxes. Science 336, 159-160.
\item rOpenGov core team (2013). R ecosystem for open government data and computational social science. NIPS Machine Learning Open Source Software workshop (MLOSS). December 2013, Lake Tahoe, Nevada, US