# Visualisation in R
We will learn how to use visualisation and transformation to explore our data in a systematic way, 
a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. 
<br>
*Generate questions about your data.<br>
*Search for answers by visualising, transforming, and modelling your data.<br>
*Use what you learn to refine your questions and/or generate new questions.<br>


Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. 
When you ask a question,the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

In [None]:
#first we need to import libraries
# The easiest way to get ggplot2 is to install the whole tidyverse:

#remove.packages("dplyr")
install.packages("dplyr")

#remove.packages("tidyverse")
install.packages("tidyverse")

library(ggplot2)
library(dplyr)

In [None]:
#lets check pre loaded datasets
data()

In [None]:
#in this lecture we will analyse diamonds dataset
diamonds

In [None]:
#lets have a look data summary
summary(diamonds)

In [None]:
sum(is.na(diamonds))#check if any missing value exists in dataset

In [None]:
#you can filter those missing values if there are 
na.omit(diamonds)

In [None]:
#in here we are showing the distribution of a categrical value for the 'cut' variable

# More details related to geom_bar: https://ggplot2.tidyverse.org/reference/geom_bar.html

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = color))

To examine the distribution of a continuous variable, use a histogram:

In [None]:
ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = price), binwidth = 1000)

bar chart, histogram shows <br>
* Which values are the most common? Why? <br>
* Which values are rare? Why? Does that match your expectations? <br>
* Can you see any unusual patterns? What might explain them? <br>

In [None]:
ggplot(data = diamonds, mapping = aes(x = price, y = carat)) +
  geom_point()

In [None]:
ggplot(data = diamonds, mapping = aes(x = table, y = carat)) +
  geom_point(color="blue")

In [None]:
 ggplot(diamonds, aes(factor(clarity), price)) +
                   geom_boxplot(color="gray") #price distribution by clarity

In [None]:
 ggplot(diamonds, aes(factor(clarity), price)) +
                   geom_jitter(alpha=2, color="green") +  # The jitter geom is a convenient shortcut for geom_point
                   geom_boxplot(alpha=0.5)#price distribution by clarity

In [None]:
#correlation between variables
diamonds.cor = cor(diamonds)

In [None]:
diamonds_numeric_columns = select_if(diamonds, is.numeric)

In [None]:
diamonds_numeric_columns

In [None]:
diamonds_correlation = cor(diamonds_numeric_columns)

In [None]:
diamonds_correlation

In [None]:
#we can plot correlation table 
#first we need to install corrplot package
install.packages("corrplot")
library(corrplot)

In [None]:
corrplot(diamonds_correlation)