<a href="https://colab.research.google.com/github/zia207/r-colab/blob/main/NoteBook/R_Beginner/01-05-04-eda-skimr-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1bLQ3nhDbZrCCqy_WCxxckOne2lgVvn3l)


# Data Exploration with {skimr} 

This tutorial will guide you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data.


### Introduction

{skimr} provides a frictionless approach to summary statistics that conforms to the principle of least surprise. It displays summary statistics that the user can skim quickly to understand their data. It handles different data types and returns a skim_df object that can be included in a pipeline or displayed nicely for the human reader.

![alt text](http://drive.google.com/uc?export=view&id=13ckjEndrNrruPGuHtC4MPdbUOyWCEfJk)



## Install rpy2
Easy way to run R in Colab with Python runtime using rpy2 python package. We have to install this package using the `pip` command:

In [3]:
!pip uninstall rpy2 -y
! pip install rpy2==3.5.1
%load_ext rpy2.ipython

Found existing installation: rpy2 3.5.1
Uninstalling rpy2-3.5.1:
  Successfully uninstalled rpy2-3.5.1
Collecting rpy2==3.5.1
  Using cached rpy2-3.5.1-cp311-cp311-linux_x86_64.whl
Installing collected packages: rpy2
Successfully installed rpy2-3.5.1


##  Mount Google Drive

Then you must create a folder in Goole drive named "R" to install all packages permanently. Before installing R-package in Python runtime. You have to mount Google Drive and follow on-screen instruction:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Check and Install Required R Packages

In [4]:
%%R
packages <- c('tidyverse',
         'skimr'
         )

In [5]:
%%R
# Install missing packages
new.packages <- packages[!(packages %in% installed.packages(lib='drive/My Drive/R/')[,"Package"])]
if(length(new.packages)) install.packages(new.packages, lib='drive/My Drive/R/')

In [6]:
%%R
# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

Installed packages:
tidyverse     skimr 
     TRUE     FALSE 


## Load Packages

In [7]:
%%R
# set library path
.libPaths('drive/My Drive/R')
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))


In [8]:
%%R
# Check loaded packages
cat("Successfully loaded packages:\n")
print(search()[grepl("package:", search())])

Successfully loaded packages:
 [1] "package:skimr"     "package:lubridate" "package:forcats"  
 [4] "package:stringr"   "package:dplyr"     "package:purrr"    
 [7] "package:readr"     "package:tidyr"     "package:tibble"   
[10] "package:ggplot2"   "package:tidyverse" "package:tools"    
[13] "package:stats"     "package:graphics"  "package:grDevices"
[16] "package:utils"     "package:datasets"  "package:methods"  
[19] "package:base"     


## Data


The data set use in this exercise can be downloaded from my [Dropbox](https://www.dropbox.com/scl/fo/fohioij7h503duitpl040/h?rlkey=3voumajiklwhgqw75fe8kby3o&dl=0) or from my [Github](https://github.com/zia207/r-colab/tree/main/Data/R_Beginners) account.

We will use `read_csv()` function of **readr** package to import data as a **tidy** data.

In [9]:
%%R
mf<-read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/gp_soil_data_na.csv")

Rows: 471 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): STATE, COUNTY, NLCD, FRG
dbl (15): ID, FIPS, STATE_ID, Longitude, Latitude, SOC, DEM, Aspect, Slope, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Getting Started with `skimr()`

The `skim` function is the main function provided by the `skimr` package. It generates a summary of the dataset, including key statistics for each variable.

You’ll get output grouped by data type (numeric, factor, etc.), showing:

- Count of missing values

- Mean, sd, min, max, and percentiles

- Histograms (in console!)

- Unique counts (for factors)


In [10]:
 %%R
 mf |>  dplyr::select(NLCD, SOC, DEM, MAP, MAT, NDVI) |>
  skimr::skim()

── Data Summary ────────────────────────
                           Values                      
Name                       dplyr::select(mf, NLCD, S...
Number of rows             471                         
Number of columns          6                           
_______________________                                
Column type frequency:                                 
  character                1                           
  numeric                  5                           
________________________                               
Group variables            None                        

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 NLCD                  0             1   6  18     0        4          0

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate     mean      sd      p0      p25
1 SOC        

### Skim only numeric columns

In [11]:
%%R
df <- mf |>  dplyr::select(NLCD, SOC, DEM, MAP, MAT, NDVI)
skim(df[, sapply(df, is.numeric)])

── Data Summary ────────────────────────
                           Values                      
Name                       df[, sapply(df, is.numeri...
Number of rows             471                         
Number of columns          5                           
_______________________                                
Column type frequency:                                 
  numeric                  5                           
________________________                               
Group variables            None                        

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate     mean      sd      p0      p25
1 SOC                   4         0.992    6.35    5.05    0.408    2.77 
2 DEM                   0         1     1631.    768.    259.    1175.   
3 MAP                   0         1      499.    207.    194.     353.   
4 MAT                   0         1        8.89    4.10   -0.591    5.88 
5 N

### Skim individual variables or subsets

In [12]:
%%R
skim(df$SOC)

── Data Summary ────────────────────────
                           Values
Name                       df$SOC
Number of rows             471   
Number of columns          1     
_______________________          
Column type frequency:           
  numeric                  1     
________________________         
Group variables            None  

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate mean   sd    p0  p25  p50  p75 p100
1 data                  4         0.992 6.35 5.05 0.408 2.77 4.97 8.71 30.5
  hist 
1 ▇▃▁▁▁


### Grouped Data Summaries

In [14]:
%%R
# Group by NLCD and summarize
df |>
  group_by(NLCD) |>
  skimr::skim()

── Data Summary ────────────────────────
                           Values            
Name                       group_by(df, NLCD)
Number of rows             471               
Number of columns          6                 
_______________________                      
Column type frequency:                       
  numeric                  5                 
________________________                     
Group variables            NLCD              

── Variable type: numeric ──────────────────────────────────────────────────────
   skim_variable NLCD               n_missing complete_rate     mean      sd
 1 SOC           Forest                     0         1       10.4     6.80 
 2 SOC           Herbaceous                 1         0.993    5.48    3.93 
 3 SOC           Planted/Cultivated         0         1        6.70    3.60 
 4 SOC           Shrubland                  3         0.977    4.13    3.74 
 5 DEM           Forest                     0         1     2567.    336.   
 

### Customizing Skim Output

`skim_with()` allows you to customize the summary statistics displayed by `skim()`. You can specify which functions to use for numeric, factor, and character data types.


In [15]:
%%R

my_skim <- skim_with(numeric = sfl(median, mad), append = FALSE)
my_skim(df)

── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             471   
Number of columns          6     
_______________________          
Column type frequency:           
  character                1     
  numeric                  5     
________________________         
Group variables            None  

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 NLCD                  0             1   6  18     0        4          0

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate   median     mad
1 SOC                   4         0.992   NA      NA    
2 DEM                   0         1     1593.    875.   
3 MAP                   0         1      433.    152.   
4 MAT                   0         1        9.17    4.88 
5 NDVI                

###  Handling Different Data Types

In [16]:
%%R -w 450 -h 400 -u px

library(lubridate)

# Create a date column
data <- tibble(
  date = seq(as.Date("2023-01-01"), by = "month", length.out = 6),
  value = rnorm(6)
)

# Skim dates
skim(data)

── Data Summary ────────────────────────
                           Values
Name                       data  
Number of rows             6     
Number of columns          2     
_______________________          
Column type frequency:           
  Date                     1     
  numeric                  1     
________________________         
Group variables            None  

── Variable type: Date ─────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min        max        median    
1 date                  0             1 2023-01-01 2023-06-01 2023-03-16
  n_unique
1        6

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate   mean   sd    p0    p25     p50   p75
1 value                 0             1 -0.460 1.19 -2.80 -0.415 -0.0328 0.207
   p100 hist 
1 0.358 ▂▁▁▂▇


## Summary and Conclusion

This tutorial guides  you through the process of using the {skimr} package for exploratory data analysis (EDA) in R. The {skimr} package is a powerful tool that provides a comprehensive overview of your dataset, allowing you to quickly understand its structure and characteristics. It generates summary statistics and visualizations, making it easier to identify patterns, trends, and potential issues in your data. The end of this tutorial, you should be able to use the {skimr} package to perform EDA on your own datasets and gain valuable insights into their characteristics.

## Resources

1.  [skimr](https://docs.ropensci.org/skimr/index.html)

2.  [Introduction to skimr](https://github.com/ropensci/skimr)