<a href="https://colab.research.google.com/github/zia207/r-colab/blob/main/NoteBook/R_Beginner/01-03-04-data-wrangling-lubricate-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1bLQ3nhDbZrCCqy_WCxxckOne2lgVvn3l)

# Data Wrangling with {lubricate}



The dates and time data are often messy and inconsistent, making it challenging to analyze. The {lubridate} package provides a set of functions that make it easy to work with dates and times in R. It allows you to parse dates from various formats, extract components like year, month, day, hour, minute, and second, and perform calculations with dates and times. The package also provides functions for handling time zones and daylight saving time. It’s part of the tidyverse and provides functions to parse, extract, manipulate, and format dates/times. It is part of the {tidyverse}, designed to simplify parsing, manipulating, and wrangling dates/times in R. The package is particularly useful for data wrangling tasks, such as cleaning and transforming date/time data, extracting components (like year, month, day), and performing calculations with dates/times. It also provides functions for handling time zones and daylight saving time.

![alt text](http://drive.google.com/uc?export=view&id=1sHkaR2OpE-1vdPUPvNj_8zy4IruVzRJj)



## {lubridate} Function Reference

The {lubridate} package provides a variety of functions for working with dates and times. Here are some of the most commonly used functions, categorized by their purpose:

| Category | Functions |
|----------------------|--------------------------------------------------|
| Parsing | `ymd()`, `mdy()`, `dmy()`, `parse_date_time()` |
| Extract Parts | `year()`, `month()`, `day()`, `hour()`, `minute()`, `second()` |
| Manipulation | `make_date()`, `make_datetime()`, `floor_date()`, `ceiling_date()` |
| Timezones | `with_tz()`, `force_tz()` |



## Setup R in Python Runtype - Install {rpy2}

{rpy2} is a Python package that provides an interface to the R programming language, allowing Python users to run R code, call R functions, and manipulate R objects directly from Python. It enables seamless integration between Python and R, leveraging R's statistical and graphical capabilities while using Python's flexibility. The package supports passing data between the two languages and is widely used for statistical analysis, data visualization, and machine learning tasks that benefit from R's specialized libraries.

In [1]:
!pip uninstall rpy2 -y
! pip install rpy2==3.5.1
%load_ext rpy2.ipython

Found existing installation: rpy2 3.5.17
Uninstalling rpy2-3.5.17:
  Successfully uninstalled rpy2-3.5.17
Collecting rpy2==3.5.1
  Downloading rpy2-3.5.1.tar.gz (201 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rpy2
  Building wheel for rpy2 (setup.py) ... [?25l[?25hdone
  Created wheel for rpy2: filename=rpy2-3.5.1-cp311-cp311-linux_x86_64.whl size=314976 sha256=136fb7228deaae2cca9619c32adc71f692787ca6ba858058b4e8baeeeb30a91e
  Stored in directory: /root/.cache/pip/wheels/e9/55/d1/47be85a5f3f1e1f4d1e91cb5e3a4dcb40dd72147f184c5a5ef
Successfully built rpy2
Installing collected packages: rpy2
Successfully installed rpy2-3.5.1


##  Mount Google Drive

Then you must create a folder in Goole drive named "R" to install all packages permanently. Before installing R-package in Python runtime. You have to mount Google Drive and follow on-screen instruction:

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Check and Install Required R Packages

In [3]:
%%R
packages <- c(
             'tidyverse'
)

In [None]:
%%R
# Install missing packages
new.packages <- packages[!(packages %in% installed.packages(lib='drive/My Drive/R/')[,"Package"])]
if(length(new.packages)) install.packages(new.packages, lib='drive/My Drive/R/')

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

## Load Packages

In [4]:
%%R
# set library path
.libPaths('drive/My Drive/R')
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

In [5]:
%%R
# Check loaded packages
cat("Successfully loaded packages:\n")
print(search()[grepl("package:", search())])

Successfully loaded packages:
 [1] "package:lubridate" "package:forcats"   "package:stringr"  
 [4] "package:dplyr"     "package:purrr"     "package:readr"    
 [7] "package:tidyr"     "package:tibble"    "package:ggplot2"  
[10] "package:tidyverse" "package:tools"     "package:stats"    
[13] "package:graphics"  "package:grDevices" "package:utils"    
[16] "package:datasets"  "package:methods"   "package:base"     


## Data

Let’s simulate a small dataset with messy date formats.

In [6]:
%%R
set.seed(123)
df <- tibble(
  id = 1:10,
  name = sample(c("Alice", "Bob", "Carol"), 10, replace = TRUE),
  raw_date = sample(c("2025-04-10", "10/04/2025", "April 10, 2025"), 10, replace = TRUE),
  timestamp = sample(seq(
    as.POSIXct("2025-04-10 08:00"),
    as.POSIXct("2025-04-10 18:00"),
    by = "1 hour"
  ), 10, replace = TRUE)
)

print(df)

# A tibble: 10 × 4
      id name  raw_date       timestamp          
   <int> <chr> <chr>          <dttm>             
 1     1 Carol 10/04/2025     2025-04-10 16:00:00
 2     2 Carol 10/04/2025     2025-04-10 10:00:00
 3     3 Carol 2025-04-10     2025-04-10 15:00:00
 4     4 Bob   10/04/2025     2025-04-10 17:00:00
 5     5 Carol April 10, 2025 2025-04-10 14:00:00
 6     6 Bob   2025-04-10     2025-04-10 17:00:00
 7     7 Bob   April 10, 2025 2025-04-10 16:00:00
 8     8 Bob   April 10, 2025 2025-04-10 10:00:00
 9     9 Carol 2025-04-10     2025-04-10 11:00:00
10    10 Alice 2025-04-10     2025-04-10 08:00:00


## Parse Dates and Times

We’ll clean the inconsistent `raw_date` column and extract useful features.

In [7]:
%%R
df_clean <- df %>%
  mutate(
    parsed_date = lubridate::parse_date_time(raw_date, orders = c("ymd", "dmy", "B d, Y")),
    year = lubridate::year(parsed_date),
    month = lubridate::month(parsed_date, label = TRUE),
    day = lubridate::day(parsed_date),
    weekday = lubridate::wday(parsed_date, label = TRUE),
    hour = lubridate::hour(timestamp),
    minute = lubridate::minute(timestamp)
  )

glimpse(df_clean)

Rows: 10
Columns: 11
$ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ name        <chr> "Carol", "Carol", "Carol", "Bob", "Carol", "Bob", "Bob", "…
$ raw_date    <chr> "10/04/2025", "10/04/2025", "2025-04-10", "10/04/2025", "A…
$ timestamp   <dttm> 2025-04-10 16:00:00, 2025-04-10 10:00:00, 2025-04-10 15:00…
$ parsed_date <dttm> 2025-04-10, 2025-04-10, 2025-04-10, 2025-04-10, 2025-04-1…
$ year        <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025, 2025
$ month       <ord> Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr
$ day         <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10
$ weekday     <ord> Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu, Thu
$ hour        <int> 16, 10, 15, 17, 14, 17, 16, 10, 11, 8
$ minute      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0


## Common Wrangling Tasks with `lubridate`

### Reformat Dates

In [8]:
%%R
df_clean <- df_clean %>%
  mutate(date_reformatted = format(parsed_date, "%d-%b-%Y"))
head(df_clean)

# A tibble: 6 × 12
     id name  raw_date timestamp           parsed_date          year month   day
  <int> <chr> <chr>    <dttm>              <dttm>              <dbl> <ord> <int>
1     1 Carol 10/04/2… 2025-04-10 16:00:00 2025-04-10 00:00:00  2025 Apr      10
2     2 Carol 10/04/2… 2025-04-10 10:00:00 2025-04-10 00:00:00  2025 Apr      10
3     3 Carol 2025-04… 2025-04-10 15:00:00 2025-04-10 00:00:00  2025 Apr      10
4     4 Bob   10/04/2… 2025-04-10 17:00:00 2025-04-10 00:00:00  2025 Apr      10
5     5 Carol April 1… 2025-04-10 14:00:00 2025-04-10 00:00:00  2025 Apr      10
6     6 Bob   2025-04… 2025-04-10 17:00:00 2025-04-10 00:00:00  2025 Apr      10
# ℹ 4 more variables: weekday <ord>, hour <int>, minute <int>,
#   date_reformatted <chr>


### Filter Data for Specific Times

In [9]:
%%R
# Filter rows where timestamp is after 12 PM
df_clean %>% filter(hour > 12)

# A tibble: 6 × 12
     id name  raw_date timestamp           parsed_date          year month   day
  <int> <chr> <chr>    <dttm>              <dttm>              <dbl> <ord> <int>
1     1 Carol 10/04/2… 2025-04-10 16:00:00 2025-04-10 00:00:00  2025 Apr      10
2     3 Carol 2025-04… 2025-04-10 15:00:00 2025-04-10 00:00:00  2025 Apr      10
3     4 Bob   10/04/2… 2025-04-10 17:00:00 2025-04-10 00:00:00  2025 Apr      10
4     5 Carol April 1… 2025-04-10 14:00:00 2025-04-10 00:00:00  2025 Apr      10
5     6 Bob   2025-04… 2025-04-10 17:00:00 2025-04-10 00:00:00  2025 Apr      10
6     7 Bob   April 1… 2025-04-10 16:00:00 2025-04-10 00:00:00  2025 Apr      10
# ℹ 4 more variables: weekday <ord>, hour <int>, minute <int>,
#   date_reformatted <chr>


### Grouping by Weekday

In [10]:
%%R
df_clean %>%
  group_by(weekday) %>%
  summarise(entries = n())

# A tibble: 1 × 2
  weekday entries
  <ord>     <int>
1 Thu          10


### Construct New Datetime from Parts

In [11]:
%%R
df_clean <- df_clean %>%
  mutate(full_datetime = make_datetime(year, month(parsed_date), day, hour = hour, min = minute))

print(df_clean$full_datetime)

 [1] "2025-04-10 16:00:00 UTC" "2025-04-10 10:00:00 UTC"
 [3] "2025-04-10 15:00:00 UTC" "2025-04-10 17:00:00 UTC"
 [5] "2025-04-10 14:00:00 UTC" "2025-04-10 17:00:00 UTC"
 [7] "2025-04-10 16:00:00 UTC" "2025-04-10 10:00:00 UTC"
 [9] "2025-04-10 11:00:00 UTC" "2025-04-10 08:00:00 UTC"


### Change Timezones



In [12]:
%%R
df_clean <- df_clean %>%
  mutate(
    full_datetime_utc = with_tz(full_datetime, tzone = "UTC"),
    full_datetime_tokyo = with_tz(full_datetime, tzone = "Asia/Tokyo")
  )


## Summary and Conlusions

In this tutorial, we covered the basics of using the {lubridate} package for date and time manipulation in R. We learned how to parse inconsistent date formats, extract components like year, month, and weekday, filter and group data by date and time, construct new datetime values, and handle timezones. By the end of this tutorial, you should be able to:

-   Parse inconsistent date formats
-   Extract year, month, weekday, hour, minute, etc.
-   Filter/group by date and time
-   Construct and manipulate datetime values
-   Handle timezones and rounding


## Resources

-   [lubridate documentation](https://lubridate.tidyverse.org/)