<a href="https://colab.research.google.com/github/zia207/r-colab/blob/main/NoteBook/R_Beginner/01-03-03-data-wrangling-janitor-r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1bLQ3nhDbZrCCqy_WCxxckOne2lgVvn3l)

# Data Wrangling with {janitor}



The [janitor]( [janitor](https://github.com/sfirke/janitor)) package in the R programming language is a popular and powerful tool for data cleaning and manipulation. It offers a variety of functions that help to clean up messy data sets quickly and easily and turn them into a more organized format. The package includes functions that remove duplicate data, handle missing values, and format column names. Additionally, it provides features to handle whitespace, convert data types, and generate summary statistics for data sets. Overall, the janitor package is an essential resource for anyone working with data in R, as it can significantly save time and effort in data cleaning and preparation. The janitor package is designed to be user-friendly and is optimized for those who are new to R. It includes simple functions for examining and cleaning up messy data, and the package can help advanced R users to perform these tasks more quickly and efficiently. The main functions of the janitor package include formatting data frame column names, creating and formatting frequency tables of one, two, or three variables, and providing other tools for cleaning and examining data frames.

![alt text](http://drive.google.com/uc?export=view&id=1sHkaR2OpE-1vdPUPvNj_8zy4IruVzRJj)



### Some common functionalists

Here are some common functionalists provided by the **`janitor`** package:

**Cleaning Functions:**

1.  `clean_names()`: Renames columns by converting them to snake_case or lower_case and removing special characters.

2.  `remove_empty()`: Removes rows and columns that are entirely empty.

3.  `remove_constant()`: Removes columns that have constant values throughout.

**Data Frame Manipulation:**

1.  `get_dupes()`: Finds duplicate rows in a data frame.

2.  `tabyl()`: Generates frequency tables (similar to `table()` but returns a data frame).

3.  `adorn_totals()`: Adds total rows or columns to a data frame.

**Column Operations:**

1.  `add_columns()`: Adds new columns to a data frame.

2.  `recode_factor()`: Modifies levels of a factor variable.

**Missing Values Handling:**

1.  `remove_missing()`: Removes rows or columns containing missing values.

2.  `replace_na()`: Replaces missing values with specified values.

#### **Factor Variables:**

1.  `factorize()`: Converts columns to factor variables.

**Other Useful Functions:**

1.  `row_to_names()`: Converts a row to column names.

2.  `crossing()`: Creates all combinations of rows from multiple data frames.

Using these functions, you can perform various data cleaning and transformation tasks



## Install rpy2

Easy way to run R in Colab with Python runtime using **rpy2** python package. We have to install this package using the pip command:

In [None]:
!pip uninstall rpy2 -y
! pip install rpy2==3.5.1
%load_ext rpy2.ipython

##  Mount Google Drive

Then you must create a folder in Goole drive named "R" to install all packages permanently. Before installing R-package in Python runtime. You have to mount Google Drive and follow on-screen instruction:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Check and Install Required R Packages

In [None]:
%%R
packages <- c(
          'tidyverse',
          'janitor'
)

In [None]:
%%R
# Install missing packages
new.packages <- packages[!(packages %in% installed.packages(lib='drive/My Drive/R/')[,"Package"])]
if(length(new.packages)) install.packages(new.packages, lib='drive/My Drive/R/')

# Verify installation
cat("Installed packages:\n")
print(sapply(packages, requireNamespace, quietly = TRUE))

## Load Packages

In [None]:
%%R
# set library path
.libPaths('drive/My Drive/R')
# Load packages with suppressed messages
invisible(lapply(packages, function(pkg) {
  suppressPackageStartupMessages(library(pkg, character.only = TRUE))
}))

In [None]:
%%R
# Check loaded packages
cat("Successfully loaded packages:\n")
print(search()[grepl("package:", search())])

## Some Important functions

We will create some "bad" data and clean them with janitor. We will apply following functions:

-   `clean_names()`

-   `remove_empty()`

-   `trim_ws()`

-   `get_dupes()`

-   `remove_constant()`


### clean_names()

The `clean_names()` function is used to clean column names in a data frame. It converts the column names to lowercase and replaces all spaces and special characters with underscores.

In [None]:
%%R
# Create a data frame with messy column names
df <- data.frame("Column One" = 1:5,
                 "Column Two!!" = 6:10,
                 "Column Three $" = 11:15,
                 "%Column four" = 11:15)
head(df)

  Column.One Column.Two.. Column.Three.. X.Column.four
1          1            6             11            11
2          2            7             12            12
3          3            8             13            13
4          4            9             14            14
5          5           10             15            15


In [None]:
%%R
df  |>
  janitor::clean_names() |>
  glimpse()

Rows: 5
Columns: 4
$ column_one    <int> 1, 2, 3, 4, 5
$ column_two    <int> 6, 7, 8, 9, 10
$ column_three  <int> 11, 12, 13, 14, 15
$ x_column_four <int> 11, 12, 13, 14, 15


### remove_empty()

The `remove_empty()` function is used to remove rows or columns that contain only missing or empty values.

In [None]:
%%R
df <-  data.frame(x = c(1,NA,4),
                    y = c(NA,NA,3),
                    z = c(NA, NA, NA))

head(df)

   x  y  z
1  1 NA NA
2 NA NA NA
3  4  3 NA


In [None]:
%%R
df %>%
  janitor::remove_empty(c("rows","cols")) |>
  glimpse()

Rows: 2
Columns: 2
$ x <dbl> 1, 4
$ y <dbl> NA, 3


### get_dupes()

The `get_dupes()` function is used to find duplicate rows in a data frame.

In [None]:
%%R
# Create a dataframe with duplicates rows
df <- data.frame("Column One" = c(1, 2, 3, 1), "Column Two" = c("A", "B", "C", "A"))
head(df)

  Column.One Column.Two
1          1          A
2          2          B
3          3          C
4          1          A


In [None]:
%%R
get_dupes(df)





  Column.One Column.Two dupe_count
1          1          A          2
2          1          A          2


## Cleaning a Messy Data
Now, we will clean up messy data using some functions of janitor packages. We will use Lung Cancer Mortality data from the USA.
All data set used in this exercise can be downloaded from my [Dropbox](https://www.dropbox.com/scl/fo/fohioij7h503duitpl040/h?rlkey=3voumajiklwhgqw75fe8kby3o&dl=0) or from my [Github](https://github.com/zia207/r-colab/tree/main/Data/R_Beginners) accounts.



We will use `read_csv()` function of **readr** package to import data from github as a **tidy** data.

In [None]:
%%R
mf = read_csv("https://github.com/zia207/r-colab/raw/main/Data/R_Beginners/USA_LBC_Data_raw.csv")


In [None]:
%%R
glimpse(mf)

Rows: 3,118
Columns: 26
$ `Lung Cancer Moratlity Rates and Risk in USA, Data Provider: Zia Ahmed` <chr> …
$ ...2                                                                    <chr> …
$ ...3                                                                    <chr> …
$ ...4                                                                    <chr> …
$ ...5                                                                    <chr> …
$ ...6                                                                    <chr> …
$ ...7                                                                    <chr> …
$ ...8                                                                    <chr> …
$ ...9                                                                    <chr> …
$ ...10                                                                   <chr> …
$ ...11                                                                   <chr> …
$ ...12                                                                   

You may have received data files that contain some text at the top of the spreadsheet before the actual data begins. In this data-frame, the column headings briefly describe the data. However, we want the first row to be the column heading. To achieve this, we will use the `row_to_names()` function. This function requires the following arguments: the data source, the row number from which the column names should come, whether that row should be deleted from the data, and whether the rows above it should be deleted from the data.

In [None]:
%%R
mf.01 = mf |>
  janitor::row_to_names(1, remove_row = TRUE, remove_rows_above = TRUE)  |>
  glimpse()

Rows: 3,117
Columns: 26
$ REGION_ID            <chr> "3", "3", "3", "3", NA, NA, "3", "3", "3", "3", "…
$ STATE                <chr> "Alabama", "Alabama", "Alabama", "Alabama", NA, N…
$ County               <chr> "Baldwin County", "Butler County", "Butler County…
$ `Empty Column 1`     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ X                    <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y                    <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips                 <chr> "1003", "1013", "1013", "1017", NA, NA, "1023", "…
$ `Empty_Column 2`     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", NA, NA, "31.8", "…
$ Smoking              <chr> "20.8", "26", "26", "25.1", NA, NA, "21.8", "22.6…
$ `PM  25`             <chr> "7.89", "8.46", "8.46", "8.87", NA, NA, "8.58", "…
$ NO2                  <chr> "0.7939", "0.6344", "0.6344", "0.8442", NA, NA, "…
$ SO2           

Still data has some empty columns and and empty rows, we are going to remove these empty columns and rows using `remove_empty()` function:




In [None]:
%%R
mf.02 = mf.01 |>
  janitor::remove_empty()  |>
  glimpse()





Rows: 3,110
Columns: 24
$ REGION_ID            <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",…
$ STATE                <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ County               <chr> "Baldwin County", "Butler County", "Butler County…
$ X                    <chr> "789777.5039", "877731.5725", "877731.5725", "984…
$ Y                    <chr> "884557.0795", "1007285.71", "1007285.71", "11486…
$ Fips                 <chr> "1003", "1013", "1013", "1017", "1023", "1025", "…
$ `LCB Mortality Rate` <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53…
$ Smoking              <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2…
$ `PM  25`             <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "…
$ NO2                  <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934",…
$ SO2                  <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.02…
$ Ozone                <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.6…
$ `Pop 65`      

Now, we are going fix column headings using `clean_names()`. It converts the column names to lowercase and replaces all spaces and special characters with underscores.

In [None]:
%%R
mf.03 = mf.02  |>
  janitor::clean_names()  |>
  glimpse()

Rows: 3,110
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Butler County",…
$ x                  <chr> "789777.5039", "877731.5725", "877731.5725", "98421…
$ y                  <chr> "884557.0795", "1007285.71", "1007285.71", "1148648…
$ fips               <chr> "1003", "1013", "1013", "1017", "1023", "1025", "10…
$ lcb_mortality_rate <chr> "48.1", "38.3", "38.3", "49.6", "31.8", "42", "53.7…
$ smoking            <chr> "20.8", "26", "26", "25.1", "21.8", "22.6", "21.2",…
$ pm_25              <chr> "7.89", "8.46", "8.46", "8.87", "8.58", "8.42", "8.…
$ no2                <chr> "0.7939", "0.6344", "0.6344", "0.8442", "0.5934", "…
$ so2                <chr> "0.035343", "0.0135", "0.0135", "0.048177", "0.0239…
$ ozone              <chr> "39.79", "38.31", "38.31", "40.1", "37.07", "37.68"…
$ pop_65        

All data are exported in R as `chr`. We are going to convert column from 4 to 21 `as.numeric` and 22 to 23 `as.factor`. We will use `dplyr::mutate_at()` function:

In [None]:
%%R
mf.04= mf.03  |>
     dplyr::mutate_at(4:21, as.numeric)  |>
     dplyr::mutate_at(22:24, as.factor)  |>
     glimpse()

Rows: 3,110
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Butler County",…
$ x                  <dbl> 789777.5, 877731.6, 877731.6, 984214.7, 726606.5, 7…
$ y                  <dbl> 884557.1, 1007285.7, 1007285.7, 1148648.7, 1023615.…
$ fips               <dbl> 1003, 1013, 1013, 1017, 1023, 1025, 1031, 1035, 103…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.…
$ smoking            <dbl> 20.8, 26.0, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.…
$ pm_25              <dbl> 7.89, 8.46, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.2…
$ no2                <dbl> 0.7939, 0.6344, 0.6344, 0.8442, 0.5934, 0.6432, 0.5…
$ so2                <dbl> 0.035343, 0.013500, 0.013500, 0.048177, 0.023989, 0…
$ ozone              <dbl> 39.79, 38.31, 38.31, 40.10, 37.07, 37.68, 38.46, 37…
$ pop_65        

Now will check the duplicates record in the this dat with `get_dupes()` function:

In [None]:
%%R
mf.04 |> janitor::get_dupes(fips)

# A tibble: 6 × 25
   fips dupe_count region_id state    county         x      y lcb_mortality_rate
  <dbl>      <int> <chr>     <chr>    <chr>      <dbl>  <dbl>              <dbl>
1  1013          2 3         Alabama  Butler C… 8.78e5 1.01e6               38.3
2  1013          2 3         Alabama  Butler C… 8.78e5 1.01e6               38.3
3  1053          2 3         Alabama  Escambia… 8.39e5 9.34e5               58.3
4  1053          2 3         Alabama  Escambia… 8.39e5 9.34e5               58.3
5  5011          2 3         Arkansas Bradley … 3.54e5 1.16e6               69.9
6  5011          2 3         Arkansas Bradley … 3.54e5 1.16e6               69.9
# ℹ 17 more variables: smoking <dbl>, pm_25 <dbl>, no2 <dbl>, so2 <dbl>,
#   ozone <dbl>, pop_65 <dbl>, pop_black <dbl>, pop_hipanic <dbl>,
#   pop_white <dbl>, education <dbl>, poverty_percent <dbl>,
#   income_equality <dbl>, uninsured <dbl>, dem <dbl>, radon_zone_class <fct>,
#   urban_rural <fct>, coal_production <fct>


As shown above, the data frame is filtered down to those rows with duplicate values in the Fips column. For removing these duplicate rows, we have to use `dplyr::distinct(.keep_all = TRUE)`

Now will check the duplicates record in the this data:



In [None]:
%%R
mf.05= mf.04 |>
     dplyr::distinct(fips,.keep_all = TRUE) |>
     janitor::get_dupes(fips) |>
     glimpse()




Rows: 0
Columns: 25
$ fips               <dbl> 
$ dupe_count         <int> 
$ region_id          <chr> 
$ state              <chr> 
$ county             <chr> 
$ x                  <dbl> 
$ y                  <dbl> 
$ lcb_mortality_rate <dbl> 
$ smoking            <dbl> 
$ pm_25              <dbl> 
$ no2                <dbl> 
$ so2                <dbl> 
$ ozone              <dbl> 
$ pop_65             <dbl> 
$ pop_black          <dbl> 
$ pop_hipanic        <dbl> 
$ pop_white          <dbl> 
$ education          <dbl> 
$ poverty_percent    <dbl> 
$ income_equality    <dbl> 
$ uninsured          <dbl> 
$ dem                <dbl> 
$ radon_zone_class   <fct> 
$ urban_rural        <fct> 
$ coal_production    <fct> 


Now we run all above function with Pipe ( `|>` ):

In [None]:
%%R
mf_clean = mf |>
  janitor::row_to_names(1, remove_row = TRUE, remove_rows_above = TRUE)  |>
  janitor::remove_empty()  |>
  janitor::clean_names()  |>
  dplyr::mutate_at(4:21, as.numeric)  |>
  dplyr::mutate_at(22:24, as.factor)  |>
  dplyr::distinct(fips,.keep_all = TRUE)  |>
     glimpse()




Rows: 3,107
Columns: 24
$ region_id          <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "…
$ state              <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabam…
$ county             <chr> "Baldwin County", "Butler County", "Chambers County…
$ x                  <dbl> 789777.5, 877731.6, 984214.7, 726606.5, 770408.9, 9…
$ y                  <dbl> 884557.1, 1007285.7, 1148648.7, 1023615.8, 988910.5…
$ fips               <dbl> 1003, 1013, 1017, 1023, 1025, 1031, 1035, 1039, 104…
$ lcb_mortality_rate <dbl> 48.1, 38.3, 49.6, 31.8, 42.0, 53.7, 46.9, 65.5, 57.…
$ smoking            <dbl> 20.8, 26.0, 25.1, 21.8, 22.6, 21.2, 24.9, 25.9, 22.…
$ pm_25              <dbl> 7.89, 8.46, 8.87, 8.58, 8.42, 8.42, 8.23, 8.24, 8.4…
$ no2                <dbl> 0.7939, 0.6344, 0.8442, 0.5934, 0.6432, 0.5698, 0.5…
$ so2                <dbl> 0.035343, 0.013500, 0.048177, 0.023989, 0.033700, 0…
$ ozone              <dbl> 39.79, 38.31, 40.10, 37.07, 37.68, 38.46, 37.92, 38…
$ pop_65        

## Summary and Conclusion

This guide explains cleaning up datasets using the R package **janitor**.  It emphasizes the importance of data cleanliness and offers an easy-to-use approach to data wrangling. The janitor package provides functions that simplify everyday data-cleaning tasks. It can handle missing data and convert data types, making it a valuable tool for preparing datasets for analysis. Remember, data cleaning is essential, and the janitor simplifies this task. With the skills gained in this guide, you can efficiently clean and tidy datasets for more meaningful analyses.


## References

1.  [Overview of janitor functions](https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html)

2.  [Cleaning and Exploring Data with the "janitor" Package](https://towardsdatascience.com/cleaning-and-exploring-data-with-the-janitor-package-ee4a3edf085e)