<a href="https://colab.research.google.com/github/zia207/r-colab/blob/main/NoteBook/R%20for%20Beginners/data_wrangling_dplyr_tidyr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Wrangling with dplyr and tidyr

Zia AHMED, University Buffalo

In the upcoming section, you will delve deeper into data manipulation using two of the most widely used and versatile R-packages: **tidyr** and **dplyr**. Both these packages are part of the [tidyverse](https://www.tidyverse.org/), a comprehensive suite of R packages specially designed for data science applications. By mastering the functionalists of these two packages, you will be able to seamlessly transform, clean, and manipulate data in a streamlined and efficient manner. The **tidyr** package provides a set of tools to tidy data in a consistent and structured manner. In contrast, the **dplyr** package offers a range of functions to filter, sort, group, and summarize data frames efficiently. With the combined power of these two packages, you will be well-equipped to handle complex data manipulation tasks and derive meaningful insights from your data.

### **tidyr - Package**

[**tidyr**](https://tidyr.tidyverse.org/) is a powerful data manipulation package that enables users to create **tidy** data, a specific format that makes it easy to work with, model, and visualize data. Tidy data follows a set of principles for organizing data into tables, where each column represents a variable, and each row represents an observation. The variables should have clear, descriptive names that are easy to understand, and the observations should be organized in a logical order. Tidy data is essential because it makes it easier to perform data analysis and visualization. Data in this format can be easily filtered, sorted, and summarized, which is particularly important when working with large datasets. Moreover, it allows users to apply a wide range of data analysis techniques, including regression, clustering, and machine learning, without having to worry about data formatting issues. Tidy data is the preferred format for many data analysis tools and techniques, including the popular R programming language.

![alt text](http://drive.google.com/uc?export=view&id=1s2ve_z1T_bXG4BXNUjyHWkvsg4HJHnMh)

**tidyr**, package provides a suite of functions for cleaning, reshaping, and transforming data into a tidy format. It allows users to split, combine, and pivot data frames, which are essential operations when working with messy data. Overall, tidyr is a powerful tool that helps users to create tidy data, which is a structured and organized format that makes it easier to analyze and visualize data. Tidy data is a fundamental concept in data science and is widely used in many data analysis tools and techniques.

**tidyr** functions fall into five main categories:

-   **Pivotting** which converts between long(**pivot_longer()**) and wide forms (**pivot_wider()**), replacing

-   **Rectangling**, which turns deeply nested lists (as from JSON) into tidy tibbles.

-   **Nesting** converts grouped data to a form where each group becomes a single row containing a nested data frame

-   **Splitting and combining character columns**. Use **separate()** and **extract()** to pull a single character column into multiple columns;

-   Make implicit missing values explicit with **complete()**; make explicit missing values implicit with **drop_na()**; replace missing values with next/previous value with **fill()**, or a known value with **replace_na()**.

### **dplyr - Package**

[**dplyr**](https://dplyr.tidyverse.org/) provides data manipulation grammar and a set of functions to efficiently clean, process, and aggregate data. It offers a tibble data structure, which is similar to a data frame but designed for easier use and better efficiency. Also, it provides a set of verbs for data manipulation, such as filter(), arrange(), select(), mutate(), and summarize(), to perform various data operations. Additionally, dplyr has a chainable syntax with pipe (%\>% or \|\>), making it easy to execute multiple operations in a single line of code. Finally, it also supports working with remote data sources, including databases and big data systems.

![alt text](http://drive.google.com/uc?export=view&id=1rzkAr_dhjcJKHgn9ae_4No8_BR7_W3xw)

In addition to data frames/tibbles, dplyr makes working with following packages:

[**dtplyr**](https://dtplyr.tidyverse.org/): for large, in-memory datasets. Translates your dplyr code to high performance data.table code.

[**dbplyr**](https://dbplyr.tidyverse.org/): for data stored in a relational database. Translates your dplyr code to SQL.

[**sparklyr**](https://spark.rstudio.com/): for very large datasets stored in Apache Spark.

### Cheat-sheet

Here below data Wrangling with [dplyr and tidyr Cheat Sheets](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf):

![Data Wragling with dplyrand tidyr Cheat sheet-2](Image/data_wragling%20cheat_sheet_01.png){#fig-dplyr_tidyr_01}

![Data Wragling with dplyrand tidyr Cheat sheet-1](Image/data_wragling%20cheat_sheet_02.png){fig-dplyr_tidyr_02}

In addition to tidyr, and dplyr, there are five packages (including stringr and forcats) which are designed to work with specific types of data:

-   **lubridate** for dates and date-times.

-   **hms** for time-of-day values.

-   **blob** for storing blob (binary) data