<a href="https://colab.research.google.com/github/tavi1402/Data_Science_bootcamp/blob/main/3_3_1_advanced_data_analysis_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Data Analysis Techniques with Python & Pandas

This tutorial is a part of the [Zero to Data Science Bootcamp by Jovian](https://zerotodatascience.com).

![](https://i.imgur.com/jspPDKJ.png)

Pandas is a popular Python library used for working in tabular data (similar to the data stored in a spreadsheet). Pandas offers several easy-to-use and efficient utilities for loading, processing, cleaning and analyzing large tabular datasets. Datasets containing millions of records can be processed using Pandas in a matter of minutes.

This tutorial covers the following topics:

- Downloading datasets from online sources
- Processing massive datasets using Pandas
- Working with categorical data
- Handling missing and duplicate data
- Transforming data with type-specific functions
- Data frame concatenation and merging

### How to run the code

This tutorial is an executable [Jupyter notebook](https://jupyter.org) hosted on [Jovian](https://www.jovian.ai). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Colab**. [Follow these instructions](https://jovian.ai/docs/user-guide/run.html#run-on-colab) to connect your Google Drive with Jovian.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

>  **Jupyter Notebooks**: This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of _cells_. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.

Let's install and import the required libraries.

## Finding and downloading datasets from online sources

There are many great sources for finding datasets online:

- [Kaggle datasets](http://kaggle.com/datasets)
- [World Bank Open Data](https://data.worldbank.org)
- [Yahoo Finance](https://finance.yahoo.com)
- [Google Dataset Search](https://datasetsearch.research.google.com)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php)
- [FastAI datasets](https://course.fast.ai/datasets)
- and many more..

While some of these provide public URLs to an easily downloadable dataset archive, others require login to limit abuse. As an example, let's look at Kaggle, which contains over 60,000+ community-curated datasets. We'll download the [US Accidents dataset](https://www.kaggle.com/sobhanmoosavi/us-accidents), which contains nearly 3 million records.

**NOTE**: The `us-accidents` dataset is currently updated on the Kaggle and a part of it has been removed due to a request from one of the main traffic data providers. We've added **NOTE** for all the changes in the notebook.

We can't use `requests` directly to download a dataset from Kaggle, because it doesn't provide a raw URL for the dataset. We'll use the `opendatasets` library, which can download a Kaggle dataset using an API token.

We'll use the `od.download` function to download the dataset.

To download the dataset, you'll need to supply your Kaggle credentials, as explained here: https://github.com/jovianml/opendatasets#kaggle-credentials

The data has been downloaded and unzipped to the folder `./us-accidents`

It consists of just one file, `US_Accidents_Dec20_updated.csv`, which is over 1 GB in size. We can also check the length of the file using the `wc` terminal command (only works on Linux and Mac).


**NOTE**: The latest version of the `us-accidents` dataset is over 500 MB in size.

The file consists of over 2.9 million records! You can learn more about the dataset by reading the dataset description on Kaggle: https://www.kaggle.com/sobhanmoosavi/us-accidents .

**NOTE**: The latest version of the `us-accidents` dataset has 1.5 million records.


Try downloading a few other datasets from the sources listed above.

> **EXERCISE**: Find and a download a dataset providing country-wise population for the last 50 years. Use it to identify the countries with the highest percentage growth in population. What other insights can you gather from this data? Experiment with it in a new notebook.
>
> *Hint*: Visit https://data.worldbank.org .



> **EXERCISE**: Download the historical monthly stock price data for Apple Inc. (AAPL) since 1988. If you had bought Apple shares worth $100 Jan 1, 1991, what would they be worth on Jan 1, 2021? What other insights can you gather from this data? Experiment with it in a new notebook.
>
> *Hint*: Visit https://finance.yahoo.com .

> **EXERCISE**: Learn about and download data set from https://archive.ics.uci.edu/ml/datasets/Air+quality . Show the trend of CO concentration using a line chart. What other insights can you gather from this data? Experiment with it in a new notebook.

## Processing massive datasets using Pandas

Let's load the US accidents data into a Pandas dataframe, and track the amount of time it takes using the `%%time` Jupyter magic command.

While the exact time for this operation depends on the hardware configuration of your computer, you will likely find that it takes less than a minute for Pandas to process a 1.1 GB containing over 2.9 million records. Isn't that impressive?

**NOTE**: The latest version of the `us-accidents` dataset is over 500 MB in size containing over 1.5 million records.

Let's take a look at the first few rows, and gather some information about the dataset.



The dataset contains 2.9 million rows, 46 columns and occupies 790 MB of memory (RAM). Let's look at some strategies to load the data faster and use less memory.

**NOTE**: The latest version of the `us-accidents` contains 1.5 million rows, 46 columns and occupies 412 MB of memory (RAM).

### Load only the required columns

You can provide the `usecols` argument to `read_csv` create a dataframe with just the given columns. This reduces the loading time, and uses lesser memory.

We've reduced the load time by over 40% and the memory usage by over 60%.

### Use smaller data types

By default, Pandas uses large datatypes like `int64` and `float64` for numerical data. However, in many cases the data in the CSV file can be represented using a smaller data type such as `int32`, `float32`, `int16` etc.

Date columns can be specified using the `parse_dates` argument.


The load time and memory gains depend on the nature of the dataset. In this case, it leads to a 25% reduction in memory usage, with about the same load time. However, keep in mind that we no longer need to parse dates columns separately, which itself would take a few seconds for this dataset.

> **EXERCISE**: Parse the `Start_Time` and `End_Time` columns of `accidents_df2` as dates using `pd.to_datetime`. Measure the time taken for the conversion.

### Using binary formats for intermediate results

Since CSVs are plain text files with no structure, they often take longer to read compared to other binary formats which recognize the tabular structure of the data. Files can be saved and loaded using the `feather` and `parquet` formats for memory efficiency and faster processing.

Let's save `accidents_df` to the feather format and load it back. It requires the `pyarrow` library to be installed.

The feather file is over 40% smaller than the CSV file.

Notice that reading a feather file is 60% faster compared to reading a CSV file.  It's a good idea to save the intermediate results of your analysis in the feather format, so that you can load the file faster and avoid recomputing results when you resume your work.

Check out a comparison of the feather and parquet formats here: https://ursalabs.org/blog/2020-feather-v2/

### Working with a sample

When working with a large dataset, sometimes it's better to work with a sample to set up your notebook, and then repeat your analysis with the entire dataset, to save time. You can use the `nrows` argument to supply the number of rows to be read.

Reading the first 1000 rows takes just a few milliseconds.

### Using dask for parallelism and memory efficiency

Dask uses parallel processing to speed up data loading.

Many Pandas operations implemented using more efficient algorithms in dask.

To compute the memory usage, we need to provide `memory_usage=True`. Warning: This may take a while.

Keep in mind that dask has a slightly different API compared to Pandas, and not all Pandas functions will work the same way. Check out the documentation of Dask to learn more: https://docs.dask.org/en/latest/dataframe.html

> **EXERCISE**: List the various file types supported by Pandas for reading & writing. Demonstrate their usage with some examples. Use the official documentation for reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

> **EXERCISE**: Save the contents of `accidents_df3` into various file formats like CSV, JSON, Excel, SQLite, Parquet, Feather etc. and read the files back using Pandas. Compare the writing time, size of created file and reading time for different formats.


> **EXERCISE**: Download the New York Taxi Fare Prediction dataset from https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data . Pick 7 columns of the dataset and save it to an efficient intermediate format. How much improvement can you achieve in the file size, memory usage and reading time using the techniques listed above?
>
> *Warning*: This dataset is quite large (> 10 GB after uncompressing). Make sure you have enough disk space while before downloading it, or use an online platform like Google Colab.

## Working with  Categorical Data

Consider the `Weather_Condition` column of the `accidents_sample_df`. While the values in the column are strings, there are only a limited number of values or _categories_ that occur in the column. `Weather_Condition` is a _categorical column_.

We can list all the values in the column using the `.unique` method.

To check the number of unique values, use `nunique`.

We can see the no. of occurrences of each value using `.value_counts()`

We can convert the string column to a categorical column in Pandas by changing its data type.

While there's no visible change, the conversion allows Pandas to optimize the storage & querying for the column by representing each category internally using a numeric code.

We can view the codes for each row as follows:

The category code is the index of the category in the following list:

Categorical columns are often replaced with their numeric codes before passing data into a machine learning algorithm which can only work with numbers.

### Numeric Categorical Columns

The column `Severity` consists of categories too, even though its values are numeric.

Let's convert it into a categorical column.

### One Hot Encoding

![](https://i.imgur.com/n8GuiOO.png)

Sometimes it's useful to create a new column for each category of a categorical column, and set the value in the column to `1` if row belongs to the category and `0` otherwise. This technique is known as one-hot encoding and is commonly applied before passing data into machine learning algorithms.

We can use the `pd.get_dummies` function to create a new column for each category of a categorical column.

The new columns can be added to the original data frame using the `pd.concat` method (we'll learn more about it later).

> **EXERICSE**: Repeat the aboves steps with `accidents_df` and `accidents_dask_df`. Track and compare the times taken for each operation.

> **EXERCISE**: Perform one-hot encoding for the `Weather_Condition` column of the dataframe `accidents_df`.

Learn more about working with categorical data in Pandas here: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

## Handling missing & duplicate data

Missing data in Pandas is indicated using `np.nan`. We can find the number of missing values in each column of a dataframe using the following expression:

The `End_Lat` and `End_Lng` columns have 96 missing values, and the `Weather_Condition` column has 21 missing values.

**NOTE**: The latest version of the `us-accidents` dataset has no missing values in the `End_Lat` and `End_Lng` columns. Only the `Weather_Condition` column has 9 missing values.

> **EXERCISE**: What is the output of the `isna` method of a Pandas data frame or series. Demonstrate with examples.

We have the following options for dealing with missing values in numerical columns:

1. Leave them as is, if they won't affect your analysis
2. Replace them with an average
3. Replace them with some other fixed value
4. Remove the rows containing missing values
5. Use the values from other rows & columns to estimate the missing value (imputation)

Here's how approach 4 can be applied:

> **EXERCISE**: Replace the missing values in the columns `End_Lng` and `End_Lat` using the average value in each column. Hint: Use the function `.fillna`.

For categorical columns, we have the following options for dealing with missing values:

1. Leave them as is, if they won't affect your analysis
2. Create a new category for missing values
3. Replace them with the most frequent category (or by some other fixed value)
4. Replace them & add a new binary column indicating whether the value was missing
5. Replace the columns with one-hot encoded columns

Let's apply technique 3 i.e. replace the null values with the most common value (the mode)

> **EXERCISE**: Apply the other techniques listed above to handle missing values in the dataframe `accidents_sample_df`.

> **EXERCISE**: Repeat the operations performed in the above section with `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

### Duplicate Data

If required, duplicate rows can be removed using the `.drop_duplicates` method.

Think carefully about how the data was collected before removing duplicates. Removing duplicates may not always be the right approach.

> **EXERCISE**: Check for duplicates in `accidents_df` and remove them if required.

> **EXERCISE**: Repeat the exercises in this section with `accidents_df` and `accidents_dask_df` and track the time taken by each operation.

## Transforming and aggregating data with type-specific functions

Pandas offers several methods for working with specific types of data. Additionally, we can also use Numpy functions to perform operations on Pandas series. Let's look at some utility methods by three types of data: numbers, strings and dates.

### Numbers

Here are some functions useful for transforming and aggregating numeric data.

We can also apply numpy functions to Pandas series

Pandas series also support arithmetic operators.

> **EXERCISE**: Try out some more arithmetic operations with other numeric columns of `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

### Strings

The `.str` property of a Pandas series provides several utility functions for manipulating string data.

> **EXERCISE**: Explore other string methods supported by Pandas data frames and series: https://pandas.pydata.org/docs/user_guide/text.html#string-methods . Demonstrate their usage with examples.

### Date & Time

The `.dt` property of a Pandas consists of utlity methods for working with dates.

Let's extract different parts of the data.

> **EXERCISE**: Explore other date methods supported by Pandas series: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dt-accessors . Demonstrate their usage with examples.

### `map` and `apply`

The `map` method of a column/series can be used to apply a custom function to each element of a series. Let's use it to convert the distance from miles to kilometres.

The `apply` method  can be used to apply a custom function to each column/row of a dataframe. Let's use it to compute the duration of each event.

> **EXERCISE**: Look up the documentation for the `applymap` method of a data frame. How is it different from `apply` and `map` methods? Demonstrate with examples.

> **EXERCISE**: Repeat the operations performed in this section (type-specific functions) with `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

Learn more about `map` and `apply` here: https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff

## Data frame concatenation and merging

Pandas provides various utilities for combining multiple data frames. We'll look at two examples in this section: concatenation and merging.

### Concatenation

Concatenation is the process of stacking two more dataframes vertically or horizontally. When concatenating vertically, columns are lined up together. Here's what vertical concatenation looks like:

![](https://i.imgur.com/ti195t3.png)

We can now concatenate these along axis 0 i.e. vertically using `pd.concat`

This operation can also be performed using the `.append` method of a dataframe.

> **EXERCISE**: Remove the column `D` from `df3`. How does it affect the result of vertical concatenation? Try passing the argument `join="inner"` to `pd.concat`. Do you observe any change?

> **EXERCISE**: Create two dataframes that don't have any common columns and concatenate them vertically. What do you observe? Try providing the arguments `join="outer"` and `join="inner"`. How do they affect the results?

> **EXERCISE**: Explore the arguments supported by `pd.concat` and come up with some examples to demonstrate the purpose of each argument.

Concatenation can also be performed horizontally by providing the argument `axis=1` to `pd.concat`. Rows are lined up together using the index.



`pd.concat` performs an "outer" join by default, which retains all the indexes from both data frames. An "inner" join only retains the common indices.

Learn more about dataframe concatenation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#concatenating-objects

### Merging

Two Pandas dataframes can be merged together row-wise using one more columns using the `.merge` method of a dataframe. A merge can be peformed in several ways:

![](https://i.imgur.com/p2fXTFs.png)

> **EXERCISE**: Demonstrate the four types of join listed above using the following dataframes. Use the `key` column for merging

> **EXERCISE**: Show an example of merging two dataframes on two columns.
>
> *Hint*: Read the docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

> **EXERCISE**: Look up the documentation for the `pd.join` function. How is it different from `pd.merge`? Demonstrate with examples. Hint: A join is always performed on the index.

## Summary and Further Reading

We've covered the following topics in this tutorial:

- Downloading datasets from online sources
- Processing massive datasets using Pandas
- Handling missing, incorrect & duplicate data
- Transforming data with type-specific functions
- Techniques for encoding categorical data
- Concatenation, merging and comparison

As an exercise, you can apply the above to other datasets, from the following sources:

- [Kaggle datasets](http://kaggle.com/datasets)
- [World Bank Open Data](https://data.worldbank.org)
- [Yahoo Finance](https://finance.yahoo.com)
- [Google Dataset Search](https://datasetsearch.research.google.com)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php)
- [FastAI datasets](https://course.fast.ai/datasets)


Check out the following resources to learn more:

- Working with categorical data in Pandas: https://jovian.ai/himani007/categorical-data-with-pandas
- Working with large datasets in Pandas: https://jovian.ai/himani007/pandas1-large-datasets
- Python for Data Analysis: https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython-ebook/dp/B075X4LT6K
- Pandas API reference: https://pandas.pydata.org/pandas-docs/stable/reference/index.html
- Merging Pandas dataframes: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
- Advanced Pandas tutorial notebooks: https://www.kaggle.com/residentmario/welcome-to-advanced-pandas
- Dask dataframes documentation: https://docs.dask.org/en/latest/dataframe.html
- [How to load CSV files 10x faster and use 10x less memory](https://towardsdatascience.com/%EF%B8%8F-load-the-same-csv-file-10x-times-faster-and-with-10x-less-memory-%EF%B8%8F-e93b485086c7)


## Questions for Revision

1.	How do you download a dataset from Kaggle?
2.	How do you check length of the file on Windows?
3.	What is the purpose of `%%time`?
4.	What are the different methods and functions you can use to get information about the data in dataframe?
5.	How to load only the required columns from a large dataset?
6.	What is the purpose of using smaller datatype?
7.	How is `parse_dates` different from `pd.to_datetime`?
8.	What are the different formats one can use when loading CSV files for better memory efficiency and faster processing?
9.	How does working with a sample of your data first help with analysis?
10.	What is dask?
11.	What is categorical data? How to deal with them during analysis?
12.	What is One Hot Encoding?
13.	What are the different techniques to handle missing values?
14.	Why should one be careful when removing duplicates from the data?
15.	What are the different methods you can use on numeric, string, and date type data?
16.	How is `map()` different from `apply()`?
17.	What is `applymap()`?
18.	What is axis parameter in Pandas?
19.	How do `join='inner'` and `join='outer'` work?
20.	What are the several ways to perform `merge()`?
21.	What is `on` parameter in `merge()`?
22.	How is `concate()` different from `merge()`?

## Solutions for Exercises

> **EXERCISE**: Find and a download a dataset providing country-wise population for the last 50 years. Use it to identify the countries with the highest percentage growth in population. What other insights can you gather from this data? Experiment with it in a new notebook.
>
> *Hint*: Visit https://data.worldbank.org .


**OBSERVATION**: Middle East countries UAE, Qatar, Bahrain seem to take the top positions.

Other insights that we can gather from this data:
 - Countries with the least percentage growth in population.
 - Was the pandemic during 2019-2020 affecting the population growth?
 - Merge the `population` data with `birth` and `death` data to identify the counrties with rapid population growth.
 - Merge the `population` data with `gender` data to calculate the gender ratio among the countries.
 - Plot the distribution of different country's population.

> **EXERCISE**: Download the historical monthly stock price data for Apple Inc. (AAPL) since 1988. If you had bought Apple shares worth $100 Jan 1, 1991, what would they be worth on Jan 1, 2021? What other insights can you gather from this data? Experiment with it in a new notebook.
>
> *Hint*: Visit https://finance.yahoo.com .

**OBSERVATION**: That's more than 10 times the actual price!

Other insights that we can gather from this data:
 - Highest price the stock reached in a month, year.
 - Lowest price the stock traded in a month, year.
 - Total amount of stocks traded (volume) in a month, year.
 - Calculate the moving average of the prices and plot their trend.

> **EXERCISE**: Learn about and download data set from https://archive.ics.uci.edu/ml/datasets/Air+quality . Show the trend of CO concentration using a line chart. What other insights can you gather from this data? Experiment with it in a new notebook.

- Looks weird, doesn't it? Let's go through the data sest information to understand the data better.

![](https://i.imgur.com/IV4NGp8.png)

**OBSERVATION**: Nov, Dec of 2004 seem to record the highest average CO concentration.

Other insights that we can gather from this data:
 - Show the trends of different gases.
 - Plot the gases with time on an axis to see which time of the day records highest and lowest values.
 - Check for correlation between the gases, if there is a high correlation between any gases, find out the reason.

> **EXERCISE**: Parse the `Start_Time` and `End_Time` columns of `accidents_df2` as dates using `pd.to_datetime`. Measure the time taken for the conversion.

**NOTE**: The time for these operations may vary from person to person as they depend on the hardware configuration of the computer.

**OBSERVATION**: The time taken for converting `Start_Time` and `End_Time` separately using `pd.to_datetime` is 1.03 s.

**OBSERVATION**: The time taken for converting `Start_Time` and `End_Time` together using `pd.to_datetime` and `.apply()` is 958 ms.

> **EXERCISE**: List the various file types supported by Pandas for reading & writing. Demonstrate their usage with some examples. Use the official documentation for reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

- Link for an article - https://realpython.com/pandas-read-write-files/

> **EXERCISE**: Save the contents of `accidents_df3` into various file formats like CSV, JSON, Excel, SQLite, Parquet, Feather etc. and read the files back using Pandas. Compare the writing time, size of created file and reading time for different formats.


**NOTE**: The time for these operations may vary from person to person as they depend on the hardware configuration of the computer.

- Pandas was throwing an [error](https://stackoverflow.com/questions/47076719/saving-big-xlsx-files-pandas-python) when converting the entire dataframe so we'll pick a sample with the maximum size

**OBSERVATION**: `to_feather()` and `read_feather` took the least amount of time for conversion.

> **EXERCISE**: Download the New York Taxi Fare Prediction dataset from https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data . Pick 7 columns of the dataset and save it to an efficient intermediate format. How much improvement can you achieve in the file size, memory usage and reading time using the techniques listed above?
>
> *Warning*: This dataset is quite large (> 10 GB after uncompressing). Make sure you have enough disk space while before downloading it, or use an online platform like Google Colab.

**OBSERVATION**: `pd.read_csv()` is faster in reading when compared to `pd.read_feather` or `pd.read_parquet`.

> **EXERICSE**: Repeat the aboves steps with `accidents_df` and `accidents_dask_df`. Track and compare the times taken for each operation.

**NOTE**: The time for these operations may vary from person to person as they depend on the hardware configuration of the computer.

`accidents_df`

**OBSERVATION**: As the size increase the time taken for conversion has also increased.

`accidents_dask_df`

> **EXERCISE**: Perform one-hot encoding for the `Weather_Condition` column of the dataframe `accidents_df`.

> **EXERCISE**: What is the output of the `isna` method of a Pandas data frame or series. Demonstrate with examples.

**OBSERVATION**: `.isna()` is function used to identify the missing values in a dataframe. The function returns a dataframe with boolean values `True` or `False`. `True` indicates the presence of missing values such a `NA` or `NaN`. `False` indicates the presence of a value.

We have the following options for dealing with missing values in numerical columns:

1. Leave them as is, if they won't affect your analysis
2. Replace them with an average
3. Replace them with some other fixed value
4. Remove the rows containing missing values
5. Use the values from other rows & columns to estimate the missing value (imputation)

Here's how approach 4 can be applied:

> **EXERCISE**: Replace the missing values in the columns `End_Lng` and `End_Lat` using the average value in each column. Hint: Use the function `.fillna`.

Let's apply technique 3 i.e. replace the null values with the most common value (the mode)

> **EXERCISE**: Apply the other techniques listed above to handle missing values in the dataframe `accidents_sample_df`.

Let's apply technique 4 i.e. Replace them & add a new binary column indicating whether the value was missing

> **EXERCISE**: Repeat the operations performed in the above section with `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

- `accidents_df`

That's a lot of missing values, huh?

Here's what we will do for the columns with missing values:
- Technique 1 i.e. Leave them as is, if they won't affect your analysis for `Number`, `zipcode`, `Airport_Code`, `Timezone` and `Weather_Timestamp` columns as they don't affect our analysis.
- Technique 2 i.e. Create a new category for missing values for `city` column with missing value as `unknown` category
- Technique 3 i.e. Replace them with the most frequent category (or by some other fixed value) for `Temperature(F)`, `Wind_Chill(F)`, `Humidity(%)`, `Pressure(in)`, `Visibility(mi)`, `Wind_Speed(mph)` and `Precipitation(in)`.
- Technique 5 i.e. Replace the columns with one-hot encoded columns for `Weather_Condition`, `Wind_Direction`,  `Sunrise_Sunset`, `Civil_Twilight`, `Nautical_Twilight` and `Astronomical_Twilight`

> Technique 2 i.e. Create a new category for missing values for city column with missing value as unknown category

> Technique 3 i.e. Replace them with the most frequent category (or by some other fixed value) for Temperature(F), Wind_Chill(F), Humidity(%), Pressure(in), Visibility(mi), Wind_Speed(mph) and Precipitation(in).

> Technique 5 i.e. Replace the columns with one-hot encoded columns for Weather_Condition, Wind_Direction, Sunrise_Sunset, Civil_Twilight, Nautical_Twilight and Astronomical_Twilight

- You can either drop the original columns after onehot encoding or you can keep them.

- We will be following the same procedures as `accidents_df` for handling missing values in `dask_df`

**OBSERVATION**: Dask dataframe seems to take less time when performing these operations, right?

> **EXERCISE**: Check for duplicates in `accidents_df` and remove them if required.

**OBSERVATION**: No duplicates to remove!

> **EXERCISE**: Repeat the exercises in this section with `accidents_df` and `accidents_dask_df` and track the time taken by each operation.

**OBSERVATION**: Again, Dask dataframe seems to perform faster!

> **EXERCISE**: Try out some more arithmetic operations with other numeric columns of `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

**OBSERVATION**: Dask dataframe seems to be winning all the races so far!

> **EXERCISE**: Look up the documentation for the `applymap` method of a data frame. How is it different from `apply` and `map` methods? Demonstrate with examples.

- `apply()`

- `map()`

![](https://i.imgur.com/Az4o01Q.png)

- `applymap()`

**OBSERVATION**: Major difference between `apply()`, `map()` and `applymap()` is-
- `apply()` works on both dataframes and pandas series.
- `map()` works only on pandas series.
- `applymap()` works only on dataframes.

For more differences: [https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas]()

> **EXERCISE**: Repeat the operations performed in this section (type-specific functions) with `accidents_df` and `accidents_dask_df`. Measure and compare the time taken for each operation.

- Numbers

`accidents_df`

`dask_df`

- Strings

`accidents_df`

`dask_df`

- Dates

`accidents_df`

`dask_df`

**OBSERVATION**: Dask dataframe is proving itself over and over again with it's fastness.

Learn more about `map` and `apply` here: https://towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff

> **EXERCISE**: Remove the column `D` from `df3`. How does it affect the result of vertical concatenation? Try passing the argument `join="inner"` to `pd.concat`. Do you observe any change?

**OBSERVATION**: The entire `D` column (`D` column in `df1` and `df2`) is dropped from the concatenated dataframe.

> **EXERCISE**: Create two dataframes that don't have any common columns and concatenate them vertically. What do you observe? Try providing the arguments `join="outer"` and `join="inner"`. How do they affect the results?

**OBSERVATION**: Due to difference in index values (indices are not aligned),  we get `NaN`s when we concatenate the dataframes.

**OBSERVATION**: `join="inner"` returns a dataframe with columns that are common in both the dataframes. As there are no common columns in both the dataframes, it is returning an empty dataframe.

**OBSERVTAION**: `join="outer"` is the same dataframe as without it. In fact, `join="outer"` is a default setting when performing concatenation.  

> **EXERCISE**: Demonstrate the four types of join listed above using the following dataframes. Use the `key` column for merging

> **EXERCISE**: Show an example of merging two dataframes on two columns.
>
> *Hint*: Read the docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

> **EXERCISE**: Look up the documentation for the `pd.join` function. How is it different from `pd.merge`? Demonstrate with examples. Hint: A join is always performed on the index.

![](https://i.imgur.com/4FzBs5L.png)

**OBSERVATION**: `merge()` doesn't have this restriction so it works perfectly fine even without setting the index!