# Assignment

## Tidy data

Let's examine this particular dataset, which can be accessed at the following URL:
https://raw.githubusercontent.com/tidyverse/tidyr/4c0a8d0fdb9372302fcc57ad995d57a43d9e4337/vignettes/pew.csv

In [None]:
import pandas as pd
pew_df = pd.read_csv('https://raw.githubusercontent.com/tidyverse/tidyr/4c0a8d0fdb9372302fcc57ad995d57a43d9e4337/vignettes/pew.csv')
pew_df

This dataset pertains to the correlation between income and religion and has been compiled from a study conducted by the Pew Research Center. For further information, please refer to this link. Now, let's determine whether this dataset is tidy or not and the reasons behind it.

It appears that many of the columns in the dataset represent values rather than variable names. This makes the dataset untidy. To make it tidy, we can utilize the "melt" function provided by pandas.

The "melt" function allows us to specify the variable columns using the id_vars parameter and the value columns using the value_vars parameter. Additionally, we can provide names for the variable column and the value column. By using this function, we can easily transform the dataset into a tidy format.

In [None]:
# Assignment: Tidy the data table by using melt function. Use variable name "income" and value name "frequency"
# The following code is the tidy format of the table you should see when you are successful. Note that the values are dummies.
pew_tidy_df = pd.DataFrame({"religion": ["ABCD" for i in range(15)],
                            "income": ["1k" for i in range(15)],
                            "frequency": [i for i in range(15)]})

# Data types

Let's discuss data types briefly. Understanding data types is crucial not only for selecting the appropriate visualizations but also for efficient data computation and storage. You may not have considered how pandas represents data in memory. A Pandas Dataframe consists of a collection of Series, which are essentially numpy arrays. An array can contain fixed-length items like integers or variable-length items like strings. Taking the time to consider the correct data type can potentially save a significant amount of memory and time.

A great example of this is the categorical data type. If you have a variable that only has a few possible values, it can be considered categorical data. Let's examine the income variable as an illustration.

In [None]:
pew_tidy_df.income.value_counts()

These were the column names in the original untidy dataset. The value can only fall within one of these income ranges, making it categorical data. What data type does pandas use to store this column?

In [None]:
pew_tidy_df.income.dtype

The "O" represents an object data type, which differs from integer or float as it does not have a fixed size. The series includes a type of pointer to the specific text objects. It is possible to examine the amount of memory utilized by the dataset.

In [None]:
pew_tidy_df.memory_usage()

In [None]:
pew_tidy_df.memory_usage(deep=True)

What is the purpose of the deep=True option? When deep=True is not specified, the memory usage method only provides information on the memory used by the numpy arrays in the pandas dataframe. However, when deep=True is passed, it includes the memory used by all text objects, giving you the total memory usage. Consequently, the religion and income columns occupy nearly ten times more memory than the frequency column, which is simply an array of integers.

Is there a way to optimize memory usage? It is important to note that the income variable consists of only 10 categories. Therefore, we only require 10 numbers to represent these categories. However, it is necessary to store the names of each category, which incurs a one-time cost. The most straightforward approach to convert a column is by utilizing the astype method.

In [None]:
income_categorical_series = pew_tidy_df.income.astype('category')

In [None]:
income_categorical_series.dtype

This series has the CategoricalDtype dtype and has much less memory by factor of ten!

In [None]:
income_categorical_series.memory_usage(deep=True)

If the categories have ordering, you can specify the ordering too.

In [None]:
from pandas.api.types import CategoricalDtype
income_type = CategoricalDtype(categories=["Don't know/refused", '<$10k', '$10-20k', '$20-30k', '$30-40k','$40-50k', '$50-75k', '$75-100k', '$100-150k', '>150k'], ordered=True)
income_type

Now, the assignment for you is to convert both religion and income columns of pew_tidy_df as categorical dtype and show that pew_tidy_df now uses much less memory.

In [None]:
# Assignment: Convert both religion and income columns of pew_tidy_df as categorical dtype and show that pew_tidy_df now uses much less memory.

# Want to learn more?
- [Tidy Data in Python ](http://www.jeannicholashould.com/tidy-data-in-python.html)
- [Stephen Simmons| Pandas from the Inside](https://www.youtube.com/watch?v=CowlcrtSyME)
- [Data school: How do I make my pandas DataFrame smaller and faster?](https://www.youtube.com/watch?v=wDYDYGyN_cw)