# Exercises for Lecture 2 (Data wrangling with Pandas)

In [1]:
import datetime
now = datetime.datetime.now()
print("Last executed: " + now.strftime("%Y-%m-%d %H:%M:%S"))

Last executed: 2022-01-08 18:39:17


In [2]:
import pandas as pd
import numpy as np

## Exercise 1: Data selection


In [3]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'population':pop})
data

Unnamed: 0,area,population
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Create a `DataFrame` containing only those states that have an area greater than 150,000 and a population greater than 20 million.

In [4]:
data[(data.area > 150e3) & (data.population > 20e6)]

Unnamed: 0,area,population
California,423967,38332521
Texas,695662,26448193


(Pandas raises an error if you try to convert something to `bool`, hence use bitwise logical operations.  Read more [here](http://pandas.pydata.org/pandas-docs/version/0.15/gotchas.html).)


## Exercise 2: Operating on data in Pandas
Consider the following two series.

In [5]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population') 

Compute the population density for each state (where possible).

In [6]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The Pandas `Series` given by `population/area` contains indicies of the *union* of the two `Series` considered, with the density computed for states where both the area and population are available.

When one of the area or population are not available NaN is returned, which is how Pandas represents missing data.

## Exercise 3: Detecting null values

Consider the following series.

In [7]:
data = pd.Series([1, np.nan, 'hello', np.nan])
data

0        1
1      NaN
2    hello
3      NaN
dtype: object

Compute a new Series of bools that specify whether each entry in the above Series is *not* NaN.  Using this Series, construct a new series from the original data that does not contain the NaN entries.

In [8]:
not_null = data.notnull()
not_null

0     True
1    False
2     True
3    False
dtype: bool

In [9]:
data[not_null]

0        1
2    hello
dtype: object

### Exercise 4: Remove null values directly

Remove null values from the previous data `Series` directly.

In [10]:
data.dropna()

0        1
2    hello
dtype: object