# Module 2: Essential Python Libraries

In this module, we will learn about some of the essential `Python` libraries for data science, such as *`numPy`, `pandas`,`matplotlib`, and `pyjanitor`*.

- **NumPy**
  `Numpy`, which stands for *Numerical Python*, is "the fundamental package for scientific computing in Python." In this module, we will briefly discuss `NumPy`. For example, we will demonstrate how to:
    - Create NumPy arrays and perform basic operations.
    - Apply mathematical functions to arrays.
  
- **Pandas**
  `pandas` is the main library for data cleaning and transformation (wrangling) in `Python`. We will use `pandas 2` in this course. Essential functions of the `pandas` library include:
    - Loading datasets.
    - Exploring and manipulating data.

- **Matplotlib**
    > Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.
    > - Create publication quality plots.
    > - Make interactive figures that can zoom, pan, and update.
    >  - Customize visual style and layout.
    > - Export to many file formats.
    > - Embed in JupyterLab and Graphical User Interfaces.
   > - Use a rich array of third-party packages built on Matplotlib.

- **Pyjanitor**
  > pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data.

  > Functionality
  Current functionality includes:
  - Cleaning columns name (multi-indexes are possible!)
  - Removing empty rows and columns
  - Identifying duplicate entries
  - Encoding columns as categorical
  - Splitting your data into features and targets (for machine learning)
  - Adding, removing, and renaming columns
  - Coalesce multiple columns into a single column
  - Date conversions (from matlab, excel, unix) to Python datetime format
  - Expand a single column that has delimited, categorical values into dummy-encoded variables
  - Concatenating and deconcatenating columns, based on a delimiter
  - Syntactic sugar for filtering the dataframe based on queries on a column
  - Experimental submodules for finance, biology, chemistry, engineering, and pyspark

We will use `pyjanitor` primarily to clean and tidy columns throughout this course.


## Getting Started
Now, let us examine each of the above libraries using examples. We will briefly illustrate each package for this course without delving into much detail except for the `pandas` and `matplotlib` libraries, which we will thoroughly explore in the remainder of the course.

### NumPy

You can install the `NumPy` package in several ways. In this course, we will install `NumPy` with either `conda` or `pip` as follows:

In [42]:
# Installing NumPy with conda:
#%conda install numpy

In [41]:
# Installing NumPy with pip:
# %pip install numpy

Note: you may need to restart the kernel to use updated packages.


### Examples
The following examples illustrate how to use `NumPy`.

In [2]:
# Creating a numpy array
import numpy as np
demo = np.array([1, 2, 3, np.nan]) 
demo

array([ 1.,  2.,  3., nan])

In [46]:
# Create an array
arr = np.array([[1, 3, 4, 8], [0, -8, 12, 10]])
arr

array([[ 1,  3,  4,  8],
       [ 0, -8, 12, 10]])

In [4]:
# Example 2 - creating a 3 x 5 matrix
my_matrix = np.array([[4, 2, 0, -6, 7], [8, 10, 4, 15, 12], [0, 2, 8, 13, -9]])

my_matrix

array([[ 4,  2,  0, -6,  7],
       [ 8, 10,  4, 15, 12],
       [ 0,  2,  8, 13, -9]])

In [49]:
# Manipulating NumPy array: pull out the second row
my_matrix[2]

array([ 0,  2,  8, 13, -9])

In [50]:
# Manipulating NumPy array: pull out the first column
my_matrix[:, 3]

array([-6, 15, 13])

In [10]:
# Manipulating NumPy array: pull out the first column
my_matrix[:, 3]

array([-6, 15, 13])

We can also use `NumPy` to create empty arrays or arrays with all zeroes as placeholders. Arrays created with `np.empty` are faster than those created with `np.zeroes`.

In [11]:
# Creating an empty array
np.empty(5)

array([ 1.  ,  2.75,  6.  , 10.75, 17.  ])

In [12]:
# Creating an array with ones 
np.ones(3)

array([1., 1., 1.])

In [14]:
# Creating an array with zeroes
np.zeros(4)

array([0., 0., 0., 0.])

In [53]:
# Creating a NumPy array with a range of numbers or elements
np.arange(11)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [20]:
# Creating an array with linearly spaced values in a specified internal
np.linspace(0, 10, num=5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

### Adding and sorting elements

In [23]:
my_arr = np.array([2, 1, 5, 3, 7, 4, 6, 8])

# Let's sort the elements in our array
my_arr.sort()

# Print the output
my_arr

array([1, 2, 3, 4, 5, 6, 7, 8])

In [55]:
# Adding two arrays together
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

# Concatenate a with b
np.concatenate((a, b))

array([1, 2, 3, 4, 5, 6, 7, 8])

In [59]:
a / b

array([0.2       , 0.33333333, 0.42857143, 0.5       ])

In [27]:
# We could also do the following
a = np.array([[1, 2, 0], [3, 4, 4]])
b = np.array([[5, 6, -3]])

np.concatenate((a,b), axis=0)

array([[ 1,  2,  0],
       [ 3,  4,  4],
       [ 5,  6, -3]])

### Checking matrix or array dimension, number of elements, and size

In [28]:
my_matrix_2 = np.array([[[0, 1, 2, 3],
                        [4, 5, 6, 7]],
                        
                        [[0, 1, 2, 3],
                        [4, 5, 6, 7]],
                        
                        [[0 ,1 ,2, 3],
                        [4, 5, 6, 7]]
                        ]
                       )
# Print the output
my_matrix_2

array([[[0, 1, 2, 3],
        [4, 5, 6, 7]],

       [[0, 1, 2, 3],
        [4, 5, 6, 7]],

       [[0, 1, 2, 3],
        [4, 5, 6, 7]]])

In [29]:
# Find the dimensions of this matrix
my_matrix_2.ndim

3

In [30]:
# Compute the number of elements in this matrix
my_matrix_2.size

24

In [31]:
# Find the shape of this matrix
my_matrix_2.shape

(3, 2, 4)

In [60]:
# Reshaping NumPy array
my_arr_2 = np.arange(1, 13)
my_arr_2

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [61]:
# Reshape my_arr_2 - convert into a 3 x 4 matrix
my_arr_2.reshape(3, 4)

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [62]:
# Performing mathematical operations on a NumPy array
my_arr_2.sum(axis=0)

78

In [64]:
my_matrix_2.sum(axis=1)

array([[ 4,  6,  8, 10],
       [ 4,  6,  8, 10],
       [ 4,  6,  8, 10]])

In [65]:
my_arr_2.max()

12

In [66]:
my_arr_2.min()

1

### Pandas
In this course, we will be using `pandas 2`, which leverages `PyArrow` in the backend. `PyArrow` is designed to work with tabular and heterogeneous data. Using PyArrow as the pandas backend optimizes and speeds up `pandas`.

### Installing Pandas

Pandas can be installed either with conda or pip, as shown below.

#### Anaconda

In [5]:
# Installing pandas and pyarrow with conda
#%conda install pandas pyarrow seaborn matplotlib 

In [1]:
import pandas as pd
pd.__version__


'2.2.1'

#### Pip

First, we must create a virtual environment containing our project files.

In [None]:
#python3 -m venv pandas-env
#source pandas-env/bin/activate
# pip install pandas pyarrow

## Data Structures
pandas has two main data structures:

- Series is a one-dimensional array representing a column in a DataFrame.
- DataFrame: a two-dimensional array for representing a tabular, spreadsheet-like data structure (Effective Pandas 2, p.15)

## Data Types

- Data Types
  - Integer
  - Float
  - String
  - Date
  - Categorical
  - Boolean

- Python:
  - int
  - float
  - str
  - datetime
  - bool

- Pandas 2
  - 'inte64[pyarrow]'
  - 'float64[pyarrow]'
  - pd.ArrowDType(pa.String())
  - 'timestamp[ns][pyarrow]' 
  - 'dictionary' 
  - 'bool[pyarrow]'

- Pandas 1
  - 'int64' 
  - 'float64' 
  - object, str
  - 'datetime64[ns]'
  - 'category'
  - bool

### PyArrow and the Future (Effective Pandas 2, p. 22)

>To overcome the annoyances and issues of NumPy in Pandas, for Pandas 2.0, the developers added a new backend, PyArrow. You can think of PyArrow as a next-generation NumPy. It is designed to work with tabular data and heterogeneous data. It also handles missing values in integers and has better support for strings. If you turn on the PyArrow backend, Pandas will be faster and more efficient. However, there might be some rough edges where PyArrow is missing features that NumPy has. I will try and point out these differences in the book.

> PyArrow also allows us to create arrays and tables similar to NumPy. However, we are going to use the library indirectly through Pandas.
For the remainder of the chapter, I want to review the common types and show what they look like in pure Python, Pandas 1.x and Pandas 2.0. I will also show how to convert between the types.

### Importing Datasets

In [1]:
# Import South Sudan 2008 Census Data
import pandas as pd
import numpy as np
import pyarrow as pa
from janitor import clean_names
pd.read
census_raw = pd.read_csv(    
    'Python for Data Science_files/data/ss_2008_census_data_raw.csv', 
    dtype_backend='pyarrow', 
    engine='pyarrow',
    # skipfooter=3
    # usecols=['Region Name', 'Variable Name', 'Age Name', '2008']
    )

In [39]:
%%timeit
census_raw = pd.read_csv(    
    'Python for Data Science_files/data/ss_2008_census_data_raw.csv', 
    dtype_backend='pyarrow', 
    engine='pyarrow',
    # usecols=['Region Name', 'Variable Name', 'Age Name', '2008']
    )

1.06 ms ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [40]:
%%timeit
census_raw = pd.read_csv(    
    'Python for Data Science_files/data/ss_2008_census_data_raw.csv', 
    dtype_backend='pyarrow', 
    engine='pyarrow',
    usecols=['Region Name', 'Variable Name', 'Age Name', '2008']
    )

747 µs ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [42]:
# ??pd.read_csv

In [43]:
# Inspect the first few rows
census_raw.head(5)

Unnamed: 0,Region,Region Name,Region - RegionId,Variable,Variable Name,Age,Age Name,Scale,Units,2008
0,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C1,Total,units,Persons,964353
1,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C2,0 to 4,units,Persons,150872
2,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C3,5 to 9,units,Persons,151467
3,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C4,10 to 14,units,Persons,126140
4,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C5,15 to 19,units,Persons,103804


In [44]:
# Inspect the few last rows
census_raw.tail(5)

Unnamed: 0,Region,Region Name,Region - RegionId,Variable,Variable Name,Age,Age Name,Scale,Units,2008
448,KN.A11,Eastern Equatoria,SS-EE,KN.B8,"Population, Female (Number)",KN.C14,60 to 64,units,Persons,5274.0
449,KN.A11,Eastern Equatoria,SS-EE,KN.B8,"Population, Female (Number)",KN.C22,65+,units,Persons,8637.0
450,,,,,,,,,,
451,Source:,"National Bureau of Statistics, South Sudan",,,,,,,,
452,Download URL:,http://southsudan.opendataforafrica.org/fvjqdp...,,,,,,,,


In [45]:
# Pull random rows to inspect
census_raw.sample(n=5, random_state=254)

Unnamed: 0,Region,Region Name,Region - RegionId,Variable,Variable Name,Age,Age Name,Scale,Units,2008
179,KN.A5,Warrap,SS-WR,KN.B8,"Population, Female (Number)",KN.C22,65+,units,Persons,10625
402,KN.A10,Central Equatoria,SS-EC,KN.B8,"Population, Female (Number)",KN.C13,55 to 59,units,Persons,6366
289,KN.A8,Lakes,SS-LK,KN.B5,"Population, Male (Number)",KN.C5,15 to 19,units,Persons,37766
96,KN.A4,Unity,SS-UY,KN.B2,"Population, Total (Number)",KN.C7,25 to 29,units,Persons,45101
85,KN.A3,Jonglei,SS-JG,KN.B8,"Population, Female (Number)",KN.C11,45 to 49,units,Persons,20227


In [47]:
# Pull column names
census_raw.columns.to_list()

['Region',
 'Region Name',
 'Region - RegionId',
 'Variable',
 'Variable Name',
 'Age',
 'Age Name',
 'Scale',
 'Units',
 '2008']

In [48]:
# Check the dimensions of our dataset
census_raw.shape

(453, 10)

In [49]:
# Print the dimensions of our dataset
print(f'South Sudan 2008 Census Data has {census_raw.shape[1]} columns (variables/features) \nand {census_raw.shape[0]} rows (observatios).')

South Sudan 2008 Census Data has 10 columns (variables/features) 
and 453 rows (observatios).


In [51]:
# Check for missing values
census_raw.isna().sum()

Region               1
Region Name          1
Region - RegionId    3
Variable             3
Variable Name        3
Age                  3
Age Name             3
Scale                3
Units                3
2008                 3
dtype: int64

In [52]:
# Clean or tidy column names
census_2008 = census_raw.clean_names()

census_2008.head()


Unnamed: 0,region,region_name,region_regionid,variable,variable_name,age,age_name,scale,units,2008
0,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C1,Total,units,Persons,964353
1,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C2,0 to 4,units,Persons,150872
2,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C3,5 to 9,units,Persons,151467
3,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C4,10 to 14,units,Persons,126140
4,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C5,15 to 19,units,Persons,103804


In [53]:
# Pull out a column
census_2008.region_name

0                                             Upper Nile
1                                             Upper Nile
2                                             Upper Nile
3                                             Upper Nile
4                                             Upper Nile
                             ...                        
448                                    Eastern Equatoria
449                                    Eastern Equatoria
450                                                 <NA>
451           National Bureau of Statistics, South Sudan
452    http://southsudan.opendataforafrica.org/fvjqdp...
Name: region_name, Length: 453, dtype: string[pyarrow]

In [54]:
# Pull out a column using square brackets
census_2008['region_name']

0                                             Upper Nile
1                                             Upper Nile
2                                             Upper Nile
3                                             Upper Nile
4                                             Upper Nile
                             ...                        
448                                    Eastern Equatoria
449                                    Eastern Equatoria
450                                                 <NA>
451           National Bureau of Statistics, South Sudan
452    http://southsudan.opendataforafrica.org/fvjqdp...
Name: region_name, Length: 453, dtype: string[pyarrow]

In [55]:
# Value counts
census_2008['region_name'].value_counts()

region_name
Upper Nile                                                                                    45
Jonglei                                                                                       45
Unity                                                                                         45
Warrap                                                                                        45
Northern Bahr el Ghazal                                                                       45
Western Bahr el Ghazal                                                                        45
Lakes                                                                                         45
Western Equatoria                                                                             45
Central Equatoria                                                                             45
Eastern Equatoria                                                                             45
National Bureau of

In [58]:
# Dropping rows with missing values
census_2008.dropna(axis=0)

Unnamed: 0,region,region_name,region_regionid,variable,variable_name,age,age_name,scale,units,2008
0,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C1,Total,units,Persons,964353
1,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C2,0 to 4,units,Persons,150872
2,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C3,5 to 9,units,Persons,151467
3,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C4,10 to 14,units,Persons,126140
4,KN.A2,Upper Nile,SS-NU,KN.B2,"Population, Total (Number)",KN.C5,15 to 19,units,Persons,103804
...,...,...,...,...,...,...,...,...,...,...
445,KN.A11,Eastern Equatoria,SS-EE,KN.B8,"Population, Female (Number)",KN.C11,45 to 49,units,Persons,13727
446,KN.A11,Eastern Equatoria,SS-EE,KN.B8,"Population, Female (Number)",KN.C12,50 to 54,units,Persons,9482
447,KN.A11,Eastern Equatoria,SS-EE,KN.B8,"Population, Female (Number)",KN.C13,55 to 59,units,Persons,5740
448,KN.A11,Eastern Equatoria,SS-EE,KN.B8,"Population, Female (Number)",KN.C14,60 to 64,units,Persons,5274


In [59]:
# Dropping NAs inplace
census_2008.dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  census_2008.dropna(inplace=True)


In [60]:
# Value counts
census_2008 = census_2008.dropna()

census_2008.shape

(450, 10)

In [61]:
# Value counts
census_2008['region_name'].value_counts()

region_name
Upper Nile                 45
Jonglei                    45
Unity                      45
Warrap                     45
Northern Bahr el Ghazal    45
Western Bahr el Ghazal     45
Lakes                      45
Western Equatoria          45
Central Equatoria          45
Eastern Equatoria          45
Name: count, dtype: int64[pyarrow]

In [62]:
# Value counts
census_2008['region_name'].nunique()

10

In [69]:
# Value counts
sorted(list(census_2008['region_name'].unique()))

['Central Equatoria',
 'Eastern Equatoria',
 'Jonglei',
 'Lakes',
 'Northern Bahr el Ghazal',
 'Unity',
 'Upper Nile',
 'Warrap',
 'Western Bahr el Ghazal',
 'Western Equatoria']

In [70]:
# Select columns of interest
# Pull column names
census_2008.columns.to_list()

['region',
 'region_name',
 'region_regionid',
 'variable',
 'variable_name',
 'age',
 'age_name',
 'scale',
 'units',
 '2008']

In [72]:
# Select desired columns
census_2008_2 = census_2008[['region_name', 'variable_name', 'age_name', '2008']]

census_2008_2.head(10)

Unnamed: 0,region_name,variable_name,age_name,2008
0,Upper Nile,"Population, Total (Number)",Total,964353
1,Upper Nile,"Population, Total (Number)",0 to 4,150872
2,Upper Nile,"Population, Total (Number)",5 to 9,151467
3,Upper Nile,"Population, Total (Number)",10 to 14,126140
4,Upper Nile,"Population, Total (Number)",15 to 19,103804
5,Upper Nile,"Population, Total (Number)",20 to 24,82588
6,Upper Nile,"Population, Total (Number)",25 to 29,76754
7,Upper Nile,"Population, Total (Number)",30 to 34,63134
8,Upper Nile,"Population, Total (Number)",35 to 39,56806
9,Upper Nile,"Population, Total (Number)",40 to 44,42139


In [74]:
# Unique values in variable_name column
list(census_2008_2.variable_name.unique())

['Population, Total (Number)',
 'Population, Male (Number)',
 'Population, Female (Number)']

In [75]:
# .iloc and .loc

census_2008.loc[:, ['region_name', 'variable_name', 'age_name', '2008']]

Unnamed: 0,region_name,variable_name,age_name,2008
0,Upper Nile,"Population, Total (Number)",Total,964353
1,Upper Nile,"Population, Total (Number)",0 to 4,150872
2,Upper Nile,"Population, Total (Number)",5 to 9,151467
3,Upper Nile,"Population, Total (Number)",10 to 14,126140
4,Upper Nile,"Population, Total (Number)",15 to 19,103804
...,...,...,...,...
445,Eastern Equatoria,"Population, Female (Number)",45 to 49,13727
446,Eastern Equatoria,"Population, Female (Number)",50 to 54,9482
447,Eastern Equatoria,"Population, Female (Number)",55 to 59,5740
448,Eastern Equatoria,"Population, Female (Number)",60 to 64,5274


In [76]:
census_raw.columns

Index(['Region', 'Region Name', 'Region - RegionId', 'Variable',
       'Variable Name', 'Age', 'Age Name', 'Scale', 'Units', '2008'],
      dtype='object')

In [77]:
census_2008.iloc[:, [1, 4, 6, 9]]

Unnamed: 0,region_name,variable_name,age_name,2008
0,Upper Nile,"Population, Total (Number)",Total,964353
1,Upper Nile,"Population, Total (Number)",0 to 4,150872
2,Upper Nile,"Population, Total (Number)",5 to 9,151467
3,Upper Nile,"Population, Total (Number)",10 to 14,126140
4,Upper Nile,"Population, Total (Number)",15 to 19,103804
...,...,...,...,...
445,Eastern Equatoria,"Population, Female (Number)",45 to 49,13727
446,Eastern Equatoria,"Population, Female (Number)",50 to 54,9482
447,Eastern Equatoria,"Population, Female (Number)",55 to 59,5740
448,Eastern Equatoria,"Population, Female (Number)",60 to 64,5274


In [78]:
# Selecting rows and columns using .loc
census_2008.loc[1:5, ['region_name', 'variable_name', 'age_name', '2008']]

Unnamed: 0,region_name,variable_name,age_name,2008
1,Upper Nile,"Population, Total (Number)",0 to 4,150872
2,Upper Nile,"Population, Total (Number)",5 to 9,151467
3,Upper Nile,"Population, Total (Number)",10 to 14,126140
4,Upper Nile,"Population, Total (Number)",15 to 19,103804
5,Upper Nile,"Population, Total (Number)",20 to 24,82588


In [92]:
# Sort the data by state column
census_sorted = census_2008.sort_values(by='region_name', ascending=True)


census_sorted.head(5)

Unnamed: 0,region,region_name,region_regionid,variable,variable_name,age,age_name,scale,units,2008
360,KN.A10,Central Equatoria,SS-EC,KN.B2,"Population, Total (Number)",KN.C1,Total,units,Persons,1103557
361,KN.A10,Central Equatoria,SS-EC,KN.B2,"Population, Total (Number)",KN.C2,0 to 4,units,Persons,163539
362,KN.A10,Central Equatoria,SS-EC,KN.B2,"Population, Total (Number)",KN.C3,5 to 9,units,Persons,161582
363,KN.A10,Central Equatoria,SS-EC,KN.B2,"Population, Total (Number)",KN.C4,10 to 14,units,Persons,138342
364,KN.A10,Central Equatoria,SS-EC,KN.B2,"Population, Total (Number)",KN.C5,15 to 19,units,Persons,128564


In [93]:
# Set index to region_name
census_sorted = census_sorted.set_index('region_name')
census_sorted.head(8)

Unnamed: 0_level_0,region,region_regionid,variable,variable_name,age,age_name,scale,units,2008
region_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Central Equatoria,KN.A10,SS-EC,KN.B2,"Population, Total (Number)",KN.C1,Total,units,Persons,1103557
Central Equatoria,KN.A10,SS-EC,KN.B2,"Population, Total (Number)",KN.C2,0 to 4,units,Persons,163539
Central Equatoria,KN.A10,SS-EC,KN.B2,"Population, Total (Number)",KN.C3,5 to 9,units,Persons,161582
Central Equatoria,KN.A10,SS-EC,KN.B2,"Population, Total (Number)",KN.C4,10 to 14,units,Persons,138342
Central Equatoria,KN.A10,SS-EC,KN.B2,"Population, Total (Number)",KN.C5,15 to 19,units,Persons,128564
Central Equatoria,KN.A10,SS-EC,KN.B2,"Population, Total (Number)",KN.C6,20 to 24,units,Persons,111675
Central Equatoria,KN.A10,SS-EC,KN.B2,"Population, Total (Number)",KN.C7,25 to 29,units,Persons,106551
Central Equatoria,KN.A10,SS-EC,KN.B2,"Population, Total (Number)",KN.C8,30 to 34,units,Persons,75048


In [95]:
# Select columns and rows with .loc()
census_sorted.loc[
    ['Jonglei', 'Unity', 'Upper Nile'], 
    ['variable_name', 'age_name', '2008']
    ]


Unnamed: 0_level_0,variable_name,age_name,2008
region_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jonglei,"Population, Total (Number)",Total,1358602
Jonglei,"Population, Total (Number)",0 to 4,207424
Jonglei,"Population, Total (Number)",5 to 9,215121
Jonglei,"Population, Total (Number)",10 to 14,179544
Jonglei,"Population, Total (Number)",15 to 19,146141
...,...,...,...
Upper Nile,"Population, Female (Number)",45 to 49,13631
Upper Nile,"Population, Female (Number)",50 to 54,10646
Upper Nile,"Population, Female (Number)",55 to 59,5824
Upper Nile,"Population, Female (Number)",60 to 64,5892


In [None]:
census_sorted.loc['Jonglei', :].head()